Praise from the Experts ®
“I wish that Ron Cody’s new book, Learning SAS by Example: A Programmer’s Guide, had been ar...
747 downloads
3666 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Praise from the Experts ®
“I wish that Ron Cody’s new book, Learning SAS by Example: A Programmer’s Guide, had been around when I was first learning SAS! This impressive book steadily moves the reader from SAS novice to SAS authority in 26 well-written, well-paced chapters. All of the basics are covered in this book; the fundamentals of the DATA step, getting data into and out of SAS, conditional and iterative processing, SAS functions and formats, important SAS procedures, creating frequency and tabular reports, and generating output in publishable formats such as RTF and PDF. “This book is well suited for people who are just learning SAS, as well as for people who would like a broad-based SAS reference. There are plenty of solid examples that emphasize the material in the chapters. Each chapter ends with exercises that test the reader’s understanding of the concepts and the SAS code presented. Because of its ® pacing, breadth, and scope, I have no doubt that Learning SAS by Example: A Programmer’s Guide will become the preferred book for learning SAS software in business, government, and educational institutions the world over.” Michael A. Raithel Senior Systems Analyst Westat
“It is refreshing to find the highlights of SAS all in one book, essentially a ‘soup to nuts’ of SAS concepts! The author has organized the book well. It starts with simple applications of SAS and gradually works up to advanced topics like ODS and SQL. As the name implies, every concept is presented with lucid examples. “This book is ideal for programmers who are novice SAS users but cannot afford the time to learn SAS in a formal manner. Even advanced users of SAS can benefit from the excellent examples and refresh their memories of long-forgotten details of options and syntax.” Priya Suresh Research Programmer Analyst RTI International Research Triangle Park, North Carolina
“Teachers and students will love the detailed code examples and numerous problems. ® The book is destined to become a classic like Ron’s Applied Statistics and the SAS Programming Language, which is now in its fifth edition.” Ian Whitlock
“Ron Cody’s newest book can serve not only as a textbook for a SAS programming course, but also as a reference for intermediate-level programmers. The book begins with basic data input methods and advances through merging data sets and more complex tasks. ®
“I would recommend Learning SAS by Example: A Programmer’s Guide to anyone wishing to learn or refresh his/her SAS programming skills.” Karol H. Katz, M.S. Programmer Analyst Yale University School of Medicine
“Ron Cody has earned a reputation as an excellent lecturer, instructor, and author. His ® most recent work, Learning SAS by Example: A Programmer’s Guide, does not merely adhere to his historic level of excellence; it raises it to a new level. “Ron has succeeded in condensing the major components of Base SAS into one easy-toread book. Using simple examples, he illustrates the use of the DATA step and many of its components, basic PROCs, and even macros and ODS. ®
“Learning SAS by Example: A Programmer’s Guide is now at the top of my list of texts to assist programmers in teaching themselves SAS.” Andrew T. Kuligowski SouthEast SAS Users Group
Learning SAS by Example
®
A Programmer’s Guide
Ron Cody
The correct bibliographic citation for this manual is as follows: Cody, Ron. 2007. Learning SAS® by Example: A Programmer’s Guide. Cary, NC: SAS Institute Inc. Learning SAS® by Example: A Programmer’s Guide Copyright © 2007, SAS Institute Inc., Cary, NC, USA ISBN 978-1-59994-165-3 All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, February 2007 SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228. ®
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.
Contents List of Programs xv Preface xxix Acknowledgments
Part 1 Chapter 1
Getting Started What Is SAS? 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Chapter 2
2.2 2.3 2.4 2.5
Chapter 3
1 3
Introduction 3 Getting Data into SAS 4 A Sample SAS Program 4 SAS Names 7 SAS Data Sets and SAS Data Types 8 The SAS Display Manager and SAS Enterprise Guide 9 Problems 9
Writing Your First SAS Program 2.1
Part 2
xxxi
11
A Simple Program to Read Raw Data and Produce a Report 11 Enhancing the Program 18 More on Comment Statements 20 How SAS Works (a Look Inside the “Black Box”) 22 Problems 25
DATA Step Processing 27 Reading Raw Data from External Files 3.1 3.2 3.3 3.4 3.5
29
Introduction 30 Reading Data Values Separated by Blanks 30 Specifying Missing Values with List Input 32 Reading Data Values Separated by Commas (CSV Files) 33 Using an Alternative Method to Specify an External File 34
iv Contents
3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15
Chapter 4
Reading Data Values Separated by Delimiters Other Than Blanks or Commas 34 Placing Data Lines Directly in Your Program (the DATALINES Statement) 36 Specifying INFILE Options with the DATALINES Statement 37 Reading Raw Data from Fixed Columns—Method 1: Column Input 37 Reading Raw Data from Fixed Columns—Method 2: Formatted Input 39 Using a FORMAT Statement in a DATA Step versus in a Procedure 43 Using Informats with List Input 43 Supplying an INFORMAT Statement with List Input 45 Using List Input with Embedded Delimiters 46 Problems 47
Creating Permanent SAS Data Sets 4.1 4.2 4.3 4.4
53
Introduction 54 SAS Libraries—The LIBNAME Statement 54 Why Create Permanent SAS Data Sets? 55 Examining the Descriptor Portion of a SAS Data Set Using PROC CONTENTS 56 4.5 Listing All the SAS Data Sets in a SAS Library Using PROC CONTENTS 59 4.6 Viewing the Descriptor Portion of a SAS Data Set Using the SAS Explorer 60 4.7 Viewing the Data Portion of a SAS Data Set Using PROC PRINT 63 4.8 Viewing the Data Portion of a SAS Data Set Using the SAS VIEWTABLE Window 64 4.9 Using a SAS Data Set as Input to a DATA Step 65 4.10 DATA _NULL_: A Data Set That Isn’t 67 4.11 Problems 68
Contents v
Chapter 5
Creating Formats and Labels 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
Chapter 6
Adding Labels to Your Variables 71 Using Formats to Enhance Your Output 73 Regrouping Values Using Formats 76 More on Format Ranges 78 Storing Your Formats in a Format Library 79 Permanent Data Set Attributes 80 Accessing a Permanent SAS Data Set with User-Defined Formats 82 Displaying Your Format Definitions 83 Problems 84
Reading and Writing Data from an Excel Spreadsheet 87 6.1 6.2 6.3 6.4 6.5 6.6
Chapter 7
71
Introduction 87 Using the Import Wizard to Convert a Spreadsheet to a SAS Data Set 88 Creating an Excel Spreadsheet from a SAS Data Set 93 Using an Engine to Read an Excel Spreadsheet 95 Using the SAS Output Delivery System to Convert a SAS Data Set to an Excel Spreadsheet 96 Problems 98
Performing Conditional Processing 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10
101
Introduction 102 The IF and ELSE IF Statements 102 The Subsetting IF Statement 105 The IN Operator 107 Using a SELECT Statement for Logical Tests 108 Using Boolean Logic (AND, OR, and NOT Operators) 109 A Caution When Using Multiple OR Operators 111 The WHERE Statement 112 Some Useful WHERE Operators 113 Problems 114
vi Contents
Chapter 8
Performing Iterative Processing: Looping 117 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
Chapter 9
Introduction 117 DO Groups 118 The Sum Statement 120 The Iterative DO Loop 125 Other Forms of an Iterative DO Loop 129 DO WHILE and DO UNTIL Statements 131 A Caution When Using DO UNTIL Statements 134 LEAVE and CONTINUE Statements 135 Problems 137
Working with Dates
141
9.1 9.2 9.3 9.4 9.5 9.6 9.7
Introduction 142 How SAS Stores Dates 142 Reading Date Values from Raw Data 143 Computing the Number of Years between Two Dates 146 Demonstrating a Date Constant 147 Computing the Current Date 148 Extracting the Day of the Week, Day of the Month, Month, and Year from a SAS Date 149 9.8 Creating a SAS Date from Month, Day, and Year Values 150 9.9 Substituting the 15th of the Month when the Day Value Is Missing 151 9.10 Using Date Interval Functions 152 9.11 Problems 157
Chapter 10 Subsetting and Combining SAS Data Sets
161
10.1 Introduction 162 10.2 Subsetting a SAS Data Set 162 10.3 Creating More Than One Subset Data Set in One DATA Step 163 10.4 Adding Observations to a SAS Data Set 164 10.5 Interleaving Data Sets 167 10.6 Combining Detail and Summary Data 168
Contents vii
10.7 10.8 10.9 10.10 10.11 10.12 10.13
Merging Two Data Sets 170 Omitting the BY Statement in a Merge 172 Controlling Observations in a Merged Data Set 173 More Uses for IN= Variables 175 When Does a DATA Step End? 176 Merging Two Data Sets with Different BY Variable Names 177 Merging Two Data Sets with Different BY Variable Data Types 179 10.14 One-to-One, One-to-Many, and Many-to-Many Merges 181 10.15 Updating a Master File from a Transaction File 183 10.16 Problems 185
Chapter 11 Working with Numeric Functions
189
11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11
Introduction 190 Functions That Round and Truncate Numeric Values 190 Functions That Work with Missing Values 192 Setting Character and Numeric Values to Missing 193 Descriptive Statistics Functions 194 Computing Sums within an Observation 196 Mathematical Functions 197 Computing Some Useful Constants 198 Generating Random Numbers 199 Special Functions 201 Functions That Return Values from Previous Observations 204 11.12 Problems 207
Chapter 12 Working with Character Functions 12.1 12.2 12.3 12.4 12.5 12.6
211
Introduction 212 Determining the Length of a Character Value 212 Changing the Case of Characters 213 Removing Characters from Strings 214 Joining Two or More Strings Together 215 Removing Leading or Trailing Blanks 217
viii Contents
12.7 12.8 12.9 12.10 12.11 12.12 12.13 12.14 12.15 12.16 12.17 12.18 12.19
Using the COMPRESS Function to Remove Characters from a String 218 Searching for Characters 220 Searching for Individual Characters 223 Searching for Words in a String 223 Searching for Character Classes 225 Using the NOT Functions for Data Cleaning 226 Describing a Real Blockbuster Data Cleaning Function 227 Extracting Part of a String 228 Dividing Strings into Words 230 Comparing Strings 232 Performing a Fuzzy Match 234 Substituting Characters or Words 235 Problems 238
Chapter 13 Working with Arrays 243 13.1 13.2
Introduction 244 Setting Values of 999 to a SAS Missing Value for Several Numeric Variables 244 13.3 Setting Values of NA and ? to a Missing Character Value 247 13.4 Converting All Character Values to Lowercase 248 13.5 Using an Array to Create New Variables 249 13.6 Changing the Array Bounds 250 13.7 Temporary Arrays 251 13.8 Loading the Initial Values of a Temporary Array from a Raw Data File 253 13.9 Using a Multidimensional Array for Table Lookup 254 13.10 Problems 257
Contents ix
Part 3
Presenting and Summarizing Your Data
Chapter 14 Displaying Your Data 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 14.10 14.11 14.12 14.13 14.14 14.15
261
Introduction 262 The Basics 262 Changing the Appearance of Your Listing 263 Changing the Appearance of Values 265 Controlling the Observations That Appear in Your Listing 266 Adding Additional Titles and Footnotes to Your Listing 268 Changing the Order of Your Listing 270 Sorting by More Than One Variable 272 Labeling Your Column Headings 273 Adding Subtotals and Totals to Your Listing 274 Making Your Listing Easier to Read 277 Adding the Number of Observations to Your Listing 279 Double-Spacing Your Listing 280 Listing the First n Observations of Your Data Set 281 Problems 283
Chapter 15 Creating Customized Reports 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 15.9 15.10 15.11 15.12 15.13 15.14 15.15
259
287
Introduction 288 Using PROC REPORT 289 Selecting Variables to Include in Your Report 291 Comparing Detail and Summary Reports 291 Producing a Summary Report 293 Demonstrating the FLOW Option of PROC REPORT 294 Using Two Grouping Variables 296 Changing the Order of Variables in the COLUMN Statement 297 Changing the Order of Rows in a Report 299 Applying the ORDER Usage to Two Variables 300 Creating a Multi-Column Report 301 Producing Report Breaks 303 Using a Nonprinting Variable to Order a Report 306 Computing a New Variable with PROC REPORT 307 Computing a Character Variable in a COMPUTE Block 308
x Contents
15.16 15.17 15.18 15.19
Creating an ACROSS Variable with PROC REPORT 310 Modifying the Column Label for an ACROSS Variable 311 Using an ACROSS Usage to Display Statistics 311 Problems 313
Chapter 16 Summarizing Your Data 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 16.10 16.11 16.12 16.13 16.14
Introduction 320 PROC MEANS—Starting from the Beginning 320 Adding a BY Statement to PROC MEANS 323 Using a CLASS Statement with PROC MEANS 324 Applying a Format to a CLASS Variable 325 Deciding between a BY Statement and a CLASS Statement 327 Creating Summary Data Sets Using PROC MEANS 327 Outputting Other Descriptive Statistics with PROC MEANS 328 Asking SAS to Name the Variables in the Output Data Set 329 Outputting a Summary Data Set: Including a BY Statement 330 Outputting a Summary Data Set: Including a CLASS Statement 331 Using Two CLASS Variables with PROC MEANS 333 Selecting Different Statistics for Each Variable 337 Problems 338
Chapter 17 Counting Frequencies 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9
319
341
Introduction 342 Counting Frequencies 342 Selecting Variables for PROC FREQ 345 Using Formats to Label the Output 346 Using Formats to Group Values 347 Problems Grouping Values with PROC FREQ 349 Displaying Missing Values in the Frequency Table 351 Changing the Order of Values in PROC FREQ 353 Producing Two-Way Tables 356
Contents xi
17.10 Requesting Multiple Two-Way Tables 358 17.11 Producing Three-Way Tables 358 17.12 Problems 360
Chapter 18 Creating Tabular Reports
363
18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 18.10 18.11 18.12 18.13
Introduction 364 A Simple PROC TABULATE Table 364 Describing the Three PROC TABULATE Operators 366 Using the Keyword ALL 369 Producing Descriptive Statistics 370 Combining CLASS and Analysis Variables in a Table 372 Customizing Your Table 374 Demonstrating a More Complex Table 377 Computing Row and Column Percentages 379 Displaying Percentages in a Two-Dimensional Table 381 Computing Column Percentages 382 Computing Percentages on Numeric Variables 384 Understanding How Missing Values Affect PROC TABULATE Output 385 18.14 Problems 390
Chapter 19 Introducing the Output Delivery System 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8
Introduction 397 Sending SAS Output to an HTML File 398 Creating a Table of Contents 400 Selecting a Different HTML Style 401 Choosing Other ODS Destinations 402 Selecting or Excluding Portions of SAS Output 403 Sending Output to a SAS Data Set 407 Problems 409
Chapter 20 Generating High-Quality Graphics 20.1 20.2 20.3 20.4 20.5
397
411
Introduction 412 Some Basic Concepts 412 Producing Simple Bar Charts Using PROC GCHART 413 Creating Pie Charts 415 Creating Bar Charts for a Continuous Variable 416
xii Contents
20.6 20.7 20.8 20.9 20.10 20.11 20.12 20.13
Part 4
Creating Charts with Values Representing Categories 418 Creating Bar Charts Representing Sums 420 Creating Bar Charts Representing Means 422 Adding Another Variable to the Chart 423 Producing Scatter Plots 425 Connecting Points 427 Connecting Points with a Smooth Line 430 Problems 431
Advanced Topics
435
Chapter 21 Using Advanced INPUT Techniques 21.1 21.2 21.3 21.4 21.5 21.6 21.7 21.8 21.9 21.10 21.11 21.12 21.13 21.14 21.15 21.16
437
Introduction 438 Handling Missing Values at the End of a Line 438 Reading Short Data Lines 440 Reading External Files with Lines Longer Than 256 Characters 443 Detecting the End of the File 443 Reading a Portion of a Raw Data File 445 Reading Data from Multiple Files 446 Reading Data from Multiple Files Using a FILENAME Statement 447 Reading External Filenames from a Data File 447 Reading Multiple Lines of Data to Form One Observation 448 Reading Data Conditionally (the Single Trailing @ Sign) 451 More Examples of the Single Trailing @ Sign 453 Creating Multiple Observations from One Line of Input 454 Using Variable and Informat Lists 455 Using Relative Column Pointers to Read a Complex Data Structure Efficiently 456 Problems 458
Contents xiii
Chapter 22 Using Advanced Features of User-Defined Formats and Informats 462 22.1 22.2 22.3
Introduction 462 Using Formats to Recode Variables 462 Using Formats with a PUT Function to Create New Variables 463 22.4 Creating User-Defined Informats 464 22.5 Reading Character and Numeric Data in One Step 467 22.6 Using Formats (and Informats) to Perform Table Lookup 470 22.7 Using a SAS Data Set to Create a Format 471 22.8 Updating and Maintaining Your Formats 477 22.9 Using Formats within Formats 479 22.10 Using Multilabel Formats 482 22.11 Using the INPUTN Function to Perform a More Complicated Table Lookup 485 22.12 Problems 490
Chapter 23 Restructuring SAS Data Sets 23.1 23.2
23.3
23.4
23.5
23.6
493
Introduction 494 Converting a Data Set with One Observation per Subject to a Data Set with Several Observations per Subject: Using a DATA Step 494 Converting a Data Set with Several Observations per Subject to a Data Set with One Observation per Subject: Using a DATA Step 496 Converting a Data Set with One Observation per Subject to a Data Set with Several Observations per Subject: Using PROC TRANSPOSE 498 Converting a Data Set with Several Observations per Subject to a Data Set with One Observation per Subject: Using PROC TRANSPOSE 500 Problems 501
Chapter 24 Working with Multiple Observations per Subject 24.1 24.2 24.3
Introduction 506 Identifying the First or Last Observation in a Group 506 Counting the Number of Visits Using PROC FREQ 509
505
xiv Contents
24.4 24.5 24.6 24.7 24.8 24.9
Counting the Number of Visits Using PROC MEANS 511 Computing Differences between Observations 512 Computing Differences between the First and Last Observation in a BY Group Using the LAG Function 514 Computing Differences between the First and Last Observation in a BY Group Using a RETAIN Statement 515 Using a Retained Variable to “Remember” a Previous Value 517 Problems 518
Chapter 25 Introducing the SAS Macro Language 25.1 25.2 25.3 25.4 25.5 25.6 25.7 25.8 25.9
521
Introduction 522 Macro Variables: What Are They? 522 Some Built-In Macro Variables 523 Assigning Values to Macro Variables with a %LET Statement 524 Demonstrating a Simple Macro 525 A Word about Tokens 527 Another Example of Using a Macro Variable as a Prefix 529 Using a Macro Variable to Transfer a Value between DATA Steps 530 Problems 532
Chapter 26 Introducing the Structured Query Language 26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9
Introduction 536 Some Basics 536 Joining Two Tables (Merge) 539 Left, Right, and Full Joins 543 Concatenating Data Sets 546 Using Summary Functions 549 Demonstrating an ORDER Clause 551 An Example of Fuzzy Matching 551 Problems 553
Solutions to Odd-Numbered Problems Index
601
557
535
List of Programs Programs in Chapter 1 1-1 1-2
A sample SAS program 5 An alternative version of Program 1-1 7
Programs in Chapter 2 2-1 2-2 2-3 2-4 2-5
Your first SAS program 12 Enhancing the program 18 Example of a fancy comment using the asterisk style 21 Example of a fancy comment using the /* */ style 21 Incorrect nesting of /* */ style comments 21
Programs in Chapter 3 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10 3-11 3-12 3-13
Demonstrating list input with blanks as delimiters 31 Adding PROC PRINT to list the observations in the data set 31 Reading data from a comma-separated values (CSV) file 33 Using a FILENAME statement to identify an external file 34 Demonstrating the DATALINES statement 36 Using INFILE options with DATALINES 37 Demonstrating column input 38 Demonstrating formatted input 40 Demonstrating a FORMAT statement 42 Rerunning Program 3-9 with a different format 42 Using informats with list input 44 Supplying an INFORMAT statement with list input 45 Demonstrating the ampersand modifier for list input 46
Programs in Chapter 4 4-1 4-2 4-3 4-4
Creating a permanent SAS data set 55 Using PROC CONTENTS to examine the descriptor portion of a SAS data set 56 Demonstrating the VARNUM option of PROC CONTENTS 58 Using a LIBNAME in a new SAS session 58
xvi List of Programs
4-5 4-6 4-7 4-8
Using PROC CONTENTS to list the names of all the SAS data sets in a SAS library 59 Using PROC PRINT to list the data portion of a SAS data set 63 Using observations from a SAS data set as input to a new SAS data set 66 Demonstrating a DATA _NULL_ step 67
Programs in Chapter 5 5-1 5-2 5-3 5-4 5-5
Adding labels to variables in a SAS data set 72 Using PROC FORMAT to create user-defined formats 74 Adding a FORMAT statement in PROC PRINT 75 Regrouping values using a format 77 Applying the new format to several variables with PROC FREQ 77 5-6 Creating a permanent format library 79 5-7 Adding LABEL and FORMAT statements in the DATA step 81 5-8 Running PROC CONTENTS on a data set with labels and formats 81 5-9 Using a user-defined format 82 5-10 Displaying format definitions in a user-created library 83 5-11 Demonstrating a SELECT statement with PROC FORMAT 84
Programs in Chapter 6 6-1 6-2 6-3 6-4
Using PROC PRINT to list the first four observations in a data set 91 Using the FIRSTOBS= and OBS= options together 92 Reading a spreadsheet using an XLS engine 96 Using ODS to convert a SAS data set into a CSV file (to be read by Excel) 97
Programs in Chapter 7 7-1 7-2 7-3 7-4 7-5 7-6
First attempt to group ages into age groups (incorrect) 102 Corrected program to group ages into age groups 104 An alternative to Program 7-2 105 Demonstrating a subsetting IF statement 106 Demonstrating a SELECT statement when a select-expression is missing 109 Combining various Boolean operators 110
List of Programs xvii
7-7 7-8
A caution on the use of multiple OR operators 111 Using a WHERE statement to subset a SAS data set 112
Programs in Chapter 8 8-1 8-2 8-3 8-4 8-5 8-6 8-7 8-8 8-9 8-10 8-11 8-12 8-13 8-14 8-15 8-16 8-17 8-18 8-19
Example of a program that does not use a DO group 118 Demonstrating a DO group 119 Attempt to create a cumulative total 121 Adding a RETAIN statement to Program 8-3 122 Third attempt to create cumulative total 123 Using a sum statement to create a cumulative total 124 Using a sum statement to create a counter 124 Program without iterative loops 125 Demonstrating an iterative DO loop 126 Using an iterative DO loop to make a table of squares and square roots 127 Using an iterative DO loop to graph an equation 128 Using character values for DO loop index values 130 Demonstrating a DO UNTIL loop 131 Demonstrating that a DO UNTIL loop always executes at least once 133 Demonstrating a DO WHILE statement 133 Demonstrating that DO WHILE loops are evaluated at the top 134 Combining a DO UNTIL and iterative DO loop 135 Demonstrating the LEAVE statement 135 Demonstrating a CONTINUE statement 136
Programs in Chapter 9 9-1 9-2 9-3 9-4 9-5 9-6 9-7
Program to read dates from raw data 143 Adding a FORMAT statement to format each of the date values 144 Compute a person's age in years 146 Demonstrating a date constant 148 Using the TODAY function to return the current date 148 Extracting the day of the week, day of the month, month, and year from a SAS date 149 Using the MDY function to create a SAS date from month, day, and year values 150
xviii List of Programs
9-8
Substituting the 15th of the month when a day value is missing 151 9-9 Demonstrating the INTCK function 154 9-10 Using the INTNX function to compute dates 6 months after discharge 156 9-11 Demonstrating the SAMEDAY alignment with the INTNX function 156
Programs in Chapter 10 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 10-11 10-12 10-13 10-14 10-15 10-16
Subsetting a SAS data set using a WHERE statement 162 Demonstrating a DROP= data set option 163 Creating two data sets in one DATA step 164 Using a SET statement to combine observations from two data sets 165 Using a SET statement on two data sets containing different variables 166 Interleaving data sets 167 Combining detail and summary data: using a conditional SET statement 168 Merging two SAS data sets 171 Demonstrating the IN= data set option 173 Using IN= variables to select IDs that are in both data sets 174 More examples of using IN= variables 175 Demonstrating when a DATA step ends 176 Merging two data sets by renaming a variable in one data set 178 Merging two data sets when the BY variables are different data types 179 An alternative to Program 10-14 180 Updating a master file from a transaction file 184
Programs in Chapter 11 11-1 11-2 11-3 11-4
Demonstrating the ROUND and INT truncation functions 191 Testing for missing numeric and character values (without the MISSING function) 192 Demonstrating the MISSING function 192 Demonstrating the N, MEAN, MIN, and MAX functions 194
List of Programs xix
11-5 11-6 11-7 11-8 11-9 11-10 11-11 11-12 11-13 11-14 11-15 11-16
Finding the sum of the three largest values in a list of variables 195 Using the SUM function to compute totals 197 Demonstrating the ABS, SQRT, EXP, and LOG functions 197 Computing some useful constants with the CONSTANT function 198 Using the RANUNI function to randomly select observations 200 Using PROC SURVEYSELECT to obtain a random sample 200 Using the INPUT function to perform a character-to-numeric conversion 202 Demonstrating the PUT function 203 Demonstrating the LAG and LAGn functions 204 Demonstrating what happens when you execute a LAG function conditionally 205 Using the LAG function to compute interobservation differences 206 Demonstrating the DIF function 207
Programs in Chapter 12 12-1 12-2 12-3 12-4 12-5 12-6 12-7 12-8 12-9 12-10 12-11 12-12 12-13 12-14 12-15 12-16
Determining the length of a character value 213 Changing values to uppercase 214 Converting multiple blanks to a single blank and demonstrating the PROPCASE function 215 Demonstrating the concatenation functions 216 Demonstrating the TRIM, LEFT, and STRIP functions 217 Using the COMPRESS function to remove characters from a string 219 Demonstrating the COMPRESS modifiers 220 Demonstrating the FIND and COMPRESS functions 221 Demonstrating the FINDW function 224 Demonstrating the ANYDIGIT function Demonstrating the NOT functions for data cleaning 227 Using the VERIFY function for data cleaning 228 Using the SUBSTR function to extract substrings 229 Demonstrating the SCAN function 230 Using the SCAN function to extract the last name 231 Demonstrating the COMPARE function 232
xx List of Programs
12-17 Clarifying the use of the colon modifier with the COMPARE function 233 12-18 Using the SPEDIS function to perform a fuzzy match 234 12-19 Demonstrating the TRANSLATE function 236 12-20 Using the TRANWRD function to standardize an address 237
Programs in Chapter 13 13-1
Converting values of 999 to a SAS missing value—without using arrays 244 13-2 Converting values of 999 to a SAS missing value—using arrays 245 13-3 Rewriting Program 13-2 using the CALL MISSING routine 246 13-4 Converting values of NA and ? to missing character values 247 13-5 Converting all character values in a SAS data set to lowercase 249 13-6 Using an array to create new variables 250 13-7 Changing the array bounds 251 13-8 Using a temporary array to score a test 252 13-9 Loading the initial values of a temporary array from a raw data file 253 13-10 Loading a two-dimensional, temporary array with data values 255
Programs in Chapter 14 14-1 14-2 14-3 14-4 14-5
PROC PRINT using all the defaults 262 Controlling which variables appear in the listing 264 Using an ID statement to omit the Obs column 264 Adding a FORMAT statement to PROC PRINT 266 Controlling which observations appear in the listing (WHERE statement) 267 14-6 Using the IN operator in a WHERE statement 267 14-7 Adding titles and footnotes to your listing 268 14-8 Using PROC SORT to change the order of your observations 270 14-9 Demonstrating the DESCENDING option of PROC SORT 271 14-10 Sorting by more than one variable 272 14-11 Using labels as column headings with PROC PRINT 273 14-12 Using a BY statement in PROC PRINT 275
List of Programs xxi
14-13 14-14 14-15 14-16 14-17 14-18
Adding totals and subtotals to your listing 276 Using an ID statement and a BY statement in PROC PRINT 278 Demonstrating the N= option with PROC PRINT 279 Double-spacing your listing 280 Listing the first five observations of your data set 281 Forcing variable labels to print horizontally 282
Programs in Chapter 15 15-1 15-2 15-3 15-4 15-5 15-6 15-7 15-8 15-9 15-10 15-11 15-12 15-13 15-14 15-15 15-16 15-17 15-18
Listing of Medical using PROC PRINT 288 Using PROC REPORT (all defaults) 289 Adding a COLUMN statement to PROC REPORT 291 Using PROC REPORT with only numeric variables 292 Using DEFINE statements to define a display usage 292 Specifying a GROUP usage to create a summary report 293 Demonstrating the FLOW option with PROC REPORT 294 Explicitly defining usage for every variable 296 Demonstrating the effect of two variables with GROUP usage 296 Reversing the order of variables in the COLUMN statement 298 Demonstrating the ORDER usage of PROC REPORT 299 Applying the ORDER usage for two variables 300 Creating a multi-column report 302 Requesting a report break (RBREAK statement) 303 Demonstrating the BREAK statement of PROC REPORT 304 Using a nonprinting variable to order the rows of a report 306 Computing a new variable with PROC REPORT 307 Demonstrating an ACROSS usage in PROC REPORT 310
Programs in Chapter 16 16-1 16-2 16-3 16-4 16-5 16-6 16-7
PROC MEANS with all the defaults 320 Adding a VAR statement and requesting specific statistics with PROC MEANS 322 Adding a BY statement to PROC MEANS 323 Using a CLASS statement with PROC MEANS 324 Demonstrating the effect of a formatted CLASS variable 326 Creating a summary data set using PROC MEANS 327 Outputting more than one statistic with PROC MEANS 329
xxii List of Programs
16-8 16-9 16-10 16-11 16-12 16-13 16-14 16-15 16-16
Demonstrating the OUTPUT option AUTONAME 330 Adding a BY statement to PROC MEANS 331 Adding a CLASS statement to PROC MEANS 332 Adding the NWAY option to PROC MEANS 332 Using two CLASS variables with PROC MEANS 333 Adding the CHARTYPE procedure option to PROC MEANS 334 Using the _TYPE_ variable to select cell means 336 Using a DATA step to create separate summary data sets 336 Selecting different statistics for each variable using PROC MEANS 337
Programs in Chapter 17 17-1 17-2 17-3 17-4 17-5
Counting frequencies: one-way tables using PROC FREQ 342 Adding a TABLES statement to PROC FREQ 345 Adding formats to Program 17-2 346 Using formats to group values 348 Demonstrating a problem in how PROC FREQ groups values 349 17-6 Fixing the grouping problem 350 17-7 Demonstrating the effect of the MISSING option of PROC FREQ 351 17-8 Demonstrating the ORDER= option of PROC FREQ 353 17-9 Demonstrating the ORDER= formatted, data, and freq options 354 17-10 Requesting a two-way table 356 17-11 Requesting a three-way table with PROC FREQ 359
Programs in Chapter 18 18-1 18-2 18-3 18-4 18-5 18-6 18-7
PROC TABULATE with all the defaults and a single CLASS variable 365 Demonstrating concatenation with PROC TABULATE 366 Demonstrating table dimensions with PROC TABULATE 367 Demonstrating the nesting operator with PROC TABULATE 368 Adding the keyword ALL to your table request 369 Using PROC TABULATE to produce descriptive statistics 370 Specifying statistics on an analysis variable with PROC TABULATE 371
List of Programs xxiii
18-8 18-9 18-10 18-11 18-12 18-13 18-14 18-15 18-16 18-17 18-18 18-19 18-20 18-21 18-22
Specifying more than one descriptive statistic with PROC TABULATE 371 Combining CLASS and analysis variables in a table 372 Associating a different format with each variable in a table 374 Renaming keywords with PROC TABULATE 375 Eliminating the N column in a PROC TABULATE table 376 Demonstrating a more complex table 377 Computing percentages in a one-dimensional table 379 Improving the appearance of output from Program 18-14 380 Counts and percentages in a two-dimensional table 381 Using COLPCTN to compute column percentages 383 Computing percentages on a numeric value 384 Demonstrating the effect of missing values on CLASS variables 386 Missing values on a CLASS variable that is not used in the table 387 Adding the PROC TABULATE procedure option MISSING 388 Demonstrating the MISSTEXT= TABLES option 389
Programs in Chapter 19 19-1 19-2 19-3 19-4 19-5 19-6 19-7
Sending SAS output to an HTML file 398 Creating a table of contents for HTML output 400 Choosing a style for HTML output 401 Using an ODS SELECT statement to restrict PROC UNIVARIATE output 404 Using the ODS TRACE statement to identify output objects 404 Using ODS to send procedure output to a SAS data set 407 Using an output data set to create a simplified report 409
Programs in Chapter 20 20-1 20-2 20-3 20-4 20-5 20-6
Producing a simple bar chart using PROC GCHART 414 Creating a simple pie chart 415 Creating a bar chart for a continuous variable 416 Selecting your own midpoints for the chart 417 Demonstrating the need for the DISCRETE option of PROC GCHART 419 Demonstrating the DISCRETE option of PROC GCHART 420
xxiv List of Programs
20-7 20-8 20-9 20-10 20-11 20-12 20-13 20-14 20-15
Creating a bar chart where the height of the bars represents sums 421 Creating a bar chart where the height of the bars represents means 422 Adding another variable to the chart 423 Demonstrating the SUBGROUP= option 424 Creating a simple scatter plot using all the defaults 425 Changing the plotting symbol and controlling the axis ranges 426 Joining the points with straight lines (first attempt) 427 Using the JOIN option on a sorted data set 429 Drawing a smooth line through your data points 430
Programs in Chapter 21 21-1 21-2 21-3 21-4 21-5 21-6 21-7 21-8 21-9 21-10 21-11 21-12 21-13 21-14 21-15 21-16
Missing values at the end of a line with list input 438 Reading a raw data file with short records 441 Demonstrating the INFILE PAD option 442 Demonstrating the END= option in the INFILE statement 444 Demonstrating the OBS= INFILE option to read the first three lines of data 445 Using the OBS= and FIRSTOBS= INFILE options together 446 Using the END= option to read data from multiple files 446 Reading external filenames from an external file 447 Reading external filenames using a DATALINES statement 448 Reading multiple lines of data to create one observation 449 Using an alternate method of reading multiple lines of data to form one SAS observation 450 Incorrect attempt to read a file of mixed record types 452 Using a trailing @ to read a file with mixed record types 453 Another example of a trailing @ sign 454 Creating one observation from one line of data 455 Creating several observations from one line of data 455
Programs in Chapter 22 22-1 22-2 22-3 22-4
Using a format to recode a variable 462 Using a format and a PUT function to create a new variable 463 Demonstrating a user-written informat 465 Demonstrating informat options UPCASE and JUST 466
List of Programs xxv
22-5 22-6 22-7 22-8 22-9 22-10 22-11 22-12 22-13 22-14 22-15 22-16 22-17 22-18 22-19 22-20 22-21 22-22
A traditional approach to reading a combination of character and numeric data 468 Using an enhanced numeric informat to read a combination of character and numeric data 468 Another example of an enhanced numeric informat 469 Using formats and informats to perform a table lookup 470 Creating a test data set that will be used to make a CNTLIN data set 472 Creating a CNTLIN data set from an existing SAS data set 473 Using the CNTLIN= created data set 474 Adding an OTHER category to your format 475 Updating an existing format using a CNTLOUT= data set option 477 Demonstrating nested formats 480 Using the nested format in a DATA step 480 Creating a MULTILABEL format 482 Using a MULTILABEL format with PROC MEANS 482 Using the PRELOADFMT, PRINTMISS, and MISSTEXT options with PROC TABULATE 484 Partial program showing how to create several informats 486 Creating several informats with a single CNTLIN data set 487 Using a SELECT statement to display the contents of two informats 488 Using user-defined informats to perform a table lookup using the INPUTN function 489
Programs in Chapter 23 23-1 23-2 23-3
Creating a data set with several observations per subject from a data set with one observation per subject 495 Creating a data set with one observation per subject from a data set with several observations per subject 497 Using PROC TRANSPOSE to convert a data set with one observation per subject into a data set with several observations per subject (first attempt) 498
xxvi List of Programs
23-4
23-5
Using PROC TRANSPOSE to convert a data set with one observation per subject into a data set with several observations per subject 499 Using PROC TRANSPOSE to convert a SAS data set with several observations per subject into one with one observation per subject 500
Programs in Chapter 24 24-1 24-2 24-3 24-4 24-5 24-6 24-7 24-8 24-9
Creating FIRST. and LAST. Variables 507 Counting the number of visits per patient using a DATA step 508 Using PROC FREQ to count the number of observations in a BY group 509 Using the RENAME= and DROP= data set options to control the output data set 510 Using PROC MEANS to count the number of observations in a BY group 511 Computing differences between observations 512 Computing differences between the first and last observation in a BY group 514 Demonstrating the use of retained variables 516 Using a retained variable to “remember” a previous value 517
Programs in Chapter 25 25-1 Using an automatic macro variable to include a date and time in a title 523 25-2 Assigning a value to a macro variable with a %LET statement 524 25-3 Another example of using a %LET statement 524 25-4 Writing a simple macro 525 25-5 Program 25-4 with documentation added 527 25-6 Demonstrating a problem with resolving a macro variable 527 25-7 Program 25-6 corrected 528 25-8 Using a macro variable as a prefix (incorrect version) 529 25-9 Using a macro variable as a prefix (corrected version) 530 25-10 Using macro variables to transfer values from one DATA step to another 531
List of Programs xxvii
Programs in Chapter 26 26-1 26-2 26-3 26-4 26-5 26-6 26-7 26-8 26-9 26-10 26-11 26-12
Demonstrating a simple query from a single data set 537 Using an asterisk to select all the variables in a data set 538 Using PROC SQL to create a SAS data set 538 Joining two tables (Cartesian product) 539 Renaming the two Subj variables 540 Using aliases to simplify naming variables 541 Performing an inner join 543 Demonstrating a left, right, and full join 544 Concatenating two tables 547 Using a summary function in PROC SQL 550 Demonstrating an ORDER clause 551 Using PROC SQL to perform a fuzzy match 552
xxviii List of Programs
Preface This book covers most SAS programming techniques, from the basics to intermediate and advanced topics. As indicated by the title, every programming technique is illustrated by one or more examples. Each chapter contains problems so that it could also be used as a textbook in a college course. Solutions to the odd-numbered problems are included in the book, while a complete set of solutions are available online for instructors. The four major sections cover getting started, DATA step processing, presenting and summarizing your data, and advanced topics (such as reading unstructured data and using multilabel and embedded formats). Anyone who programs using SAS, from beginner to intermediate and advanced users, will find this book helpful. The example approach makes it unique. If you ask most programmers how they learn to program or understand a SAS feature, they will tell you that they want to see an example. Every example in the book is followed by a detailed explanation of how the program works. Because this book covers so many diverse topics, it should be an ideal reference for SAS users. There is enough detail, in chapters covering such topics as PROC REPORT or PROC TABULATE, that for many users, this book will be all they need (in addition to SAS OnlineDoc, of course).
xxx Preface
Acknowledgments As I have mentioned in previous books, this is the fun part. Most of the work is done and I get to thank all the people who helped make this book possible. Even though there is only one name on the cover of this book (mine), that doesn't mean that I am the only one who put in a lot of effort. In particular, I had reviewers who did high-level reviews of preliminary chapters. So, thanks to Richard Bell, Cynthia Zender, Ginny Piechota, Kelly Gray, and Jason Secosky. Other reviewers read every word (and checked every program) in the final draft. This is a monumental job and I am really grateful to these dedicated folks. Let's give a round of applause to Paul Grant, Lynn Mackey, Linda Jolley, Susan Marcella, and Mike Zdeb. Judy Whatley has been the acquisitions editor for quite a few of my books. She is one of the reasons that writing a book for SAS Press is so enjoyable and rewarding. She even lets me get away with some of my corny jokes and is my advocate for the slightly unorthodox pictures you see on the back of my books. Thanks Judy! Finally, the production staff worked hard to produce the finished book. They include Mary Beth Steinbach, managing editor; Kathy Restivo, copy editor; Candy Farrell and Jennifer Dilley, technical publishing specialists; Patrice Cherry, cover designer; Denise Jones, CD-ROM production specialist; and Liz Villani and Shelly Goodin, marketing specialists. Ron Cody, Winter 2007
xxxii Acknowledgments
P a r t
1
Getting Started 3
Chapter 1
What Is SAS?
Chapter 2
Writing Your First SAS Program
11
2 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
1
What Is SAS? 1.1 Introduction 3 1.2 Getting Data into SAS 4 1.3 A Sample SAS Program 4 1.4 SAS Names 7 1.5 SAS Data Sets and SAS Data Types 8 1.6 The SAS Display Manager and SAS Enterprise Guide 9 1.7 Problems 9
1.1 Introduction SAS is a collection of modules that are used to process and analyze data. It began in the late ’60s and early ’70s as a statistical package (the name SAS originally stood for Statistical Analysis System). However, unlike many competing statistical packages, SAS is also an extremely powerful, general-purpose programming language. We see SAS as the predominant software in the pharmaceutical industry and most Fortune 500
4 Learning SAS by Example: A Programmer’s Guide
companies. In recent years, it has been enhanced to provide state-of-the-art data mining tools and programs for Web development and analysis. This book covers most of the basic data management and programming tools provided in 1 Base SAS. Statistical procedures are not covered here. The only way to really learn a programming language is to write lots of programs, make some errors, correct the errors, and then make some more. You can download all the programs and data files used in this book from this book’s companion Web site at http://support.sas.com/cody and from the CD that accompanies this book. If you already have access to SAS at work or school, you are ready to go. If you are learning SAS on your own and do not have a copy of SAS to play with, we highly recommend that you obtain the SAS Learning Edition 4.1. This is a relatively inexpensive, fully functional version of SAS that was developed primarily for students for learning purposes only. Anyone can buy it, either through SAS Publishing, Amazon.com, or other retailers. With a pre-set die date of 12/31/08, you can use the SAS Enterprise Guide 4.1 point-and-click interface, or write and modify SAS code using the SAS Program Editor. You will be able to run any program in this book using the SAS Learning Edition…it is an ideal way to learn SAS.
1.2 Getting Data into SAS SAS can read data from almost any source. Common sources of data are raw text files, Microsoft Office Excel spreadsheets, Access databases, and most of the common database systems such as DB2 and Oracle. Most of this book uses either text files or Excel spreadsheets as data sources.
1.3 A Sample SAS Program Let’s start out with a simple SAS program that reads data from a text file and produces some basic reports to give you an overview of the structure of SAS programs.
1
See Ron Cody and Jeffrey K. Smith, Applied Statistics and the Programming Language, 5th ed. (Englewood Cliffs, NJ: Prentice Hall, 2005), which is available from SAS Press, for details on using SAS for statistical analysis.
Chapter 1: What Is SAS? 5
For this example, we have a text file with data on vegetable seeds. Each line of the file contains the following pieces of information (separated by spaces):
Vegetable name Product code Days to germination Number of seeds Price
In SAS terminology, each piece of information is called a variable. (Other database systems, and sometimes SAS, use the term column.) A few sample lines from the file are shown here: File c:\books\learning\veggies.txt Cucumber Cucumber Carrot Carrot Corn Corn Corn Eggplant
50104-A 51789-A 50179-A 50872-A 57224-A 62471-A 57828-A 52233-A
55 56 68 65 75 80 66 70
30 30 1500 1500 200 200 200 30
195 225 395 225 295 395 295 225
In this example, each line of data produces what SAS calls an observation (also referred to as a row in other systems). A complete SAS program to read this data file and produce a list of the data, a frequency count showing the number of entries for each vegetable, the average price per seed, and the average number of days until germination is shown here:
Program 1-1 A sample SAS program *SAS Program to read veggie data file and to produce several reports; options nocenter nonumber; data veg; infile "c:\books\learning\veggies.txt"; input Name $ Code $ Days Number Price; CostPerSeed = Price / Number; run;
6 Learning SAS by Example: A Programmer’s Guide
title "List of the Raw Data"; proc print data=veg; run; title "Frequency Distribution of Vegetable Names"; proc freq data=veg; tables Name; run; title "Average Cost of Seeds"; proc means data=veg; var Price Days; run;
At this point in the book, we won’t explain every line of the program—we’ll just give an overview. SAS programs often contain DATA steps and PROC steps. DATA steps are parts of the program where you can read or write the data, manipulate the data, and perform calculations. PROC (short for procedure) steps are parts of your program where you ask SAS to run one or more of its procedures to produce reports, summarize the data, generate graphs, and much more. DATA steps begin with the word DATA and PROC steps begin with the word PROC. Most DATA and PROC steps end with a RUN statement (more on this later). SAS processes each DATA or PROC step completely and then goes on to the next step. SAS also contains global statements that affect the entire SAS environment and remain in effect from one DATA or PROC step to another. In the program above, the OPTIONS and TITLE statements are examples of global statements. It is important to keep in mind that the actions of global statements remain in effect until they are changed by another global statement or until you end your SAS session. All SAS programs, whether part of DATA or PROC steps, are made up of statements. Here is the rule: all SAS statements end with semicolons. This is an important rule because if you leave out a semicolon where one is needed, the program may not run correctly, resulting in hard-to-interpret error messages.
Chapter 1: What Is SAS? 7
Let’s discuss some of the basic rules of SAS statements. First, they can begin in any column and can span several lines, if necessary. Because a semicolon determines the end of a SAS statement, you can place more than one statement on a single line (although this is not recommended as a matter of style). To help make this clear, let’s look at some of the statements in Program 1-1. You could write the DATA step as shown in Program 1-2. Although this program is identical to the original, notice that it doesn’t look organized, making it hard to read. Notice, too, that spacing is not critical either, though it is useful for legibility. It is a common practice to start each SAS statement on a new line and to indent each statement within a DATA or PROC step by several spaces (this author likes three spaces).
Program 1-2 An alternative version of Program 1-1 data veg; infile "c:\books\learning\veggies.txt"; Name $ Code $ Days Number Price; CostPerSeed = Price / Number; run;
input
Another thing to notice about this program is that SAS is not case sensitive. Well, this is almost true. Of course references to external files must match the rules of your particular operating system. So, if you are running SAS under UNIX or Linux, file names will be case-sensitive. As you will see later, you get to name the variables in a SAS data set. The variable names in Program 1-1 are Name, Code, Days, Number, Price, and CostPerSeed. Although SAS doesn’t care whether you write these names in uppercase, lowercase, or mixed case, it does “remember” the case of each variable the first time it encounters that variable and uses that form of the variable name when producing printed reports.
1.4 SAS Names SAS names follow a simple naming rule: All SAS variable names and data set names can be no longer than 32 characters and must begin with a letter or the underscore ( _ ) character. The remaining characters in the name may be letters, digits, or the underscore character. Characters such as dashes and spaces are not allowed. Here are some valid and invalid SAS names.
8 Learning SAS by Example: A Programmer’s Guide
Valid SAS Names Parts LastName First_Name Ques5 Cost_per_Pound DATE time X12Y34Z56 Invalid SAS Names 8_is_enough
Begins with a number
Price per Pound
Contains blanks
Month-total
Contains an invalid character ( - )
Num%
Contains an invalid character (%)
1.5 SAS Data Sets and SAS Data Types We will talk a lot about SAS data sets throughout this book. For now, you need to know that when SAS reads data from anywhere (for example, raw data, spreadsheets), it stores the data in its own special form called a SAS data set. Only SAS can read and write SAS data sets. If you opened a SAS data set with another program (Microsoft Word, for example), it would not be a pretty sight—it would consist of some recognizable characters and many funny-looking graphics characters. In other words, it would look like nonsense. Even if SAS is reading data from Oracle tables or DB2, it is actually converting the data into SAS data set format in the background. The good news is that you don’t ever have to worry about how SAS is storing its data or the structure of a SAS data set. However, it is important to understand that SAS data sets contain two parts: a descriptor portion and a data portion. Not only does SAS store the actual data values for you, it stores information about these values (things like storage lengths, labels, and formats). We’ll discuss that more later. SAS has only two types of variables: character and numeric. This makes it much simpler to use and understand than some other programs that have many more data types (for example, integer, long integer, and logical). SAS determines a fixed storage length for every variable. Most SAS users never need to think about storage lengths for numerical
Chapter 1: What Is SAS? 9
values—they are stored in 8 bytes (about 14 or 15 significant digits, depending on your operating system) if you don’t specify otherwise. The majority of SAS users will never have to change this default value (it can lead to complications and should only be considered by experienced SAS programmers). Each character value (data stored as letters, special characters, and numerals) is assigned a fixed storage length explicitly by program statements or by various rules that SAS has about the length of character values.
1.6 The SAS Display Manager and SAS Enterprise Guide Because SAS runs on many different platforms (mainframes, microcomputers running various Microsoft operating systems, UNIX, and Linux), the way you write and run programs will vary. You might use a general-purpose text editor on a mainframe to write a SAS program, submit it, and send the output back to a terminal or to a file. On PCs, you might use the SAS Display Manager, where you write your program in the Enhanced Editor (Editor window), see any error messages and comments about your program and the data in the Log window, and view your output in the Output window. In addition to the Enhanced Editor, an older program, simply called the Program Editor, is available for Windows and UNIX users. As an alternative to the Display Manager, you may enter the SAS environment using SAS Enterprise Guide, which is a front-end to SAS that allows you to use a menu-driven system to write SAS programs and produce reports. There are many excellent books published by SAS that offer detailed instructions on how to run SAS programs on each specific platform and the appropriate access method into SAS. This book concentrates on how to write SAS programs. You will find that SAS programs, regardless of what computer or operating system you are using, look basically the same. Typically, the only changes you need to make to migrate a SAS program from one platform to another is the way you describe external data sources and where you store SAS programs and output.
1.7 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems.
10 Learning SAS by Example: A Programmer’s Guide
1. Identify which of the following variable names are valid SAS names: Height HeightInCentimeters Height_in_centimeters Wt-Kg x123y456 76Trombones MiXeDCasE
2. In the following list, classify each data set name as valid or invalid: Clinic clinic work hyphens-in-the-name 123GO Demographics_2006
3. You have a data set consisting of Student ID, English, History, Math, and Science test scores on 10 students. a. The number of variables is __________ b. The number of observations is __________ 4. True or false: a. b. c. d.
You can place more than one SAS statement on a single line. You can use several lines for a single SAS statement. SAS has three data types: character, numeric, and integer. OPTIONS and TITLE statements are considered global statements.
5. What is the default storage length for SAS numeric variables (in bytes)?
C h a p t e r
2
Writing Your First SAS Program 2.1 A Simple Program to Read Raw Data and Produce a Report 11 2.2 Enhancing the Program 18 2.3 More on Comment Statements 20 2.4 How SAS Works (a Look Inside the “Black Box”) 22 2.5 Problems 25
2.1 A Simple Program to Read Raw Data and Produce a Report Let’s start out with a simple program to read data from a text file and produce some basic summaries. Then we’ll go on to enhance the program. The task: you have data values in a text file. These values represent Gender (M or F), Age, Height, and Weight. Each data value is separated from the next by one or more blanks. You want to produce two reports: one showing the frequencies for Gender (how
12 Learning SAS by Example: A Programmer’s Guide
many Ms and Fs); the other showing the average age, height, and weight for all the subjects. Here is a listing of the raw data file that you want to analyze: File c:\books\learning\mydata.txt M F M F M
50 23 65 35 15
68 60 72 65 71
155 101 220 133 166
Here is the program:
Program 2-1 Your first SAS program data demographic; infile "c:\books\learning\mydata.txt"; input Gender $ Age Height Weight; run; title "Gender Frequencies"; proc freq data=demographic; tables Gender; run; title "Summary Statistics"; proc means data=demographic; var Age Height Weight; run;
Notice that this program consists of one DATA step followed by two PROC steps. As we mentioned in Chapter 1, the DATA step begins with the word DATA. In this program, the name of the SAS data set being created is Demographic. The next line (the INFILE statement) tells SAS where the data values are coming from. In this example, the text file mydata.txt is in the folder c:\books\learning on a Windows-based system. The INPUT statement shown here is one of four different methods that SAS has for reading raw data. This program uses the list input method, appropriate for data values separated by delimiters. The default data delimiter for SAS is the blank. SAS can also read data separated by any other delimiter (for example, commas, tabs) with a minor change to the INFILE statement. When you use the list input method for reading data, you only need to list the names you want to give each data value. SAS calls these
Chapter 2: Writing Your First SAS Program 13
variable names. As mentioned in Chapter 1, these names must conform to the SAS naming convention. Notice the dollar sign ($) following the variable name Gender. The dollar sign following variable names tells SAS that values for Gender are character values. Without a dollar sign, SAS assumes values are numbers and should be stored as SAS numeric values. Finally, the DATA step ends with a RUN statement. You will see later that, depending on what platform you are running your SAS program, RUN statements are not always necessary. In Program 2-1 we placed a blank line between each step to make the program easier to read. Feel free to include blank lines whenever you wish to make the program more readable. There are several TITLE statements in this program. You will see this statement in many of the SAS programs in this book. As you may have guessed, the text following the keyword TITLE (placed in single or double quotes) is printed at the top of each page of SAS output. Statements such as the TITLE statement are called global statements. The term global refers to the fact that the operations these statements perform are not tied to one single DATA or PROC step. They affect the entire SAS environment. In addition, the operations performed by these global statements remain in effect until they are changed. For example, if you have a single TITLE statement in the beginning of your program, that title will head every page of output from that point on until you write a new TITLE statement. It is a good practice to place a TITLE statement before every procedure that produces output to make it easy for someone to read and understand the information on the page. If you exit your SAS session, your titles are all reset and you need to submit new TITLE statements if you want them to appear. The FREQ procedure (also called PROC FREQ) is one of the many built-in SAS procedures. As the name implies, this procedure counts frequencies of data values. To tell this procedure which variables to count frequencies on, you add an additional statement—the TABLES (or TABLE) statement. Following the word TABLES, you list those variables for which you want frequency counts. You could actually omit this statement but, if you did, PROC FREQ would compute frequencies for every variable in your data set.
14 Learning SAS by Example: A Programmer’s Guide
PROC MEANS is another built-in SAS procedure that computes means (averages) as well as some other statistics such as the minimum and maximum value of each variable. A VAR (short for variables) statement supplies PROC MEANS with a list of analysis variables (which must be numeric) for which you want to compute these statistics. Without a VAR statement, PROC MEANS computes statistics on every numeric variable in your data set. Depending on whether you are working in an interactive SAS session under Windows or UNIX or if you are on a mainframe, the actual mechanics of submitting your program may differ slightly. For this example, and most of the examples in this book, we will assume you are running SAS in a windowing environment. Given that, you can submit your program by using the menu system (Run Æ Submit), by pressing the appropriate function key (F3 in Windows), or by clicking the Submit icon (picture of a running person). Here is what your screen would look like after you have typed this program into the editor:
Chapter 2: Writing Your First SAS Program 15
Now, after you submit your program, SAS does its magic and you see the following:
What you see are the Output and Log windows. (The exact appearance of these windows will vary, depending on how you have set up SAS.) The Output window (the top one) shows part of the output. To see it all, you can click on this window to make it active (alternatives: use a function key or select an item from the View menu on the Menu bar) and then scroll up or down. You can also click the Print icon to send this output to your printer. The output from this program is shown next:
16 Learning SAS by Example: A Programmer’s Guide
Gender Frequencies The FREQ Procedure Cumulative Gender
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ F
2
40.00
2
40.00
M
3
60.00
5
100.00
Summary Statistics The MEANS Procedure Variable
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Age
5
37.6000000
20.2188031
15.0000000
65.0000000
Height
5
67.2000000
4.8682646
60.0000000
72.0000000
Weight
5
155.0000000
44.0056815
101.0000000
220.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
The first part of the output shows the frequency counts of Gender produced by PROC FREQ. You see that there are two females and three males. Next (actually on another page) are the summary statistics produced by PROC MEANS. Here you see the mean (average) along with some other statistics for the three variables Age, Height, and Weight. Notice the two titles correspond to the text you placed on the TITLE statement. Note: For most of the output in this book, a system option called NOCENTER was used so that the output is left-justified. By default, SAS centers all output. The Log window is very important. It is here that you see any error messages if you have made any mistakes in writing your program. In this example, there were no mistakes (a rarity for this author), so you see only the original program along with some information about the data file that was read and some timing information from each of the two procedures that were run. A complete listing of the Log window follows:
Chapter 2: Writing Your First SAS Program 17
1 2 3 4
data demographic; infile "c:\books\learning\mydata.txt"; input Gender $ Age Height Weight; run;
NOTE: The infile "c:\books\learning\mydata.txt" is: File Name=c:\books\learning\mydata.txt, RECFM=V,LRECL=256 NOTE: 5 records were read from the infile "c:\books\learning\mydata.txt". The minimum record length was 11. The maximum record length was 13. NOTE: The data set WORK.DEMOGRAPHIC has 5 observations and 4 variables. NOTE: DATA statement used (Total process time): real time
0.02 seconds
cpu time
0.01 seconds
5 6
title "Gender Frequencies";
7
proc freq data=demographic;
8 9
tables Gender; run;
NOTE: There were 5 observations read from the data set WORK.DEMOGRAPHIC. NOTE: PROCEDURE FREQ used (Total process time): real time
0.01 seconds
cpu time
0.02 seconds
10 11
title "Summary Statistics";
12
proc means data=demographic;
13 14
var Age Height Weight; run;
NOTE: There were 5 observations read from the data set WORK.DEMOGRAPHIC. NOTE: PROCEDURE MEANS used (Total process time): real time
0.01 seconds
cpu time
0.01 seconds
18 Learning SAS by Example: A Programmer’s Guide
Let’s spend a moment looking over the log. First, you see that the data came from the mydata.txt file located in the c:\books\learning folder. Next, you see a note showing that five records (lines) of data were read and that the shortest line was 11 characters long and the longest was 13. The next note indicates that SAS created a data set called Work.Demographic. The Demographic part makes sense because that is the name you used on the DATA statement. The Work part is the way SAS tells you that this is a temporary data set—when you end the SAS session, this data set will self-destruct (and the secretary will disavow all knowledge of your actions). You see later how to make SAS data sets permanent. Also, as part of this note, you see that the Work.Demographic data set has five observations and four variables. The SAS term observations is analogous to rows in a table. The SAS term variables is analogous to columns in a table. In this example, each observation corresponds to the data collected on each subject and each variable corresponds to each item of information you collected on each subject. The remaining notes show the real and CPU time used by SAS to process each procedure.
2.2 Enhancing the Program At this point, it would be a good idea to access SAS somewhere, enter this program (you will probably want to change the name of the folder where you are storing your data file), and submit it. Now, let’s enhance the program so you can learn some more about how SAS works. For this version of the program, you will add a comment statement and compute a new variable based on the height and weight data. Here is the program:
Program 2-2 Enhancing the program *Program name: demog.sas stored in the c:\books\learning folder. Purpose: The program reads in data on height and weight (in inches and pounds, respectively) and computes a body mass index (BMI) for each subject. Programmer: Ron Cody Date Written: May 2006;
Chapter 2: Writing Your First SAS Program 19
data demographic; infile "c:\books\learning\mydata.txt"; input Gender $ Age Height Weight; *Compute a body mass index (BMI); BMI = (Weight / 2.2) / (Height*.0254)**2; run;
The statement beginning with an asterisk (*) is called a comment statement. It enables you to include comments for yourself or others reading your program later. One way of writing a SAS comment is to start with an asterisk, write as many comment lines as you like, and end the statement (as you do all SAS statements) with a semicolon. Comments are not only useful for others trying to read and understand your program—they are useful to you as well. Just imagine trying to understand a section of a long program that you wrote a year ago and now need to correct or modify. Trust me—you will be glad you commented your program. You should usually include information about the file name used to store the program, the purpose of the program, and the date you wrote the program as well as the date and purpose of any changes you made to the program. The statement that starts with BMI= is called an assignment statement. It is an instruction to perform the computation on the right-hand side of the equal sign and assign the resulting value to the variable named on the left. In this example, you are creating a new variable named BMI that is defined as a person’s weight (in kilograms) divided by a person’s Height (in meters) squared. BMI (body mass index) is a useful index of obesity. Medical researchers often use BMI when computing the health risks of various diseases (such as heart attacks). This assignment statement uses three of the basic arithmetic operators used by SAS: the forward slash (/) for division, the asterisk (*) for multiplication, and the double asterisk (**) for exponentiation. This is a good time to mention the full set of arithmetic operators. They are as follows: Operator + – * / ** –
Description
Priority
Addition Subtraction Multiplication Division Exponentiation Negation
Lowest Lowest Next Highest Next Highest Highest Highest
20 Learning SAS by Example: A Programmer’s Guide
The same rules you learned about the order of algebraic operations in school apply to SAS arithmetic operators. That is, multiplication and division occur before addition and subtraction. In the previous table, the two highest priority operations occur before all others; the next highest operations occur before the lowest. For example, the value of x in the following assignment statement is 14: x = 2 + 3 * 4;
If you want to multiply the sum of 2 + 3 by 4, you need to use parentheses like this: x = (2 + 3) * 4;
When you include parentheses in your expression, all operations within the parentheses are performed first. In this example, because parentheses surround the addition operation, the 2 and 3 are added together first and then multiplied by 4, yielding a value of 20. As a further example of how the priority of arithmetic operators works, take a look at the expression here that uses each of the different operators: x = 2**3 + 4 * -5;
Because exponentiation and negation occur first, you have the following equation: x = 8 + 4 * -5;
This gives you: 8 + (-20) = -12
2.3 More on Comment Statements Another way to add a comment to a SAS program is to start it with a slash star (/*) and end it with a star slash (*/). You may even embed comments of this type within a SAS statement. For example, you could write: input Gender $ Age /* age is in years */ Height Weight;
If you are using a mainframe computer, you may want to avoid starting your /* in column one because the operating system will interpret it as job control language (JCL) and terminate your SAS job.
Chapter 2: Writing Your First SAS Program 21
You can get fancy, if you want, and make your comments even more elaborate, as shown in Program 2-3.
Program 2-3 Example of a fancy comment using the asterisk style *---------------------------------------------------------* | Program Name: DEMOG.SAS stored in the c:\books\learning | | folder | | Purpose: The program reads in data on height and weight | | (in inches and pounds, respectively) and computes a body| | mass index (BMI) for each subject. | *---------------------------------------------------------*;
Because this statement begins with an asterisk and ends with a semicolon, it represents a comment. It doesn’t matter that there are asterisks within the comment itself. You can also make a fancy comment using the /* */ style. For example, Program 2-4 represents a single comment.
Program 2-4 Example of a fancy comment using the /* */ style /********************************************************* Program name: demog.sas stored in the c:\books\learning folder. Purpose: The program reads in data on height and weight (in inches and pounds, respectively) and computes a body mass index (BMI) for each subject. **********************************************************/
or /*********************************************************\ | This is another way to make a fancy-looking comment | | using the slash star – star slash form. | \*********************************************************/
Be sure that you do not nest the /* */ style comments. For example, you would get an error if you wrote Program 2-5.
Program 2-5 Incorrect nesting of /* */ style comments /* This comment contains a /* style */ comment embedded within another comment. Notice that the first star slash ends the comment and the remaining portion of the comment will cause a syntax error */
22 Learning SAS by Example: A Programmer’s Guide
2.4 How SAS Works (a Look Inside the “Black Box”) This is a good time to explain some of the inner workings of SAS as it processes a DATA step. Looking again at Program 2-2, let’s “play computer.” SAS processes DATA steps in two stages—a compile stage and an execution stage. Here’s how it works. SAS recognizes the keyword DATA and understands that it needs to process a DATA step. In the compile stage, it does some important housekeeping tasks. First, it prepares an area to store the SAS data set (Demographic). It checks the input file (described by the INFILE statement) and determines various attributes of this file (such as the length of each record). Next, it sets aside a place in memory called the input buffer, where it will place each record (line) of data as it is read from the input file. It then reads each line of the program, checks for invalid syntax, and determines the name of all the variables that are in the data set. Depending on your INPUT statement (or other SAS statements), SAS determines whether each variable is character or numeric and the storage length of each variable. This information is called the descriptor portion of the data set. In this compile stage, no data is read from the input file and no logical statements are evaluated. Each line is processed in order from the top to the bottom. In this example, SAS sees the first four variables listed in the INPUT statement, decides that Gender is character (because of the dollar sign ($) following the name), and sets the storage length of each of these variables. Because no lengths are specified by the program, each variable is given a default length (8 bytes for the character and numeric variables). Eight bytes for a character variable means you can store values with up to eight characters. Eight bytes for numeric variables means that SAS can store numbers with approximately 14 or 15 significant figures (depending on the operating system). It is important to realize that the 8 bytes used to store numeric values does not limit you to numbers with eight digits. The information about each of the variables is stored in a reserved area of memory called the Program Data Vector (PDV for short). Think of the PDV as a set of post office boxes, with one box per variable, with information affixed to each box showing the variable name, type (character or numeric), and storage length. Some additional pieces of information are also stored for each variable. We’ll discuss these later when we discuss more advanced programming techniques.
Chapter 2: Writing Your First SAS Program 23
It helps to picture the PDV like this: Gender Character 8 bytes
Age Numeric 8 bytes
Height Numeric 8 bytes
Weight Numeric 8 bytes
This shows that each variable has a name, a type, and a storage length. The second row of boxes is used to store the value for each of these variables. Next, SAS sees the assignment statement defining a new variable called BMI. Because BMI is defined by an arithmetic operation, SAS decides that this variable is numeric, uses the default storage length for numerics (8 bytes), and adds it to the PDV. Gender Character 8 bytes
Age Numeric 8 bytes
Height Numeric 8 bytes
Weight Numeric 8 bytes
BMI Numeric 8 bytes
SAS has reached the bottom of the DATA step and the compile stage is complete. Now it begins the execution stage. The first step is to set all the values in the PDV to a missing value. This happens before SAS reads in new data to ensure that there is a clean slate and that no values are left over from a previous operation. SAS uses blanks to represent missing character values and periods to represent missing numeric values. Therefore, you can now picture the PDV like this: Gender Character 8 bytes
Age Numeric 8 bytes .
Height Numeric 8 bytes .
Weight Numeric 8 bytes .
BMI Numeric 8 bytes .
The first line of data from the input file is copied to the input buffer. M
50
68
155
An internal pointer that keeps track of the current record in the input file now moves to the next line. In this example, the values in the text file are separated by one or more blanks. This arrangement of data values is called delimited data and the method that SAS uses to read this type of data is called list input. SAS expects blanks as the default delimiter but, as you will see later, you can tell SAS if your file contains other delimiters (such as commas) between the data values.
24 Learning SAS by Example: A Programmer’s Guide
SAS reads each number until it reaches a delimiter (blank) and then moves along until it finds the next number. The values in the input buffer are now copied to the PDV as follows: Gender Character 8 bytes M
Age Numeric 8 bytes 50
Height Numeric 8 bytes 68
Weight Numeric 8 bytes 155
BMI Numeric 8 bytes .
Next, BMI is evaluated by substituting the values in the PDV for Height and Weight and evaluating the equation. This value is then added to the PDV: Gender Character 8 bytes M
Age Numeric 8 bytes 50
Height Numeric 8 bytes 68
Weight Numeric 8 bytes 155
BMI Numeric 8 bytes 23.616947202
SAS has reached the bottom of the DATA step (because it sees the RUN statement—an explicit step boundary). Note that SAS would sense the end of the DATA step without a RUN statement if the next line were a DATA or PROC statement (an implicit step boundary). As a matter of style, it is preferable to end each DATA or PROC step with a RUN statement. At this point the values in the PDV are written to the SAS data set (Demographics), forming the first observation. There is, by default, an implied OUTPUT statement at the bottom of each DATA step. SAS returns back to the top of the DATA step (the line following the DATA statement) and sees that there are more lines of data to read (when it executes the INPUT statement). It repeats the process of setting values in the PDV to missing, reading new data values, computing the BMI, and outputting observations to the SAS data set. This continues until the INPUT statement reads the end-of-file marker. You can think of a DATA step as a loop that continues until all data values have been read. At this time, you may find this discussion somewhat tedious. However, as you learn more advanced programming techniques, you should review this discussion—it can really help you understand the more advanced and subtle features of SAS programming.
Chapter 2: Writing Your First SAS Program 25
2.5 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. You have a text file called stocks.txt containing a stock symbol, a price, and the number of shares. Here are some sample lines of data: File stocks.txt AMGN 67.66 100 DELL 24.60 200 GE 34.50 100 HPQ 32.32 120 IBM 82.25 50 MOT 30.24 100
a.
Using this raw data file, create a temporary SAS data set (Portfolio). Choose your own variable names for the stock symbol, price, and number of shares. In addition, create a new variable (call it Value) equal to the stock price times the number of shares. Include a comment in your program describing the purpose of the program, your name, and the date the program was written. b. Write the appropriate statements to compute the average price and the average number of shares of your stocks. 2. Given the program here, add the necessary statements to compute four new variables: a. Weight in kilograms (1 kg = 2.2 pounds). Name this variable WtKg. b. Height in centimeters (1 inch = 2.54 cm). Name this variable HtCm. c. Average blood pressure (call it AveBP) equal to the diastolic blood pressure plus one-third the difference of the systolic blood pressure minus the diastolic blood pressure. d. A variable (call it HtPolynomial) equal to 2 times the height squared plus 1.5 times the height cubed.
26 Learning SAS by Example: A Programmer’s Guide
Here is the program for you to modify: data prob2; input ID $ Height Weight SBP DBP
/* /* /* /*
in inches */ in pounds */ systolic BP */ diastolic BP */;
< place your statements here > datalines; 001 68 150 110 70 002 73 240 150 90 003 62 101 120 80 ; title "Listing of PROB2"; proc print data=prob2; run;
Note: This program uses a DATALINES statement, which enables you to include the input data directly in the program. You can read more about this statement in the next chapter. 3. You are given an equation to predict electromagnetic field (EMF) strength, as follows: EMF = 1.45 x V + (R/E) x V3 – 125.
If your SAS data set contains variables called V, R, and E, write a SAS assignment statement to compute the EMF strength. 4. What is wrong with this program? 001 002 003 004 005 006 007
data new-data; infile prob4data.txt; input x1 x2 y1 = 3(x1) + 2(x2); y2 = x1 / x2; new_variable_from_x1_and_x2 = x1 + x2 – 37; run;
Note: Line numbers are for reference only; they are not part of the program.
P a r t
2
DATA Step Processing 29
Chapter 3
Reading Raw Data from External Files
Chapter 4
Creating Permanent SAS Data Sets
Chapter 5
Creating Formats and Labels
Chapter 6
Reading and Writing Data from an Excel Spreadsheet
Chapter 7
Performing Conditional Processing
Chapter 8
Performing Iterative Processing: Looping
Chapter 9
Working with Dates
Chapter 10
Subsetting and Combining SAS Data Sets
Chapter 11
Working with Numeric Functions
53
71
101 117
141
189
161
87
28 Learning SAS by Example: A Programmer’s Guide
Chapter 12
Working with Character Functions
Chapter 13
Working with Arrays
243
211
C h a p t e r
3
Reading Raw Data from External Files 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15
Introduction 30 Reading Data Values Separated by Blanks 30 Specifying Missing Values with List Input 32 Reading Data Values Separated by Commas (CSV Files) 33 Using an Alternative Method to Specify an External File 34 Reading Data Values Separated by Delimiters Other Than Blanks or Commas 34 Placing Data Lines Directly in Your Program (the DATALINES Statement) 36 Specifying INFILE Options with the DATALINES Statement 37 Reading Raw Data from Fixed Columns—Method 1: Column Input 37 Reading Raw Data from Fixed Columns—Method 2: Formatted Input 39 Using a FORMAT Statement in a DATA Step versus in a Procedure 43 Using Informats with List Input 43 Supplying an INFORMAT Statement with List Input 45 Using List Input with Embedded Delimiters 46 Problems 47
30 Learning SAS by Example: A Programmer’s Guide
3.1 Introduction One way to provide SAS with data is to have SAS read the data from a text file and create a SAS data set. Some SAS users already have data in SAS data sets. If this is your case, you can skip this chapter! SAS has different ways of reading data from text files and, depending on how the data values are arranged, you can choose an input method that is most convenient. You have already seen one method, called list input, that was used in the introductory program in Chapter 2. This chapter discusses list input as well as two other methods that are appropriate for data arranged in fixed columns. Some of the more advanced aspects of reading raw data are covered in Chapter 21.
3.2 Reading Data Values Separated by Blanks One of the easiest methods of reading data is called list input. By default, SAS assumes that data values are separated by one or more blanks. Task: you have a raw data file called mydata.txt stored in your c:\books\learning folder. It is shown here: File c:\books\learning\mydata.txt M F M F M
50 23 65 35 15
68 60 72 65 71
155 101 220 133 166
These values represent gender, age, height (in inches), and weight (in pounds). Notice that this file meets the criteria for list input—each data value is separated from the next by one or more blanks. Program 3-1 reads data from this file and creates a SAS data set.
Chapter 3: Reading Raw Data from External Files 31
Program 3-1 Demonstrating list input with blanks as delimiters data demographics; infile 'c:\books\learning\mydata.txt'; input Gender $ Age Height Weight; run;
The INFILE statement tells SAS where to find the data. The INPUT statement contains the variable names you want to associate with each data value. The order of these names matches the order of the values in the file. The dollar sign ($) following Gender tells SAS that Gender is a character variable. To see that this program works properly, let's add a PRINT procedure (PROC PRINT) step to list the observations in the SAS data set (details on PROC PRINT can be found in Chapter 14).
Program 3-2 Adding PROC PRINT to list the observations in the data set title "Listing of data set DEMOGRAPHICS"; proc print data=demographics; run;
Here is the output from PROC PRINT: Listing of data set DEMOGRAPHICS Obs
Gender
Age
Height
Weight
1
M
50
68
155
2
F
23
60
101
3
M
65
72
220
4
F
35
65
133
5
M
15
71
166
Each column represents a variable in the data set and each row represents the data on a single person (an observation). The first column, labeled Obs (short for observation), is generated by PROC PRINT. The values in this column go from 1 to the number of observations in the data set. The order of rows in this list reflects the order that the observations were created in the DATA step. If you change the order of the observations or add new observations to the data set, the numbers in the Obs column may change. The order of the variables (columns) reflects the order that the variables were encountered in the DATA step.
32 Learning SAS by Example: A Programmer’s Guide
3.3 Specifying Missing Values with List Input What would happen if you didn't have a value for Age for the second subject in your file? Your data file would look like this: File c:\books\learning\mydata.txt (with a missing value in line 2) M F M F M
50 60 65 35 15
68 155 101 72 220 65 133 71 166
It should be obvious that this will cause errors. SAS reads the value 60 for the Age and 101 for the Height. Because there are no more values on the second line of data, SAS goes to the next line and attempts to read the M as a Height value (and causes a data error message in the log). Clearly, you need a way to tell SAS that there is a missing value for Age in the second line. One way to do this is to use a period to represent the missing value, like this: File c:\books\learning\mydata.txt (using a period to represent a missing value) M F M F M
50 . 65 35 15
68 60 72 65 71
155 101 220 133 166
You must separate the period from the values around it by one or more spaces because a space is the default delimiter character. SAS now assigns a missing value for Age for the second subject. By the way, a missing value is not the same as a 0. This is important because if you asked SAS to compute the mean (average) Age for all the subjects, it would average only the non-missing values. You can use a period to represent a missing character or numeric value when you use list input.
Chapter 3: Reading Raw Data from External Files 33
3.4 Reading Data Values Separated by Commas (CSV Files) A common way to store data on Windows and UNIX platforms is in comma-separated values (CSV) files. These files use commas instead of blanks as data delimiters. They may or may not enclose character values in quotes. The file mydata.csv contains the same values as the file mydata.txt. It is shown here. File c:\books\learning\mydata.csv "M",50,68,155 "F",23,60,101 "M",65,72,220 "F",35,65,133 "M",15,71,166
Program 3-3 reads this file and creates a SAS data set. We will use the same name for the SAS data set as before (Demographics).
Program 3-3 Reading data from a comma-separated values (CSV) file data demographics; infile 'c:\books\learning\mydata.csv' dsd; input Gender $ Age Height Weight; run;
Notice the INFILE statement in this example. The DSD (delimiter-sensitive data) following the file name is an INFILE option. It performs several functions. First, it changes the default delimiter from a blank to a comma. Next, if there are two delimiters in a row, it assumes there is a missing value between. Finally, if character values are placed in quotes (single or double quotes), the quotes are stripped from the value. That’s a lot of mileage for just three letters! The INPUT statement is identical to Program 3-1 as is the resulting SAS data set.
34 Learning SAS by Example: A Programmer’s Guide
3.5 Using an Alternative Method to Specify an External File The INFILE statement in Program 3-3 used the actual file name (placed in quotes) to specify your raw data file. An alternative method is to use a separate FILENAME statement to identify the file and to use this reference (called a fileref) in your INFILE statement instead of the actual file name. Program 3-4 is identical to Program 3-3 except for the way it references the external file.
Program 3-4 Using a FILENAME statement to identify an external file filename preston 'c:\books\learning\mydata.csv'; data demographics; infile preston dsd; input Gender $ Age Height Weight; run;
The name following the FILENAME statement (Preston, in this example) is an alias for the actual file name. For certain operating environments, the fileref can be created outside of SAS (for example, in a DD statement in JCL on a mainframe). Notice also that the fileref (Preston) in the INFILE statement is not placed in quotes. This is how SAS knows that Preston is not the name of a file but rather a reference to it.
3.6 Reading Data Values Separated by Delimiters Other Than Blanks or Commas Remember that the default data delimiter for list input is a blank. Using the INFILE option DSD changes the default to a comma. What if you have a file with other delimiters, such as tabs or colons? No problem! You only need to add the DLM= option to the INFILE statement. For example, the following lines of data use colons as delimiters.
Chapter 3: Reading Raw Data from External Files 35
Example of a file using colon delimiters: M:50:68:155 F:23:60:101 M:65:72:220 F:35:65:133 M:15:71:166
To read this file, you could use this INFILE statement: infile 'file-description' dlm=':';
You can spell out the name of the DELIMITER= option instead of using the abbreviation DLM= if you like, for example: infile 'file-description' delimiter=':';
You can use the DSD and DLM= options together. This combination of options performs all the actions requested by the DSD option (see Section 3.4) but overrides the default delimiter (comma) with a delimiter of your choice. infile 'file-description' dsd dlm=':';
Tabs present a particularly interesting problem. What character do you place between the quotes on the DLM= option? You cannot click the TAB key. Instead, you need to represent the tab by its hexadecimal equivalent. For ASCII files (the coding method used on Windows platforms and UNIX operating systems—it stands for American Standard Code for Information Interchange), you would use the following: infile 'file-description' dlm='09'x;
For EBCDIC files (used on most mainframe computers—it stands for Extended BinaryCoded Decimal Interchange Code), this would be the following statement: infile 'file-description' dlm='05'x;
Note: These two values are called hexadecimal constants. If you know (or look up) the hexadecimal value of any character, you can represent it in a SAS statement by placing the hexadecimal value in single or double quotes and following the value immediately (no space) by an upper- or lowercase x.
36 Learning SAS by Example: A Programmer’s Guide
3.7 Placing Data Lines Directly in Your Program (the DATALINES Statement) Suppose you want to write a short test program in SAS. Instead of having to place your data in an external file, you can place your lines of data directly in your SAS program by using a DATALINES statement. For example, if you want to read data from the text file mydata.txt (blank delimited data with values for Gender, Age, Height, and Weight), but you don’t want to go to the trouble of writing the external file, you could use Program 3-5.
Program 3-5 Demonstrating the DATALINES statement data demographic; input Gender $ Age Height Weight; datalines; M 50 68 155 F 23 60 101 M 65 72 220 F 35 65 133 M 15 71 166 ;
As you can see from this example, the INFILE statement was removed and a DATALINES statement was added. Following DATALINES are your lines of data. Finally, a semicolon is used to end the DATA step. (Note: You may either use a single semicolon or a RUN statement to end the DATA step.) The lines of data must be the last element in the DATA step—any assignment statement must come before the lines of data. While you would probably not use DATALINES in a real application, it is extremely useful when you want to write short test programs. As a historical note, the DATALINES statement used to be called the CARDS statement. If you don’t know what a computer card is, ask an old person. By the way, you can still use the word CARDS in place of DATALINES if you want.
Chapter 3: Reading Raw Data from External Files 37
3.8 Specifying INFILE Options with the DATALINES Statement What if you use DATALINES and want to use one or more of the INFILE options, such as DLM= or DSD? You can use many of the INFILE options with DATALINES by using a reserved file reference called DATALINES. For example, if you wanted to run Program 3-3 without an external data file, you could use Program 3-6.
Program 3-6 Using INFILE options with DATALINES data demographics; infile datalines dsd; input Gender $ Age Height Weight; datalines; "M",50,68,155 "F",23,60,101 "M",65,72,220 "F",35,65,133 "M",15,71,166 ;
3.9 Reading Raw Data from Fixed Columns— Method 1: Column Input Many raw data files store specific information in fixed columns. This has several advantages over data values separated by delimiters. First, you don’t have to worry about missing values. If you do not have a value, you can leave the appropriate columns blank. Next, when you write your INPUT statement, you can choose which variables to read and in what order to read them. The simplest method for reading data in fixed columns is called column input. This method of input can read character data and standard numeric values. By standard numeric values, we mean positive or negative numbers as well as numbers in exponential rd form (for example, 3.4E3 means 3.4 times 10 to the 3 power). This form of input cannot handle values with commas or dollar signs. You can read only dates as character values with this form of input as well. Now for an example.
38 Learning SAS by Example: A Programmer’s Guide
You have a raw data file called bank.txt in a folder called c:\books\learning on your Windows-based computer. A data description for this file follows. File c:\books\learning\bank.txt Variable
Description
Subj DOB Gender Balance
Subject Number Date of Birth Gender Bank Account Balance
Starting Column 1 4 14 15
Ending Column 3 13 14 21
Data Type Character Character Character Numeric
File c:\books\learning\bank.txt 1 2 1234567890123456789012345 Å Columns (not part of the file) ------------------------00110/21/1955M 1145 00211/18/2001F 18722 00305/07/1944M 123.45 00407/25/1945F -12345
Program 3-7 is a SAS program that reads data values from this file.
Program 3-7 Demonstrating column input data financial; infile 'c:\books\learning\bank.txt'; input Subj $ 1-3 DOB $ 4-13 Gender $ 14 Balance 15-21; run;
As you can see from this example, you specify a variable name, a dollar sign if the variable is a character value, the starting column, and the ending column (if the value takes more than one column). In this program, the number of columns you specify for each character variable determines the number of bytes SAS uses to store these values; for numeric variables, SAS will always use 8 bytes to store these values, regardless of how many columns you specify in your INPUT statement. (There are advanced techniques to change the storage length for numeric variables—and these techniques should be used only when you need to save storage space and you understand the possible problems that can result.)
Chapter 3: Reading Raw Data from External Files 39
Notice that this program uses a separate line for each variable. This is not necessary, but it makes the program more readable. You could have written the program like this: data financial; infile 'c:\books\learning\bank.txt'; input Subj $ 1-3 DOB $ 4-13 Gender $ 14 Balance 15-21; run;
It just doesn’t look as nice and is harder to read. This is a good time to recommend that you get into good habits in writing your SAS programs. It is amazing how much easier it is to read and understand a program where some care is taken in its appearance. You can use PROC PRINT to examine the observations in the Financial data set as follows: title "Listing of FINANCIAL"; proc print data=financial; run;
The resulting listing is: Listing of FINANCIAL Obs
Subj
DOB
Gender
Balance
1
001
10/21/1955
M
1145.00
2
002
11/18/2001
F
18722.00
3
003
05/07/1944
M
123.45
4
004
07/25/1945
F
-12345.00
It is important to remember that the date of birth (DOB) is a character value in this data set. To create a more useful, numerical SAS date, you need to use formatted input, the next type of input to be described.
3.10 Reading Raw Data from Fixed Columns— Method 2: Formatted Input Formatted input also reads data from fixed columns. It can read both character and standard numeric data as well as nonstandard numerical values, such as numbers with dollar signs and commas, and dates in a variety of formats. Formatted input is the most
40 Learning SAS by Example: A Programmer’s Guide
common and powerful of all the input methods. Any time you have nonstandard data in fixed columns, you should consider using formatted input to read the file. Let’s start with the same raw data file (bank.txt) that was used in Program 3-7. First examine the program, and then read the explanation.
Program 3-8 Demonstrating formatted input data financial; infile 'c:\books\learning\bank.txt'; input @1 Subj $3. @4 DOB mmddyy10. @14 Gender $1. @15 Balance 7.; run;
The @ (at) signs in the INPUT statement are called column pointers—and they do just that. For example, @4 says to SAS, go to column 4. Following the variable names are SAS informats. Informats are built-in instructions that tell SAS how to read a data value. The choice of which informat to use is dictated by the data. Two of the most basic informats are w.d and $w. The w.d format reads standard numeric values. The w tells SAS how many columns to read. The optional d tells SAS that there is an implied decimal point in the value. For example, if you have the number 123 and you read it with a 3.0 informat, SAS stores the value 123.0. If you read the same number with a 3.1 informat, SAS stores the value 12.3. If the number you are reading already has a decimal point in it (this counts as one of the columns to be read), SAS ignores the d portion of the informat. So, if you read the value 1.23 with a 4.1 informat, SAS stores a value of 1.23. The $w. informat tells SAS to read w columns of character data. In this program, Subj is read as character data and takes up three columns; values of Gender take up a single column. Now it’s time to read the date. The MMDDYY10. informat tells SAS that the date you are reading is in the mm/dd/yyyy form. SAS reads the date and converts the value into a SAS date. SAS stores dates as numeric values equal to the number of days from January 1, 1960. So, if you read the value 01/01/1960 with the MMDDYY10. informat, SAS stores a value of 0.
Chapter 3: Reading Raw Data from External Files 41
The date 01/02/1960 read with the same informat would result in a value of 1, and so forth. SAS knows all about leap years and correctly converts any date from 1582 to way into the future (1582 is the year Pope Gregory started the Gregorian calendar—dates before this are not defined in SAS). So, getting back to our example, since date values are in the mm/dd/yyyy form and start in column 4, you use @4 to move the column pointer to column 4 and the MMDDYY10. informat to tell SAS to read the next 10 columns as a date in this form. SAS then computes the number of days from January 1, 1960, corresponding to each of the date values. Let’s see what happens when we use PROC PRINT to see the contents of this data set. title "Listing of FINANCIAL"; proc print data=financial; run;
This code produces the following output: Listing of FINANCIAL Obs
Subj
DOB
Gender
Balance
1
001
-1533
M
1145.00
2
002
15297
F
18722.00
3
003
-5717
M
123.45
4
004
-5273
F
-12345.00
Well, the dates (variable DOB) look rather strange. What you are seeing are the actual values SAS is storing for each DOB. You need a way to display these dates in a more traditional form, such as the way the dates were displayed in the raw data file (10/21/1955, in the first observation) or in some other form (such as 10Oct1955). While you are at it, why not add dollar signs and commas to the Balance figures? You can accomplish both of these tasks by associating a format with each of these two variables. There are many built-in formats in SAS that allow you to display dates and financial values in easily readable ways. You associate these formats with the appropriate variables in a FORMAT statement. Program 3-9 shows how to add a FORMAT statement to PROC PRINT.
42 Learning SAS by Example: A Programmer’s Guide
Program 3-9 Demonstrating a FORMAT statement title "Listing of FINANCIAL"; proc print data=financial; format DOB mmddyy10. Balance dollar11.2; run;
Here you are using the MMDDYY10. format to print the DOB values and the dollar11.2 format to print the Balance values. Notice the period in each of the formats. All SAS formats need to end either in a period or in a period followed by a number. The 11.2 following the dollar format says to allow up to 11 columns to print the Balance values (including the dollar sign, the decimal point, and possibly a comma or a minus sign). The 2 following the period says to include two decimal places after the decimal point. Here is the revised output: Listing of FINANCIAL Obs
Subj
DOB
Gender
Balance
1
001
10/21/1955
M
$1,145.00
2
002
11/18/2001
F
$18,722.00
3
003
05/07/1944
M
$123.45
4
004
07/25/1945
F
$-12,345.00
It is important to remember that the formats only affect the way these values appear in printed output—the internal values are not changed. To be sure that you understand what formats do, let’s repeat Program 3-9 and use another format for date of birth (DOB).
Program 3-10 Rerunning Program 3-9 with a different format title "Listing of FINANCIAL"; proc print data=financial; format DOB date9. Balance dollar11.2; run;
Chapter 3: Reading Raw Data from External Files 43
This produces the resulting output: Listing of FINANCIAL Obs
Subj
DOB
Gender
Balance
1
001
21OCT1955
M
$1,145.00
2
002
18NOV2001
F
$18,722.00
3
003
07MAY1944
M
$123.45
4
004
25JUL1945
F
$-12,345.00
The DATE9. format, as you can see, prints dates as a two-digit day of the month, a threecharacter month abbreviation, and a four-digit year. This format helps avoid confusion between the month-day-year and day-month-year formats used in the United States and Europe, respectively. Notice also that the DOLLAR11.2 format makes the Balance figures much easier to read. This is a good place to mention that the COMMAw.d format is useful for displaying large numbers where you don’t need or want dollar signs.
3.11 Using a FORMAT Statement in a DATA Step versus in a Procedure Program 3-9 demonstrated using a FORMAT statement in a procedure. Placing a FORMAT statement here associates the formats and variables only for that procedure. It is usually more useful to place your FORMAT statement in the DATA step. When you do this, there is a permanent association of the formats and variables in the data set. You can override any permanent format by placing a FORMAT statement in a particular procedure where you would like a different format. You will usually want to place all of your date formats in a DATA step because no one wants to see unformatted SAS dates.
3.12 Using Informats with List Input Suppose you have a blank- or comma-delimited file containing dates and character values longer than 8 bytes (or other values that require an informat). One way to provide informats with list input is to follow each variable name in your INPUT statement with a
44 Learning SAS by Example: A Programmer’s Guide
colon, followed by the appropriate informat. To see how this works, suppose you want to read this CSV file: File: c:\books\learning\list.csv "001","Christopher Mullens",11/12/1955,"$45,200" "002","Michelle Kwo",9/12/1955,"$78,123" "003","Roger W. McDonald",1/1/1960,"$107,200"
Variables in this file represent a subject number (Subj), Name, date of birth (DOB), and yearly salary (Salary). You need to supply informats for Name (length is greater than 8 bytes), DOB (you need a date informat here), and Salary (this is a nonstandard numeric value—with a dollar sign and commas). Program 3-11 shows one way to supply the appropriate informats for these variables.
Program 3-11 Using informats with list input data list_example; infile 'c:\books\learning\list.csv' dsd; input Subj : $3. Name : $20. DOB : mmddyy10. Salary : dollar8.; format DOB date9. Salary dollar8.; run;
You see here that there is a colon preceding each informat. This colon (called an informat modifier) tells SAS to use the informat supplied but to stop reading the value for this variable when a delimiter is encountered. Do not forget the colons because without them SAS may read past a delimiter to satisfy the width specified in the informat. This program would also work if the informat for Subj were omitted and the variable name was followed by a dollar sign (to signify that Subj is a character variable). However, the Subj variable would then be stored in 8 bytes (the default length for character variables with list input). By providing the $3. informat, you tell SAS to use 3 bytes to store this variable.
Chapter 3: Reading Raw Data from External Files 45
3.13 Supplying an INFORMAT Statement with List Input Another way to supply informats when using list input is to use an INFORMAT statement before the INPUT statement. Following the keyword INFORMAT, you list each variable and the informat you want to use to read each variable. You may also use a single informat for several variables if you follow a list of variables by a single informat. To see how this works, see Program 3-12, which is rewritten using an INFORMAT statement.
Program 3-12 Supplying an INFORMAT statement with list input data list_example; informat Subj $3. Name $20. DOB mmddyy10. Salary dollar8.; infile 'c:\books\learning\list.csv' dsd; input Subj Name DOB Salary; format DOB date9. Salary dollar8.; run;
This program uses an INFORMAT statement to associate an informat to each of the variables. When choosing informats for your variables, be sure to make the length long enough to accommodate the longest data value you will encounter. Notice that the INPUT statement does not require anything other than the variable names because each variable already has an assigned informat. A listing from PROC PRINT confirms that all is well: Listing of LIST_EXAMPLE Obs
Subj
Name
DOB
Salary
1
001
2
002
Christopher Mullens
12NOV1955
$45,200
Michelle Kwo
12SEP1955
3
003
$78,123
Roger W. McDonald
01JAN1960
$107,200
46 Learning SAS by Example: A Programmer’s Guide
3.14 Using List Input with Embedded Delimiters What if the previous CSV file used blanks instead of commas as delimiters and there were no quotes around each character value? Here's what the file would look like: File c:\books\learning\list.txt 001 Christopher Mullens 11/12/1955 $45,200 002 Michelle Kwo 9/12/1955 $78,123 003 Roger W. McDonald 1/1/1960 $107,200
Houston, we have a problem! If you try to read this file with list input, the blank(s) in the Name field will trigger the end of the variable. SAS, in its infinite wisdom, came up with a novel solution—the ampersand (&) informat modifier. The ampersand, like the colon, says to use the supplied informat, but the delimiter is now two or more blanks instead of just one. So, if you use an ampersand modifier to read the list.txt file here, you need to use the ampersand modifier following Name. You also need to have two or more spaces between the end of the name and the date of birth. Here is the modified file: File c:\books\learning\list.txt 001 Christopher Mullens 002 Michelle Kwo 003 Roger W. McDonald
11/12/1955 $45,200 9/12/1955 $78,123 1/1/1960 $107,200
And here is the program using the ampersand modifier:
Program 3-13 Demonstrating the ampersand modifier for list input data list_example; infile 'c:\books\learning\list.txt'; input Subj : $3. Name & $20. DOB : mmddyy10. Salary : dollar8.; format DOB date9. Salary dollar8.; run;
As you can see, the INPUT statement is one of the most powerful and versatile SAS statements. Please refer to Chapter 25 to learn even more about the ability of SAS to read raw data.
Chapter 3: Reading Raw Data from External Files 47
3.15 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. You have a text file called scores.txt containing information on gender (M or F) and four test scores (English, history, math, and science). Each data value is separated from the others by one or more blanks. Here is a listing of the data file: File scores.txt M F M F
80 94 96 95
82 92 88 .
85 88 89 92
88 96 92 92
a.
Write a DATA step to read in these values. Choose your own variable names. Be sure that the value for Gender is stored in 1 byte and that the four test scores are numeric. b. Include an assignment statement computing the average of the four test scores. c. Write the appropriate PROC PRINT statements to list the contents of this data set. 2. You are given a CSV (comma-separated values) file called political.csv containing state, political party, and age. A listing of this file is shown here: File political.csv "NJ",Ind,55 "CO",Dem,45 "NY",Rep,23 "FL",Dem,66 "NJ",Rep,34
a. Write a SAS program to create a temporary SAS data set called Vote. Use the variable names State, Party, and Age. Age should be stored as a numeric variable; State and Party should be stored as character variables. b. Include a procedure to list the observations in this data set. c. Include a procedure to compute frequencies for Party.
48 Learning SAS by Example: A Programmer’s Guide
3. You are given a text file where dollar signs were used as delimiters. To indicate missing values, two dollars signs were entered. Values in this file represent last name, employee number, and annual salary. Here is a listing of this file: File company.txt Roberts$M234$45000 Chien$M74777$$ Walters$$75000 Rogers$F7272$78131
Using this data file as input, create a temporary SAS data set called Company with the variables LastName (character), EmpNo (character), and Salary (numeric). 4. Repeat Problem 2 using a FILENAME statement to create a fileref instead of using the file name on the INFILE statements. 5. You want to create a test data set that uses a DATALINES statement to read in values for X and Y. In the DATA step, you want to create a new variable, Z, equal to 2 2 100 + 50X + 2X – 25Y + Y . Use the following (X,Y) data pairs: (1,2), (3,6), (5,9), and (9,11). 6. You have a text file called bankdata.txt with data values arranged as follows: Variable Name Acct Balance Rate
Description Name Account number Acct balance Interest rate
Starting Column 1 16 21 27
Ending Column 15 20 26 30
Data Type Char Char Num Num
Create a temporary SAS data set called Bank using this data file. Use column input to specify the location of each value. Include in this data set a variable called Interest computed by multiplying Balance by Rate. List the contents of this data set using PROC PRINT.
Chapter 3: Reading Raw Data from External Files 49
Here is a listing of the text file: File bankdata.txt Philip Jones Nathan Philips Shu Lu Betty Boop
V1234 4322.32 V1399 15202.45 W8892 451233.45 V7677 50002.78
7. You have a text file called geocaching.txt with data values arranged as follows: Variable Name
Description Cache name
Starting Column 1
Ending Column 20
Data Type Char
LongDeg
Longitude degrees
21
22
Num
LongMin
Longitude minutes
23
28
Num
LatDeg
Latitude degrees
29
30
Num
LatMin
Latitude minutes
31
36
Num
Here is a listing of the file: File geocaching.txt Higgensville Hike Really Roaring Cushetunk Climb Uplands Trek
4030.2937446.539 4027.4047442.147 4037.0247448.014 4030.9907452.794
Create a temporary SAS data set called Cache using this data file. Use column input to read the data values. To learn about geocaching (treasure hunting with a hand-held GPS), go to www.geocaching.com. The author and his wife use the geocaching name “Jan and the Man.” Check it out. 8. Repeat Problem 6 using formatted input to read the data values instead of column input. 9. Repeat Problem 7 using formatted input to read the data values instead of column input. 10. You are given a text file called stockprices.txt containing information on the purchase and sale of stocks. The data layout is as follows:
50 Learning SAS by Example: A Programmer’s Guide
Variable
Description
Length
Type
Stock symbol Purchase date Purchase price
Starting Column 1 5 15
Stock PurDate PurPrice
4 10 6
Number of shares Selling date Selling price
21 25 35
4 10 6
Char mm/dd/yyyy Dollar signs and commas Num mm/dd/yyyy Dollar signs and commas
Number SellDate SellPrice
A listing of the data file is: File stockprices.txt IBM 5/21/2006 CSCO04/05/2005 MOT 03/01/2004 XMSR04/15/2006 BBY 02/15/2005
$80.0 $17.5 $14.7 $28.4 $45.2
10007/20/2006 20009/21/2005 50010/10/2006 20004/15/2007 10009/09/2006
$88.5 $23.6 $19.9 $12.7 $56.8
Create a SAS data set (call it Stocks) by reading the data from this file. Use formatted input. Compute several new variables as follows: Variable TotalPur TotalSell Profit
Description Total purchase price Total selling price Profit
Computation Number times PurPrice Number times SellPrice TotalSell minus TotalPur
Print out the contents of this data set using PROC PRINT.
Chapter 3: Reading Raw Data from External Files 51
11. You have a CSV file called employee.csv. This file contains the following information: Variable ID Name Depart DateHire Salary
Description Employee ID Employee name Department Hire date Yearly salary
Desired Informat $3. $20. $8. MMDDYY10. DOLLAR8.
Use list input to read data from this file. You will need an informat to read most of these values correctly (i.e., DateHire needs a date informat). You can do this in either of two ways. First is to include an INFORMAT statement to associate each variable with the appropriate informat. The other is to use the colon modifier and supply the informats directly in the INPUT statement. Create a temporary SAS data set (Employ) from this data file. Use PROC PRINT to list the observations in your data set and the appropriate procedure to compute frequencies for the variable Depart. A listing of the raw data file is: File employee.csv 123,"Harold Wilson",Acct,01/15/1989,$78,123. 128,"Julia Child",Food,08/29/1988,$89,123 007,"James Bond",Security,02/01/2000,$82,100 828,"Roger Doger",Acct,08/15/1999,$39,100 900,"Earl Davenport",Food,09/09/1989,$45,399 906,"James Swindler",Acct,12/21/1978,$78,200
52 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
4
Creating Permanent SAS Data Sets 4.1
Introduction 54
4.2
SAS Libraries—The LIBNAME Statement 54
4.3 4.4
Why Create Permanent SAS Data Sets? 55 Examining the Descriptor Portion of a SAS Data Set Using PROC CONTENTS 56 Listing All the SAS Data Sets in a SAS Library Using PROC CONTENTS 59 Viewing the Descriptor Portion of a SAS Data Set Using the SAS Explorer 60
4.5 4.6 4.7 4.8
Viewing the Data Portion of a SAS Data Set Using PROC PRINT 63 Viewing the Data Portion of a SAS Data Set Using the SAS VIEWTABLE Window 64
4.9
Using a SAS Data Set as Input to a DATA Step 65
4.10 DATA _NULL_: A Data Set That Isn’t 67 4.11 Problems 68
54 Learning SAS by Example: A Programmer’s Guide
4.1 Introduction SAS procedures cannot read raw data files or spreadsheets directly. One way or another, they need the data in SAS data sets. Remember that SAS DATA steps can create SAS data sets. You can also have SAS convert data from other sources, such as Microsoft Office Excel, Oracle, and DB2. This conversion process can be automated by using the Import Wizard (on Windows platforms) or by using data access engines, which automatically convert the data into a form SAS can process. This chapter describes how to make your SAS data set permanent and how to determine the contents of a SAS data set.
4.2 SAS Libraries—The LIBNAME Statement When you write a DATA statement such as data test;
SAS creates a temporary SAS data set called Test. When you close your SAS session, this data set disappears. SAS data set names actually have two-part names in the form: libref.data-set-name The part of the name before the period is called a libref (short for library reference), and this tells SAS where to store (or retrieve) the data set. The part of the name after the period identifies the name you want to give the data set. Up to now, all the programming examples in this book used a data set name without a period. When you use a name like Test in the DATA statement, SAS uses a default libref called Work that SAS creates automatically every time you open a SAS session. For example, if you write a DATA statement such as data test;
SAS adds the default libref Work, so this DATA statement is equivalent to data work.test;
All that is required to make your SAS data sets permanent is to create your own libref using a LIBNAME statement and use that libref in the two-level SAS data set name.
Chapter 4: Creating Permanent SAS Data Sets 55
Suppose you want to create a permanent SAS data set called Test_Scores in your c:\books\learning folder. You could use following program.
Program 4-1 Creating a permanent SAS data set libname mozart 'c:\books\learning'; data mozart.test_scores; length ID $ 3 Name $ 15; input ID $ Score1-Score3 Name $; datalines; 1 90 95 98 2 78 77 75 3 88 91 92 ;
The LIBNAME statement starts with the LIBNAME keyword and then specifies the name of the library (called a libref), followed by the directory or folder where you want to store your permanent SAS data sets. The libref you use must not be more than 8 characters in length and must be a valid SAS name. When you run this program, data set Test_Scores becomes a permanent SAS data set in the c:\books\learning folder. It is important to remember that any libref that you create exists only for your current SAS session. If you open a new SAS session, you need to reissue a new LIBNAME statement. A good way to think of a libref is as an alias for the name of the folder (on Windows or UNIX platforms). On mainframe computers, a SAS library is actually a single file that can hold multiple SAS data sets. If you run Program 4-1 on a Windows platform, the SAS data set will be stored as the SAS data set called Test_Scores, and it will be stored as the file test_scores.sas7bdat in the c:\books\learning folder. The file extension stands for SAS binary data version 7. You may wonder why there is a 7 rather than a 9 in the file extension when this data set ® was created using SAS 9. Since the structure of SAS data sets has not changed since SAS 7, SAS has maintained the same file extension it used in SAS 7.
4.3 Why Create Permanent SAS Data Sets? If your data sets are small, you may choose to create them each time you start a SAS session. However, it takes considerable computer resources to create SAS data sets and it makes more sense to make your data sets permanent if you plan to use them more than once, especially if they are large.
56 Learning SAS by Example: A Programmer’s Guide
4.4 Examining the Descriptor Portion of a SAS Data Set Using PROC CONTENTS A SAS data set consists of two parts: a descriptor portion and a data portion. One way to examine the descriptor portion of a SAS data set is by using PROC CONTENTS. If you want to see the descriptor portion of the Test_Scores data set, submit the following program:
Program 4-2 Using PROC CONTENTS to examine the descriptor portion of a SAS data set title "The Descriptor Portion of Data Set TEST_SCORES"; proc contents data=Mozart.test_scores; run;
The resulting output is shown next: The Descriptor Portion of Data Set TEST_SCORES The CONTENTS Procedure Data Set Name
MOZART.TEST_SCORES
Observations
3
Member Type
DATA
Variables
5
Engine
V9
Indexes
0
Created
Tue, Sep 20, 2005
Observation Length
48
Deleted Observations
0
Protection
Compressed
NO
Data Set Type
Sorted
NO
03:45:58 PM Last Modified
Tue, Sep 20, 2005 03:45:58 PM
Label Data Representation WINDOWS_32 Encoding
wlatin1 Western (Windows)
(continued)
Chapter 4: Creating Permanent SAS Data Sets 57
Engine/Host Dependent Information Data Set Page Size
4096
Number of Data Set Pages
1
First Data Page
1
Max Obs per Page
84
Obs in First Data Page
3
Number of Data Set Repairs 0 File Name
c:\books\learning\test_scores.sas7bdat
Release Created
9.0101M3
Host Created
XP_PRO
Alphabetic List of Variables and Attributes #
Variable
Type
Len
1
ID
Char
3
2
Name
Char
15
3
Score1
Num
8
4
Score2
Num
8
5
Score3
Num
8
The output displays information about the data set, such as the number of variables, the number of observations, and the creation and modification dates. It also displays information about the SAS version used to create the data set. In addition, it displays information on each of the variables in the data set—the variable name, type, and storage length. A quick look at this output shows that the data set Test_Scores has 3 observations and 5 variables. The list of variables is in alphabetical order. It shows that ID and NAME are character variables stored in 3 and 15 bytes, respectively; the three SCORE variables are numeric and are stored in 8 bytes each. Note: The list is in alphabetical order because all the variable names are in lowercase. If you also have uppercase variable names, they will be grouped first in the list, followed by any lowercase names. The title at the top of the page is created by using a TITLE statement as shown in Program 4-2. You may use either single or double quotes to enclose your title. If the title contains any single quotation marks (or apostrophes), you should use double quotation marks. If your title does not contain any apostrophes, you can actually omit the quotation
58 Learning SAS by Example: A Programmer’s Guide
marks altogether. However, as a matter of style, you may want to use quotation marks on all your TITLE statements. A more useful way to list variable information is to list them in the order the variables are stored in the SAS data set, rather than alphabetically. To create such a list, use the VARNUM option of PROC CONTENTS, like this:
Program 4-3 Demonstrating the VARNUM option of PROC CONTENTS title "The Descriptor Portion of Data Set TEST_SCORES"; proc contents data=Mozart.test_scores varnum; run;
Output from this program is identical to the previous output except that the variable list is now in the same order as the variables in the data set. This portion of the output is shown below to demonstrate the effect of this option: Variables in Creation Order #
Variable
Type
Len
1
ID
Char
3
2
Name
Char
15
3
Score1
Num
8
4
Score2
Num
8
5
Score3
Num
8
It is important to remember that if you have just opened a new SAS session, you must reissue a LIBNAME statement if you want to access a previously created SAS data set or to create a new one. You may use any library name (libref) you want each time you open a SAS session, although in practice you usually use the same library reference each time. For example, if you open up a new SAS session, you can submit the following statements to obtain information on the Test_Scores data set:
Program 4-4 Using a LIBNAME in a new SAS session libname proj99 'c:\books\learning'; title "Descriptor Portion of Data Set TEST_SCORES"; proc contents data=proj99.test_scores varnum; run;
Chapter 4: Creating Permanent SAS Data Sets 59
4.5 Listing All the SAS Data Sets in a SAS Library Using PROC CONTENTS You can use PROC CONTENTS to list the names of all the SAS data sets in a SAS library (folder). To do this, use the following program:
Program 4-5 Using PROC CONTENTS to list the names of all the SAS data sets in a SAS library title "Listing All the SAS Data Sets in a Library"; proc contents data=Mozart._all_ nods; run;
The keyword _ALL_ is used in place of a data set name. The NODS option gives you the name of the SAS data sets only, omitting the detail listing for each data set. A sample listing (showing three data sets) is shown here: Listing All the SAS Data Sets in a Library The CONTENTS Procedure Directory Libref
MOZART
Engine
V9
Physical Name
c:\books\learning
File Name
c:\books\learning
Member
File
# Name
Type
Size
Last Modified
1 CLINIC
DATA
5120
20Sep05:16:27:33
2 PATIENTS
DATA
5120
20Sep05:16:27:33
3 TEST_SCORES DATA
5120
20Sep05:15:45:58
60 Learning SAS by Example: A Programmer’s Guide
4.6 Viewing the Descriptor Portion of a SAS Data Set Using the SAS Explorer If you are running SAS in a Windows environment, you can use the SAS Explorer to display similar information to that produced by PROC CONTENTS. This is quite easy to do. First, click on the Explorer tab to the left of your editor window:
This brings up the following window:
Chapter 4: Creating Permanent SAS Data Sets 61
The Libraries icon shows the built-in libraries plus any libraries you have created using LIBNAME statements.
The Work library contains all of your temporary SAS data sets. Selecting a library enables you to see all the SAS data sets stored there.
A right-click on the data set icon brings up a menu that includes a choice to see the variables (columns) in the data set and their attributes.
62 Learning SAS by Example: A Programmer’s Guide
The Columns tab shows the same information you can obtain by running PROC CONTENTS. The order of the variables in the list is the same as the order you will see when using the VARNUM option.
The Details tab displays the same information you see in part of the output from PROC CONTENTS.
Chapter 4: Creating Permanent SAS Data Sets 63
4.7 Viewing the Data Portion of a SAS Data Set Using PROC PRINT As you have seen in several programs, PROC PRINT can be used to list the data in a SAS data set. Although there are a number of options to control how this listing appears, you can use it with all the defaults to get a quick listing of your data set. Here is the code to list the data portion of data set Test_Scores:
Program 4-6 Using PROC PRINT to list the data portion of a SAS data set title "Listing of TEST_SCORES"; proc print data=Mozart.test_scores; run;
64 Learning SAS by Example: A Programmer’s Guide
This code generates the following output: Listing of TEST_SCORES Obs
ID
Name
Score1
Score2
1
1
2 3
Score3
Milton
90
95
98
2
Washington
78
77
75
3
Smith
88
91
92
This listing displays all the variables and all the observations in the Test_Scores data set. Program 4-6 is an example of a procedure that uses all the default actions. That is, you did not specify any details such as which variables to print or other controllable aspects of this procedure. Chapter 14 describes how to add options and statements to PROC PRINT to customize your report.
4.8 Viewing the Data Portion of a SAS Data Set Using the SAS VIEWTABLE Window Following the initial steps in Section 4.6, you can bring up the SAS VIEWTABLE window to list the observations in a SAS data set. Instead of right-clicking on the data set name, use the left mouse button (double-click) to open the SAS viewer, as shown here:
Chapter 4: Creating Permanent SAS Data Sets 65
This action opens your data set in a spreadsheet type view like this:
The SAS Viewer
You can drag and drop columns, sort, or hide columns using your mouse, very much as you do with an Excel spreadsheet. It is important to close this window before attempting to modify the data set because the open viewer prevents any changes.
4.9 Using a SAS Data Set as Input to a DATA Step Besides raw data files, SAS data sets can also be used as input to a DATA step. As an example, you might want to use the information in an existing SAS data set to compute new variables. As an example, consider the data set Test_Scores (stored in the c:\books\learning folder). This data set contains the variables ID, Name, and Score1–Score3 (three test scores). Suppose you want to compute an average score for each subject in this data set. Program 4-7, here, performs this task:
66 Learning SAS by Example: A Programmer’s Guide
Program 4-7 Using observations from a SAS data set as input to a new SAS data set data new; set learn.test_scores; AveScore = mean(of score1-score3); run; title "Listing of Data Set NEW"; proc print data=new; var ID Score1-Score3 AveScore; run;
The key to this program is the SET statement. You can think of a SET statement as an INPUT statement except you are reading observations from a SAS data set instead of lines from a raw data file. There is a difference, however. Each time you read a line of data from a raw data file, the variables being read from the raw data file or created by assignment statements in the DATA step are initialized to a missing value during each iteration of the DATA step. Variables that are read from SAS data sets are not set to missing values during each iteration of the DATA step—they are said to be retained. In Program 4-7, the variables ID, Name, and Score1–Score3 are retained; the variable AveScore is not. This fact is not a concern to us here, but it can be used to advantage in more advanced programs. The assignment statement that creates the AveScore variable uses the MEAN function to compute the mean of the three Score variables. You can read more about the MEAN function in Chapter 11. For now, you should notice that the variable list Score1–Score3 is preceded by the word of. This is typical of many SAS statistical functions that can take a variable list as an argument. Without the word of, the MEAN function would return the difference of Score1 and Score3 (that is, the dash would be interpreted as a minus sign). Listing of Data Set NEW Ave Obs
ID
Score1
Score2
Score3
Score
1
1
90
95
98
94.3333
2
2
78
77
75
76.6667
3
3
88
91
92
90.3333
Chapter 4: Creating Permanent SAS Data Sets 67
4.10 DATA _NULL_: A Data Set That Isn’t There are many applications where you want to process observations in a SAS data set, perhaps to print out data errors or to produce a report, and you don’t need to create a new data set. You can use the data set name _NULL_ for these applications. The reserved data set name _NULL_ tells SAS not to create a data set. It enables you to process observations from an existing data set without the overhead of creating a new data set. Here is an example. You have a permanent SAS data set (Test_Scores) and you want to create a list of all the IDs of students who achieved a score of 95 or higher on any of the tests. You could create a new SAS data set and use PROC PRINT to list these students or you could do it more efficiently with a DATA _NULL_ step, like this:
Program 4-8 Demonstrating a DATA _NULL_ step data _null_; set learn.test_scores; if score1 ge 95 or score2 ge 95 or score3 ge 95 then put ID= Score1= Score2= Score3=; run;
The IF statement checks if any of the three test scores is greater than or equal to 95. If so, the PUT statement writes out the values of ID and the three test scores. A PUT statement writes text to a location of your choice: an external text file, the SAS log, or the OUTPUT window. In Program 4-8, an output location is not specified so the default location, the SAS log, is used. Here is a listing of the SAS log after running this program: 40
data _null_;
41
set learn.test_scores;
42
if score1 ge 95 or score2 ge 95 or score3 ge 95 then
43 44
put ID= Score1= Score2= Score3=; run;
ID=1 Score1=90 Score2=95 Score3=98 NOTE: There were 3 observations read from the data set LEARN.TEST_SCORES. NOTE: DATA statement used (Total process time): real time
0.00 seconds
cpu time
0.00 seconds
68 Learning SAS by Example: A Programmer’s Guide
Placing PUT statements in a DATA step is an excellent way to help debug SAS programs. You can examine the values of your variables at any place in the DATA step. (You can also use the SAS debugger, available on the PC platform for this purpose.) If you want to send the output to a file called c:\books\learning\highscores.txt, you would need to place a FILE statement before the PUT statement, as follows: file 'c:\books\learning\highscores.txt';
A file statement is somewhat like an INFILE statement—that is, it works in concert with a PUT statement, telling SAS the destination of the text you are outputting. If you want the results of the PUT statement to be written to the output device (on a PC, this would be the OUTPUT window), you can use the reserved file reference PRINT, like this: file print;
DATA _NULL_ steps are sometimes used to create custom reports. As a matter of fact, this type of report is referred to as DATA _NULL_ reporting. To control how SAS writes this output, you can use pointers and formats to specify exactly what columns to write to and how the values are to be formatted.1
4.11 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Run the program here to create a permanent SAS data set called Perm. You will need to modify the program to specify a folder where you want to place this data set. Run PROC CONTENTS on this data set and then use the SAS Explorer to investigate the properties of this data set as well.
1
See Michele M. Burlew, SAS Guide to Report Writing: Examples, Second Edition (Cary, NC: SAS Institute Inc., 2005), for detailed information on how to use DATA _NULL_ for report writing.
Chapter 4: Creating Permanent SAS Data Sets 69
libname learn 'c:\your-folder-name'; data learn.perm; input ID : $3. Gender : $1. DOB : mmddyy10. Height Weight; label DOB = 'Date of Birth' Height = 'Height in inches' Weight = 'Weight in pounds'; format DOB date9.; datalines; 001 M 10/21/1946 68 150 002 F 5/26/1950 63 122 003 M 5/11/1981 72 175 004 M 7/4/1983 70 128 005 F 12/25/2005 30 40 ;
2. Run PROC PRINT on the data set you created in Problem 1. Use the SAS VIEWTABLE window to open this data set and compare the headings in the window to the column headings from your PROC PRINT. What is the difference? 3. Run this program to create a permanent SAS data set called Survey2007. Close your SAS session, open up a new session, and write the statements necessary to compute the mean age. * Write your LIBNAME statement here; data –fill in your data set name here- ; input Age Gender $ (Ques1-Ques5)($1.); /* See Chapter 21, Section 14 for a discussion of variable lists and format lists used above */ datalines; 23 M 15243 30 F 11123 42 M 23555 48 F 55541 55 F 42232 62 F 33333 68 M 44122 ; * Write your libname statement here; proc means data= - insert the correct data set name -; var Age; run;
70 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
5
Creating Formats and Labels 5.1 Adding Labels to Your Variables 71 5.2 Using Formats to Enhance Your Output 73 5.3 Regrouping Values Using Formats 76 5.4 More on Format Ranges 78 5.5 Storing Your Formats in a Format Library 79 5.6 Permanent Data Set Attributes 80 5.7 Accessing a Permanent SAS Data Set with User-Defined Formats 82 5.8 Displaying Your Format Definitions 83 5.9 Problems 84
5.1 Adding Labels to Your Variables If you are using SAS to produce listings and reports for others, you will want to make the output more readable and attractive. SAS formats and labels help you do this. They also help you to remember what each variable represents.
72 Learning SAS by Example: A Programmer’s Guide
Many SAS procedures use variable labels to improve readability. You can create labels either in a DATA or PROC step. As an example, you can add labels to the variables in the Test_Scores data set like this:
Program 5-1 Adding labels to variables in a SAS data set libname learn 'c:\books\learning'; data learn.test_scores; length ID $ 3 Name $ 15; input ID $ Score1-Score3; label ID = 'Student ID' Score1 = 'Math Score' Score2 = 'Science Score' Score3 = 'English Score'; datalines; 1 90 95 98 2 78 77 75 3 88 91 92 ;
Labels are created with a LABEL statement. Following the keyword LABEL, you enter a variable name, followed by an equal sign, followed by your label, placed in single or double quotes. Labels can be up to 256 characters long (255 on UNIX platforms). You may continue with variable names and labels for as many variables as you want. Just make sure that you complete the LABEL statement with a semicolon. When you run certain SAS procedures, these labels are printed along with the variable names. For example, here is output from PROC MEANS, giving statistics on the three test scores: Test Score Statistics The MEANS Procedure Variable
Label
N
Mean
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Score1
Math Score
3
85.3
78.0
90.0
Score2
Science Score
3
87.7
77.0
95.0
Score3
English Score
3
88.3
75.0
98.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chapter 5: Creating Formats and Labels 73
Notice how the labels improve the readability of this output. If you include your LABEL statement in the DATA step, the labels remain associated with the respective variables; if you include your LABEL statement in a PROC step, the labels are used only for that procedure. This is because the label created in a DATA step is stored in the descriptor portion of the SAS data set.
5.2 Using Formats to Enhance Your Output SAS provides built-in formats to improve the appearance of printed output. For example, you can print financial data with dollar signs or add commas to large numbers. You can also create your own formats. For example, if you have a variable called Gender with values of F and M, you can format these values so that they print as Male and Female. If you have a variable representing age, you can use formats to display the values as age groups instead of actual ages. You can have one format for each variable or use one format for a group of variables. You create user-defined formats with PROC FORMAT; you associate your formats (or SAS built-in formats) with one or more variables in a FORMAT statement. A SAS data set called Survey shows how formats can be used. Here is a listing of this data set, without any formats: Data Set SURVEY ID
Gender
Age
Salary
Ques1
Ques2
Ques3
Ques4
Ques5
001
M
23
28000
1
2
1
2
3
002
F
55
76123
4
5
2
1
1
003
M
38
36500
2
2
2
2
1
004
F
67
128000
5
3
2
2
4
005
M
22
23060
3
3
3
4
2
006
M
63
90000
2
3
5
4
3
007
F
45
76100
5
3
4
3
3
74 Learning SAS by Example: A Programmer’s Guide
Let’s see how formats can improve the readability of this listing:
Program 5-2 Using PROC FORMAT to create user-defined formats proc format; value $gender 'M' 'F' ' ' other value age low-29 30-50 51-high value $likert '1' '2' '3' '4' '5' run;
= = = = = = = = = = = =
'Male' 'Female' 'Not entered' 'Miscoded'; 'Less than 30' '30 to 50' '51+'; 'Strongly disagree' 'Disagree' 'No opinion' 'Agree' 'Strongly agree';
You should notice several things about this procedure. First, you use a VALUE statement to create each user-defined format. Next, formats used with character variables start with a dollar sign. Following the format name are either unique values or ranges, an equal sign, and then the text you want to associate with each value or range of values. Rules concerning format names are the same as those for SAS variable names with the ® exception that these names cannot end in a numeral. (SAS versions prior to SAS 9 allowed only 8 character format names.) The first format to be defined is $GENDER. Format names do not need to be related to a variable name—calling this format $GENDER makes it easier to remember that you will use it later to alter how the Gender values will be printed in SAS output. Values for Gender are stored as M and F. Associating the $GENDER format with the variable Gender results in M displaying as Male, F displaying as Female, and missing values displayed as Not entered. The keyword other in the VALUE statement causes the text Miscoded to be printed for any characters besides M, F, or a missing value. The format AGE is used to group ages into three categories. Notice that it is OK to use the same name for a format and a variable. (SAS knows that a name containing a period is a format.) If you apply this format to the variable Age, the age groups are printed instead of the actual ages. Remember that the internal values of SAS variables are not changed because they have been associated with a format. The format affects only how values print or, in some cases, how SAS procedures process a variable. (For example, PROC FREQ computes frequencies of formatted values rather than raw values; PROC MEANS uses formatted values for variables listed in the CLASS statement, and so forth.)
Chapter 5: Creating Formats and Labels 75
In the AGE format, the keywords LOW and HIGH refer to the lowest nonmissing value and the highest value, respectively. Note: The keyword LOW when used with character formats includes missing values. The last format, $LIKERT, is used to substitute the appropriate text for the numbers 1 (strongly disagree) to 5 (strongly agree).
Let’s first see what happens if you place a format statement in PROC PRINT, as follows:
Program 5-3 Adding a FORMAT statement in PROC PRINT title "Data Set SURVEY with Formatted Values"; proc print data=learn.survey; id ID; var Gender Age Salary Ques1-Ques5; format Gender $gender. Age age. Ques1-Ques5 $likert. Salary dollar11.2; run;
Here the formats $GENDER and AGE are used to format the variables Gender and Age, respectively. The format $LIKERT formats the five variables Ques1 through Ques5. Notice that each format is followed by a period, just the same as built-in SAS formats. The format for Salary, DOLLAR11.2, is a SAS format. The name dollar indicates that you want to use the dollar format (which adds a dollar sign and commas to the value); the number 11 tells SAS to print a value using 11 columns; the 2 following the decimal point tells SAS that you want to print two digits to the right of the decimal point. The largest value for Salary using the DOLLAR11.2 format would be: $999,999.99
It is a good idea to make the total width a bit larger than you think you need, just in case your data contains a larger number than you expect. SAS has a set of rules that will allow numbers to print when you have not allocated enough columns. However, it is better to ensure you have enough columns and not be concerned with what happens when the format is too small. Before we show you the output, notice the ID statement in Program 5-3. When you include an ID statement in PROC PRINT, the variable (or variables) you list show up in the first column (or columns) of your report, replacing the Obs column that SAS usually displays in the first column. If you list a variable in an ID statement, don’t also list it in the VAR statement. If you do, it appears twice on the listing. If you have an ID variable such as Subject or ID, it is recommended that you use an ID statement.
76 Learning SAS by Example: A Programmer’s Guide
Here is the listing: Data Set SURVEY with Formatted Values ID
Gender Age
001 Male
Less than 30
Salary Ques1
Ques2
$28,000.00 Strongly disagree Disagree
002 Female 51+
$76,123.00 Agree
Strongly agree
003 Male
$36,500.00 Disagree
Disagree
30 to 50
004 Female 51+
$128,000.00 Strongly agree
No opinion
005 Male
Less than 30
$23,060.00 No opinion
No opinion
006 Male
51+
$90,000.00 Disagree
No opinion
007 Female 30 to 50
$76,100.00 Strongly agree
No opinion
ID
Ques4
Ques5
001 Strongly disagree
Disagree
No opinion
002 Disagree
Strongly disagree
Strongly disagree
003 Disagree
Disagree
Strongly disagree
004 Disagree
Disagree
Agree
005 No opinion
Agree
Disagree
006 Strongly agree
Agree
No opinion
007 Agree
No opinion
No opinion
Ques3
5.3 Regrouping Values Using Formats You can use formats to group various values together. For example, suppose you want to see the survey results, but instead of looking at the five possible responses for Questions 1 through 5, you want to group the values 1 and 2 (strongly disagree and disagree) together and the values 4 and 5 (agree and strongly agree) to make three categories for each question. You can accomplish this by creating a new format, as in Program 5-4:
Chapter 5: Creating Formats and Labels 77
Program 5-4 Regrouping values using a format proc format; value $three '1','2' = 'Disagreement' '3' = 'No opinion' '4','5' = 'Agreement'; run;
You can then apply this to the Question variables in a procedure, as follows:
Program 5-5 Applying the new format to several variables with PROC FREQ proc freq data=learn.survey; title "Question Frequencies Using the $three Format"; tables Ques1-Ques5; format Ques1-Ques5 $three.; run;
PROC FREQ, as you saw in Chapter 2, is used to count frequencies for the variables listed in the TABLES statement (Ques1–Ques5 in this case). Because of the FORMAT statement in this procedure, the tables have only three categories rather than the original five. Here is a partial listing of the output: Question Frequencies Using the $three Format (partial listing) The FREQ Procedure Cumulative Ques1
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Disagreement
3
42.86
3
No opinion
1
14.29
4
57.14
Agreement
3
42.86
7
100.00
Cumulative Ques2
Frequency
Percent
Frequency
42.86
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Disagreement
2
28.57
2
No opinion
4
57.14
6
28.57 85.71
Agreement
1
14.29
7
100.00
78 Learning SAS by Example: A Programmer’s Guide
5.4 More on Format Ranges When you define a format, you can specify individual values or ranges to the left of the equal sign in your VALUE statement. As an example of how flexible this approach is, consider that you have a variable called Grade with values of A, B, C, D, F, I, and W. The following VALUE statement creates a format that places these grades into six categories: value $gradefmt 'A' – 'C' 'D' 'F' 'I','W' ' ' other
= = = = = =
'Passing' 'Borderline' 'Failing' 'Incomplete or withdrew' 'Not recorded' 'Miscoded';
Here you see that grades A, B, or C will be formatted as Passing, D as Borderline, F as Failing, I or W as Incomplete or withdrew, missing values as Not recorded, and any other value as Miscoded. You may leave the quotes off the character ranges and the labels if you want. However, as a matter of style, we recommend that you use single or double quotes here. In Program 5-2, the ranges for the AGE format were defined like this: value age low-29 = 'Less than 30' 30-50 = '30 to 50' 51-high = '51+';
This is fine if this format is used with integer values. However, suppose you used this format with a variable that could take on values such as 29.5? This value falls between the two ranges low-29 and 30-50. You can make sure there are no cracks in your ranges like this: value age low-<30 30-<51 51-high
= 'Less than 30' = '30 to less than 51' = '51+';
The first range includes all values Less than 30 (which would include 29.5). The second range includes values from 30 to less than 51 and the last range includes values of 51+. You can also use a less than (<) sign on the left side of a range. For example, take a look at the following format: value age low-30 = 'Less than or equal to 30' 30<-51 = 'Greater than 30 to 51' 51<-high = 'Greater than 51';
Chapter 5: Creating Formats and Labels 79
So, if you know that your format may be used with noninteger values, be sure that there are no cracks in your ranges.
5.5 Storing Your Formats in a Format Library As we mentioned earlier, if you place LABEL and FORMAT statements in the DATA step, the labels and formats become permanently associated with their respective variables. If you have user-defined formats with permanent SAS data sets, it is important to make your formats permanent also. Here are the steps to do this: 1. Create a library reference (libref) to indicate where you want to store your SAS formats. This can be the same library where you store your data sets. 2. Use the option LIBRARY=libref when you run PROC FORMAT. (Remember, you have to run this procedure only once.) As an example, suppose you want to make the formats created in Program 5-2 permanent and save them in the c:\books\learning\formats folder. Note: On mainframe computers, the libraries are single sequential files. Program 5-6 creates a permanent format library for you.
Program 5-6 Creating a permanent format library libname myfmts 'c:\books\learning\formats'; proc format library=myfmts; value $gender 'M' = 'Male' 'F' = 'Female' ' ' = 'Not entered' other = 'Miscoded'; value age low-29 = 'Less than 30' 30-50 = '30 to 50' 51-high = '51+'; value $likert '1' = 'Strongly disagree' '2' = 'Disagree' '3' = 'No opinion' '4' = 'Agree' '5' = 'Strongly agree'; run;
80 Learning SAS by Example: A Programmer’s Guide
If you run this program on a Windows-based system, a file called formats.sas7bcat will be created in the folder specified by the libref.
5.6 Permanent Data Set Attributes If you add your LABEL and FORMAT statements in the DATA step, the labels and formats become permanently associated with their respective variables. This makes for a very convenient way to document a data set. Another user could use PROC CONTENTS or the SAS Explorer to list the labels and formats used with each variable. Anytime you want to use a SAS data set with associated user-defined formats, you need to tell SAS where to look for these formats. By default, SAS will only look for its own formats, formats in a Work library (i.e., temporary formats), or formats in a library with the special name Library. If you want SAS to also look in one of your own libraries, you need to issue a FMTSEARCH= system option. You can list one or more libraries for SAS to search using this option. For example, if you want to use the formats you placed in the Myfmts library, you would need to submit the following code: options fmtsearch=(myfmts);
If you do this, SAS first looks in the Work library, then the library called Library, and then the Myfmts library. If you want SAS to look in the Myfmts library before it looks in either of the other two libraries, you can name them on the FMTSEARCH statement like this: options fmtsearch=(myfmts work library);
Now, SAS searches the Myfmts library first and then the Work and Library libraries. Program 5-7 demonstrates how to make a permanent SAS data set with user-defined formats. (For this example, assume you have already created a permanent SAS library in your c:\books\learning\formats folder.)
Chapter 5: Creating Formats and Labels 81
Program 5-7 Adding LABEL and FORMAT statements in the DATA step libname learn 'c:\books\learning'; libname myfmts 'c:\books\learning\formats'; options fmtsearch=(myfmts); data learn.survey; infile 'c:\books\learning\survey.txt' pad; input ID : $3. Gender : $1. Age Salary (Ques1-Ques5)(: $1.); format Gender $gender. Age age. Ques1-Ques5 $likert. Salary dollar10.0; label ID = 'Subject ID' Gender = 'Gender' Age = 'Age as of 1/1/2006' Salary = 'Yearly Salary' Ques1 = 'The governor is doing a good job?' Ques2 = 'The property tax should be lowered' Ques3 = 'Guns should be banned' Ques4 = 'Expand the Green Acre program' Ques5 = 'The school needs to be expanded'; run;
Now, run PROC CONTENTS on this data set, as follows:
Program 5-8 Running PROC CONTENTS on a data set with labels and formats title "Data set SURVEY"; proc contents data=learn.survey varnum; run;
82 Learning SAS by Example: A Programmer’s Guide
You obtain a listing that helps document the data set, like this (partial listing): Variables in Creation Order #
Variable
Type
Len
Format
Label
1
ID
Char
3
2
Gender
Char
1
$GENDER.
Gender
3
Age
Num
8
AGE.
Age as of 1/1/2006
4
Salary
Num
8
DOLLAR10. Yearly Salary
5
Ques1
Char
1
$LIKERT.
The governor is doing a good job?
6
Ques2
Char
1
$LIKERT.
The property tax should be lowered
7
Ques3
Char
1
$LIKERT.
Guns should be banned
8
Ques4
Char
1
$LIKERT.
Expand the Green Acre program
9
Ques5
Char
1
$LIKERT.
The school needs to be expanded
Subject ID
5.7 Accessing a Permanent SAS Data Set with User-Defined Formats If you want to use a permanent SAS data set that has user-defined formats, the only requirement is to remember to tell SAS where to find the formats. If you forget the FMTSEARCH= system option, you will get an error message telling you that SAS cannot find the formats. If you give a copy of a SAS data set with user-defined formats to another user, be sure to also give a copy of the format library to them as well. (On a PC platform, you need to give them a copy of the file formats.sas7bcat.) Here is an example of a program to compute frequencies on the variables Ques1–Ques5 in the permanent SAS data set Survey:
Program 5-9 Using a user-defined format libname learn 'c:\books\learning'; libname myfmts 'c:\books\learning\formats'; options fmtsearch=(myfmts); title "Using User-defined Formats"; proc freq data=learn.survey; tables Ques1-Ques5 / nocum; run;
Chapter 5: Creating Formats and Labels 83
Once you submit the FMTSEARCH= option, you can use your own formats just as if they were built-in SAS formats.
5.8 Displaying Your Format Definitions A useful PROC FORMAT option is FMTLIB. This option creates a listing of each format in the specified library with the ranges and labels. As an example, if you want to display the definitions of all the formats in your Myfmts library, you would submit the following code:
Program 5-10 Displaying format definitions in a user-created library title "Format Definitions in the MYFMTS Library"; proc format library=myfmts fmtlib; run;
You obtain a table like this: „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ FORMAT NAME: AGE LENGTH: 12 NUMBER OF VALUES: 3 ‚ ‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 12 FUZZ: STD ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚START ‚END ‚LABEL (VER. V7|V8 11OCT2005:14:17:09)‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚LOW ‚ 29‚Less than 30 ‚ ‚ 30‚ 50‚30 to 50 ‚ ‚ 51‚HIGH ‚51+ ‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ FORMAT NAME: $GENDER LENGTH: 11 NUMBER OF VALUES: 4 ‚ ‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 11 FUZZ: 0 ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚START ‚END ‚LABEL (VER. V7|V8 11OCT2005:14:17:09)‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚Not entered ‚ ‚F ‚F ‚Female ‚ ‚M ‚M ‚Male ‚ ‚**OTHER** ‚**OTHER** ‚Miscoded ‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
(continued)
84 Learning SAS by Example: A Programmer’s Guide
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ FORMAT NAME: $LIKERT LENGTH: 17 NUMBER OF VALUES: 5 ‚ ‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 17 FUZZ: 0 ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚START ‚END ‚LABEL (VER. V7|V8 11OCT2005:14:17:09)‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚1 ‚1 ‚Strongly disagree ‚ ‚2 ‚2 ‚Disagree ‚ ‚3 ‚3 ‚No opinion ‚ ‚4 ‚4 ‚Agree ‚ ‚5 ‚5 ‚Strongly agree ‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
If you want to see only specific formats in a format library, you can add a SELECT statement to your PROC FORMAT. You list the formats you want displayed following the keyword SELECT. When you use a SELECT statement, you do not also have to include the FMTLIB option. For example, to display only the AGE and $LIKERT formats, you could use the following program:
Program 5-11 Demonstrating a SELECT statement with PROC FORMAT proc format library=myfmts; select age $likert; run;
There is also an EXCLUDE statement that enables you to name the formats you do not want to see displayed. Please refer to Chapter 22 for more advanced uses of both SAS and user-defined formats.
5.9 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems.
Chapter 5: Creating Formats and Labels 85
1. Run the program here to create a temporary SAS data set called Voter: data voter; input Age Party : $1. (Ques1-Ques4)($1. + 1); datalines; 23 D 1 1 2 2 45 R 5 5 4 1 67 D 2 4 3 3 39 R 4 4 4 4 19 D 2 1 2 1 75 D 3 3 2 3 57 R 4 3 4 4 ;
Add formats for Age (0–30, 31–50, 51–70, 71+), Party (D = Democrat, R = Republican), and Ques1–Ques4 (1=Strongly Disagree, 2=Disagree, 3=No Opinion, 4=Agree, 5=Strongly Agree). In addition, label Ques1–Ques4 as follows: Ques1 Ques2 Ques3 Ques4
The president is doing a good job Congress is doing a good job Taxes are too high Government should cut spending
Note: Use PROC PRINT to list the observations in this data set and PROC FREQ to list frequencies for the four questions. (The default action of PROC PRINT is to head each column with a variable name, not the label. To use labels as column headings, use the LABEL option with PROC PRINT.) 2. You want to see frequencies for Questions 1 to 4 from the previous question. However, you want only three categories: Generally Disagree (combine Strongly Disagree and Disagree), No Opinion, and Generally Agree (combine Agree and Strongly Agree). Accomplish this using a new format for Ques1–Ques4. 3. Run the following program to create a SAS data set called Colors (see Chapter 21 for a discussion of the double at signs [@@] in the INPUT statement): data colors; input Color : $1. @@; datalines; R R B G Y Y . . B G R B G Y P O O V V B ;
86 Learning SAS by Example: A Programmer’s Guide
Use a format to group the colors as follows: R, B, G = Group 1 Y, O = Group 2 Missing = Not Given All others = Group 3 Use PROC FREQ to list the frequencies of the color groups. 4. Make a permanent SAS data set from data set Voter in Problem 1. Place this data set in a folder of your choice. Make the labels and formats permanent attributes in this data set and make your formats permanent as well (place them in the same library as the data set). Use the FMTLIB option with PROC FORMAT when you run this procedure. 5. Write the necessary statements to make three permanent formats in a library of your choice. Use the FMTLIB option to list each of these formats. The formats are defined as follows: YESNO $YESNO $Gender age20yr
1 = Yes, 0 = No Y = Yes, N = No M = Male, F = Female low-20 = 1, 21-40 = 2, 41-60 = 3, 61-80 = 4, 81-high = 5
C h a p t e r
6
Reading and Writing Data from an Excel Spreadsheet 6.1 Introduction 87 6.2 Using the Import Wizard to Convert a Spreadsheet to a SAS Data Set 88 6.3 Creating an Excel Spreadsheet from a SAS Data Set 93 6.4 Using an Engine to Read an Excel Spreadsheet 95 6.5 Using the SAS Output Delivery System to Convert a SAS Data Set to an Excel Spreadsheet 96 6.6 Problems 98
6.1 Introduction It is quite common to be given a Microsoft Office Excel spreadsheet as your data source. Luckily, SAS has several methods to easily convert a spreadsheet into a SAS data set. One way is to convert the spreadsheet into a comma-separated values (CSV) file and to
88 Learning SAS by Example: A Programmer’s Guide
read the file using INFILE statements (using the DSD option—see Chapter 3 for more information) and INPUT statements. However, if you have licensed SAS/ACCESS Interface to PC Files, you can have SAS do the conversion automatically.
6.2 Using the Import Wizard to Convert a Spreadsheet to a SAS Data Set The spreadsheet here was created using Excel. The Wage column (E) was computed by multiplying the Hours Worked column (C) by the Rate Per Hour column (D). Notice that the column headings are not valid SAS variable names. In addition, this spreadsheet has two worksheets: one named Temporary (for temporary workers) and the other named Permanent (for permanent employees).
Chapter 6: Reading and Writing Data from an Excel Spreadsheet 89
The first step is to select Import Data from the SAS File menu, as shown here:
This brings up a screen where you can select from a variety of formats (Excel, Access, dBase, Lotus, and several others). Select Microsoft Excel from the pull-down menu.
Click Next to bring up the next screen. Here you can either type in the name for the Excel file you want to read, or select Browse to obtain a list of the Excel spreadsheets on your system.
90 Learning SAS by Example: A Programmer’s Guide
If you have multiple worksheets, you can select the one you want to import. The default worksheet name is Sheet1$ (SAS places a dollar sign after the worksheet name). Any named ranges that have been created in the worksheet are also listed. Worksheet names end in a $; the names of named ranges don’t.
Once you have selected the table you want to import, click Next. You are presented with a screen where you can enter a library (Work, if you want a temporary SAS data set, or a libref created with a LIBNAME statement) and the name of the SAS data set (labeled as Member in the destination screen). See the following:
Chapter 6: Reading and Writing Data from an Excel Spreadsheet 91
If you then select Finish, SAS imports your spreadsheet. If you select Next, you can choose to save the conversion program (PROC IMPORT). If you save this program (you will be prompted for the name of a file in which to save the program), you can run it later to perform your conversion. You are done. That’s all there is to it. In this example, you chose the Work library and called your SAS data set Wages_Temporary. It is a good idea to list the contents of this data set using the SAS viewer or PROC PRINT before attempting to use it for reports or further analysis. If the data set is large, you can list the first few observations in the data set by using the OBS=n data set option to limit the number of observations you want to process. For example, to see the first four observations in data set Wages_Temporary, you could run the following program:
Program 6-1 Using PROC PRINT to list the first four observations in a data set title "The First Four Observations of WAGES_TEMPORARY"; proc print data=wages_temporary(obs=4); run;
92 Learning SAS by Example: A Programmer’s Guide
Here is the output: The First Four Observations of WAGES_TEMPORARY Hours_ Obs
Subject
Date
Worked
Rate_ Per_Hour
Wage
1
M12
12DEC2006
40
15.25
$610.00
2
F34
16NOV2006
35
25.57
$894.95
3
M76
03JUN2006
10
54.00
$540.00
4
F87
06APR2006
45
35.00
$1,575.00
Notice that SAS has created valid variable names from the Excel column headings. It replaces any invalid characters in column headings (in this case, blanks) with an underscore (_). It is also a good idea to use the SAS Explorer or PROC CONTENTS to view the type (character or numeric) and length of each variable in this data set. You may need to use PROC DATASETS to change the format of one or more variables or to write a short DATA step to perform a character-to-numeric conversion for variables that you want to be numeric but, for one reason or another, wound up as character. Using the OBS=n data set option is a very useful way to check the data values of very large data sets. By the way, while we are on the topic, you can combine the OBS=n data set option with the FIRSTOBS=m option. The value of FIRSTOBS= defines the first observation you want to process; the value of the OBS= option is the last observation you want to process. It is a good idea to think of OBS= as LASTOBS. Suppose you want to list observations 100 through 110 in a very large SAS data set called Verybig in a library with a libref of Project. You would combine the FIRSTOBS= and OBS= options like this:
Program 6-2 Using the FIRSTOBS= and OBS= options together title "Observations 100 through 110 in VERYBIG"; proc print data=project.verybig(firstobs=100 obs=110); run;
Program 6-2 results in a listing containing 11 observations.
Chapter 6: Reading and Writing Data from an Excel Spreadsheet 93
6.3 Creating an Excel Spreadsheet from a SAS Data Set Before we leave this chapter, let’s see how you can use the Export Wizard to convert a SAS data set into an Excel spreadsheet. From the File menu, select Export Data and then select Microsoft Excel. Suppose you want to convert your Sales data set (located in the c:\books\learning folder) to an Excel spreadsheet. First, you need to be sure you have a library reference to the folder. For example, issue the following statement: libname learn 'c:\books\learning';
Then, follow these steps: From the File menu, select Export Data.
On the next screen, enter the LIBNAME and the name of the SAS data set you want to export (Sales in this example) and click Next.
94 Learning SAS by Example: A Programmer’s Guide
Next, select Microsoft Excel from the pull-down menu.
You can browse or enter the name of the Excel spreadsheet.
Chapter 6: Reading and Writing Data from an Excel Spreadsheet 95
You can also name the specific table.
Select Finish and SAS writes the data to an Excel spreadsheet (or select Next to write the PROC EXPORT statements to a file). Here, the file is called sales.xls (in the c:\books\learning folder) and the table name is Table1.
6.4 Using an Engine to Read an Excel Spreadsheet You can have SAS treat an Excel spreadsheet as if it were a SAS data set by using the XLS engine. As an example, suppose you want to run a SAS procedure with the data in a spreadsheet called wages.xls in your c:\books\learning folder. The following LIBNAME statement enables you to access this spreadsheet directly: libname readit 'c:\books\learning\wages.xls';
You can now access any of the worksheets within this file. This particular spreadsheet file contains two worksheets, Temporary and Permanent. Suppose you want to compute the mean of Wage and Hours Worked in the Permanent worksheet. Here are the SAS statements to do that:
96 Learning SAS by Example: A Programmer’s Guide
Program 6-3 Reading a spreadsheet using an XLS engine title "Statistics from Sales Spreadsheet"; proc means data=readit.'Permanent$'n mean; var Wage Hours_Worked; run;
There are several important points to notice in this program. First, because SAS requires you to follow the worksheet name with a dollar sign and because dollar signs are not normally allowed in SAS data set names, you need to use a name literal to do this. In Program 6-3, you place the worksheet name (Permanent$) in single quotes and follow this with an n. This notation allows you to use invalid characters as part of SAS names. Next, remember that the column heading Hours Worked is not a valid SAS variable name and the rule is that SAS will substitute an underscore for any invalid variable names. Thus, you need to refer to this column as Hours_Worked. Here is the output from PROC MEANS: Statistics from Sales Spreadsheet The MEANS Procedure Variable
Label
Mean
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Wage
Wage
2568.40
Hours_Worked
Hours Worked
37.2000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Once you have created the appropriate LIBNAME statement, you can treat your spreadsheets as if they were SAS data sets.
6.5 Using the SAS Output Delivery System to Convert a SAS Data Set to an Excel Spreadsheet You can use the Output Delivery System (ODS) to create CSV files that Excel can open directly. For more information on ODS, refer to Chapter 19.
Chapter 6: Reading and Writing Data from an Excel Spreadsheet 97
As an example, suppose you want to send the contents of the permanent SAS data set Survey to Excel. Program 6-4 creates a CSV file from the SAS data set:
Program 6-4 Using ODS to convert a SAS data set into a CSV file (to be read by Excel) libname learn 'c:\books\learning'; ods csv file='c:\books\learning\odsexample.csv'; proc print data=learn.survey noobs; run; ods csv close;
The ODS CSV statement opens the CSV file as an output destination. Notice that the NOOBS option of PROC PRINT is used to remove the OBS column from the output. It is important to close the file with an ODS CLOSE statement following the PROC PRINT. Here is a listing of the CSV file: File c:\books\learning\odsexample.csv "ID","Gender","Age","Salary","Ques1","Ques2","Ques3","Ques4","Ques5" 001,"M",23,28000,1,2,1,2,3 002,"F",55,76123,4,5,2,1,1 003,"M",38,36500,2,2,2,2,1 004,"F",67,128000,5,3,2,2,4 005,"M",22,23060,3,3,3,4,2 006,"M",63,90000,2,3,5,4,3 007,"F",45,76100,5,3,4,3,3
You can open this file directly into Excel with the following result:
98 Learning SAS by Example: A Programmer’s Guide
As you can see, SAS can read and write Excel data very easily. Be sure to check the resulting files following a transfer to ensure that data values, especially dates, were processed properly. In addition, if you are creating a SAS data set, be sure to run PROC CONTENTS or use the SAS Explorer to verify variable types and lengths.
6.6 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Download the spreadsheet drugtest.xls from the SAS Web site. Use the SAS Explorer to convert this to a temporary SAS data set called Drugtest. Use PROC PRINT to list the observations in this data set. 2. Run the following program to create a CSV file. Substitute a folder of your choice for the one specified in the program: data soccer; input Team : $20. Wins Losses; datalines; Readington 20 3 Raritan 10 10 Branchburg 3 18 Somerville 5 18 ;
Chapter 6: Reading and Writing Data from an Excel Spreadsheet 99
options nodate nonumber; title; ods listing close; ods csv file='c:\books\learning\soccer.csv'; proc print data=soccer noobs; run; ods csv close; ods listing;
Open Excel on your computer and open the CSV file (you will have to change the file type to .csv). It should look like this:
Save this as a spreadsheet using the FileÆSave As pull down menu and naming the file soccer.xls. Now, use the SAS IMPORT wizard to convert this spreadsheet into a permanent SAS data set called Soccer in a folder of your choice. 3. Read the file soccer.xls created in Problem 2 using an XLS engine. The table name is SOCCER. (You will need to use a name constant 'SOCCER$'n to read this file.)
100 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
7
Performing Conditional Processing 7.1
Introduction 102
7.2
The IF and ELSE IF Statements 102
7.3
The Subsetting IF Statement 105
7.4
The IN Operator 107
7.5
Using a SELECT Statement for Logical Tests 108
7.6
Using Boolean Logic (AND, OR, and NOT Operators) 109
7.7
A Caution When Using Multiple OR Operators 111
7.8
The WHERE Statement 112
7.9
Some Useful WHERE Operators 113
7.10 Problems 114
102 Learning SAS by Example: A Programmer’s Guide
7.1 Introduction This chapter describes the tools that allow programs to “make decisions” based on data values. For example, you may want to read a value of Age and create a variable that represents age groups. You may want to determine if values for a particular variable are within predefined limits. Programs that perform any of these operations require conditional processing—the ability to make logical decisions based on data values.
7.2 The IF and ELSE IF Statements Two of the basic tools for conditional processing are the IF and ELSE IF statements. To understand how these statements work, suppose you have collected data on a group of students based on the following variables:
Age (in years) Gender (recorded as M or F) Midterm (grade on the midterm exam) Quiz (quiz grade from F to A+) FinalExam (grade on the final exam)
Your first task is to create a new variable that represents age groups. Here is a first attempt. Note: This program is not correct.
Program 7-1 First attempt to group ages into age groups (incorrect) data conditional; length Gender $ 1 Quiz $ 2; input Age Gender Midterm Quiz FinalExam; if Age lt 20 then AgeGroup = 1; if Age ge 20 and Age lt 40 then AgeGroup = 2; if Age ge 40 and Age lt 60 then AgeGroup = 3; if Age ge 60 then AgeGroup = 4; datalines; 21 M 80 B- 82 . F 90 A 93 35 M 87 B+ 85
Chapter 7: Performing Conditional Processing 103
48 F . . 76 59 F 95 A+ 97 15 M 88 . 93 67 F 97 A 91 . M 62 F 67 35 F 77 C- 77 49 M 59 C 81 ; title "Listing of CONDITIONAL"; proc print data=conditional noobs; run;
A complete list of the logical comparison operators is displayed in the following table. Logical Comparison Equal to Not equal to Less than Less than or equal to Greater than Greater than or equal to Equal to one in a list
Mnemonic EQ NE LT LE GT GE IN
Symbol = ^= or ~= or ¬= * < <= > >=
* The symbol you use depends on the type of terminal.
Let’s follow the logic of Program 7-1. The first IF statement asks if Age is less than 20. When the logical expression following the keyword IF is true, the statement following the word THEN is executed; if the expression is not true, the program continues to process the statements in the DATA step. There is one serious problem with this program’s logic and it relates to how SAS treats missing numeric values. Missing numeric values are treated logically as the most negative number you can reference on your computer. Therefore, the first IF statement will be true for missing values as well as for all ages less than 20. This is a very important point. It is a good example of a program that has no syntax errors, runs without any warning or error messages in the log, and produces incorrect results. There are several ways to prevent your missing values from being included in AgeGroup 1. Here are several options: if Age lt 20 and Age ne . then AgeGroup = 1; if Age ge 0 and Age lt 20 then AgeGroup = 1;
104 Learning SAS by Example: A Programmer’s Guide
if 0 le Age lt 20 then AgeGroup = 1; if Age lt 20 and not missing(Age) then AgeGroup = 1;
All of these statements result in the correct value for AgeGroup. Subjects with a missing value for Age will also have a missing value for AgeGroup. The first IF statement uses the fact that you refer to a numeric missing value in a DATA step by a period. The last IF statement uses the MISSING function. This function returns a value of TRUE if the argument (the variable in parentheses) is missing, and FALSE if the argument is not missing. This program can be improved further. If a person is less than 20 years of age, the first IF statement is true and AgeGroup is set to 1. All the remaining IF statements are still executed (although they will all be false and AgeGroup remains 1). So, a better way to write this program is to change all the IF statements after the first one to ELSE IF statements. Here is what the corrected program looks like:
Program 7-2 Corrected program to group ages into age groups data conditional; length Gender $ 1 Quiz $ 2; input Age Gender Midterm Quiz FinalExam; if Age lt 20 and not missing(age) then AgeGroup = 1; else if Age ge 20 and Age lt 40 then AgeGroup = 2; else if Age ge 40 and Age lt 60 then AgeGroup = 3; else if Age ge 60 then AgeGroup = 4; datalines;
Using this logic, when any of the IF statements is true, all the following ELSE statements are not evaluated. This saves on processing time. An alternative way to write this program is to test for a missing value in the first IF statement and use the ELSE statements to advantage, as follows:
Chapter 7: Performing Conditional Processing 105
Program 7-3 An alternative to Program 7-2 data conditional; length Gender $ 1 Quiz $ 2; input Age Gender Midterm Quiz FinalExam; if missing(Age) then AgeGroup = .; else if Age lt 20 then AgeGroup = 1; else if Age lt 40 then AgeGroup = 2; else if Age lt 60 then AgeGroup = 3; else if Age ge 60 then AgeGroup = 4; datalines;
When you write a program like this, you need to “play computer” and make sure your logic is correct. For example, what if a person is 25 years old? The first IF statement is false because Age is not missing. The next ELSE IF statement is evaluated and found to be false as well. Finally, the third IF statement is evaluated and AgeGroup is set equal to 2. Because this IF statement is true, all the remaining ELSE IF statements are skipped. If you are working with very large data sets and want to squeeze every last drop out of the efficiency tank, you should place the IF statements in order, from the ones most likely to have a true condition to the ones least likely to have a true condition. This increases efficiency because SAS skips testing all the ELSE conditions when a previous IF condition is true.
7.3 The Subsetting IF Statement If you want to take a subset of data from a raw data file or from an existing SAS data set, you can use a special form of an IF statement called a subsetting IF. As an example, let’s use the raw data from Program 7-1, but restrict the resulting data set to females only. Here is the program:
106 Learning SAS by Example: A Programmer’s Guide
Program 7-4 Demonstrating a subsetting IF statement data females; length Gender $ 1 Quiz $ 2; input Age Gender Midterm Quiz FinalExam; if Gender eq 'F'; datalines; 21 M 80 B- 82 . F 90 A 93 35 M 87 B+ 85 48 F . . 76 59 F 95 A+ 97 15 M 88 . 93 67 F 97 A 91 . M 62 F 67 35 F 77 C- 77 49 M 59 C 81 ; title "Listing of FEMALES"; proc print data=Females noobs; run;
Notice that there is no THEN following the IF in this program. If the condition is true, the program continues to the next statement; if the condition is false, control returns to the top of the DATA step. In this case, the only statement following the subsetting IF statement is a DATALINES statement. If the value of Gender is F, the end of the DATA step is reached and an automatic output occurs. If the value of Gender is not equal to F, control returns to the top of the DATA step and the automatic output does not occur. Here is the output: Listing of FEMALES Final Gender F
Quiz A
F
Age
Midterm
Exam
.
90
93
48
.
76
F
A+
59
95
97
F
A
67
97
91
F
C-
35
77
77
Chapter 7: Performing Conditional Processing 107
A more efficient way to write Program 7-4 would be to read a value of Gender first. Then if it is an F, continue reading the rest of the data values; if not, return to the top of the DATA step and do not output an observation. To see how this can be accomplished, look at Section 11 in Chapter 21.
7.4 The IN Operator If you want to test if a value is one of several possible choices, you can use multiple OR statements, like this: if Quiz then else if then else if
= 'A+' or Quiz = 'A' or Quiz = 'B+' or Quiz = 'B' QuizRange = 1; Quiz = 'B-' or Quiz = 'C+' or Quiz = 'C' QuizRange = 2; not missing(Quiz) then QuizRange = 3;
These statements can be simplified by using the IN operator, like this: if Quiz in ('A+' 'A' 'B+' 'B') then QuizRange = 1; else if Quiz in ('B-' 'C+' 'C') then QuizRange = 2; else if not missing(Quiz) then QuizRange = 3;
The list of values in parentheses following the IN operator can be separated by blanks or commas. The first line could also be written like this: if Quiz in ('A+','A','B+','B') then QuizRange = 1;
You can also use the IN operator with numeric variables. For example, if you had a numeric variable called Subject (stored as a numeric value) and you wanted to list observations for Subject numbers 10, 22, 25, and 33, the following WHERE statement could be used: where Subject in (10 22 25 33);
As with the example using character values, you may separate the values with spaces or commas. You can also specify a range of numeric values, using a colon to separate values in the list. For example, to list observations where Subject is a 10, 22–25, or 30, you can write: where Subject in (10,22:25,30);
108 Learning SAS by Example: A Programmer’s Guide
7.5 Using a SELECT Statement for Logical Tests A SELECT statement provides an alternative to a series of IF and ELSE IF statements. Here is one way to use a SELECT statement: select (AgeGroup); when (1) Limit = 110; when (2) Limit = 120; when (3) Limit = 130; otherwise; end;
The expression following the SELECT statement is referred to as a select-expression; the expression following a WHEN statement is referred to as a when-expression. In this example, the select-expression (AgeGroup) is compared to each of the when-expressions. If the comparison is true, the statement following the when-expression is executed and control skips to the end of the SELECT group. If the comparison is false, the next whenexpression is compared to the select-expression. If none of the comparisons is evaluated to be true, the expression following the OTHERWISE statement is executed. As you can see in this example, the otherwise-expression can be a null statement. It is a good idea to include an OTHERWISE statement because the program will terminate if you omit it and none of the preceding comparisons is true. You can place more than one value in the when-expression, like this: select (AgeGroup); when (1) Limit = 110; when (2) Limit = 120; when (3,5) Limit = 130; otherwise; end;
In this example, AgeGroup values of 3 or 5 will set Limit equal to 130.
Chapter 7: Performing Conditional Processing 109
To help clarify this concept, let’s follow some scenarios: If AgeGroup is equal to 1, Limit will be 110. If Agegroup is equal to 3, Limit will be equal to 130. If AgeGroup is equal to 4, Limit will be a missing value (because it is set to a missing value in the PDV at each iteration of the DATA step and it is never assigned a value). If you do not supply a select-expression, each WHEN statement is evaluated to determine if the when-expression is true or false. As an example, here is Program 7-3, rewritten using a SELECT statement:
Program 7-5 Demonstrating a SELECT statement when a select-expression is missing data conditional; length Gender $ 1 Quiz $ 2; input Age Gender Midterm Quiz FinalExam; select; when (missing(Age)) AgeGroup = .; when (Age lt 20) AgeGroup = 1; when (Age lt 40) AgeGroup = 2; when (Age lt 60) AgeGroup = 3; when (Age ge 60) Agegroup = 4; otherwise; end; datalines;
Notice that there is no select-expression in this SELECT statement. Each whenexpression is evaluated and, if true, the statement following the expression is executed.
7.6 Using Boolean Logic (AND, OR, and NOT Operators) You can combine various logical operators (also known as Boolean operators) to form fairly complex statements. As an example, the data set Medical contains information on clinic, diagnosis (DX), and weight. A program to list all patients who were seen at the HMC clinic and had either a diagnosis 7 or 9 or weighted over 180 pounds demonstrates how to combine various Boolean operators:
110 Learning SAS by Example: A Programmer’s Guide
Program 7-6 Combining various Boolean operators title "Example of Boolan Expressions"; proc print data=learn.medical; where Clinic eq 'HMC' and (DX in ('7' '9') or Weight gt 180); id Patno; var Patno Clinic DX Weight VisitDate; run;
Notice the parentheses around the two statements separated by OR. The AND operator has precedence over the OR operator. That is, a statement such as the following: if X and Y or Z;
is the same as this one: if (X and Y) or Z;
If you want to perform the OR operation before the AND operation, use parentheses, like this: if X and (Y or Z);
In Program 7-6, you want the clinic to be HMC and you want either the diagnosis to be one of two values or the weight to be over 180 pounds. You need the parentheses to first decide if either the diagnosis or weight condition is true before performing the AND operation with the Clinic variable. Even in cases where parentheses are not needed, it is fine to include them so that the logical statements are easier to read and understand. The NOT operator has the highest precedence, so it is performed before AND. For example, the statement: if X and not y or z;
is equivalent to: if (X and (not y)) or z;
Chapter 7: Performing Conditional Processing 111
Here is the output from Program 7-6: Example of Boolean Expressions Patno
Clinic
DX
Weight
VisitDate
004
HMC
9
288
11/11/2006
050
HMC
123
199
07/06/2006
7.7 A Caution When Using Multiple OR Operators Look at the short program here:
Program 7-7 A caution on the use of multiple OR operators data believe_it_or_not; input X; if X = 3 or 4 then Match = 'Yes'; else Match = 'No'; datalines; 3 7 . ; title "Listing of BELIEVE_IT_OR_NOT"; proc print data=believe_it_or_not noobs; run;
The programmer probably wanted to say: if X = 3 or X = 4 then Match = 'Yes';
112 Learning SAS by Example: A Programmer’s Guide
What happens when you run this program? Many folks would expect to see an error message in the log and would expect the program to terminate. Not so. Here is the output from this program: Listing of BELIEVE_IT_OR_NOT X
Match
3
Yes
7
Yes
.
Yes
In Program 7-7, there is one condition on either side of the OR operator—one is X = 3, the other is 4. In SAS, any value other than 0 or missing is true. Therefore, 4 is evaluated as true and the statement X = 4 OR 4 is always true.
7.8 The WHERE Statement If you are reading data from a SAS data set, you can use a WHERE statement to subset your data. For example, if you started with the SAS data set Conditional (Program 7-3), you could create a data set of all females with the following program:
Program 7-8 Using a WHERE statement to subset a SAS data set data females; set conditional; where Gender eq 'F'; run;
In this example you could use either a WHERE or a subsetting IF statement. There are sometimes advantages to using a WHERE statement instead of a subsetting IF statement. You have a larger choice of operators that can be used with a WHERE statement (to be discussed next) and, if the input data set is indexed, the WHERE statement might be more efficient. You may also use a WHERE statement in a SAS procedure to subset the data being processed. Note: IF statements are not allowed inside SAS procedures.
Chapter 7: Performing Conditional Processing 113
7.9 Some Useful WHERE Operators The table here lists some of the useful operators that you can use with a WHERE statement. Operator
Description
Example
IS MISSING
Matches a missing value
where Subj is missing
IS NULL
Equivalent to IS MISSING
where Subj is null
BETWEEN AND
An inclusive range
where age between 20 and 40
CONTAINS
Matches a substring
where Name contains Mac
LIKE
Matching with wildcards
where Name like R_n%
=*
Phonetic matching
where Name =* Nick
Note: When using the LIKE operator, the underscore character takes the place of a single character, while the percent sign can be substituted for a string of any length (including a null string). Here are some examples. Expression
Matches
where Gender is null
A missing character value
where Age is null
A missing numeric value
where Age is missing
A missing numeric value
where Age between 20 and 40
All values between 20 and 40, including 20 and 40
where Name contains mac
macon immaculate
where Name like R_n%
Ron Ronald Run Running
where Name =* Nick
Nick Nack Nikki
Notes: 1. The IS NULL or IS MISSING expression matches a character or a numeric missing value. 2. The BETWEEN AND expression matches all the values greater than or equal to the first value and less than or equal to the second value. This works with character as well as numeric variables.
114 Learning SAS by Example: A Programmer’s Guide
3. The CONTAINS expression matches any character value containing the given string. 4. The LIKE expression uses two wildcard operators. The underscore (_) is a place holder; enter as many underscores as you need to stand for the same number of characters. The percent (%) matches nothing or a string of any length.
7.10 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Run the program here to create a temporary SAS data set called School: data school; input Age Quiz : $1. Midterm Final; /* Add you statements here */ datalines; 12 A 92 95 12 B 88 88 13 C 78 75 13 A 92 93 12 F 55 62 13 B 88 82 ;
Using IF and ELSE IF statements, compute two new variables as follows: Grade (numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13. The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75, D = 70, and F = 65. Using this information, compute a course grade (Course) as a weighted average of the Quiz (20%), Midterm (30%) and Final (50%). 2. Using the SAS data set Hosp, use PROC PRINT to list observations for Subject values of 5, 100, 150, and 200. Do this twice, once using OR operators and once using the IN operator. Note: Subject is a numeric variable.
Chapter 7: Performing Conditional Processing 115
3. Using the Sales data set, list the observations for employee numbers (EmpID) 9888 and 0177. Do this two ways, one using OR operators and the other using the IN operator. Note: EmpID is a character variable. 4. Using the Sales data set, create a new, temporary SAS data set containing Region and TotalSales plus a new variable called Weight with values of 1.5 for the North Region, 1.7 for the South Region, and 2.0 for the West and East Regions. Use a SELECT statement to do this. 5. Starting with the Blood data set, create a new, temporary SAS data set containing all the variables in Blood plus a new variable called CholGroup. Define this new variable as follows:
CholGroup Low Medium High
Chol Low – 110 111 – 140 141 – High
Use a SELECT statement to do this. 6. Using the Sales data set, list all the observations where Region is North and Quantity is less than 60. Include in this list any observations where the customer name (Customer) is Pet's are Us. 7. Using the Bicycles data set, list all the observations for Road Bikes that cost more than $2,500 or Hybrids that cost more than $660. The variable Model contains the type of bike and UnitCost contains the cost.
116 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
8
Performing Iterative Processing: Looping 8.1 Introduction 117 8.2 DO Groups 118 8.3 The Sum Statement 120 8.4 The Iterative DO Loop 125 8.5 Other Forms of an Iterative DO Loop 129 8.6 DO WHILE and DO UNTIL Statements 131 8.7 A Caution When Using DO UNTIL Statements 134 8.8 LEAVE and CONTINUE Statements 135 8.9 Problems 137
8.1 Introduction Many programming tasks require that blocks of code be run more than once. SAS provides several ways to accomplish this. This chapter covers DO groups, DO loops, DO WHILE statements, and DO UNTIL statements.
118 Learning SAS by Example: A Programmer’s Guide
8.2 DO Groups To demonstrate a DO group, we start with a data file containing some information on students: their age, gender, midterm grade, quiz grade, and final exam grade. A listing of this data file follows: File c:\books\learning\grades.txt 21 . 35 48 59 15 67 . 35 49
M F M F F M F M F M
80 90 87 . 95 88 97 62 77 59
BA B+ . A+ . A F CC
82 93 85 76 97 93 91 67 77 81
You want to read values from this file and compute two new variables—age group (Agegrp) and a value (Grade) computed from the midterm and final exam grades. If the age is less than or equal to 39, you want to set Agegrp equal to Younger group and the grade to be computed as a weighted average of the midterm grade (40%) and the final exam grade (60%). If the age is greater than 39, you want to set Agegrp equal to Older group and compute the grade as a simple average of the midterm and final exam grades. Take a look at Program 8-1 that solves this problem:
Program 8-1 Example of a program that does not use a DO group data grades; length Gender $ 1 Quiz $ 2 AgeGrp $ 13; infile 'c:\books\learning\grades.txt' missover; input Age Gender Midterm Quiz FinalExam; if missing(Age) then delete; if Age le 39 then Agegrp = 'Younger group'; if Age le 39 then Grade = .4*Midterm + .6*FinalExam; if Age gt 39 then Agegrp = 'Older group'; if Age gt 39 then Grade = (Midterm + FinalExam)/2; run;
Chapter 8: Performing Iterative Processing: Looping 119
title "Listing of GRADES"; proc print data=grades noobs; run;
Notice that the first two IF statements and the last two IF statements test the same condition. You would like to be able to test a condition and then perform several operations. By using DO and END statements, you can do this. Program 8-2 works identically to Program 8-1 except it uses DO and END statements to make the code more efficient and easier to read:
Program 8-2 Demonstrating a DO group data grades; length Gender $ 1 Quiz $ 2 AgeGrp $ 13; infile 'c:\books\learning\grades.txt' missover; input Age Gender Midterm Quiz FinalExam; if missing(Age) then delete; if Age le 39 then do; Agegrp = 'Younger group'; Grade = .4*Midterm + .6*FinalExam; end; else if Age gt 39 then do; Agegrp = 'Older group'; Grade = (Midterm + FinalExam)/2; end; run; title "Listing of GRADES"; proc print data=grades noobs; run;
All the statements between DO and END form a DO group. When the IF condition is true, all the statements in the DO group execute. A good way to think of this structure is “If the condition is true, do the following statements until you reach the end.” It is standard practice to indent all the statements in the DO group as shown here.
120 Learning SAS by Example: A Programmer’s Guide
The DO group coding is not only more efficient than multiple IF statements, it is also easier to read. Here is a listing of data set Grades: Listing of GRADES Final Gender
Quiz
AgeGrp
Age
Midterm
Exam
Grade
M
B-
Younger group
21
80
82
81.2
M
B+
Younger group
35
87
85
85.8
Older group
48
.
76
Older group
59
95
97
96.0
Younger group
15
88
93
91.0
F F
A+
M
.
F
A
Older group
67
97
91
94.0
F
C-
Younger group
35
77
77
77.0
M
C
Older group
49
59
81
70.0
You should take note that both of these programs work properly if there is a missing value for Age. The MISSING function returns a true value if its argument is a missing character or numeric value. The DELETE statement does two things: first, it prevents the current observation from being added to the data set, and second, it forces a return to the top of the DATA step.
8.3 The Sum Statement The programs in the next few sections all make use of the sum statement. This seemingly simple statement is extremely useful—you may find you use it in a majority of your SAS programs—yet many programmers who use it don’t fully appreciate how it works. There are two primary uses for a sum statement: one is to accumulate totals such as a month-to-date total, and the other is to create a counter—a variable that is incremented by a fixed amount on each iteration of the DATA step. Suppose you have a data set with one observation for each day of the week, and you want a program that will read in these values and compute a cumulative sum. The following program creates a test data set (Revenue) and attempts to create the cumulative sum.
Chapter 8: Performing Iterative Processing: Looping 121
Program 8-3 Attempt to create a cumulative total (Note: This program does not work) data revenue; input Day : $3. Revenue : dollar6.; Total = Total + Revenue; /* Note: this does not work */ format Revenue Total dollar8.; datalines; Mon $1,000 Tue $1,500 Wed . Thu $2,000 Fri $3,000 ;
Here is the output: Listing of REVENUE Obs
Day
Revenue
Total
1
Mon
$1,000
.
2
Tue
$1,500
.
3
Wed
.
.
4
Thu
$2,000
.
5
Fri
$3,000
.
Remember that variables read from raw data or created by assignment statements are initialized to a missing value for each iteration of the DATA step. On the first iteration of the DATA step, you are adding a missing value (Total) to a Revenue value ($1,000) and the result is a missing value. Using this same logic, you see that Total is missing for every observation. You can use a RETAIN statement to tell SAS not to do this. A RETAIN statement also enables you to set an initial value for a variable. Here is attempt number two (a bit better but still not there):
122 Learning SAS by Example: A Programmer’s Guide
Program 8-4 Adding a RETAIN statement to Program 8-3 data revenue; retain Total 0; input Day : $3. Revenue : dollar6.; Total = Total + Revenue; /* Note: this does not work */ format Revenue Total dollar8.; datalines; Mon $1,000 Tue $1,500 Wed . Thu $2,000 Fri $3,000 ;
In this program, Total is retained and initialized at 0. Here is the output: Listing of REVENUE Obs
Day
Revenue
Total
1
Mon
$1,000
$1,000
2
Tue
$1,500
$2,500
3
Wed
.
.
4
Thu
$2,000
.
5
Fri
$3,000
.
Everything works fine until the program encounters a missing value for Revenue. This sets Total to a missing value, where it remains for the rest of the DATA step. You can fix this by adding a statement to test the value of Revenue before you attempt to add it to the cumulative Total, like this:
Chapter 8: Performing Iterative Processing: Looping 123
Program 8-5 Third attempt to create cumulative total data revenue; retain Total 0; input Day : $3. Revenue : dollar6.; if not missing(Revenue) then Total = Total + Revenue; format Revenue Total dollar8.; datalines; Mon $1,000 Tue $1,500 Wed . Thu $2,000 Fri $3,000 ;
Here is the output: Listing of REVENUE Obs
Day
Revenue
Total
1
Mon
$1,000
$1,000
2
Tue
$1,500
$2,500
3
Wed
.
$2,500
4
Thu
$2,000
$4,500
5
Fri
$3,000
$7,500
This time the program worked correctly. But, there is an easier way: use a sum statement. A sum statement takes the following form: variable + increment;
Notice there is a plus sign and no equal sign in this statement. That’s what identifies this as a sum statement to SAS. This statement does the following:
Variable is retained Variable is initialized at 0 Missing values (of increment) are ignored
124 Learning SAS by Example: A Programmer’s Guide
So, rewriting Program 8-5, you have the following:
Program 8-6 Using a sum statement to create a cumulative total data revenue; input Day : $3. Revenue : dollar6.; Total + Revenue; format Revenue Total dollar8.; datalines; Mon $1,000 Tue $1,500 Wed . Thu $2,000 Fri $3,000 ;
The output from this program is identical to the output from Program 8-5. Another very common use of a sum statement is to create counters, for example:
Program 8-7 Using a sum statement to create a counter data test; input x; if missing(x) then MissCounter + 1; datalines; 2 . 7 . ;
Here is the output: Listing of TEST Miss Obs
x
Counter
1
2
0
2
.
1
3
7
1
4
.
2
Chapter 8: Performing Iterative Processing: Looping 125
As you can see from the output, MissCounter is counting the number of missing values for x.
8.4 The Iterative DO Loop Although it is useful to have the capability of executing a group of code when a condition is true, there are times when you would like to execute a group of SAS statements multiple times. The program here is written without any iterative loops.
Program 8-8 Program without iterative loops data compound; Interest = .0375; Total = 100; Year + 1; Total + Interest*Total; output; Year + 1; Total + Interest*Total; output; Year + 1; Total + Interest*Total; output; format Total dollar10.2; run; title "Listing of COMPOUND"; proc print data=compound noobs; run;
The purpose of this program should be obvious: you want to compute the total amount of money you will have if you start with $100 and invest it at a 3.75% interest rate for 3 years. (Yes, there are formulas for compound interest as well as SAS functions, but this makes for a good example.) As we just discussed, the statement Year + 1; is a sum statement. It increments the value of Year by 1 each time it executes. The value of Total is also computed using a sum statement. Notice here that the increment can be an expression (Interest * Total).
126 Learning SAS by Example: A Programmer’s Guide
The OUTPUT statement is an instruction for SAS to write out an observation to the output data set. An output usually occurs automatically at the bottom of the DATA step. But here, you want to output an observation each time you compute a new Total. When you include an OUTPUT statement anywhere in a DATA step, SAS does not execute an automatic output at the bottom of the DATA step. Next, notice that the group of three statements, starting with this sum statement, is repeated three times. This is a clue that there is probably a better way to write this program. That better way is to use an iterative DO loop, like this:
Program 8-9 Demonstrating an iterative DO loop data compound; Interest = .0375; Total = 100; do Year = 1 to 3; Total + Interest*Total; output; end; format Total dollar10.2; run;
When this program executes, Year is first set to 1, the lower limit in the iterative DO range. All the statements up to the END statement are executed and Year is incremented by 1 (the default increment value). SAS then tests if the new value of Year is between the lower and the upper limit (the value after the keyword TO). If it is, the statements in the DO group execute again; if not, the program continues at the first line following the END statement. Here is the output from Program 8-9: Listing of COMPOUND Interest
Total
Year
0.0375
$103.75
1
0.0375
$107.64
2
0.0375
$111.68
3
One form of an iterative DO statement follows: do index-variable = start to stop by increment;
If you leave off the increment, it defaults to 1.
Chapter 8: Performing Iterative Processing: Looping 127
Suppose you want to generate a table of the integers from 1 to 10, along with their squares and square roots. An iterative DO loop makes simple work of this, as follows:
Program 8-10 Using an iterative DO loop to make a table of squares and square roots data table; do n = 1 to 10; Square = n*n; SquareRoot = sqrt(n); output; end; run; title "Table of Squares and Square Roots"; proc print data=table noobs; run;
Notice that this program does not have any input data. It generates the value of n in the DO loop, computes the squares and square roots (SQRT is a square root function—it returns the square root of its argument), and outputs an observation to the data set. This continues for all the values from 1 to 10. Here is the output: Table of Squares and Square Roots Square n
Square
Root
1
1
1.00000
2
4
1.41421
3
9
1.73205
4
16
2.00000
5
25
2.23607
6
36
2.44949
7
49
2.64575
8
64
2.82843
9
81
3.00000
10
100
3.16228
128 Learning SAS by Example: A Programmer’s Guide
What if you want a table where n has values of 0, 10, 20, 30, and so forth up to 100. The following DO statement does the trick: do n = 0 to 100 by 10;
DO loops can also count backwards. For example, the following statement produces values of 10, 8, 6, 4, and 2 for the index variable: do index = 10 to 1 by -2;
When you use a negative increment value, the index value is decremented by this value for each iteration of the DO loop. When the value of the index variable is less than the stop value, the loop stops. Here is another example. You have an equation relating X and Y and want a graph of Y versus X for values of X from –10 to +10. Again, an iterative DO loop makes this very easy:
Program 8-11 Using an iterative DO loop to graph an equation data equation; do X = -10 to 10 by .01; Y = 2*x**3 – 5*X**2 + 15*X -8; output; end; run; symbol value=none interpol=sm width=2; title "Plot of Y versus X"; proc gplot data=equation; plot Y * X; run;
The DO loop starts X at –10 and increments the value by .01 until X reaches +10. The OUTPUT statement inside the loop writes an observation out to the data set for each iteration of the loop.
Chapter 8: Performing Iterative Processing: Looping 129
PROC GPLOT is used to plot the line. You can obtain information on PROC GPLOT in Chapter 20. Here is the graph:
8.5 Other Forms of an Iterative DO Loop SAS provides several other methods of specifying how a DO loop operates. You can provide a list of numeric or character values following the index variable. Here are some examples: do x = 1,2,5,10; (values of x are: 1, 2, 5, and 10) do month = 'Jan','Feb','Mar'; (values of month are: 'Jan', 'Feb', and 'Mar') do n = 1,3, 5 to 9 by 2, 100 to 200 by 50; (values of n are: 1, 3, 5, 7, 9, 100, 150, and 200)
130 Learning SAS by Example: A Programmer’s Guide
If you use character values in the DO statement, the length of the first character value determines the storage length of the index variable. Therefore, you may need to use a LENGTH statement to set the storage length for the index variable. Using character values for DO loop indices can be especially useful, as illustrated by this example. You have five scores for patients on a placebo drug and five scores for patients on an active drug. The raw data values are arranged like this: 250 222 230 210 199 166 183 123 129 234
An easy way to read these values is to use a DO loop with character values, like this:
Program 8-12 Using character values for DO loop index values data easyway; do Group = 'Placebo','Active'; do Subj = 1 to 5; input Score @; output; end; end; datalines; 250 222 230 210 199 166 183 123 129 234 ;
Before we discuss the DO loops in this program, let’s take a moment to tell you about the at (@) sign at the end of the INPUT statement. (This is referred to as a single trailing @ sign and you can find a detailed explanation in Chapter 21, Section 11.) To keep this program compact, it was convenient to place several scores on a single line. Without the @ sign, each time SAS executes an INPUT statement, it goes to a new line of data. The single trailing @ sign is an instruction to “hold the line” for another INPUT statement in the DATA step. Try running this program without the trailing @ sign to see what happens. This program demonstrates two things: first, you can use character values in a DO loop, and second, you can nest one DO loop inside another. Let’s “play computer” and follow the execution of this program. The outer loop first sets Group equal to Placebo. Next, the inner loop iterates five times, reading score values and outputting an observation each time (remember that Group is equal to Placebo for each of these five observations). When the inner loop is finished, control returns to the top of the outer loop where Group is set to Active. Five more
Chapter 8: Performing Iterative Processing: Looping 131
values of Score are read and five observations are written to the SAS data set. Here is a listing of data set Easyway: Listing of EASYWAY Group
Subj
Score
Placebo
1
250
Placebo
2
222
Placebo
3
230
Placebo
4
210
Placebo
5
199
Active
1
166
Active
2
183
Active
3
123
Active
4
129
Active
5
234
8.6 DO WHILE and DO UNTIL Statements Instead of choosing a stopping value for an iterative DO loop, you can stop a loop when a condition is met or while a condition is true. Enter the DO UNTIL and DO WHILE statements. Let’s revisit the compound interest problem from earlier in this chapter. Instead of asking how much money you have after x years, you want to know how many years you need to keep your $100 in the bank at 3.75% interest to double your money. The two programs here solve this problem:
Program 8-13 Demonstrating a DO UNTIL loop data double; Interest = .0375; Total = 100; do until (Total ge 200); Year + 1; Total = Total + Interest*Total; output; end; format Total dollar10.2; run;
132 Learning SAS by Example: A Programmer’s Guide
title "Listing of DOUBLE"; proc print data=double noobs; run;
The condition is placed in parentheses following the keyword UNTIL. In this example, the loop continues to repeat until the value of Total is greater than or equal to 200. Here is the output: Listing of DOUBLE Interest
Total
Year
0.0375
$103.75
1
0.0375
$107.64
2
0.0375
$111.68
3
0.0375
$115.87
4
0.0375
$120.21
5
0.0375
$124.72
6
0.0375
$129.39
7
0.0375
$134.25
8
0.0375
$139.28
9
0.0375
$144.50
10
0.0375
$149.92
11
0.0375
$155.55
12
0.0375
$161.38
13
0.0375
$167.43
14
0.0375
$173.71
15
0.0375
$180.22
16
0.0375
$186.98
17
0.0375
$193.99
18
0.0375
$201.27
19
An important point to remember about DO UNTIL is that the condition, placed in parentheses after the keyword UNTIL, is tested at the bottom of the loop. Therefore, a DO UNTIL loop always executes at least once. To make this clear, suppose you started with $300. What happens when you run the program?
Chapter 8: Performing Iterative Processing: Looping 133
Program 8-14 Demonstrating that a DO UNTIL loop always executes at least once data double; Interest = .0375; Total = 300; do until (Total gt 200); Year + 1; Total = Total + Interest*Total; output; end; format Total dollar10.2; run;
The condition is true even before the loop starts, but because the condition is tested at the bottom of the loop, this program outputs one observation (as shown here): Listing of DOUBLE Interest 0.0375
Total $311.25
Year 1
An alternative to DO UNTIL is DO WHILE. As you might expect, a DO WHILE loop iterates as long as the condition following WHILE is true. There is another difference between DO WHILE and DO UNTIL—the WHILE condition is tested at the top of the loop rather than at the bottom. So, unlike a DO UNTIL block that always iterates at least once, a DO WHILE block does not execute even once if the condition is false. You can rewrite Program 8-13 using a DO WHILE statement, like this:
Program 8-15 Demonstrating a DO WHILE statement data double; Interest = .0375; Total = 100; do while (Total le 200); Year + 1; Total = Total + Interest*Total; output; end; format Total dollar10.2; run;
134 Learning SAS by Example: A Programmer’s Guide
proc print data=double noobs; title "Listing of DOUBLE"; run;
The block of code between the DO WHILE and END statements executes as long as Total is less than or equal to 200. Output from this program is identical to the output from Program 8-13. To reinforce the idea that DO WHILE conditions are tested at the top of the loop, look at this program:
Program 8-16 Demonstrating that DO WHILE loops are evaluated at the top data double; Interest = .0375; Total = 300; do while (Total lt 200); Year + 1; Total = Total + Interest*Total; output; end; format Total dollar10.2; run;
Because the WHILE condition is never true, the statements inside the DO WHILE block never execute and the data set Double has no observations.
8.7 A Caution When Using DO UNTIL Statements It is very important that the condition you place on a DO UNTIL statement becomes true at some point. For example, if you change the DO UNTIL statement in Program 8-14 to read as follows, the condition is never true and you have what is called an infinite loop: do until (Total eq 200);
Depending on whether or not you are paying for your computer time, this could be a bad (expensive) thing. On a PC platform, you can usually interrupt the program by selecting the icon to stop a SAS program (the explanation point on the task bar) or simultaneously clicking the CTRL and C keys. The lesson here is to be very careful when using a DO UNTIL statement: make sure the condition you specify eventually returns a true value.
Chapter 8: Performing Iterative Processing: Looping 135
One way to prevent infinite loops is to combine a regular DO loop with an UNTIL condition. You could rewrite the program like this:
Program 8-17 Combining a DO UNTIL and iterative DO loop data double; Interest = .0375; Total = 100; do Year = 1 to 100 until (Total gt 200); Total = Total + Interest*Total; output; end; format Total dollar10.2; run;
There are two advantages to this structure: first, even if the UNTIL condition never becomes true, the loop ends when Year reaches 100, and second, you don’t have to assign a value to Year inside the loop as you did in Program 8-13.
8.8 LEAVE and CONTINUE Statements The LEAVE statement inside a DO loop shifts control to the statement following the END statement at the bottom of the loop. The CONTINUE statement halts further statements within the DO loop from executing and continues iterations of the loop. Note: You can also use a LEAVE statement inside a SELECT group. Program 8-18 demonstrates how the LEAVE statement works.
Program 8-18 Demonstrating the LEAVE statement data leave_it; Interest = .0375; Total = 100; do Year = 1 to 100; Total = Total + Interest*Total; output; if Total ge 200 then leave; end; format Total dollar10.2; run;
136 Learning SAS by Example: A Programmer’s Guide
In this program, the loop continues until Total is greater than or equal to 200. At this point, the LEAVE statement terminates the loop. To demonstrate a CONTINUE statement, take a look at the following program:
Program 8-19 Demonstrating a CONTINUE statement data continue_on; Interest = .0375; Total = 100; do Year = 1 to 100 until (Total ge 200); Total = Total + Interest*Total; if Total le 150 then continue; output; end; format Total dollar10.2; run;
As long as Total is less than or equal to 150, the CONTINUE statement causes execution to drop to the bottom of the loop (skipping the OUTPUT statement) and the loop continues. When Total is greater than 150, output occurs and the outer loop continues until Total is greater than 200. Thus, this program prints values of Total greater than 150 until Total reaches or exceeds 200. Here is the output: Listing of CONTINUE_ON Interest
Total
Year
0.0375
$155.55
12
0.0375
$161.38
13
0.0375
$167.43
14
0.0375
$173.71
15
0.0375
$180.22
16
0.0375
$186.98
17
0.0375
$193.99
18
0.0375
$201.27
19
As you saw in this chapter, iterative statements in SAS can make your programs shorter and easier to understand. They also allow you to write DATA steps that generate data for creating tables or plotting functions.
Chapter 8: Performing Iterative Processing: Looping 137
8.9 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Run the program here to create a temporary SAS data set called Vitals: data vitals; input ID : #3. Age Pulse SBP DBP; label SBP = "Systolic Blood Pressure" DBP = "Diastolic Blood Pressure"; datalines; 001 23 68 120 80 002 55 72 188 96 003 78 82 200 100 004 18 58 110 70 005 43 52 120 82 006 37 74 150 98 007 . 82 140 100 ;
Using this data set, create a new data set (NewVitals) with the following new variables: For subjects less than 50 years of age: If Pulse is less than 70, set PulseGroup equal to Low; otherwise, set PulseGroup equal to High. If SBP is less than 130, set SBPGroup equal to Low; otherwise, set SBPGroup equal to High. For subjects greater than or equal to 50 years of age: If Pulse is less than 74, set PulseGroup equal to Low; otherwise, set PulseGroup equal to High. If SBP is less than 140, set SBPGroup equal to Low; otherwise, set SBPGroup equal to High. You may assume there are no missing values for Pulse or SBP.
138 Learning SAS by Example: A Programmer’s Guide
2. Run the program here to create a temporary SAS data set (MonthSales): data monthsales; input month sales @@; /* add your line(s) here */ datalines; 1 4000 2 5000 3 . 4 5500 5 5000 6 6000 7 6500 8 4500 9 5100 10 5700 11 6500 12 7500 ;
Modify this program so that a new variable, SumSales, representing Sales to date, is added to the data set. Be sure that the missing value for Sales in month 3 does not result in a missing value for SumSales. 3. Modify the program here so that each observation contains a subject number (Subj), starting with 1: data test; input Score1-Score3; /* add your line(s) here */ datalines; 90 88 92 75 76 88 88 82 91 72 68 70 ;
4. Count the number of missing values for the variables A, B, and C in the Missing data set. Add the cumulative number of missing values to each observation (use variable names MissA, MissB, and MissC). Use the MISSING function to test for the missing values. 5. Create and print a data set with variables N and LogN, where LogN is the natural log of N (the function is LOG). Use a DO loop to create a table showing values of N and LogN for values of N going from 1 to 20. 6. Repeat Problem 5, except have the range of N go from 5 to 100 by 5. 7. Use an iterative DO loop to plot the following equation: y = 3*x2 – 5*x + 10
Use values of x from 0 to 10, with an increment of .10. Copy the GPLOT statements from Problem 8 or use PROC PLOT to display the resulting equation.
Chapter 8: Performing Iterative Processing: Looping 139
8. Use an iterative DO loop to plot the following equation: Logit = log(p / (1 – p))
Use values of p from 0 to 1 (with a point at every .05). Using the following GPLOT statements will produce a very nice plot. (If you do not have SAS/GRAPH software, use PROC PLOT to plot your points). goptions reset=all ftext='arial' htext=1.0 ftitle='arial/bo' htitle=1.5 colors=(black); symbol v=none i=sm; title "Logit Plot"; proc gplot data=logitplot; plot Logit * p; run; quit;
9. You have the following seven values for temperatures for each day of the week, starting with Monday: 70, 72, 74, 76, 77, 78, and 85. Create a temporary SAS data set (Temperatures) with a variable (Day) equal to Mon, Tue, Wed, Thu, Fri, Sat, and Sun and a variable called Temp equal to the listed temperature values. Use a DO loop to create the Day variable. 10. You are testing three speed-reading methods (A, B, and C) by randomly assigning 10 subjects to each of the three methods. You are given the results as three lines of reading speeds, each line representing the results from each of the three methods, respectively. Here are the results: 250 255 256 300 244 268 301 322 256 333 267 275 256 320 250 340 345 290 280 300 350 350 340 290 377 401 380 310 299 399
Create a temporary SAS data set from these three lines of data. Each observation should contain Method (A, B, or C), and Score. There should be 30 observations in this data set. Use a DO loop to create the Method variable and remember to use a single trailing @ in your INPUT statement. Provide a listing of this data set using PROC PRINT.
140 Learning SAS by Example: A Programmer’s Guide
11. You have daily temperatures for each hour of the day for two cities (Dallas and Houston). The 48 temperature values are strung out in several lines like this: 80 91 86 80 88 88
81 93 84 81 90 84
82 93 80 82 92 82
83 95 78 82 92 78
84 96 76 86 93 76
84 87 88 89 89 97 99 95 92 90 88 77 78 96 94 92 90 74
The first 24 values represent temperatures from Hour 1 to Hour 24 for Dallas and the next 24 values represent temperatures for Hour 1 to Hour 24 for Austin. Using the appropriate DO loops, create a data set (Temperature) with 48 observations, each observation containing the variables City, Hour, and Temp. Note: For this problem, you will need to use a single trailing @ on your INPUT statement (see Chapter 21, Section 21.11 for an explanation). 12. You place money in a fund that returns a compound interest of 4.25% annually. You deposit $1,000 every year. How many years will it take to reach $30,000? Do not use compound interest formulas. Rather, use “brute force” methods with DO WHILE or DO UNTIL statements to solve this problem. 13. You invest $1,000 a year at 4.25% interest, compounded quarterly. How many years will it take to reach $30,000? Do not use compound interest formulas. Rather, use “brute force” methods with DO WHILE or DO UNTIL statements to solve this problem. 14. Generate a table of integers and squares starting at 1 and ending when the square value is greater than 100. Use either a DO UNTIL or DO WHILE statement to accomplish this.
C h a p t e r
9
Working with Dates 9.1
Introduction 142
9.2
How SAS Stores Dates 142
9.3
Reading Date Values from Raw Data 143
9.4
Computing the Number of Years between Two Dates 146
9.5
Demonstrating a Date Constant 147
9.6 9.7
Computing the Current Date 148 Extracting the Day of the Week, Day of the Month, Month, and Year from a SAS Date 149
9.8
Creating a SAS Date from Month, Day, and Year Values 150
9.9
Substituting the 15th of the Month when the Day Value Is Missing 151
9.10 Using Date Interval Functions 152 9.11 Problems 157
142 Learning SAS by Example: A Programmer’s Guide
9.1 Introduction Most data sets contain date information, such as a date of birth or a transaction date. SAS can read dates in almost any commonly used format. It can calculate intervals between dates and much, much more. Read on.
9.2 How SAS Stores Dates SAS can read dates in almost any form, such as the following: Date
Description
10/21/1950
Month – Day – Year
21/10/1950
Day – Month – Year
21Oct1950
Day – Month Abbreviation – Year
50294
Julian Date
However, SAS does not normally store dates in any of these forms—it converts all of these dates into a single number—the number of days from January 1, 1960. Dates after January 1, 1960, are positive integers; dates before January 1, 1960, are negative integers. For example, the following table shows some dates and the internal values stored by SAS: Date
SAS Internal Value
January 1, 1960
0
January 2, 1960
1
December 31, 1959
-1
June 15, 2006
16,967
October 21, 1950
-3,359
Chapter 9: Working with Dates 143
9.3 Reading Date Values from Raw Data As you saw in Chapter 3, formatted input (or list input with an appropriate informat) allows SAS to read nonstandard numeric data. SAS has many built-in date informats that allow you to read dates in any of the forms in the table in Section 9.2 (and more) and automatically convert these date values into SAS dates. Let’s look at an example. You have a raw data file, as shown here: File c:\books\learning\dates.txt 1 2 3 1234567890123456789012345678901234567 Columns 001 10/21/1950 05122003 08/10/65 23Dec2005 002 01/01/1960 11122009 09/13/02 02Jan1960
The first three dates (starting in columns 5, 16, and 25) are in the month-day-year form; the last date (starting in column 34) starts with the day of the month, a three-letter month abbreviation, and a four-digit year. Notice that some of the dates include separators between the values, while others do not. Also, the third date value only has a two-digit year. Here is a program to read these dates:
Program 9-1 Program to read dates from raw data data four_dates; infile 'c:\books\learning\dates.txt' truncover; input @1 Subject $3. @5 DOB mmddyy10. @16 VisitDate mmddyy8. @26 TwoDigit mmddyy8. @34 LastDate date9.; run;
The TRUNCOVER option was added to the INFILE statement. You typically want to use either the TRUNCOVER or PAD option when reading raw data files with data in fixed columns (see Chapter 21). Both of these options prevent errors when you have a short record. Each of the four dates is read with an appropriate date informat. As you can see, the first three dates are all in month-day-year form so they all use the MMDDYYw. informat. Because the date of birth (DOB) takes up 10 columns, MMDDYY10. is the proper informat to use. The second date (VisitDate) uses a format similar to the DOB, except
144 Learning SAS by Example: A Programmer’s Guide
that there are no separators between the month, day, and year values. Therefore, the number of columns occupied by these dates is 8. The third date uses separators between the values but only has two-digit year values. Therefore, these dates also take up 8 columns. Finally, the last date uses the day of the month, a three-character month abbreviation, and a four-digit year. The number of columns used for these dates is 9, and the informat name is DATE. Each of these dates is read with an appropriate SAS informat. As these dates are read, SAS converts all of the dates to their corresponding numerical value. To see this, here is a listing, produced by PROC PRINT: Listing of FOUR_DATES Visit Subject
Two
Last
Digit
Date
15837
2048
16793
18213
15596
1
DOB
Date
001
-3359
002
0
The numbers in this listing represent the internally stored values for the four dates that SAS read from the raw data and converted. These values are stored the same way SAS stores any numeric values. You can choose any SAS date format to display these dates—they do not have to match the informat that was used to read the dates. To demonstrate this, the next program adds a FORMAT statement in the DATA step, as follows:
Program 9-2 Adding a FORMAT statement to format each of the date values data four_dates; infile 'c:\books\learning\dates.txt' truncover; input @1 Subject $3. @5 DOB mmddyy10. @16 VisitDate mmddyy8. @26 TwoDigit mmddyy8. @34 LastDate date9.; format DOB VisitDate date9. TwoDigit LastDate mmddyy10.; run;
Chapter 9: Working with Dates 145
Two very popular date formats are DATE9. and MMDDYY10. Again, it is important to separate the way the dates are read with the way you choose to display them. Here is the listing, with formatted date values: Listing of FOUR_DATES Subject
DOB
VisitDate
TwoDigit
LastDate
001
21OCT1950
12MAY2003
08/10/1965
12/23/2005
002
01JAN1960
12NOV2009
09/13/2002
01/02/1960
There are several things to notice about how SAS read these four date values. First, two of the dates (VisitDate and TwoDate) both used the same informat (MMDDYY8.) because both of these date values occupied 8 columns. SAS was clever enough to realize that VisitDate did not use delimiters, so the year must have been four digits. The values for TwoDate did include delimiters; therefore, the year values must have been two digits. All of these dates would also have been read correctly if the day of the month or the month values did not include leading 0s. Next, notice the values for TwoDigit in this listing. The first observation shows the year as 1965, while the second observation shows the year as 2002. How did SAS figure out whether to make the first two digits 19 or 20? Whenever you use a two-digit year, it is impossible to know if the date is from 1900 to 1999 or from 2000 to 2999 (or some other century). There is a system option called YEARCUTOFF that enables SAS to compute these values. The default value for this ® option is 1920 in SAS 8 and SAS 9. This value determines the start of a 100-year interval that SAS uses when it encounters a two-digit year. With a YEARCUTOFF value of 1920, all two-digit years are in the interval from 1920 to 2019. That is why the first date (8/10/65) is given the value 8/10/1965 and the second date (9/13/02) is given the value 9/13/2002. You might consider including a statement such as the following in each of your programs in case the version of SAS you are using changed the default value of this option: options yearcutoff=1920;
Of course, the best practice is to always use four-digit years!
146 Learning SAS by Example: A Programmer’s Guide
9.4 Computing the Number of Years between Two Dates Suppose you want to compute a person’s age, given his or her date of birth. The following program uses the Four_Dates data set as input and computes the subject’s Age as of the visit date (VisitDate):
Program 9-3 Compute a person's age in years data ages; set four_dates; Age = yrdif(DOB,VisitDate,'Actual'); run; title "Listing of AGES"; proc print data=ages; id Subject; var DOB VisitDate Age; run;
Program 9-3 uses a SAS function, YRDIF, to compute the difference in years between the date of birth and the visit date. In this example, the YRDIF function computes the number of years between DOB and VisitDate. The third argument to this function (ACTUAL) tells SAS to use the actual number of days in each month and to take leap years into account when doing the calculation. You can specify values other than ACTUAL if you want to use 30-day months and 360-day years (to compute bond interest and other financial calculations, for example).1 Here is the listing from PROC PRINT: Listing of AGES Subject
1
DOB
VisitDate
Age
001
21OCT1950
12MAY2003
52.5562
002
01JAN1960
12NOV2009
49.8630
See Ron Cody, SAS Functions by Example (Cary, NC: SAS Institute Inc., 2004), for details on these other options.
Chapter 9: Working with Dates 147
If you want the Age as of the person’s last birthday (dropping any fractional part of a year), you can use the INT (integer) function: Age = int(yrdif(DOB,VisitDate,'Actual'));
If you want to round the Age to the nearest year, you can use the ROUND function: Age = round(yrdif(DOB,VisitDate,'Actual'));
You may see the following expression in some older programs: Age = (VisitDate – DOB) / 365.25;
This expression gives an approximate value for the number of years between two dates (the .25 accounts for a leap year every four years). Because the YRDIF function has the ability to return an exact value, you should use it rather than this expression.
9.5 Demonstrating a Date Constant How do you compute a person’s age as of a certain date, January 1, 2006, for example? You need to be able to enter this date as the second argument of the YRDIF function. SAS allows you to enter dates in the DATA step by using a date constant. The date January 1, 2006, is written as follows: '01Jan2006'd
The general form of a date constant is a one- or two-digit day of the month, a threecharacter month abbreviation, and a two- or four-digit year in single or double quotation marks, followed by an upper- or lowercase d. This is the only form allowed as a date constant. You cannot use '01/01/2006'd, for example. You can use date constants in any expression involving dates.
148 Learning SAS by Example: A Programmer’s Guide
Rewriting Program 9-3 to compute Age as of January 1, 2006, you have the following program:
Program 9-4 Demonstrating a date constant data ages; set four_dates; Age = yrdif(DOB,'01Jan2006'd,'Actual'); run; title "Listing of AGES"; proc print data=ages; id Subject; var DOB Age; format Age 5.1; run;
In this program, a date constant represents the date January 1, 2006. (You could use the expression Age = yrdif(DOB,16802,'Actual'); if you happen to know that January 1, 2006, is 16,802 days after January 1, 1960.) This program also formats Age so that it prints the value to the nearest 10th. Here is the listing: Listing of AGES Subject
DOB
Age
001
21OCT1950
55.2
002
01JAN1960
46.0
9.6 Computing the Current Date Suppose you want to compute a quantity based on the current date. The TODAY function returns the value of the current date. Here is an example where the current date is substituted for January 1, 2006, in Program 9-4:
Program 9-5 Using the TODAY function to return the current date data ages; set four_dates; Age = yrdif(DOB,today(),'Actual'); run;
Chapter 9: Working with Dates 149
Even though you do not pass any arguments to the TODAY function, the parentheses following the function name are required (so that SAS can tell you want to use the function rather than a variable named Today). The DATE function performs an identical task as the TODAY function. Feel free to use either name.
9.7 Extracting the Day of the Week, Day of the Month, Month, and Year from a SAS Date There are SAS functions to extract the day of the week, day of the month, month, and year from a SAS date. Program 9-6 demonstrates these functions:
Program 9-6 Extracting the day of the week, day of the month, month, and year from a SAS date data extract; set four_dates; Day = weekday(DOB); DayOfMonth = day(DOB); Month = Month(DOB); Year = year(DOB); run; title "Listing of EXTRACT"; proc print data=extract noobs; var DOB Day –- Year; run;
The WEEKDAY function returns the day of the week, with 1 = Sunday, 2 = Monday, and so on, and the DAY function returns the day of the month (a number from 1 to 31). Be careful not to get these two functions mixed up. The MONTH function returns a number from 1 to 12 and the YEAR function returns a four-digit year value. If you haven’t seen it before, the variable list used in the VAR statement uses the double dash (--) form of a variable list. This VAR statement includes all of the variables from Day through Year in the order they are stored in the SAS data set. Be careful when using the double dash form of a variable list because the variable order may not be what you expect. It is best to generate a list of the variables in a SAS data set by running PROC CONTENTS (use the VARNUM option) before using this notation.
150 Learning SAS by Example: A Programmer’s Guide
A listing of data set Extract is shown next: Listing of EXTRACT Day Of DOB
Day
Month
Month
Year
21OCT1950
7
21
10
1950
01JAN1960
6
1
1
1960
9.8 Creating a SAS Date from Month, Day, and Year Values A very useful function, MDY (month day year), allows you to create a SAS date value by supplying month, day, and year values. This function is especially useful if you have a SAS data set that contains these values but does not contain the corresponding SAS date value or if you have values for month, day, and year in a raw data file that do not conform to any of the SAS date informats. Data set Month_Day_Year contains the variables Month, Day, and Year. You can create a SAS date like this:
Program 9-7 Using the MDY function to create a SAS date from month, day, and year values data mdy_example; set learn.month_day_year; Date = mdy(Month, Day, Year); format Date mmddyy10.; run;
In this program, the three arguments in the MDY function are MONTH, DAY, and YEAR values from the input data set. Note: The YEAR values may be two- or four-digit values. If they are the former, the value returned by the MDY function depends on the value of the YEARCUTOFF option.
Chapter 9: Working with Dates 151
Here is a listing of a PROC PRINT of the data set MDY_Example: Listing of MDY_EXAMPLE Obs
Month
Day
Year
Date
1950
10/21/1950 01/15/2005
1
10
21
2
1
15
5
3
3
.
2005
.
4
5
7
2000
05/07/2000
Notice that the value of Year (5) in Observation 2 results in the date 01/15/2005 (because the default YEARCUTOFF value of 1920 was used). In Observation 3, where the day of the month is missing, the value for Date is also missing.
9.9 Substituting the 15th of the Month when the Day Value Is Missing There are occasions where you have a missing value for the day of the month but still want to compute an approximate date. Many people use the 15th of the month to substitute for a missing Day value. You can use the Month data set from the previous section to demonstrate how this is done. Here is the program:
Program 9-8 Substituting the 15th of the month when a Day value is missing data substitute; set learn.month_day_year; if missing(day) then Date = mdy(Month,15,Year); else Date = mdy(Month,Day,Year); format Date mmddyy10.; run;
152 Learning SAS by Example: A Programmer’s Guide
Here the MISSING function tests if there is a missing value for the variable Day. If so, the number 15 is used as the second argument to the MDY function. The resulting listing shows the 15th of the month for the date in the third observation: Listing of SUBSTITUTE Obs
Month
Day
Year
Date
1950
10/21/1950
1
10
21
2
1
15
5
01/15/2005
3
3
.
2005
03/15/2005
4
5
7
2000
05/07/2000
9.10 Using Date Interval Functions Two functions, INTCK and INTNX, deal with date intervals (such as months, quarters, years). The INTCK function computes the number of intervals between two dates; the INTNX function computes a date after a given number of intervals. First of all, nobody knows how to pronounce either of these two functions—you're on your own. Next, these functions can be very complicated. For a more complete treatment of these two functions, please see SAS Functions by Example2 or SAS OnlineDoc3. To understand even the most basic use of these two functions, you must understand that they both deal with interval boundaries. For example, an interval boundary for the interval YEAR is January 1. A few examples make this clear. Look at the following table. Expression
2
Value Returned
INTCK('year','01Jan2005'd,'31Dec2005)
0
INTCK('year','31Dec2005'd,'01Jan2006)
1
INTCK('month','01Jan2005'd,'31Jan2005'd)
0
INTCK('month','31Jan2005'd,'01Feb2005'd)
1
INTCK('qtr','25Mar2005'd,'15Apr2005'd)
1
See Ron Cody, SAS Functions by Example (Cary, NC: SAS Institute Inc., 2004), for a complete treatment of these two functions. 3 See SAS OnlineDoc at http://support.sas.com/documentation/onlinedoc/index.html for more information on SAS functions.
Chapter 9: Working with Dates 153
As you can see, the INTCK function takes three arguments. The first is the desired interval (see the following table of interval values). The next two arguments represent a starting date and an ending date. The function returns the number of interval boundaries that have been crossed by going from the first date to the second date. The first entry in the table starts at January 1, 2005, and ends at December 31, 2005. The value returned is 0 because a year boundary (January 1) was not crossed in going from the first date to the second. The next entry, from December 31, 2005, to January 1, 2006, returns a 1 because a year boundary was crossed (even though this represents only one day). The same logic holds for the two INTCK functions with MONTH as the interval. A month boundary is the first of any month. The last table entry asks how many quarters are crossed going from March 25, 2005 to April 15, 2005. Because the boundaries for quarters are January 1, April 1, July 1, and October 1, this expression returns a 1 (the boundary of April 1 was crossed). A partial list of intervals supported by both the INTCK and INTNX functions is displayed in the following table. Interval
Description
Week
Weekly intervals with Sunday as the first day of the week
Weekday
Weekdays (default Sat. and Sun. not included)
Qtr
Quarters (Jan. 1, April 1, July 1, and Oct. 1)
Semi-year
Semi-annual intervals
Year
Yearly intervals
Here is an example. Data set Hosp contains 1,000 observations. Variables in this data set include a date of birth (DOB) and an admission date (AdmitDate). You want to see a graph showing the number of admissions for each quarter starting with January 1, 2003, and ending with June 30, 2006. Without the INTCK function, this task would require a lot of programming; with the INTCK function, this becomes a more straightforward problem. Here is one possible solution:
154 Learning SAS by Example: A Programmer’s Guide
Program 9-9 Demonstrating the INTCK function data frequency; set learn.hosp(keep=AdmitDate where=(AdmitDate between '01Jan2003'd and '30Jun2006'd)); Quarter = intck('qtr','01Jan2003'd,AdmitDate); run; title "Frequency of Admissions"; proc freq data=frequency noprint; tables Quarter / out=admit_per_quarter; run; goptions ftitle=swiss ftext=swiss; symbol v=dot i=sm color=black width=2; title height=2 "Frequency of Admissions From"; title2 height=2 "January 1, 2003 and June 30, 2006"; proc gplot data=admit_per_quarter; plot Count * Quarter; run; quit;
The key to this program is using the INTCK function to count the number of quarters from the starting date of January 1, 2003. You could simply use PROC FREQ to list the number of observations (visits) for each quarter. However, this program goes a bit further, hopefully to whet your appetite, and shows how you can use a combination of PROC FREQ and PROC GPLOT to produce a plot showing admission frequencies by quarter. In this program, PROC FREQ counts the number of observations for each quarter and outputs this value to a SAS data set (Admit_Per_Quarter). (Using PROC FREQ to output frequencies to a SAS data set is discussed in Chapter 24.) Finally, PROC GPLOT plots the distribution of admissions by quarter. Please see Chapter 20 for a discussion of PROC GPLOT. The resulting plot is shown next:
Chapter 9: Working with Dates 155
The converse of the INTCK function is INTNX. This function takes a starting date and returns the date after which a given number of interval boundaries have been crossed. Here is an example. You have a data set (Discharge) containing patient numbers (PatNo) and discharge dates (DischrDate). You want to notify each patient on the first day of the month 6 months after discharge. First, here is a listing of the Discharge data set: Listing of DISCHARGE Pat DischrDate
No
11/29/2003
879
11/30/2003
880
09/04/2003
883
08/28/2003
884
09/04/2003
885
08/26/2003
886
08/31/2003
887
08/25/2003
888
11/16/2003
913
11/15/2003
914
156 Learning SAS by Example: A Programmer’s Guide
Program 9-10 computes notification dates:
Program 9-10 Using the INTNX function to compute dates 6 months after discharge data followup; set learn.discharge; FollowDate = intnx('month',DischrDate,6); format FollowDate date9.; run;
Here the INTNX function uses MONTH as the interval and returns a date 6 months after the discharge date. A listing follows: Listing of FOLLOWUP Pat
Dischr
Follow
No
Date
Date
879
29NOV2003
01MAY2004
880
30NOV2003
01MAY2004
883
04SEP2003
01MAR2004
884
28AUG2003
01FEB2004
885
04SEP2003
01MAR2004
886
26AUG2003
01FEB2004
887
31AUG2003
01FEB2004
888
25AUG2003
01FEB2004
913
16NOV2003
01MAY2004
914
15NOV2003
01MAY2004
If you want the patient to return on the same day of the month after an interval of 6 months has passed, you can add SAMEDAY, an optional fourth argument to the INTNX function. This fourth argument is referred to in SAS documentation as an alignment parameter. Choices are BEGINNING, MIDDLE, END, and SAMEDAY. SAMEDAY is a particularly useful option since you can find dates on the same day of the week or the same day of the month as the starting date. To see how this can help you recall the patients on the same day of the month, 6 months hence, the program is written like this:
Chapter 9: Working with Dates 157
Program 9-11 Demonstrating the SAMEDAY alignment with the INTNX function data followup; set learn.discharge; FollowDate = intnx('month',DischrDate,6,'sameday'); format FollowDate date9.; run;
The resulting output is: Listing of FOLLOWUP Pat
Dischr
Follow
No
Date
Date
879
29NOV2003
29MAY2004
880
30NOV2003
30MAY2004
883
04SEP2003
04MAR2004
884
28AUG2003
28FEB2004
885
04SEP2003
04MAR2004
886
26AUG2003
26FEB2004
887
31AUG2003
29FEB2004
888
25AUG2003
25FEB2004
913
16NOV2003
16MAY2004
914
15NOV2003
15MAY2004
The follow-up date is now on the same day of the month as the discharge date. As you can see, these two functions can be very confusing. And, we haven’t even mentioned multiple and shifted intervals! A parting word: be very careful when you use these two functions and try running your programs with test data.
9.11 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems.
158 Learning SAS by Example: A Programmer’s Guide
1. You have several lines of data, consisting of a subject number and two dates (date of birth and visit date). The subject starts in column 1 (and is 3 bytes long), the date of birth starts in column 4 and is in the form mm/dd/yyyy, and the visit date starts in column 14 and is in the form nnmmmyyyy (see sample lines below). Read the following lines of data to create a temporary SAS data set called Dates. Format both dates using the DATE9. format. Include the subject’s age at the time of the visit in this data set. 0011021195011Nov2006 0020102195525May2005 0031225200525Dec2006
2. Using the following lines of data, create a temporary SAS data set called ThreeDates. Each line of data contains three dates, the first two in the form mm/dd/yyyy descenders and the last in the form ddmmmyyyy. Name the three date variables Date1, Date2, and Date3. Format all three using the MMDDYY10. format. Include in your data set the number of years from Date1 to Date2 (Year12) and the number of years from Date2 to Date3 (Year23). Round these values to the nearest year. Here are the lines of data (note that the columns do not line up): 01/03/1950 01/03/1960 03Jan1970 05/15/2000 05/15/2002 15May2003 10/10/1998 11/12/2000 25Dec2005
3. You have several dates that range from 1910 to 2006 in a raw data file. Unfortunately, all of the dates only have two-digit years. Read these dates and be sure that the resulting data set (call it Dates1910_2006) are in this range. Hint: Remember the system option YEARCUTOFF. Here are the values (all starting in column 1). 01/01/11 02/23/05 03/15/15 05/09/06
4. Using the Hosp data set, compute the subject’s ages two ways: as of January 1, 2006 (call it AgeJan1), and as of today’s date (call it AgeToday). The variable DOB represents the date of birth. Take the integer portion of both ages. List the first 10 observations.
Chapter 9: Working with Dates 159
5. Using the Hosp data set, compute the frequencies for the days of the week, months of the year, and year, corresponding to the admission dates (variable AdmitDate). Supply a format for the days of the week and months of the year. Use PROC FREQ to list these frequencies. 6. Using the Medical data set, compute frequencies for the days of the week for the date of the visit (VisitDate). Supply a format for the days of the week and months of the year. 7. Using the Hosp data set, list all the observations with admission dates (AdmitDate) before July 15, 2002. Write your statement so that if there were any missing values for AdmitDate, they are not included in this list. 8. Using the values for Day, Month, and Year in the raw data below, create a temporary SAS data set containing a SAS date based on these values (call it Date) and format this value using the MMDDYY10. format. Here are the Day, Month, and Year values: 25 12 2005 1 1 1960 21 10 1946
9. Repeat Problem 8, except use the following data. If there is a missing value for the day, substitute the 15th of the month. 25 12 2005 . 5 2002 12 8 2006
10. Using the Hosp data set, compute the number of months from the admission date (AdmitDate) and December 31, 2007 (call it MonthsDec). Also, compute the number of months from the admission date to today's date (call it MonthsToday). Use a date interval function to solve this problem. List the first 20 observations for your solution. 11. The data set Medical contains a variable called VisitDate. Create a temporary SAS data set (Interval) with the variables in Medical plus a new variable (Quarter), representing the number of quarters from January 1, 2006. 12. You want to see each patient in the Medical data set on the same day of the week 5 weeks after they visited the clinic (the variable name is VisitDate). Provide a listing of the patient number (Patno), the visit date, and the date for the return visit.
160 Learning SAS by Example: A Programmer’s Guide
13. You want to see each patient in the Medical data set on the same day of the month 6 months after they visited the clinic (the variable name is VisitDate). Provide a listing of the patient number (Patno), the visit date, and the date of the return visit. Remember, the SAMEDAY alignment, when used with the MONTH interval, results in a date on the same day of the month.
C h a p t e r
10
Subsetting and Combining SAS Data Sets 10.1
Introduction 162
10.2
Subsetting a SAS Data Set 162
10.3
Creating More Than One Subset Data Set in One DATA Step 163
10.4
Adding Observations to a SAS Data Set
10.5
Interleaving Data Sets 167
10.6
Combining Detail and Summary Data 168
10.7
Merging Two Data Sets 170
10.8
Omitting the BY Statement in a Merge 172
10.9
Controlling Observations in a Merged Data Set 173
164
10.10 More Uses for IN= Variables 175 10.11 When Does a DATA Step End? 176 10.12 Merging Two Data Sets with Different BY Variable Names 177 10.13 Merging Two Data Sets with Different BY Variable Data Types 179
162 Learning SAS by Example: A Programmer’s Guide
10.14 One-to-One, One-to-Many, and Many-to-Many Merges 181 10.15 Updating a Master File from a Transaction File 183 10.16 Problems 185
10.1 Introduction This chapter describes how to subset a SAS data set and how to combine data from several data sets into a single data set.
10.2 Subsetting a SAS Data Set Subsetting a SAS data set involves selecting observations from one data set by defining selection criteria, usually in a WHERE or subsetting IF statement. As an example, suppose you want to select all observations from the permanent SAS data set Survey where the value of Gender is F. One way to do this is with a WHERE statement, as follows:
Program 10-1 Subsetting a SAS data set using a WHERE statement data females; set learn.survey; where gender = 'F'; run;
Remember that the variables used in a WHERE statement must all come from a SAS data set. Variables that are created by reading raw data or from an assignment statement may not be used in this fashion. In Program 10-1, the data set Females contains all the observations in the data set Survey where Gender has a value of F. Here is a listing of this data set:
Chapter 10: Subsetting and Combining SAS Data Sets 163
Listing of FEMALES ID
Gender
Age
Salary
Ques1
Ques2
Ques3
Ques4
Ques5
002
F
55
76123
4
5
2
1
1
004
F
67 128000
5
3
2
2
4
007
F
45
5
3
4
3
3
76100
Notice that the data set Females contains all the variables found in the data set Survey. If you do not need one or more variables from the input data set (the data set on the SET statement), you can use a DROP= or KEEP= data set option. For example, to omit SALARY from the new data set, you can use the following method:
Program 10-2 Demonstrating a DROP= data set option data females; set learn.survey(drop=Salary); where gender = 'F'; run;
There is an important difference between using a DROP= data set option on the input data set and placing a DROP statement somewhere in the DATA step. In this example, the variable Salary is not present in the Program Data Vector (PDV). See Chapter 2, “How SAS Works (a Look inside the ‘Black Box’).” If you used a DROP statement listing Salary instead of the DROP= data set option, Salary would be in the PDV but not written out to the Females data set. If the input data set contains a large number of variables and you only want a few of these variables in the new data set, using the DROP= data set option is more efficient. However, you must remember that when you use the DROP= data set option, the variables in the drop list are not available in the DATA step.
10.3 Creating More Than One Subset Data Set in One DATA Step You can create multiple SAS data sets from one input data set (something that SQL cannot do). Following the previous example, you can create a data set containing only data on females and one containing only data on males, in one step like this:
164 Learning SAS by Example: A Programmer’s Guide
Program 10-3 Creating two data sets in one DATA step data males females; set learn.survey; if gender = 'F' then output females; else if gender = 'M' then output males; run;
Notice that you must name the data set following the OUTPUT statement. If you do not, SAS outputs the observation to all the data sets listed in the DATA statement.
10.4 Adding Observations to a SAS Data Set Suppose you want to create a single data set from several similar data sets. For example, your company may collect data each month into separate data sets and you want to analyze a year’s worth of data. You can list as many data sets as you want on a SET statement and SAS will add all the observations together to form a single data set. For example, look at the listings of data sets One and Two here: Listing of TWO
Listing of ONE Obs
ID
Name
Weight
1
7
Adams
210
2
1
Smith
190
3
2
Schneider
110
4
4
Gregory
Obs
ID
Name
Weight
1
9
Shea
120
2
3
O'Brien
180
3
5
Bessler
207
90
Each of the data sets contains the same variables. Also, the storage length for Name is the same in each of the two data sets. One way to combine these data sets is like this:
Chapter 10: Subsetting and Combining SAS Data Sets 165
Program 10-4 Using a SET statement to combine observations from two data sets data one_two; set one two; run;
Here is the output: Listing of one_two Obs
ID
Name
Weight
1
7
Adams
210
2
1
Smith
190
3
2
Schneider
110
4
4
Gregory
5
9
Shea
120
6
3
O'Brien
180
7
5
Bessler
207
90
All the observations from data set One are followed by all the observations from data set Two. SAS refers to this process as concatenating data sets. This seems simple enough. But what happens if you use the SET statement on two data sets that don’t contain all the same variables? To see what happens, here is data set Three: Listing of THREE Obs
ID
Gender
Name
1
10
M
Horvath
2
15
F
Stevens
3
20
M
Brown
Data set Three contains a new variable, Gender, and does not contain the variable Weight. You can’t tell from the listing, but the variable Name in both data sets is the same length. Let’s use the SET statement on data sets One and Three and see what happens.
166 Learning SAS by Example: A Programmer’s Guide
Program 10-5 Using a SET statement on two data sets containing different variables data one_three; set one three; run;
Here is the output: Listing of ONE_THREE Obs
ID
Name
Weight
Gender
1
7
Adams
210
2
1
Smith
190
3
2
Schneider
110
4
4
Gregory
90
5
10
Horvath
.
M
6
15
Stevens
.
F
7
20
Brown
.
M
Looking at this output helps you understand what is going on. At compile time, SAS looks at every data set listed in the SET statement. First comes data set One. This brings ID, Name, and Weight into the PDV (and all their attributes). Next SAS looks at data set Three. Are there any variables in data set Three that are not already in the PDV? Yes, Gender. Therefore, Gender is added to the PDV. Let’s follow what goes on as Program 10-5 executes. First, all the variables in the PDV are set to missing. Then, each observation from data set One is read and then written to the output data set. The PDV is not set back to missing as each observation in data set One is processed, but this really doesn’t matter because each new observation replaces the values in the PDV. (This would be true even if one of the values in the current observation was missing.) After all the observations in data set One are read and written out to data set One_Three and SAS prepares to read observations from data set Three, all the variables in the PDV are again set to missing. This is important because data set Three does not contain the variable Weight. If this initialization (setting values in the PDV to missing) did not take place, every person coming from data set Three would wind up at 90 pounds (the last Weight value in data set One). SAS continues reading observations from data set Three until it reaches the end of the data set.
Chapter 10: Subsetting and Combining SAS Data Sets 167
If the length of a variable is different in any of the input data sets, the length of the variable in the output data set is equal to the length that variable has in the first data set encountered in the DATA step. It is a good idea to check lengths of character variables when you are combining several data sets to be sure you will not truncate any values. If necessary, place a LENGTH statement before the SET statement to be sure the resulting length is adequate to hold all of your values. (Remember that the length of a character variable is determined as soon as that variable enters the PDV; it cannot be changed after that.) Finally, if you have a variable in two data sets, one character and the other numeric, SAS prints an error message in the log and the program terminates.
10.5 Interleaving Data Sets There is another way to add observations from several data sets. If each of the data sets to be combined is already sorted, you can take advantage of that fact and wind up with a sorted output data set. All you need to do is to follow the SET statement with a BY statement. SAS selects observations from each of the input data sets in order, with the resulting data set already sorted. The advantage of this method is that you don’t have to sort the resulting data set. There are times when the resulting data set would be too large to sort conveniently or at all. To demonstrate the interleaving of data sets, let’s combine data sets One and Two, with the resulting data set in ID order. Here is the code:
Program 10-6 Interleaving data sets proc sort data=one; by ID; run; proc sort data=two; by ID; run; data interleave; set one two; by ID; run;
168 Learning SAS by Example: A Programmer’s Guide
Remember, each of the data sets in the SET statement must be in order of the BY variable(s). Here is the output: Listing of INTERLEAVE ID
Name
Weight
1
Smith
190
2
Schneider
110
3
O'Brien
180
4
Gregory
90
5
Bessler
207
7
Adams
210
9
Shea
120
Notice that this output data set is in ID order. There is a minor difference between obtaining a data set by interleaving and concatenating the data sets and then sorting. When you use PROC SORT to sort a SAS data set, a sort flag is set (you can see this on the first page of output from PROC CONTENTS) and SAS does not resort this data set if you attempt to sort it again by the same BY variables. When you interleave data sets, this sort flag is not set (which should not cause you any problems).
10.6 Combining Detail and Summary Data Suppose you have a SAS data set and want to express values of a variable as a percentage of the mean for all observations. For example, you want to express each value of cholesterol in the Blood data set as a percentage of the mean for all subjects. Problems like this involve combining detail data (data on individuals) with summary data (the mean of all subjects). One solution is to perform a SET operation conditionally. Here is an example:
Program 10-7 Combining detail and summary data: using a conditional SET statement proc means data=learn.blood noprint; var Chol; output out = means(keep=AveChol) mean = AveChol; run;
Chapter 10: Subsetting and Combining SAS Data Sets 169
data percent; set learn.blood(keep=Subject Chol); if _n_ = 1 then set means; PerChol = Chol / AveChol; format PerChol percent8.; run;
The PROC MEANS step creates a SAS data set (Means) with one observation and one variable. AveChol is the mean cholesterol value for all observations in the Blood data set. (You can read more about creating summary data sets in Chapter 16.) A listing of the data set Means follows: Listing of Data Set MEANS Obs
AveChol
1
201.435
To combine this single value with every observation in the Blood data set, you execute a SET statement conditionally. Here’s how it works. The first SET statement brings in an observation from the Blood data set. The automatic variable _n_ counts iterations of the DATA step. You can use this variable to conditionally perform the SET operation on the MEANS data set. During the first iteration of the DATA step, _n_ is equal to 1 and the first (and only) observation from the Means data set is brought into the PDV. The first observation in the data set Percent contains all the variables in the Blood data set plus the variable AveChol. For the second iteration of the DATA step, the next observation from the Blood data set is brought into the PDV. Because _n_ is now equal to 2, the conditional SET statement does not execute. (Without the conditional SET statement, the DATA step would end when SAS tried to read a second observation from the Means data set—see Section 10.11 for details on when a DATA step ends.) However, because variables that come from SAS data sets are automatically retained (that is, the value in the PDV is not set to a missing value), the AveChol value is still in the PDV and it is added to the second observation. This process continues until the Blood data set reaches the end of file. You may wonder why the computation of AveChol doesn’t multiply the result by 100. The answer is that the PERCENT format not only adds a percent sign to the value, it also multiplies the value by 100.
170 Learning SAS by Example: A Programmer’s Guide
To help make this clear, here are listings of the first few observations in the Blood data set, the Means data set, and the Percent data set: Listing of BlOOD - first 5 observations Subject 1
Chol 258
2
.
3
184
4
.
5
187
Listing of MEANS Obs
AveChol
1
201.435
Listing of PERCENT - first 5 observations Obs
Subject
Chol
AveChol
1
1
2 3
PerChol
258
201.435
2
.
201.435
.
3
184
201.435
91%
4
4
.
201.435
.
5
5
187
201.435
93%
128%
10.7 Merging Two Data Sets SAS uses the term merge to describe the process of combining variables (columns) from two or more data sets. For example, you could have an employee data set (Employee) containing ID numbers and names. If you had another data set (Hours) containing ID numbers, along with a job class and the number of hours worked, you might want to add the name from the Employee data set to each observation in the Hours data set.
Chapter 10: Subsetting and Combining SAS Data Sets 171
To demonstrate this process, take a look at a listing of the two data sets here: Listing of EMPLOYEE ID
Listing of HOURS Job
Name ID
Class
Hours
1
Smith
2
Schneider
1
A
39
4
Gregory
4
B
44
5
Washington
5
A
35
7
Adams
9
B
57
You want to merge the Employee and Hours data sets as follows:
Program 10-8 Merging two SAS data sets proc sort data=employee; by ID; run; proc sort data=hours; by ID; run; data combine; merge employee hours; by ID; run;
You first sort each data set by the variable or variables that link the two data sets. Next, you name each of the data sets in a MERGE statement. Be sure to follow the MERGE statement with a BY statement, naming the variable or variables that tell SAS which observations to place in the same observation.
172 Learning SAS by Example: A Programmer’s Guide
Here is the data set Combine: Listing of COMBINE Job ID
Name
Class
1
Smith
2
Schneider
4
Hours
A
39
Gregory
B
44
5
Washington
A
35
7
Adams B
57
.
.
9
It’s pretty clear what is happening. When the BY variable is present in both data sets, the merged observation contains the corresponding values from both data sets. When the BY variable is missing from the Employee data set, Name will have missing values; when the BY variable is missing from the Hours data set, JobClass and Hours will have missing values.
10.8 Omitting the BY Statement in a Merge This is a good time to discuss what happens when you omit a BY statement when performing a merge. Without the BY statement in Program 10-8, the first three observations in data set Employee would be matched with the first three observations in data set Hours. The result would be completely wrong. To be sure this is clear, here is a listing of the merge when you omit the BY statement: Listing of COMBINE (when you omit the BY statement) Job ID
Name
Class
Hours
1
Smith
A
39
4
Schneider
B
44
5
Gregory
A
35
5
Washington
.
7
Adams
.
Chapter 10: Subsetting and Combining SAS Data Sets 173
Performing a merge without a BY statement is usually a mistake and almost always a bad idea. You can set a system option called MERGENOBY to values of NOWARN, WARN, or ERROR to control what happens when you attempt to perform a merge without a BY statement. The default value is NOWARN—that is, SAS performs the merge and does not issue a warning message. If you set this option to WARN, SAS still performs the merge, but a warning message is printed in the SAS log. Finally if you set this option to ERROR, the program terminates if you attempt a merge that is not followed with a BY statement. At the very minimum, you should set this option to WARN, like this: options mergenoby = warn;
To be even safer, set it to ERROR, like this: options mergenoby = error;
10.9 Controlling Observations in a Merged Data Set You may want the merged data set to contain only those employees who are in the Hours data set or you may want to see if any employees in the Hours data set are not listed in the Employee data set. SAS provides a method of controlling which observations you want in the merged data set. To demonstrate how this works, let’s merge the two data sets as before, but add a data set option called IN= and examine the result:
Program 10-9 Demonstrating the IN= data set option data new; merge employee(in=InEmploy) hours (in=InHours); by ID; file print; put ID= InEmploy= InHours= Name= JobClass= Hours=; run;
An IN= data set option follows each data set name. Following the keyword IN= is a variable name that you make up. In this example, the names InEmploy and InHouse were chosen. These two variables are temporary variables (SAS will not add them to the output data set). They have a value of 1 (true) if the data set they refer to is making a
174 Learning SAS by Example: A Programmer’s Guide
contribution to the current observation and 0 (false) otherwise. Because these variables are not in the output data set, Program 10-9 uses a PUT statement to list the values of these variables, along with other variables in the data set. Here is the output: Demonstrating the IN= Data Set Option ID=1 InEmploy=1 InHours=1 Name=Smith JobClass=A Hours=39 ID=2 InEmploy=1 InHours=0 Name=Schneider JobClass= Hours=. ID=4 InEmploy=1 InHours=1 Name=Gregory JobClass=B Hours=44 ID=5 InEmploy=1 InHours=1 Name=Washington JobClass=A Hours=35 ID=7 InEmploy=1 InHours=0 Name=Adams JobClass= Hours=. ID=9 InEmploy=0 InHours=1 Name= JobClass=B Hours=57
Because Employee 1 is in both data sets, InEmploy and InHours are both equal to 1 in the first observation. Employee 2 is in the Employee data set but not in the Hours data set, so in the second observation InEmploy is equal to 1 and InHours is equal to 0. Finally, in the last observation (ID = 9), there is no corresponding ID in the Employee data set, so InEmploy is equal to 0. You can use IN= variables to control which observations are written to the output data set. For example, if you want only those employees who are in both data sets, you can add a subsetting IF statement to the DATA step, like this:
Program 10-10 Using IN= variables to select IDs that are in both data sets data combine; merge employee(in=InEmploy) hours(in=InHours); by ID; if InEmploy and InHours; run;
You can, alternatively, write the subsetting IF statement like this: if InEmploy = 1 and InHours = 1;
The subsetting IF statement in this program (see Chapter 7 for more details) allows observations in the COMBINE data set only where both InEmploy and InHours are true. When either of these IN= variables is false, no observation is written to the output data set.
Chapter 10: Subsetting and Combining SAS Data Sets 175
The resulting data set contains only Employees 1, 4, and 5. Here is a listing of the data set Combine: Listing of COMBINE Job ID
Name
Class
Hours
1
Smith
A
39
4
Gregory
B
44
5
Washington
A
35
10.10 More Uses for IN= Variables You can use IN= variables to check if an ID is missing from a particular data set as well. The program here accomplishes this:
Program 10-11 More examples of using IN= variables data in_both missing_name(drop = Name); merge employee(in=InEmploy) hours(in=InHours); by ID; if InEmploy and InHours then output in_both; else if InHours and not InEmploy then output missing_name; run;
In this program, you are creating two SAS data sets in one DATA step. Data set In_Both contains all observations where the ID variable was found in both data sets. Data set Missing_Name contains all the employees who were in the Hours data set but were not in the Employee data set. You can also write the ELSE IF statement like this: else if InHours = 1 and InEmployee = 0 then output missing_name;
Either way, you are asking for all IDs that are in the Hours data set and not in the Employee data set. Use whatever form of this statement makes most sense to you.
176 Learning SAS by Example: A Programmer’s Guide
To be sure this is clear, here are the listings of both data sets: Employees with Hours who are in Employee File Job ID
Name
Class
Hours
1
Smith
A
39
4
Gregory
B
44
5
Washington
A
35
Employees who submitted hours but are not in file Job ID 9
Class B
Hours 57
10.11 When Does a DATA Step End? When any data set reaches an end of file, it signals the end of the DATA step. Take a look at the short program that follows:
Program 10-12 Demonstrating when a DATA step ends data short; input x; datalines; 1 2 ; data long; input x; datalines; 3 4 5 6 ;
Chapter 10: Subsetting and Combining SAS Data Sets 177
data new; set short; output; set long; output; run;
Data set Short has two observations and data set Long has four. How many observations are in data set New? Each SET statement keeps a pointer to keep track of which observation it is reading. In this program, an observation is first read from the Short data set, an observation is written out to the New data set, an observation is read from the Long data set, and another observation is written to the New data set. You might expect that this would continue until all the observations from both data sets were read. However, when the end of file on data set Short is encountered, it signals an end to the DATA step, with the result that data set New has only four observations, with values of x equal to 1, 2, 3, and 4. In more complicated programs, you need to be sure that your DATA step doesn’t end prematurely when an input data set reaches an end of file.
10.12 Merging Two Data Sets with Different BY Variable Names In a perfect world, multiple data sets would all use the same name for the variable needed to put them together. However, you will sometimes find yourself trying to merge two data sets where the name of the variable you want to use to join them has a different name in each data set. For example, one data set may call a variable ID and the other EmpID. Luckily, this is an easy problem to solve; you can use a RENAME= data set option to rename the variable in one data set to be consistent with the name in the other.
178 Learning SAS by Example: A Programmer’s Guide
Here is an example: Data set Bert has two variables: ID (a character variable) and X (a numeric variable). Data set Ernie also has two variables: EmpID (a character variable) and Y (a numeric variable). You want to merge these two data sets on the ID (or EmpID) variable. Listings of these two data sets are shown here: Listing of Data Set BERT
Listing of Data Set ERNIE
ID
EmpNo
X
Y
123
90
123
200
222
95
222
205
333
100
333
317
You can use the RENAME= data set option to either rename ID to EmpID or vice versa. Here’s one solution:
Program 10-13 Merging two data sets by renaming a variable in one data set data sesame; merge bert ernie(rename=(EmpNo = ID)); by ID; run;
Here, the name of the variable EmpNo from data set Ernie is changed to ID as it is read into the PDV so that the data sets can be merged on a common variable. (Names of variables in data set Ernie are not affected.) Data set Sesame contains three observations, as shown in the following listing: Listing of SESAME ID
X
Y
123
90
200
222
95
205
333
100
317
Notice that the name of the ID variable in the merged data set is ID.
Chapter 10: Subsetting and Combining SAS Data Sets 179
10.13 Merging Two Data Sets with Different BY Variable Data Types A slightly more complicated situation occurs when the variable you want to use as the BY variable in a merge is a different data type in the two data sets. SAS performs a merge only when the BY variables in both data sets are the same type—either character or numeric. As an example, two divisions of your company store Social Security numbers differently (they do this just to make your life interesting). Division 1 stores the Social Security number as a numeric variable and Division 2 stores the Social Security number as a character variable. This is not an unusual situation. Here are listings of two sample data sets to demonstrate this problem: Listing of Data Set DIVISION1 SS
DOB
Gender
Listing of Data Set DIVISION2 SS
JobCode
Salary
111223333
11/14/1956
M
111-22-3333
A10
45123
123456789
05/17/1946
F
123-45-6789
B5
35400
987654321
04/01/1977
F
987-65-4321
A20
87900
In this example, not only are the data types different, but one of the data sets (Division2) includes dashes in the SS values. Luckily, SAS can handle this easily. You can create a character variable based on the numeric value of SS in the Division1 data set or you can create a numeric variable based on the character variable in the Division2 data set. The following solution uses the first approach:
Program 10-14 Merging two data sets when the BY variables are different data types data division1c; set division1(rename=(SS = NumSS)); SS = put(NumSS,ssn11.); drop NumSS; run; data both_divisions; ***Note: Both data sets already in order of BY variable; merge division1c division2; by SS; run;
180 Learning SAS by Example: A Programmer’s Guide
Because the two data sets use the same name for the BY variable, first you have to rename this variable and then create a new character variable with the variable name of SS. The PUT function takes the variable named in the first argument, applies the format named in the second argument, and returns the formatted value (see Chapter 11 for more details on this function). The SSN11. format is a built-in SAS format that prints leading 0s and adds dashes as required for Social Security numbers. Thus, the variable SS in data set Division1C is identical in form to the SS variable in Division2. You can now proceed with the merge in the standard manner. Here is a listing of the resulting data set: Listing of BOTH_DIVISIONS Converting Numeric to Character Job DOB
Gender
SS
Code
Salary
11/14/1956
M
111-22-3333
A10
45123
05/17/1946
F
123-45-6789
B5
35400
04/01/1977
F
987-65-4321
A20
87900
You may choose to create a numeric variable from the character value of SS in data set Division2 instead. The choice of which method to use depends on which data type you want in the merged data set. Here is the alternative approach—creating a numeric variable for the merge:
Program 10-15 An alternative to Program 10-14 data division2n; set division2(rename=(SS = CharSS)); SS = input(compress(CharSS,'-'),9.); ***Alternative: SS = input(CharSS,comma11.); drop CharSS; run;
Because the character value of SS contains dashes, first you need to use the COMPRESS function to remove the dashes from the value, and then use the INPUT function to perform the character-to-numeric conversion. The INPUT function takes the first argument (a character variable), and applies the informat specified in the second argument. A very clever alternative (inspired by a good friend, Mike Zdeb) is to use the comma informat directly in the INPUT function. It is not commonly known that this informat not only removes commas and dollar signs when reading a value, it also removes dashes. You can therefore use the COMMA11. informat in the INPUT function directly without first having to remove the dashes. You can now merge data sets
Chapter 10: Subsetting and Combining SAS Data Sets 181
Division1 and Division2N as before. We will not bother to show a listing of the result—it contains a numeric variable called SS.
10.14 One-to-One, One-to-Many, and Many-toMany Merges In the merge examples shown previously, there was only one observation for each value of the BY variable in both data sets. This is called a one-to-one merge. You may have a situation where one data set has only one observation for each value of the BY variable, but the other has more than one observation for each value of the BY variable. This is referred to as a one-to-many merge. As an example, suppose you want to merge data set Bert with a data set with multiple values of ID. To demonstrate this, take a look at the following two data sets: Listing of Data Set BERT
Listing of Data Set OSCAR
ID
X
ID
123
90
123
200
222
95
123
250
333
100
222
205
333
317
333
400
333
500
Y
Notice that data set Oscar has multiple observations with the same ID. What happens when you perform a merge? The resulting data set is shown next: Listing of COMBINE Obs
ID
X
Y
1
123
90
200
2
123
90
250
3
222
95
205
4
333
100
317
5
333
100
400
6
333
100
500
182 Learning SAS by Example: A Programmer’s Guide
Inspection of the merged data set shows that the correct value of X is added to each observation in Oscar, based on the value of ID. This type of merge also works if you reverse the order of the two data sets in the MERGE statement. A many-to-many merge has the potential to be catastrophic. If you have exactly the same number of observations for each BY value in the two data sets, a many-to-many merge works as expected. If you have more than one observation with a given BY value in each of the two data sets and the number of observations differs in the two data sets, you should not attempt to use a MERGE statement to combine the data sets. To see what happens in this situation, look at the two data sets here: Listing of ONE
Listing of TWO
Obs
Obs
ID
X
ID
Y
1
123
90
1
123
3
2
123
80
2
123
4
3
222
95
3
123
5
4
333
100
4
222
6
5
333
150
5
333
7
6
333
200
6
333
8
An attempt to merge these two data sets, with ID as the BY variable, results in the following: Listing of MANY_TO_MANY Obs
ID
X
Y
1
123
90
3
2
123
80
4
3
123
80
5
4
222
95
6
5
333
100
7
6
333
150
8
7
333
200
8
Notice that the first two observations in the merged data set seem OK. Because there are only two observations with ID equal to 123 in data set One, SAS uses the last value for X (80) for all remaining observations in data set Two with ID equal to 123. This result is unlikely to be of use. Notice also that there are no error or warning messages to indicate that you are producing this strange result.
Chapter 10: Subsetting and Combining SAS Data Sets 183
10.15 Updating a Master File from a Transaction File If you have two data sets that have some common variables and you perform a data set merge, values in the second data set replace values in the first data set, even if the values in the second data set are missing values. If you use an UPDATE statement instead, missing values in the second data set do not replace values in the first data set. This makes the UPDATE statement perfect for updating values in a master data set from new values in a transaction data set. Here is an example. You have a data set called Prices containing item codes, descriptions, and prices. Here is a listing of this data set: Listing of Data Set PRICES Item Code
Description
Price
150
50 foot hose
19.95
175
75 foot hose
29.95
200
greeting card
204
25 lb. grass seed
18.88
208
40 lb. fertilizer
17.98
1.99
You also have a data set called New15Dec2005 with item codes and some new prices as follows: Listing of Data Set NEW15DEC2005 Item Code
Price
204
17.87
175
25.11
208
.
184 Learning SAS by Example: A Programmer’s Guide
You want to use this data set to update the prices in data set Prices. You can use the UPDATE statement like this:
Program 10-16 Updating a master file from a transaction file proc sort data=prices; by ItemCode; run; proc sort data=new15dec2005; by ItemCode; run; data prices_15dec2005; update prices new15dec2005; by ItemCode; run;
Here is the result: Listing of Data Set PRICES_15DEC2005 Item Code
Description
Price
150
50 foot hose
19.95
175
75 foot hose
25.11
200
greeting card
204
25 lb. grass seed
17.87
208
40 lb. fertilizer
17.98
1.99
Only nonmissing values of Price in the transaction data set replaced values in the master file, as shown in this listing. You may want to take a look at Chapter 26, “PROC SQL,” to see other ways to combine data from multiple data sets.
Chapter 10: Subsetting and Combining SAS Data Sets 185
10.16 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Using the SAS data set Blood, create two temporary SAS data sets called Subset_A and Subset_B. Include in both of these data sets a variable called Combined equal to .001 times WBC plus RBC. Subset_A should consist of observations from Blood where Gender is equal to Female and BloodType is equal to AB. Subset_B should consist of all observations from Blood where Gender is equal to Female, BloodType is equal to AB, and Combined is greater than or equal to 14. 2. Using the SAS data set Hosp, create a temporary SAS data set called Monday2002, consisting of observations from Hosp where the admission date (AdmitDate) falls on a Monday and the year is 2002. Include in this new data set a variable called Age, computed as the person’s age as of the admission date, rounded to the nearest year. 3. Using the SAS data set Blood, create two temporary SAS data sets by selecting all subjects with cholesterol levels (Chol) below 100. Place the male subjects in Lowmale and the female subjects in Lowfemale. Do this using a single DATA step. Note: Values for Gender are Male and Female. Careful, some of the cholesterol values are missing. Print the resulting data sets. 4. Using the SAS data set Bicycles, create two temporary SAS data sets as follows: Mountain_USA consists of all observations from Bicycles where Country is USA and Model is Mountain. Road_France consists of all observations from Bicycles where Country is France and Model is Road Bike. Print these two data sets. 5. Print out the observations in the two data sets Inventory and NewProducts. Next, create a new temporary SAS data set (Updated) containing all the observations in Inventory followed by all the observations in NewProducts. Sort the resulting data set and print out the observations. 6. Repeat Problem 5, except this time sort Inventory and NewProducts first (create two temporary SAS data sets for the sorted observations). Next, create a new, temporary SAS data set (Updated) by interleaving the two temporary, sorted SAS data sets. Print out the result.
186 Learning SAS by Example: A Programmer’s Guide
7. Using the Gym data set, create a new, temporary SAS data set (Percent) that contains all the variables found in Gym plus a new variable (call it CostPercent) that represents the Cost as a percentage of the average cost for all subjects. Use PROC MEANS to create a SAS data set containing the mean cost (see Program 10-7 to see how to do this). Round this value to the nearest percent. 8. Run the program here to create a SAS data set called Markup: data markup; input manuf : $10. Markup; datalines; Cannondale 1.05 Trek 1.07 ;
Combine this data set with the Bicycles data set so that each observation in the Bicycles data set now has a markup value of 1.05 or 1.07, depending on whether the bicycle is made by Cannondale or Trek. In this new data set (call it Markup_Prices), create a new variable (NewTotal) computed as TotalCost times Markup. 9. Merge the Purchase and Inventory data sets to create a new, temporary SAS data set (Pur_Price) where the Price value found in the Inventory data set is added to each observation in the Purchase data set, based on the Model number (Model). There are some models in the Inventory data set that were not purchased (and, therefore, are not in the Purchase data set). Do not include these models in your new data set. Based on the variable Quantity in the Purchase data set, compute a total cost (TotalCost) equal to Quantity times Price in this data set as well. 10. Using the Purchase and Inventory data sets, provide a list of all Models (and the Price) that were not purchased. 11. Merge the Purchase and Inventory data sets without sorting either one and omitting the BY Model statement. Check the SAS log and list the observations in Purchase, Inventory, and the merged data set. Now, run this same program with the system option MERGENOBY set to WARN. Check the SAS log. Finally, set MERGENOBY to ERROR and run this program a third time. Check the SAS log. Note: You might want to give the merged data set a different name for each of the three runs so that it is clear which programs run and which programs do not run. 12. You want to merge two SAS data sets, Demographic and Survey1, based on an identifier. In Demographic, this identifier is called ID; in Survey1, the identifier is called Subj. Both are character variables.
Chapter 10: Subsetting and Combining SAS Data Sets 187
13. You want to merge two SAS data sets, Demographic and Survey2, based on an identifier. In Demographic, this identifier is called ID and it is character; in Survey2, the identifier is also called ID, but it is numeric. Hint: If you choose to convert the numeric identifier to a character variable, use a Z3. format so that the leading 0s are present in the character value. 14. Data set Inventory contains two variables: Model (an 8-byte character variable) and Price (a numeric value). The price of Model M567 has changed to 25.95 and the price of Model X999 has changed to 35.99. Create a temporary SAS data set (call it NewPrices) by updating the prices in the Inventory data set.
188 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
11
Working with Numeric Functions 11.1
Introduction 190
11.2
Functions That Round and Truncate Numeric Values 190
11.3
Functions That Work with Missing Values 192
11.4
Setting Character and Numeric Values to Missing 193
11.5
Descriptive Statistics Functions 194
11.6
Computing Sums within an Observation 196
11.7
Mathematical Functions 197
11.8
Computing Some Useful Constants 198
11.9
Generating Random Numbers 199
11.10 Special Functions 201 11.11 Functions That Return Values from Previous Observations 204 11.12 Problems 207
190 Learning SAS by Example: A Programmer’s Guide
11.1 Introduction SAS functions are an essential tool in DATA step programming. They perform such tasks as rounding numbers, computing dates from month-day-year values, summing and averaging the values of SAS variables, and hundreds of other tasks. This chapter focuses on functions that primarily operate on numeric values—the next chapter describes some of the functions that work with character data. Date functions are discussed separately in Chapter 9. There is so much to say about SAS functions that one could write a whole book on the subject. These two chapters touch only the surface of the enormous power of SAS 1 2 functions. For more information, see SAS Functions by Example and SAS OnlineDoc .
11.2 Functions That Round and Truncate Numeric Values Two of the most useful functions in this category are the ROUND and INT functions. As the names suggest, ROUND is used to round numbers, either to the nearest integer or to other values such as 10ths or 100ths. The INT function returns the integer portion of a numeric value. (That is different from Ents, which are talking trees in Hobbit land.) As an example, suppose you have raw data values on age (in years) and weight (in pounds). You want a data set with age as of the last birthday (that is, throw away any fractional part of a year) and weight in pounds, rounded to the nearest pound. In addition, you also want to compute weight in kilograms, rounded to the nearest 10th. Here is the program:
1
See Ron Cody, SAS Functions by Example (Cary, NC: SAS Institute Inc., 2004), for a detailed discussion of SAS functions. 2 See SAS OnlineDoc at http://support.sas.com/documentation/onlinedoc/index.html for more information on SAS functions.
Chapter 11: Working with Numeric Functions 191
Program 11-1 Demonstrating the ROUND and INT truncation functions data truncate; input Age Weight; Age = int(Age); WtKg = round(2.2*Weight,.1); Weight = round(Weight); datalines; 18.8 100.7 25.12 122.4 64.99 188 ;
The ROUND function is used twice in this program. When it is used to round the WtKg values, it has two arguments, separated by a comma. The first argument is the value to be rounded; the second argument is the round-off unit. Typical values for round-off units are .1, .01, .001, and so forth. However, you can use other values here as well. For example, a round-off unit of 2 rounds values to the nearest even number—a value of 100 rounds a value to the nearest 100. When ROUND is used with the Weight variable, there is no second argument. Therefore, the default action—rounding to the nearest whole number— occurs. Here is a listing of the resulting data set: Listing of TRUNCATE Age
Weight
WtKg
18
101
221.5
25
122
269.3
64
188
413.6
Notice that the fractional part of all the age values has been dropped and that the WtKg values are rounded to the nearest 10th and the Weight values to the nearest whole number. In this program, it was important to compute the value of WtKg before Weight was rounded.
192 Learning SAS by Example: A Programmer’s Guide
11.3 Functions That Work with Missing Values In SAS, you can refer to a missing numeric value with a period and a missing character value by a single blank (inside single or double quotes). For example, take a look at the short program here that tests to see if several variables contain a missing value:
Program 11-2 Testing for missing numeric and character values (without the MISSING function) data test_miss; set learn.blood; if Gender = ' ' then MissGender + 1; if WBC = . then MissWBC + 1; if RBC = . then MissWBC + 1; if Chol lt 200 and Chol ne . then Level = 'Low '; else if Chol ge 200 then Level = 'High'; run;
Instead of using separate values for a missing numeric and character value, you can use the MISSING function instead. This function returns a value of true if its argument is a missing value and false otherwise. The amazing thing about this function is that the argument can be either character or numeric—a real rarity among SAS functions. You can rewrite Program 11-2 using the MISSING function like this:
Program 11-3 Demonstrating the MISSING function data test_miss; set learn.blood; if missing(Gender) then MissGender + 1; if missing(WBC) then MissWBC + 1; if missing(RBC) then MissWBC + 1; if Chol lt 200 and not missing(Chol) then Level = 'Low '; else if Chol ge 200 then Level = 'High'; run;
Besides making programs easier to read, the MISSING function also returns a value of true for the alternate numeric values (.A, .B, … .Z, and .). These alternative numeric missing values can be useful when you have different categories of missing information.
Chapter 11: Working with Numeric Functions 193
For example, you could assign .A to did not answer and .B to Not applicable. SAS treats all of these values as missing for numeric calculations but allows you to distinguish 3 among them.
11.4 Setting Character and Numeric Values to Missing The CALL routine CALL MISSING is a handy way to set one or more character and/or numeric variables to missing. When you use a variable list such as X1–X10 or Char1– Char5, you need to precede the list with the keyword OF. Here are some examples: call missing(X,Y,Z,of A1–A10);
X, Y, and Z are numeric and A1–A10 are character. X, Y, and Z are set to a numeric missing value; A1–A10 are set to a character missing value.
call missing(of X1–X10);
X1–X10 are numeric and are set to a numeric missing value.
call missing(of _all_);
All variables defined up to the point of the function call are set to missing.
call missing(of X1–X5,of Y1–Y5);
X1–X5 and Y1–Y5 are all numeric and are set to a numeric missing value.
For those readers interested in the details, the following is a brief discussion on the difference between SAS functions and CALL routines. A SAS function can return only a single value and, typically, the arguments do not change their value when the function is executed. A CALL routine, unlike a function, cannot be used in an assignment statement. Also, the arguments used in a CALL routine can change their value. Therefore, if you want to retrieve more than a single value, you 4 need to use a CALL routine instead of a function.
3
See SAS OnlineDoc at http://support.sas.com/documentation/onlinedoc/index.html for more information on alternative numeric missing values. 4 See Ron Cody, SAS Functions by Example (Cary, NC: SAS Institute Inc., 2004), for more information on CALL routines.
194 Learning SAS by Example: A Programmer’s Guide
11.5 Descriptive Statistics Functions One group of SAS functions is called descriptive statistics functions. While these functions are capable of computing statistical results such as means and standard deviations, many functions in this category provide extremely useful nonstatistical tasks. Suppose you want to score a psychological test and the scoring instructions state that you should take the average (mean) of the first 10 questions (labeled Q1–Q10). This calculation should be performed only if there are seven or more non-missing values. In addition, you want to identify the two questions that resulted in the highest and lowest scores, respectively. Here is a program that performs all of these tasks:
Program 11-4 Demonstrating the N, MEAN, MIN, and MAX functions data psych; input ID $ Q1-Q10; if n(of Q1-Q10) ge 7 then Score = mean(of Q1-Q10); MaxScore = max(of Q1-Q10); MinScore = min(of Q1-Q10); datalines; 001 4 1 3 9 1 2 3 5 . 3 002 3 5 4 2 . . . 2 4 . 003 9 8 7 6 5 4 3 2 1 5 ;
What seemed like a difficult problem was solved in just three lines of SAS code, using some of the descriptive statistics functions. The N function returns the number of nonmissing numeric values among its arguments. As with all of the functions in this category, you must precede any list of variables in the form Var1–Varn with the keyword OF. Without this, SAS assumes that you want to subtract the two values. Therefore, if there are seven or more non-missing values, you use the MEAN function to compute the mean of the non-missing values. Otherwise, Score will be a missing value. The MEAN function, as well as the other functions discussed in this section, ignores missing values. For example, subject 001 has nine non-missing values so the mean is computed by adding up the nine values and dividing by 9.
Chapter 11: Working with Numeric Functions 195
The MAX and MIN functions return the largest and smallest (non-missing) value of its arguments. Here is a listing of data set Psych: Listing of PSYCH
ID
Max
Min
Q10
Score
Score
Score
.
3
3.44444
9
1
4
.
5
2
1
5
9
1
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
001
4
1
3
9
1
2
3
5
002
3
5
4
2
.
.
.
2
003
9
8
7
6
5
4
3
2
. 5.00000
Another function similar to the N function is NMISS. This function returns the number of missing values in the list of variables. What if you want the second or third largest value in a group of values? The LARGEST function enables you to extract the nth largest value, given a list of variables. For example, to find the sum of the three largest scores in the Psych data set, you could use the SUM and LARGEST functions together like this:
Program 11-5 Finding the sum of the three largest values in a list of variables data three_large; set psych(keep=ID Q1-Q10); SumThree = sum(largest(1,of Q1-Q10), largest(2,of Q1-Q10), largest(3,of Q1-Q10)); run;
The first argument of the LARGEST function tells SAS which value you want—1 gives you the largest value, 2 gives you the second largest, and so forth. This function ignores missing values, the same as the other descriptive statistics functions. Note: LARGEST(1, variable-list) is the same as MAX(variable-list).
196 Learning SAS by Example: A Programmer’s Guide
Here is a listing of Three_Large: Listing of THREE_LARGE Sum Obs
ID
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Three
1
001
4
1
3
9
1
2
3
5
.
3
18
2
002
3
5
4
2
.
.
.
2
4
.
13
3
003
9
8
7
6
5
4
3
2
1
5
24
Are you surprised to know that SAS also has a SMALLEST function? SMALLEST(1, variable-list) gives you the smallest value in a list of variables (equal to the result of the MIN function); SMALLEST(2, variable-list) gives you the second smallest value, and so on.
11.6 Computing Sums within an Observation One way to compute the sum of several variables is to write a statement such as this: SumCost = Cost1 + Cost2 + Cost3;
What if one of the Cost values is missing? This causes the sum to be missing. If you want to ignore missing values, you can either write some DATA step logic, or simply use the SUM function, like this: SumCost = sum(of Cost1–Cost3);
If one or two of the Cost values are missing, SumCost is the sum of the non-missing values. If all three Cost values are missing, the SUM function returns a missing value. If you want the sum to be 0 if all the arguments are missing, a nice trick is to include a 0 in the list of arguments, like this: SumCost = sum(0, of Cost1–Cost3);
You can also see how useful the SUM function is when you have a large number of variables to sum. Here is an example. You have a SAS data set (EndOfYear) containing Pay1–Pay12 and Extra1–Extra12. You want to compute the sum of these 24 values. In addition, you want the sum to be 0 if all 24 values are missing. Here is the program:
Chapter 11: Working with Numeric Functions 197
Program 11-6 Using the SUM function to compute totals data sum; set learn.EndOfYear; Total = sum(0, of Pay1-Pay12, of Extra1-Extra12); run;
11.7 Mathematical Functions We'll start out with four straightforward functions: ABS (absolute value), SQRT (square root), EXP (exponentiation), and LOG (natural log). The following program demonstrates these four functions—a short explanation follows:
Program 11-7 Demonstrating the ABS, SQRT, EXP, and LOG functions data math; input x @@; Absolute = abs(x); Square = sqrt(x); Exponent = exp(x); Natural = log(x); datalines; 2 -2 10 100 ;
To take an absolute value of a number, you throw away the minus sign (if the number is negative). Although you can raise a number to the .5 power to compute a square root, using the SQRT function may make your program easier to read (and to write). The EXP function raises e (the base of natural logarithms) to the value of its argument. The LOG function takes the natural logarithm of its argument. (If you need a base 10 log, use the LOG10 function.) The @@ signs at the end of the INPUT statement are called a double trailing @. Notice that you are reading four values of x and creating four observations. Without the @@ at the end of the INPUT statement, SAS would go to a new line each time SAS reached the bottom of the DATA step. The @@ at the end of the line is an instruction to “hold the line” and keep reading values on the same line until there are no more values to read. You can read more about the double trailing @ in Chapter 21.
198 Learning SAS by Example: A Programmer’s Guide
Here is a listing of the Math data set from Program 11-8: Listing of MATH x 2
Absolute 2
Square
Exponent
Natural
1.4142
7.39
0.69315
-2
2
10
10
3.1623
.
22026.47
0.14
2.30259
.
100
100
10.0000
2.6881E43
4.60517
11.8 Computing Some Useful Constants The CONSTANT function returns values of commonly used mathematical constants such as pi and e. For a SAS programmer, perhaps the most useful feature of this function is its ability to compute the largest integer that can be stored in less than 8 bytes. The next program demonstrates some of the more common uses of this function:
Program 11-8 Computing some useful constants with the CONSTANT function data constance; Pi = constant('pi'); e = constant('e'); Integer3 = constant('exactint',3); Integer4 = constant('exactint',4); Integer5 = constant('exactint',5); Integer6 = constant('exactint',6); Integer7 = constant('exactint',7); Integer8 = constant('exactint',8); run;
Chapter 11: Working with Numeric Functions 199
Pi and e are computed as shown here. To compute the largest integer stored in n bytes, you provide a second argument indicating the number of bytes. Output from this program is shown as follows: Demonstrating the CONSTANT Function Pi 3.14159
e
Integer3
2.71828
8192
Integer6 137438953472
Integer4
Integer5
2097152
536870912
Integer7
Integer8
3.5184E13
9.0072E15
To be sure the exact integer feature of this function is clear, if you use a LENGTH statement to set the length of a numeric variable to 3, the largest integer you can represent without losing accuracy is 8,192; with a length of 4, you can represent integers up to 2,097,152. (Remember that these values may vary, depending on your operating system.)
11.9 Generating Random Numbers We will discuss only one of the random number functions here—RANUNI. This function generates random numbers between 0 and 1. You may wonder why you would ever need to generate random numbers. Some possible uses are to create data sets for benchmarking, to select random samples, and to assign subjects randomly to two or more groups. Computers do not really generate true random numbers. However, they are capable of generating series of numbers that are very close to random (called pseudo-random numbers by snobs). In order to get the computer started, you need to provide a seed number to be used to generate the first number in the random sequence. If you use 0 (or any negative number) as a seed, SAS uses the computer’s clock to supply the seed. If you choose any positive integer, that number is used as a seed. SAS recommends that you choose a seed of at least 7 digits. If you supply the seed, the program generates the same sequence of random numbers every time you run the program. With a 0 seed, the sequence is different every time you run the program.
200 Learning SAS by Example: A Programmer’s Guide
Suppose you want to select approximately 10% of the observations from the Blood data set. You can use the RANUNI function to accomplish this, as follows:
Program 11-9 Using the RANUNI function to randomly select observations data subset; set learn.blood; if ranuni(1347564) le .1; run;
Because RANUNI returns random numbers between 0 and 1, approximately 10% of these numbers will be less than .1. Here is part of the SAS log after this program was run: NOTE: There were 1000 observations read from the data set LEARN.BLOOD. NOTE: The data set WORK.SUBSET has 104 observations and 7 variables.
Notice that the Subset data set is not exactly 10% of the Blood data set. If you need an exact 10% sample, there are a variety of methods that you can use. For more information, 5 see SAS Functions by Example and SAS for Monte Carlo Studies: A Guide for 6 Quantitative Researchers . Before leaving this section, it seems appropriate to mention that an easy way to obtain an exact random subset of a SAS data set is to use PROC SURVEYSELECT, which is available in SAS/STAT software. This procedure is quite flexible and offers many options for generating these subsets. As an example, to obtain a simple random sample of size 100 from the Blood data set, you could use the following program:
Program 11-10 Using PROC SURVEYSELECT to obtain a random sample proc surveyselect data=learn.blood out=subset method=srs sampsize=100; run;
The procedure options DATA= and OUT= are pretty clear. METHOD= allows you to choose a method for selecting your sample (SRS is a simple random sample, a sample taken without replacement). SAMPSIZE= allows you to choose the size of the sample. One additional option, which is not used here, is SEED=. If you supply a value for this option, this value is used as the seed. For more details on this procedure, see SAS Online Doc. 5 6
See Ron Cody, SAS Functions by Example (Cary, NC: SAS Institute Inc., 2004). See Xitao Fan et al., SAS for Monte Carlo Studies: A Guide for Quantitative Researchers (Cary, NC: SAS Institute Inc., 2002).
Chapter 11: Working with Numeric Functions 201
To generate random integers in the range from 1 to 10, you could use: RandomInteger = int(ranuni(0)*10 + 1);
You might think this expression could result in values of 11. Because the values of RANUNI are between 0 and 1, RANUNI(0)*10 + 1 never reaches 11. Therefore the integer portion of this expression has a maximum value of 10. If you use the ROUND function to generate random integers, you may wind up with a series of integers where the probability for each value is not the same.
11.10 Special Functions The name special functions (a SAS category) makes these functions seem esoteric. That is far from the case. The INPUT and PUT functions, in particular, are extremely useful functions that you will use all the time. The INPUT function enables you to “read” a character variable using a SAS or userdefined informat and assign the resulting value to a SAS variable. One of the most common uses of this function is to perform a character-to-numeric conversion. Another is to convert a date as a character string to a SAS date value. Here is an example. You are given a SAS data set (Chars) that contains the variables Height, Weight, and Date. All three variables are character variables. You want to create a new data set called Nums with the same variables, except that they are numeric variables. A listing of data set Chars is shown here: Listing of CHARS Height
Weight
Date
58
155
10/21/1950
63
200
5/6/2005
45
79
11/12/2004
Although it may not be obvious from the listing, all three variables are character variables.
202 Learning SAS by Example: A Programmer’s Guide
Program 11-11 shows you how to perform the conversion:
Program 11-11 Using the INPUT function to perform a character-to-numeric conversion data nums; set learn.chars (rename= (Height = Char_Height Weight = Char_Weight Date = Char_Date)); Height = input(Char_Height,8.); Weight = input(Char_Weight,8.); Date = input(Char_Date,mmddyy10.); drop Char_Height Char_Weight Char_Date; run;
The technique used in this program is called swap and drop. Because the same variable cannot be both character and numeric, this technique allows you to wind up with numeric variables with the same names as the original character variables. You use the RENAME= SET option to rename all three variables. Next, you use the INPUT function to do the character-to-numeric conversions. A good way to understand the INPUT function is to think about what an INPUT statement does. It reads character data from a raw data file using an informat to determine how the value should be read and assigns the result to a SAS variable. Think of the INPUT function as “reading” a value of a character variable according to whatever informat you supply as the second argument of the function. The INPUT function, when used for the Height and Weight variables, uses an 8. informat. It doesn’t matter if this value is larger than you need. Because the dates are in the month-day-year form, you use the MMDDYY10. informat to do the conversion. Because you don’t want the original character values, you drop them. That’s why this technique is called swap and drop. Before we leave this function, we should point out that you can shorten the DROP statement as follows: drop Char_:;
The colon notation is a SAS wildcard. This statement says to drop all variables that begin with Char_. The other special function we discuss here is the PUT function. Just as a PUT statement can send the formatted value of a variable to an external file, a PUT function takes a value (the first argument), formats this value using the format supplied (the second argument), and “writes” the result to a variable. The result of a PUT function is always a character value. One common use of a PUT function is to perform a numeric-to-character conversion.
Chapter 11: Working with Numeric Functions 203
The following program demonstrates some possible uses of the PUT function:
Program 11-12 Demonstrating the PUT function proc format; value agefmt low-<20 = 'Group One' 20-<40 = 'Group Two' 40-high = 'Group Three'; run; data convert; set learn.numeric; Char_Date = put(Date,date9.); AgeGroup = put(Age,agefmt.); Char_Cost = put(Cost,dollar10.); drop Date Cost; run;
Data set Numeric contains three numeric variables: Date, Age, and Cost. In data set Convert, the three variables Char_Date, AgeGroup, and Char_Cost are all character variables. The second argument of the PUT function specifies the format to apply to the first argument. Therefore, Char_date is a date in the DATE9. format, and Char_Cost is a value written using the DOLLAR10. format. Finally, AgeGroup applies a user-written format to place the age values into one of three groups. Here is a listing of the resulting data set. Note: Date was left unformatted so that it is clear that it is a true SAS date value. Listing of CONVERT Char_ Date
Char_Date
AgeGroup
Cost
14898
15OCT2000
Group Two
$12,345
-13199
12NOV1923
Group Three
$39,393
204 Learning SAS by Example: A Programmer’s Guide
11.11 Functions That Return Values from Previous Observations Because SAS processes data from raw data files and SAS data sets line by line (or observation by observation), it is difficult to compare a value in the present observation with one from a previous observation. Two functions, LAG and DIF, are useful in this regard. Let’s start out with a short program that demonstrates how the LAG function works:
Program 11-13 Demonstrating the LAG and LAGn functions data look_back; input Time Temperature; Prev_temp = lag(Temperature); Two_back = lag2(Temperature); datalines; 1 60 2 62 3 65 4 70 ;
A listing of data set Look_Back follows: Listing of LOOK_BACK Prev_ Obs
Time
Temperature
temp
Two_back
1
1
60
.
.
2
2
62
60
.
3
3
65
62
60
4
4
70
65
62
As you can see from this listing, the LAG function returns the temperature from the previous time and the LAG2 function returns the temperature from the time before that. (There is a whole family of LAG functions: Lag, LAG2, LAG3, and so on.) This program might give you the idea that the LAG function returns the value of its argument from the previous observation. This is not always true. The correct definition of the LAG function is that it returns the value of its argument the last time the LAG function executed. To
Chapter 11: Working with Numeric Functions 205
help clarify this somewhat clunky sounding definition, see if you can predict the values of x and Last_x in the program that follows:
Program 11-14 Demonstrating what happens when you execute a LAG function conditionally data laggard; input x @@; if X ge 5 then Last_x = lag(x); datalines; 9 8 7 1 2 12 ;
Here is a listing of data set Laggard: Listing of LAGGARD Obs
x
Last_x
1
9
.
2
8
9
3
7
8
4
1
.
5
2
.
6
12
7
OK, are you surprised? The value of Last_x in the first three observations is clear. But, what happened in Observation 6? To understand this, you need to read the definition carefully. The IF statement is not true in Observations 4 and 5; therefore, Last_x, which is set to a missing value at each iteration of the DATA step, remains missing. In Observation 6, the IF statement is true and the LAG function returns the value of x the last time this function executed, which was back at Observation 3, where x was equal to 7. The take-home message is this: “Be careful if you execute a LAG function conditionally.” In most cases, you want to execute the LAG function for each iteration of the DATA step. When you do, this function returns the value of its argument from the previous observation.
206 Learning SAS by Example: A Programmer’s Guide
A common use of the LAG function is to compute differences between observations. For example, you can modify Program 11-13 to compute the difference in temperature from one time to the next, as follows:
Program 11-15 Using the LAG function to compute interobservation differences data diff; input Time Temperature; Diff_temp = Temperature – lag(Temperature); datalines; 1 60 2 62 65 4 70 ;
Here is a listing of Diff: Listing of DIFF Diff_ Obs
Time
Temperature
temp
1
1
60
.
2
2
62
2
3
3
65
3
4
4
70
5
Programmers often use the form: x – lag(x);
Chapter 11: Working with Numeric Functions 207
Therefore, a set of DIF functions (DIF, DIF2, DIF3, and so on) is available. DIF(x) is equal to x – LAG(x). You could, therefore, rewrite Program 11-15 like this:
Program 11-16 Demonstrating the DIF function data diff; input Time Temperature; Diff_temp = dif(Temperature); datalines; 1 60 2 62 3 65 4 70 ;
For more examples using the LAG and DIF functions, please see the examples in Chapter 24.
11.12 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Using the SAS data set Health, compute the body mass index (BMI) defined as the weight in kilograms divided by the height (in meters) squared. Create four other variables based on BMI: 1) BMIRound is the BMI rounded to the nearest integer, 2) BMITenth is the BMI rounded to the nearest tenth, 3) BMIGroup is the BMI rounded to the nearest 5, and 4) BMITrunc is the BMI with a fractional amount truncated. Conversion factors you will need are: 1 Kg equals 2.2 Lbs and 1 inch = .0254 meters. 2. Count the number of missing values for WBC, RBC, and Chol in the Blood data set. Use the MISSING function to detect missing values. 3. Create a new, temporary SAS data set (Miss_Blood) based on the SAS data set Blood. Set Gender, RBC, and Chol to a missing value if WBC is missing. Use the MISSING and CALL MISSING functions in this program. 4. The SAS data set Psych contains an ID variable, 10 question responses (Ques1– Ques10), and 5 scores (Score1–Score5). You want to create a new, temporary SAS data set (Evaluate) containing the following:
208 Learning SAS by Example: A Programmer’s Guide
a.
A variable called QuesAve computed as the mean of Ques1–Ques10. Perform this computation only if there are seven or more non-missing question values.
b. If there are no missing Score values, compute the minimum score (MinScore), the maximum score (MaxScore), and the second highest score (SecondHighest). 5. The SAS data set Psych contains an ID variable, 10 question responses (Ques1– Ques10), and 5 scores (Score1–Score5). You want to create a new, temporary SAS data set (Evaluate) containing the following: a.
A value (ScoreAve) consisting of the mean of the three highest Score values. If there are fewer than three non-missing score values, ScoreAve should be missing.
b. An average of Ques1–Ques10 (call it QuesAve) if there are seven or more nonmissing values. c.
A composite score (Composit) equal to ScoreAve plus 10 times QuesAve.
6. Write a short DATA _NULL_ step to determine the largest integer you can score on your computer in 3, 4, 5, 6, and 7 bytes. 7. Given values of x, y, and z, compute the following (using a DATA _NULL_ step): a. AbsZ = absolute value of z b. Expx = e raised to the x power c.
Circumference = 2 times pi times y
Use values of x, y, and z equal to 10, 20, and –30, respectively. Round the values for b and c to the nearest .001. 8. Create a temporary SAS data set (Random) consisting of 1,000 observations, each with a random integer from 1 to 5. Make sure that all integers in the range are equally likely. Run PROC FREQ to test this assumption. 9. Using the random functions, create a temporary SAS data set (Fake) with 100 observations. Each observation should contain a subject number (Subj) starting from 1, a random gender (with approximately 40% females and 60% males), and a random age (integers from 10 to 50). Compute the frequencies for Gender and list the first 10 observations in the data set. 10. Data set Char_Num contains character variables Age and Weight and numeric variables SS and Zip. Create a new, temporary SAS data set called Convert with new variables NumAge and NumWeight that are numeric values of Age and Weight, respectively, and CharSS and CharZip that are character variables created from SS and Zip. CharSS should contain leading 0s and dashes in the appropriate places for Social Security numbers and CharZip should contain leading 0s.
Chapter 11: Working with Numeric Functions 209
Hint: The Z5. format includes leading 0s for the ZIP code. 11. Repeat Problem 10, except this time use the same variable names for the converted variables. Hint: Swap and drop. 12. Using the Stocks data set (containing variables Date and Price), compute daily changes in the prices. Use the statements here to create the plot. Note: If you do not have SAS/GRAPH installed, use PROC PLOT and omit the GOPTIONS and SYMBOL statements. goptions reset=all colors=(black) ftext=swiss htitle=1.5; symbol1 v=dot i=smooth; title "Plot of Daily Price Differences"; proc gplot data=difference; plot Diff*Date; run; quit;
13. Plot the daily stock prices in data set Stocks along with a moving average of the prices using a three-day moving average. Use the PLOT statements here to produce the plots. Note: If you do not have SAS/GRAPH installed, use PROC PLOT and omit the GOPTIONS and SYMBOL statements. goptions reset=all colors=(black) ftext=swiss htitle=1.5; symbol1 v=dot line=1 i=smooth; symbol2 v=square line=2 i=smooth; title "Plot of Price and Moving Average"; proc gplot data=smooth; plot Price*Date Average*Date / overlay; run; quit;
210 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
12
Working with Character Functions 12.1
Introduction 212
12.2
Determining the Length of a Character Value 212
12.3
Changing the Case of Characters 213
12.4
Removing Characters from Strings 214
12.5
Joining Two or More Strings Together 215
12.6 12.7
Removing Leading or Trailing Blanks 217 Using the COMPRESS Function to Remove Characters from a String 218
12.8
Searching for Characters 220
12.9
Searching for Individual Characters 223
12.10 Searching for Words in a String 223 12.11 Searching for Character Classes 225 12.12 Using the NOT Functions for Data Cleaning 226 12.13 Describing a Real Blockbuster Data Cleaning Function 227 12.14 Extracting Part of a String 228 12.15 Dividing Strings into Words 230
212 Learning SAS by Example: A Programmer’s Guide
12.16 Comparing Strings 232 12.17 Performing a Fuzzy Match 234 12.18 Substituting Characters or Words 235 12.19 Problems 238
12.1 Introduction This chapter covers some of the basic functions that work with character values. These functions enable you to search for character strings, take strings apart and put them back ® together again, and remove selected characters from a string. With SAS 9, you can search for or remove classes of characters, such as digits, punctuation marks, or space characters. You can even perform fuzzy matches between two character values—useful in matching names that may be misspelled. For more information, see SAS Functions by Example1 and SAS Online Doc2.
12.2 Determining the Length of a Character Value The LENGTH function returns the length of a character value, not counting trailing ® blanks. A new function, LENGTHN, was added with SAS 9. This function is identical to LENGTH except it returns a length of 0 (instead of a length of 1) for a character missing value (also called a null string—hence the N at the end of the function name). Here is an example. You have a SAS data set (Sales) that includes a variable called Name. You want to see all the names that are longer than 12 characters. Here is the program:
1
See Ron Cody, SAS Functions by Example (Cary, NC: SAS Institute Inc., 2004), for a detailed discussion of SAS functions. 2 See SAS OnlineDoc at http://support.sas.com/documentation/onlinedoc/index.html for more information on SAS functions.
Chapter 12: Working with Character Functions 213
Program 12-1 Determining the length of a character value data long_names; set learn.sales; if lengthn(Name) gt 12; run;
The subsetting IF statement is true whenever the number of characters in Name exceeds 12 (not counting trailing blanks but including any blanks within the string). As long as ® you are using SAS 9 or higher, we recommend that you use the LENGTHN function instead of the older LENGTH function. Another SAS®9 function, LENGTHC, returns the storage length of a string. You may want to test storage lengths when you are combining data from multiple files.
12.3 Changing the Case of Characters A common programming problem is matching values where the case of the two values may not be the same. Suppose you are presented with two SAS data sets: one has names in mixed case, and the other has names in uppercase. You want to merge the two data sets. A listing of data sets Mixed and Upper are shown here. Note: Data sets are already sorted by Name. Listing of MIXED Name
ID
Listing of UPPER Name
DOB
Daniel Fields
123
DANIEL FIELDS
2194
Patrice Helms
233
PATRICE HELMS
10370
Thomas Chien
998
THOMAS CHIEN
14926
214 Learning SAS by Example: A Programmer’s Guide
Here is a program to merge the two data sets:
Program 12-2 Changing values to uppercase data mixed; set learn.mixed; Name = upcase(Name); run; data both; merge mixed learn.upper; by Name; run;
As you might expect, the UPCASE function converts all letters to uppercase. There are two other functions, LOWCASE and PROPCASE, that convert letters to lowercase and proper case, respectively. Proper case capitalizes the first letter of every “word” and converts the remaining letters to lowercase. By word, we mean any consecutive letters separated by a delimiter. Default delimiters are blank, forward slash, hyphen, open parenthesis, period, and tab. You can specify delimiters as an optional second argument to the PROPCASE function if you want. Program 12-3 demonstrates the use of the PROPCASE function.
12.4 Removing Characters from Strings The two functions in this category are COMPBL and COMPRESS. The former converts two or more blanks to a single blank; the latter removes blanks (default action) or characters that you specify from a character value. We demonstrate the COMPBL function (don’t try to pronounce COMPBL; you might hurt yourself—just say compress blank) in a program to help standardize some addresses. In the listing of the addresses here, notice that several of the lines contain multiple blanks and there is inconsistent use of case as well. Listing of ADDRESS Name ron
coDY
jason Tran
Street
City
State
Zip
1178 HIGHWAY 480
camp verde
tx
78010
123 lake view drive
East Rockaway
ny
11518
Chapter 12: Working with Character Functions 215
Here is a program that converts all multiple blanks to a single blank; converts the case for Name, Street, and City to proper case; and converts the state abbreviations to uppercase:
Program 12-3 Converting multiple blanks to a single blank and demonstrating the PROPCASE function data standard; set learn.address; Name = compbl(propcase(Name)); Street = compbl(propcase(Street)); City = compbl(propcase(City)); State = upcase(State); run;
Notice in this example how you can nest one function within another. The COMPBL function has as its argument the value returned by the PROPCASE function (in this example, the order doesn’t matter). The following listing shows the result: Listing of STANDARD Name
Street
City
State
Zip
Ron Cody
1178 Highway 480
Camp Verde
TX
78010
Jason Tran
123 Lake View Drive
East Rockaway
NY
11518
12.5 Joining Two or More Strings Together Putting strings together is called concatenation. For example, if you have separate variables representing a first and last name, you can concatenate them (with a blank in between) to create a variable containing the full name. The concatenation operator, || (or !!), has always been available to SAS programmers. For example, if One = ABC and Two = DEF, then One || Two is equal to ABCDEF. When you use the || operator, if you do not define the length of the resulting string beforehand, the length of this string is the sum of the lengths of the individual strings you are concatenating. Three new and very useful functions, CAT, CATS, and CATX, make this process much easier. The CAT function takes two or more arguments and concatenates them. It is
216 Learning SAS by Example: A Programmer’s Guide
almost the same as using the || operator, except that the length of the resulting string defaults to 200 if you do not define the length first. The CATS function strips off leading and trailing blanks before joining the strings (remember S for strip). The CATX function is similar to the CATS function, except you supply a separator as the first argument. When you use this function to concatenate strings, leading and trailing blanks are removed and the separator value is placed between each of the strings. Here are some examples:
Program 12-4 Demonstrating the concatenation functions title "Demonstrating the Concatenation Functions"; data _null_; Length Join Name1–Name4 $ 15; First = 'Ron '; Last = 'Cody '; Join = ':' || First || ':'; Name1 = First || Last; Name2 = cat(First,Last); Name3 = cats(First,Last); Name4 = catx(' ',First,Last); file print; put Join= / Name1= / Name2= / Name3= / Name4= /; run;
First the output, and then an explanation: Demonstrating the Concatenation Functions Join=:Ron
:
Name1=Ron
Cody
Name2=Ron
Cody
Name3=RonCody Name4=Ron Cody
The variable Join is created by concatenating a colon to the front and back of the variable First. Notice that there are three blanks between Ron and the final colon. Name1 joins the two variables First and Last, but because there are three trailing blanks in First, there are three blanks between the two names. The CAT function gives you the same result as the || operator. Notice that the variable Name3 has no blanks between the names because the
Chapter 12: Working with Character Functions 217
CATS function strips leading and trailing blanks before joining the strings. Finally, Name4 has a single blank between the first and last names because a blank was entered as the separator value.
12.6 Removing Leading or Trailing Blanks Two old functions, TRIM and LEFT, and one newer function, STRIP, enable you to remove trailing blanks, leading blanks, or both from a character value. We will demonstrate these functions with a sample program:
Program 12-5 Demonstrating the TRIM, LEFT, and STRIP functions data blanks; String = ' ABC '; ***There are 3 leading and 2 trailing blanks in String; JoinLeft = ':' || left(String) || ':'; JoinTrim = ':' || trim(String) || ':'; JoinStrip = ':' || strip(String) || ':'; run;
Here is the output: Listing of BLANKS Join String ABC
JoinLeft :ABC
:
JoinTrim
Strip
:
:ABC:
ABC:
The variable String has three leading and two trailing blanks. A good way to know if a value has leading and/or trailing blanks is to concatenate a character (a colon, for example) to the beginning and end of a character value. Note: It is very hard to see trailing blanks on printed output—only experienced SAS programmers can do this! Between the colons in the three Join variables here, we have JoinLeft - ABC followed by five blanks, JoinTrim - three blanks followed by ABC, and JoinStrip - no leading or trailing blanks.
218 Learning SAS by Example: A Programmer’s Guide
When you assign the result of any of these functions to a variable, the length of that variable does not change. So, if you create three variables like the following, the storage length of each of the three variables Left, Trim, and Strip is equal to 8 (the length of String): Left = left(String); Trim = trim(String); Strip = strip(String);
Left is ABC followed by five blanks. Trim is three blanks followed by ABC followed by three blanks. Even though the TRIM function returns a value with no trailing blanks, when this value is assigned to a variable with a length of 8, the trailing blanks reappear. Using this same logic, Strip is equal to ABC followed by five blanks.
12.7 Using the COMPRESS Function to Remove Characters from a String Here’s an interesting (and fairly common) problem: you have a SAS data set (or a raw data file) containing phone numbers. These numbers came from various sources and are in various formats, as shown here: Listing of PHONE Phone (908)232-4856 210.343.4757 (516) 343 - 9293 9342342345
You want to retain only the numerals (digits) in each of these phone numbers. To do this, you want to remove blanks, left and right parentheses, periods, and dashes from each of the values. The COMPRESS function allows you to select which characters you want to remove from a character value. Here is one way to accomplish this task:
Chapter 12: Working with Character Functions 219
Program 12-6 Using the COMPRESS function to remove characters from a string data phone; length PhoneNumber $ 10; set learn.phone; PhoneNumber = compress(Phone,' ()-.'); drop Phone; run;
The first argument of the COMPRESS function is the character value from which you want to remove characters. The second argument is a list of characters you want to remove. If you omit the second argument, COMPRESS removes all blanks from the string. A listing of data set Phone is shown next: Listing of PHONE Phone Number 9082324856 2103434757 5163439293 9342342345 ®
A third argument was added to the COMPRESS function in SAS 9. This third argument, called a modifier, allows you to add character classes such as digits or punctuation to the characters that are to be deleted, which are listed in the second argument. One of these modifiers (k) reverses the action of the COMPRESS function (to help remember this, k is for keep). When you use the k modifier, the COMPRESS function keeps the selected characters and removes everything else. This table shows some of the more useful modifiers. Modifier d a i k s p
Action Adds numerals (digits) to the list of characters to be deleted Adds upper- and lowercase letters to the list of characters to be deleted Ignores case Keeps listed characters instead of removing them Adds blanks, tabs, line-feeds, or carriage returns to the list of characters to be deleted Adds punctuation to the list of characters to be deleted
220 Learning SAS by Example: A Programmer’s Guide
Here are a few examples: If String = "X1y2Z3" Function
Description
Value Returned
compress(String,,'a')
Removes all letters
123
compress(String,,'kd')
Keeps digits (deletes everything else)
123
compress(String,'wxyz','i')
Removes wxyz and ignores case
123
compress("A?B C99",,'pd')
Removes punctuation and digits
AB C
Notice that if you want to use modifiers and you do not specify a second argument, you need to use two commas together to indicate that the modifiers are the third argument.` Here is Program 12-7 written using modifiers:
Program 12-7 Demonstrating the COMPRESS modifiers data phone; length PhoneNumber $ 10; set learn.phone; PhoneNumber = compress(Phone,,'kd'); *Keep only digits; drop Phone; run;
The resulting data set is identical to the one listed previously. As you can see in this example, it is easier to use the k modifier to indicate what you want to keep, than it is to list all of the characters you want to remove.
12.8 Searching for Characters There are a large number of SAS functions that allow you to search for individual characters or several characters together. Let’s start with a problem where you are given a SAS data set containing measurements in either English or metric units.
Chapter 12: Working with Character Functions 221
First, here is a listing of data set Mixed_Nuts: Listing of MIXED_NUTS Weight
Height
100Kgs.
59in
180lbs
60inches
88kg
150cm.
50KGS
160CM
Notice that the units are not consistently in upper- or lowercase, there may or may not be a period in the units, and they vary slightly. This looks like a difficult problem—but not when you have SAS character functions. Problems of this type are quite common. Take a look at the program here and then we will discuss the approach:
Program 12-8 Demonstrating the FIND and COMPRESS functions data English; set learn.mixed_nuts(rename= (Weight = Char_Weight Height = Char_Height)); if find(Char_Weight,'lb','i') then Weight = input(compress(Char_Weight,,'kd'),8.); else if find(Char_Weight,'kg','i') then Weight = 2.2*input(compress(Char_Weight,,'kd'),8.); if find(Char_Height,'in','i') then Height = input(compress(Char_Height,,'kd'),8.); else if find(Char_Height,'cm','i') then Height = input(compress(Char_Height,,'kd'),8.)/2.54; drop Char_:; run;
Because you want to use the same variable names in the English data set as in the original Mixed_Nuts data set, you use the swap-and-drop approach discussed in the previous chapter. Next, the FIND function is used first to search for the string lb. Here’s how this function works: In its simplest form, you supply this function with two arguments, like this: find(string, find-string)
222 Learning SAS by Example: A Programmer’s Guide
Here, string is the string you want to search and find-string is the string you are looking for. For example, the value of the following statement is 2, the first position in what hat is that where the letters hat first appear: find("what hat is that","hat")
If find-string is not found, the function returns a 0. An older function, INDEX, was identical to the FIND function as described here. However, the FIND function has some added capabilities; you can add two more arguments to the FIND function. One of these arguments is a modifier that alters how the search works. The most useful modifier is the i (ignore case) modifier. You can also specify a starting position at which to start the search. If you use a negative starting value, the search proceeds from right to left. To summarize, the FIND function has the following syntax: find(string, find-string, modifiers, starting-position)
In the previous program, the i modifier is used so that the search looks for upper- or ® lowercase values of LB or KG. If you are using a version earlier than SAS 9, you could replace the FIND function with the following: index(lowcase(Char_Weight),'lb')
Once you determine the units (pounds or kilograms), you use the COMPRESS function to extract the digits, and the INPUT function to do the character-to-numeric conversion. Finally, you drop all the variables that begin with Char_ (the colon is a wildcard). A listing of the resulting data set is shown next: Listing of ENGLISH Weight
Height
220.0
59.0000
180.0
60.0000
193.6
59.0551
110.0
62.9921
Chapter 12: Working with Character Functions 223
12.9 Searching for Individual Characters A companion function, FINDC, searches a string for the first occurrence of a character from a list that you supply as the second argument to the function. The table here compares the two functions, FIND and FINDC. Line 1 2 3 4 5 6 7
Function
Value Returned
find('XYZCBA','ABC') findc('XYZCBA','ABC') findc('XYZCBA','A','B','C') find('D','ABCDE') findc('D','ABCDE') find('XYZcba','ABC','i') findc('XYZcba','ABC','i')
0 4 4 0 1 0 4
In Line 1 of the table, you are looking for the string ABC. Because this string does not appear in the first argument, the function returns a 0. Lines 2 and 3 of the table are identical. You may write the list of search characters all together as in Line 2 or specify them individually as in Line 3. In either case, the FINDC function returns a 4 (the position of the C). When you are looking for a single character, as in Lines 4 and 5, both functions return the same value—the position of that character. The last two lines in the table demonstrate the i (ignore case) modifier that you can use with both of these functions.
12.10 Searching for Words in a String The FINDW function (available in SAS 9.2 and later) is similar to the FIND function, except that it searches for words (hence the W in the function name). Note: If you are using a SAS version prior to 9.2, you can use the INDEXW function instead. Words are defined as a series of characters that start and end with a word boundary. The default word boundaries are the beginning of a string, the end of a string, and a blank. You can specify alternative delimiters as the third argument to the FINDW function.
224 Learning SAS by Example: A Programmer’s Guide
The syntax for this function is as follows: findw(string, find-string, delimiters, modifiers, startingposition)
Here, string is the string you want to search, find-string is the string you are looking for, and delimiters supplies a list of word delimiters. The following table shows the difference between the two functions, FIND and FINDW. Function
Value Returned
find('there is the dog','the')
1 (position of the string the)
findw('there is the dog','the')
10 (start of the word the)
findw('pear:apple','apple',':')
6 (start of the word apple)
As an example, suppose you want to look for the name Roger in a list of character values. Here is the program:
Program 12-9 Demonstrating the FINDW function data look_for_roger; input String $40.; if findw(String,'Roger') then Match = 'Yes'; else Match = 'No'; datalines; Will Rogers Roger Cody Was roger here? Was Roger here? ;
Here is a listing of data set Look_For_Roger: Listing of LOOK_FOR_ROGER String
Match
Will Rogers
No
Roger Cody
Yes
Was roger here?
No
Was Roger here?
Yes
Chapter 12: Working with Character Functions 225
This program shows why the FINDW function is useful when you are looking for words that may be part of a longer string.
12.11 Searching for Character Classes A group of functions that some folks call the ANY functions (ANYALNUM, ANYALPHA, ANYDIGIT, ANYPUNCT, and ANYSPACE) allow you to search for alphanumerics (upper- and lowercase letters and digits), alphas (all letters), digits, punctuation characters, and space characters (blank, tab, line feeds), respectively. As an example, suppose you have some ID values that contain letters and digits. You know that every ID contains upper- and lowercase letters, but they may or may not contain any digits. You want to read these values from a raw data file (id.txt) and place them into two SAS data sets, one for IDs containing no digits and the other for IDs containing both letters and digits. A listing of the id.txt file is as follows: ABc123 XrayMan 142abc Agent007 Terminator
Here is the program:
Program 12-10 Demonstrating the ANYDIGIT function data only_alpha mixed; infile 'c:\books\learning\id.txt' truncover; input ID $10.; if anydigit(ID) then output mixed; else output only_alpha; run;
If the argument of ANYDIGIT contains any digits, the function returns the position of the first one. If not, ANYDIGIT returns a 0. Therefore, when ID contains any digits, it returns a value greater than or equal to 1. Because all values that are not 0 or missing are considered true in logical expressions, these IDs are written out to the Mixed data set. IDs containing no digits wind up in the Only_Alpha data set. Here are the listings:
226 Learning SAS by Example: A Programmer’s Guide
Listing of ONLY_ALPHA ID XrayMan Terminator Listing of MIXED ID ABc123 142abc Agent007
There is an optional second argument for all of the ANY functions. This argument indicates the starting position of the character value at which to start looking for the specified characters. If this value is negative, the search proceeds from right to left.
12.12 Using the NOT Functions for Data Cleaning Another collection of functions, the NOT functions, are similar to the ANY functions. The difference is that these functions return the position of the first character in a string that does not belong to the specified class. That makes this collection of functions especially useful for checking a character value for characters that don’t belong. For example, if you expect a string to contain only digits, you can see if the NOTDIGIT function returns any value greater than 0. Let’s see some examples. You have a SAS data set (Cleaning) that contains three variables: Letters, Numerals, and Both. As you can probably tell from these variable names, Letters should contain only upper- and lowercase letters; Numerals should contain only digits, and Both can contain either letters or digits. You want to write a program to produce an error report. Here is a listing of the Cleaning data set:
Chapter 12: Working with Character Functions 227
Listing of CLEANING Subject
Letters
Numerals
Both
1
Apple
12345
XYZ123
2
Ice9
123X
Abc.123
3
Help!
999
X1Y2Z3
And here is a program that produces the error report:
Program 12-11 Demonstrating the NOT functions for data cleaning title "Data Cleaning Application"; data _null_; file print; set learn.cleaning; if notalpha(trim(Letters)) then put Subject= Letters=; if notdigit(trim(Numerals)) then put Subject= Numerals=; if notalnum(trim(Both)) then put Subject= Both=; run;
It is very important to understand why you need to use the TRIM function in this program. Without the TRIM function, each of the three NOT functions used here would return the position of the first trailing blank in each of the character values. The three NOT functions are used to check for any invalid character type in each of the three variables. Here is the output: Data Cleaning Application Subject=2 Letters=Ice9 Subject=2 Numerals=123X Subject=2 Both=Abc.123 Subject=3 Letters=Help!
12.13 Describing a Real Blockbuster Data Cleaning Function One of the most useful SAS functions for checking that a character value contains only valid values is the VERIFY function. This function acts somewhat like the NOT functions—it returns the position in a string of the first invalid character and a 0 if it
228 Learning SAS by Example: A Programmer’s Guide
doesn’t find any invalid characters. How does the function know that characters are valid or invalid? Well, you supply it with a list of valid characters in the second argument. This is somewhat complicated to describe and easier to demonstrate. So, on to some examples. You want to check that a variable only contains the letters A through E and write out a message for any values that violate this rule. Here’s the program:
Program 12-12 Using the VERIFY function for data cleaning data errors valid; input ID $ Answer : $5.; if verify(Answer,'ABCDE') then output errors; else output valid; datalines; 001 AABDE 002 A5BBD 003 12345 ;
The VERIFY function takes two arguments: the first is the character value you want to check, and the second is the list of valid characters. In this program, the VERIFY function returns the position of the first character in Answer that is not an A, B, C, D, or E. If there are no invalid characters, the function returns a 0. Therefore, the observations for ID 002 and 003 (values returned are 2 and 1, respectively) wind up in data set Errors; the observation for ID 001 winds up in data set Valid (because VERIFY returns a 0). If you are checking a value that may contain trailing blanks, remember to trim the value first.
12.14 Extracting Part of a String It is useful to be able to extract one or more characters from a character variable. For example, an ID number such as NJ12M99 might contain a state abbreviation as the first two digits and a gender code in the 5th position. You might even want to extract the digits between the state code and the gender as well. The SUBSTR (stands for substring) function allows you to accomplish these tasks. Here is a program that extracts a state abbreviation, the digits in Positions 3–4 (and creates a numeric variable), the gender code in the 5th position, and the final digit or digits as a character variable:
Chapter 12: Working with Character Functions 229
Program 12-13 Using the SUBSTR function to extract substrings data extract; input ID : $10. @@; length State $ 2 Gender $ 1 Last $ 5; State = substr(ID,1,2); Number = input(substr(ID,3,2),3.); Gender = substr(ID,5,1); Last = substr(ID,6); datalines; NJ12M99 NY76F4512 TX91M5 ;
The second and third arguments of the SUBSTR function specify the starting position and the length of the substring. For example, the variable State is extracted from ID, starting in the first position for a length of 2. The variable Number is a bit more complicated. The SUBSTR function extracts two digits, starting at Position 3, and then uses the INPUT function to perform a character-to-numeric conversion. Gender is the 5th character of the ID. Notice that the SUBSTR function used to create the variable Last does not use a third argument. When the third argument is missing, the SUBSTR function extracts characters until the last non-blank character in the string. Here is a listing of the resulting data set: Listing of EXTRACT ID
State
Gender
Last
Number
NJ12M99
NJ
M
99
12
NY76F4512
NY
F
4512
76
TX91M5
TX
M
5
91
In this program, a LENGTH statement is used to define the lengths of State, Gender, and Last. Without this statement, the lengths of these three character variables would be equal to 10, which is the length of ID. Understanding why this is, is very important. SAS assigns a length to all character variables in the compile stage before any data values are read. In general, the starting position and the length of a substring could be read from data or computed in the DATA step. So, without a LENGTH statement, what length should SAS assign to the variables State, Gender, and Last? The longest substring you can extract from a string of length n is a string of length n, so if you don’t define a length, that’s exactly what SAS does.
230 Learning SAS by Example: A Programmer’s Guide
By the way, you could create the variable State like this: State = ID;
Because you have set the length of State equal to 2, this assignment statement sets State equal to the first two characters in the ID. This is a useful trick that you will see from time to time in SAS programs.
12.15 Dividing Strings into Words The SCAN function is typically used to extract words from a string. However, as you will see in a later example, it can also be used to parse (separate) substrings separated by delimiters other than blanks. Let’s start with a common example—separating the first and last names from a variable that contains both:
Program 12-14 Demonstrating the SCAN function data original; input Name $ 30.; datalines; Jeffrey Smith Ron Cody Alan Wilson Alfred E. Newman ; data first_last; set original; length First Last $ 15; First = scan(Name,1,' '); Last = scan(Name,2,' '); run;
The SCAN function extracts the nth word from a string with blanks and most punctuation characters as the default delimiters. (The default delimiters differ in the ASCII and EBCDIC character sets. They are < ( + & ! $ * ) ; ^ - / , % | and < ( + | & ! $ * ) ; ¬ - / , % | ¢ respectively.) It is usually a good idea to specify the word delimiters you want to use as the third argument to the SCAN function. In this program, a blank is specified as the delimiter.
Chapter 12: Working with Character Functions 231
A LENGTH statement is used to define the length of First and Last. If you don’t include this statement, both First and Last will have a length of 200. Here is a listing of data set FIRST_LAST: Listing of FIRST_LAST Name
First
Last
Jeffrey Smith
Jeffrey
Smith
Ron Cody
Ron
Cody
Alan Wilson
Alan
Wilson
Alfred E. Newman
Alfred
E.
Notice that the middle initial was read as a last name for Mr. Newman. So, how do you solve this problem? You could use the SCAN function to look for a possible third word in this program. If you ask for a third word when there are only two, SCAN returns a missing character value. You could test each Name for a third word. If it is found, you can assume that the second word is an initial. However, there is an easier way. Here is a program that extracts the last name from the Name value and produces a list of names in alphabetical order:
Program 12-15 Using the SCAN function to extract the last name data last; set original; length LastName $ 15; LastName = scan(Name,-1,' '); run; proc sort data=last; by LastName; run; title "Alphabetical list of names"; proc print data=last noobs; var Name; run;
232 Learning SAS by Example: A Programmer’s Guide
The trick is to use a negative value as the second argument to the SCAN function. A negative value causes the scan to proceed from right to left. Here is the result: Alphabetical list of names Name Ron Cody Alfred E. Newman Jeffrey Smith Alan Wilson
Notice that the names are in alphabetical order by last name.
12.16 Comparing Strings The COMPARE function allows you to compare two character values. With optional modifiers, you can ignore case and truncate a longer value to the length of a shorter value before making the comparison. To demonstrate the COMPARE function, suppose you want to check diagnosis codes that start with V450. One problem is that some of the codes may have the V in lowercase. You also want to match codes that start with V450 and are followed by a period and, optionally, additional digits, such as V450.100. While this is a relatively easy task using conventional DATA step programming, you can accomplish the comparison in a single statement using the COMPARE function. Take a look at the following program:
Program 12-16 Demonstrating the COMPARE function data diagnosis; input Code $10.; if compare(Code,'V450','i:') eq 0 then Match = 'Yes'; else Match = 'No'; datalines; V450 v450 v450.100 V900 ;
Chapter 12: Working with Character Functions 233
The first two arguments of the COMPARE function are the two character values you want to compare. The optional third argument allows you to specify modifiers. The i modifier specifies that you want to ignore case; the colon modifier specifies that you want to truncate the longer string to the length of the shorter string before making the comparison. COMPARE returns a 0 if there is a match (after applying the modifiers), and a non-0 value if the two values differ. The value returned tells you the first character in the two strings that is different. The sign of this value tells you which of the two values comes first in the collating sequence. In practice, you simply want to know if the function returns a 0 or not. Let’s look at a listing of the Diagnosis data set: Listing of DIAGNOSIS Code
Match
V450
Yes
v450
Yes
v450.100
Yes
V900
No
Notice that the first three values resulted in a match. Be careful when you use the colon modifier. When SAS computes the length of the shorter string, it includes trailing blanks. Here is an example:
Program 12-17 Clarifying the use of the colon modifier with the COMPARE function data _null_; String1 = 'ABC '; String2 = 'ABCXYZ'; Compare1 = compare(String1,String2,':'); Compare2 = compare(trim(String1),String2,':'); put String1= String2= Compare1= Compare2=; run;
Here is the output: String1=ABC String2=ABCXYZ Compare1=-4 Compare2=0
String1 is ABC followed by three blanks. When you use the colon modifier to compare this value to String2, SAS sees the length of both strings equal to six. If you want to compare only the value ABC (without the trailing blanks), use the TRIM function to
234 Learning SAS by Example: A Programmer’s Guide
remove the trailing blanks. In computing the value of Compare2, SAS trims String2 to a length of three (the length of String1 after you strip off the trailing blanks) before making the comparison. In case you are interested why the value of Compare1 is –4, here is the explanation: the two strings differ in the fourth character. Because a blank comes before a Z in the collating sequence, the value is negative.
12.17 Performing a Fuzzy Match The SPEDIS function (stands for spelling distance) is used for fuzzy matching, which is comparing character values that may be spelled differently. The logic is a bit complicated, but using this function is quite easy. As an example, suppose you want to search a list of names to see if the name Friedman is in the list. You want to look for an exact match or names that are similar. Here is such a program:
Program 12-18 Using the SPEDIS function to perform a fuzzy match data fuzzy; input Name $20.; Value = spedis(Name,'Friedman'); datalines; Friedman Freedman Xriedman Freidman Friedmann Alfred FRIEDMAN ;
Chapter 12: Working with Character Functions 235
Here is a listing of data set Fuzzy: Listing of FUZZY Name
Value
Friedman
0
Freedman
12
Xriedman
25
Freidman
6
Friedmann Alfred FRIEDMAN
3 100 87
The SPEDIS function returns a 0 if the two arguments match exactly. The function assigns penalty points for each type of spelling error. For example, getting the first letter wrong is assigned more points than misspelling other letters. Interchanging two letters is a relatively small error, as is adding an extra letter to a word. Once the total number of penalty points has been computed, the resulting value is computed as a percentage of the length of the first argument. This makes sense because getting one letter wrong in a 3-letter word would be a more serious error than getting one letter wrong in a 10-letter word. Notice that the two character values evaluated by the SPEDIS function are case-sensitive (look at the last observation in the listing). If case may be a problem, use the UPCASE or LOWCASE function before testing the value with SPEDIS. To identify any name that is similar to Friedman, you could extract all names where the value returned by the SPEDIS function is less than some predetermined value. In the program here, values less than 15 or 20 would identify some reasonable misspellings of the name.
12.18 Substituting Characters or Words The last two functions in this chapter (hurray!) are TRANSLATE and TRANWRD. TRANSLATE allows you to substitute one character value for another. TRANSWRD lets you substitute one word for another (hence the name, translate word—TRANWRD).
236 Learning SAS by Example: A Programmer’s Guide
In this first example, you want to substitute the values A, B, C, D, and E for the character values 1, 2, 3, 4, and 5.
Program 12-19 Demonstrating the TRANSLATE function data trans; input Answer : $5.; Answer = translate(Answer,'ABCDE','12345'); datalines; 14325 AB123 51492 ;
The resulting data set Trans contains the following three observations: Listing of TRANS Answer ADCBE ABABC EAD9B
As you can see, the values 1–5 are replaced by A–E. Any other characters are not changed. The TRANSLATE function takes three arguments: the first argument is the character string you want to change; the second and third arguments are the to-string and the from-string. Each character in the to-string is substituted for the corresponding character in the from-string. In the program above, a 1 becomes an A, a 2 becomes a B, and so forth. The TRANWRD function is used to substitute one word for another. This function is often used to standardize addresses (changing Street to St., Road to Rd., and so forth). Here is an example:
Chapter 12: Working with Character Functions 237
Program 12-20 Using the TRANWRD function to standardize an address data address; infile datalines dlm=' ,'; *Blanks or commas are delimiters; input #1 Name $30. #2 Line1 $40. #3 City & $20. State : $2. Zip : $5.; Name Name Name Name Name
= = = = =
tranwrd(Name,'Mr.',' '); tranwrd(Name,'Mrs.',' '); tranwrd(Name,'Dr.',' '); tranwrd(Name,'Ms.',' '); left(Name);
Line1 = tranwrd(Line1,'Street','St.'); Line1 = tranwrd(Line1,'Road','Rd.'); Line1 = tranwrd(Line1,'Avenue','Ave.'); datalines; Dr. Peter Benchley 123 River Road Oceanside, NY 11518 Mr. Robert Merrill 878 Ocean Avenue Long Beach, CA 90818 Mrs. Laura Smith 80 Lazy Brook Road Flemington, NJ 08822 ;
The arguments for the to-string and the from-string are not in the same order as the TRANSLATE function. Here, the first argument to the function is the character variable where you want to make the substitutions, and the second and third arguments are the from-string and to-string, respectively. This seems to make more sense, at least to this author. Keep in mind that, unlike the TRANSLATE function, TRANWRD does not substitute one character for another—it substitutes entire words. Notice the use of the TRANWRD function to remove the titles (Mr., Mrs., Dr., and Ms.) from the names. Here you are substituting a blank for these strings. To remove the leading blank that results, you use the LEFT function to left-align the name. If the length of the resulting variable is not defined before you use this function, SAS sets the length to 200. The reasoning here is that the resulting string could be longer than the original string after a substitution; therefore, the default length is made quite large. (Two hundred was not a completely arbitrary number—it was the maximum length of a character variable in SAS 6.) In this program, length is not an issue because the length of
238 Learning SAS by Example: A Programmer’s Guide
the variables resulting from the TRANWRD function have already been defined and, because you are substituting shorter strings for longer ones, the original length is long enough to hold the new values. Here is a listing of the Address data set. Notice that all the substitutions have been made and the titles (for example, Mr. and Mrs.) have all been removed. Listing of ADDRESS Name
Line1
City
State
Zip
Peter Benchley
123 River Rd.
Oceanside
NY
11518
Robert Merrill
878 Ocean Ave.
Long Beach
CA
90818
Laura Smith
80 Lazy Brook Rd.
Flemington
NJ
08822
And so ends this rather long chapter. Even though this chapter only covered some of the more useful character functions, you can see that SAS has a tremendously powerful set of functions to manipulate character data.
12.19 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Look at the following program and determine the storage length of each of the variables: data storage; length A $ 4 B $ 4; Name = 'Goldstein'; AandB = A || B; Cat = cats(A,B); if Name = 'Smith' then Match = 'No'; else Match = 'Yes'; Substring = substr(Name,5,2); run;
Chapter 12: Working with Character Functions 239
A B Name AandB Cat Match Substring
_________________ _________________ _________________ _________________ _________________ _________________ _________________
2. Using the data set Mixed, create a temporary SAS data set (also called Mixed) with the following new variables: a.
NameLow – Name in lowercase
b. NameProp – Name in proper case c.
(Bonus – difficult) NameHard – Name in proper case without using the PROPCASE function
3. Here is a listing of data set Names_And_More: Listing of Data Set LEARN.NAMES_AND_MORE Name
Height
Mixed
(908)782-1234
5ft. 10in.
50 1/8
Thomas Jefferson
(315) 848-8484
6ft. 1in.
23 1/2
Marco Polo
(800)123-4567
5Ft. 6in.
40
Brian Watson
(518)355-1766
5ft. 10in
89 3/4
Michael DeMarco
(445)232-2233
6ft.
76 1/3
Roger
Phone Cody
Create a new, temporary SAS data set (Names_And_More) using the permanent SAS data set Names_And_More with the following changes: a.
Name has only single blanks between the first and last name.
b. Phone contains only digits (and is still a character value). 4. Data set Names_And_More contains a character variable called Height. As you can see in the listing in Problem 3, the heights are in feet and inches. Assume that these units can be in upper- or lowercase and there may or may not be a period following the units. Create a temporary SAS data set (Height) that contains a numeric variable (HtInches) that is the height in inches.
240 Learning SAS by Example: A Programmer’s Guide
Note: One of the Height values is missing an inches value. Be sure that there are no character-to-numeric notes in the SAS log. Hints: You can remove all the characters INinFTft and period from Height, leaving only the digits and the space between. You can then use the SCAN function to extract the feet and inches values. 5. Data set Names_And_More contains a character variable called Mixed that is either an integer or a mixed number (such as 50 1/8). Using this data set, create a new, temporary SAS data set with a numeric variable (Price) that has decimal values. This number should be rounded to the nearest .001. Hint: Use the SCAN function with blanks and forward slashes (/) as delimiters. 6. Data set Study (shown here) contains the character variables Group and Dose. Create a new, temporary SAS data set (Study) with a variable called GroupDose by putting these two values together, separated by a dash. The length of the resulting variable should be 6 (test this using PROC CONTENTS or the SAS Explorer). Make sure that there are no blanks (except trailing blanks) in this value. Try this problem two ways: first using one of the CAT functions, and second without using any CAT functions. Here is the listing: Listing of Data Set LEARN.STUDY Subj
Group
Dose
Weight
Subgroup
001
A
Low
220lbs.
2
002
A
High
90Kg.
1
003
B
Low
88kg
1
004
B
High
165lbs.
2
005
A
Low
88kG
1
7. Data set Study contains a character variable (Group) and a numeric variable (Subgroup). Create a new, temporary SAS data set with these two variables plus a variable consisting of the Group, a dash, and the Subgroup (call it Combined). Try doing this with and without using any of the CAT functions. Be sure there are no conversion messages in your SAS log. 8. Notice in the listing of data set Study in Problem 6 that the variable called Weight contains units (either lbs or kgs). These units are not always consistent in case and may or may not contain a period. Assume an upper- or lowercase LB indicates pounds and an upper- or lowercase KG indicates kilograms. Create a new, temporary
Chapter 12: Working with Character Functions 241
SAS data set (Study) with a numeric variable also called Weight (careful here) that represents weight in pounds, rounded to the nearest 10th of a pound. Note: 1 kilogram = 2.2 pounds. 9. Using the Sales data set, create a temporary SAS data set (Spirited) containing all the observations from Sales where the string (not necessarily the word) SPIRIT in either upper-, lower-, or mixed case is part of the Customer value (variable name Customer). 10. Data set Errors contains character variables Subj (3 bytes) and PartNumber (8 bytes). (See the partial listing here.) Create a temporary SAS data set (Check1) with any observation in Errors that violates either of the following two rules: first, Subj should contain only digits, and second, PartNumber should contain only the uppercase letters L and S and digits. Here is a partial listing of Errors: Listing of Data Set LEARN.ERRORS Part Subj
Number
Name
001
L1232
Nichole Brown
0a2
L887X
Fred Beans
003
12321
Alfred 2 Nice
004
abcde
Mary Bumpers
X89
8888S
Gill Sandford
11. List the subject number (Subj) for all observations in Errors where the Name contains a digit. (See a listing of the previous data set.) 12. List the subject number (Subj) for any observations in Errors where PartNumber contains an upper- or lowercase X or D. 13. Data set Social contains two variables, SS1 and SS2. These variables represent all possible combinations of Social Security numbers from two separate data sets. Using this data set, create two temporary SAS data sets: one, where SS1 is equal to SS2, and two, where SS1 is within a spelling distance of 25 of SS2. Call these data sets Exact and Within25, respectively. Hint: You can compute the spelling distance between two Social Security numbers just as you would between two names.
242 Learning SAS by Example: A Programmer’s Guide
14. List all patients in the Medical data set where the word antibiotics is in the comment field (Comment). 15. Using the Names_And_More data set, create a temporary SAS data set containing the phone number (Phone) and the 3-digit area code (AreaCode). Be sure the length of AreaCode is 3. You may need to list a few observations in Names_And_More to see how Phone is stored. 16. Provide a list, in alphabetical order by last name, of the observations in the Names_And_More data set. Set the length of the last name to 15 and remove multiple blanks from Name. Note: The variable Name contains a first name, one or more spaces, and then a last name. 17. List the observations in data set Personal. Replace the first 7 digits of the Social Security number (SS) with asterisks and replace the last character (Position 5) of the account number (AcctNumber) with a dash. Do not include the variables Food1– Food8 in the list.
C h a p t e r
13
Working with Arrays 13.1 13.2
Introduction 244 Setting Values of 999 to a SAS Missing Value for Several Numeric Variables 244
13.3
Setting Values of NA and ? to a Missing Character Value 247
13.4
Converting All Character Values to Lowercase 248
13.5
Using an Array to Create New Variables 249
13.6
Changing the Array Bounds 250
13.7 13.8
Temporary Arrays 251 Loading the Initial Values of a Temporary Array from a Raw Data File 253
13.9
Using a Multidimensional Array for Table Lookup 254
13.10 Problems 257
244 Learning SAS by Example: A Programmer’s Guide
13.1 Introduction Cody’s rule of SAS programming goes something like this: if you are writing a SAS program, and it is becoming very tedious, stop. There is a good chance that there is a SAS tool, perhaps arrays or macros, that will make your task less tedious. To the beginning programmer, arrays can be a bit frightening—to an experienced programmer, arrays can be a huge time saver. So, let’s get you over the fright and into saving time. First of all, what is an array? SAS arrays are a collection of elements (usually SAS variables) that allow you to write SAS statements referencing this group of variables. Note: SAS arrays are different from arrays in many other programming languages. They do not hold values, and they allow you to refer to a collection of SAS variables in a convenient manner. It is much easier to understand what an array is by a few simple examples. So, here goes.
13.2 Setting Values of 999 to a SAS Missing Value for Several Numeric Variables Typically arrays are used to perform a similar operation on a group of variables. In this example, you have a SAS data set called SPSS that contains several numeric variables. The folks that created this data set used a value of 999 whenever there was a missing value. (Some statistical packages such as SPSS allow you to substitute a missing value for a specific value. It is common to use values such as 999 or 9999 to represent this missing value.) The following is a program that solves this problem without using arrays:
Program 13-1 Converting values of 999 to a SAS missing value—without using arrays data new; set learn.SPSS; if Height = 999 then Height = .; if Weight = 999 then Weight = .; if Age = 999 then Age = .; run;
Chapter 13: Working with Arrays 245
Notice that you are writing the same SAS statement several times—the only thing that is changing is the name of the variable. You would like to be able to write something like this: if Height, Weight, or Age = 999 then Height, Weight or Age = .;
This is pretty much how an array works. Let’s first see the program using arrays, and then we’ll go through the explanation:
Program 13-2 Converting values of 999 to a SAS missing value—using arrays data new; set learn.SPSS; array myvars{3} Height Weight Age; do i = 1 to 3; if myvars{i} = 999 then myvars{i} = .; end; drop i; run;
The first thing you may notice is that the program with arrays is longer than the one without arrays! However, if you had 50 or 100 variables to process, the program using arrays would not be any longer. The ARRAY statement is used to create the array. Following the keyword ARRAY is the name you choose for your array. Array names follow the same rules you use for SAS variables. In this example, you chose the name MYVARS as the array name. Following the array name, you place the number of elements (in this example, the three variables) in brackets. Finally, you list the variables you want to include in the array. You may use any of the SAS shorthand methods for referring to a list of variables here, such as Var1–Varn. This list of variables must be all numeric or all character—you cannot mix them. You may also use square brackets [] or parentheses () following the array name. SAS documentation usually uses curly brackets {}. We recommend that you always use the same type of brackets (either straight or curly) when you use arrays. The reason—you can always tell when a program is referencing an array if you reserve a particular type of bracket for use only when you are writing an array element. Once you have defined your array, you can use an array reference in place of a variable name anywhere in the DATA step. For example, MYVARS{2} can be used in place of the variable Weight. The number in the brackets following the array name is called a subscript, even though it is not true subscript notation.
246 Learning SAS by Example: A Programmer’s Guide
By placing the array in a DO loop, you can process each variable in the array. Also, because you do not need or want the DO loop counter included in the SAS data set, you use a DROP statement. Let’s “play computer” and follow the logic of this program. At the top of the DATA step, the SET statement brings in an observation from data set SPSS. Next, the DO loop counter starts at 1. The IF statement references the array element MYVARS{1}, which is equivalent to the variable Height. SAS checks if the value of Height is equal to 999 and, if so, replaces it with a SAS missing value. The DO loop continues for all three array elements. When the DO loop finishes, you are at the bottom of the DATA step. An observation is written out to the NEW data set and control returns to the top of the DATA step, where the next observation from data set SPSS is read. This process continues until there are no more observations to read from the SPSS data set. The best way to get started writing arrays is to first write a few lines of code without an array. Next, write an ARRAY statement where the array elements are all the variables you want to process. Next, use one of the sample lines as a template and write that line, substituting your array name (with a subscript of your choice) for the variable name. Finally, place this line (or lines) of code inside a DO loop. (Don’t forget to drop the DO loop counter.) You are done. Before we leave this section, this is a good time to mention that in either of the programs discussed previously, you could use the CALL MISSING routine to assign a missing value to a variable (see Chapter 11). Program 13-2 rewritten using this CALL routine is displayed in Program 13-3:
Program 13-3 Rewriting Program 13-2 using the CALL MISSING routine data new; set learn.SPSS; array myvars{3} Height Weight Age; do i = 1 to 3; if myvars{i} = 999 then call missing(myvars{i}); end; drop i; run;
Chapter 13: Working with Arrays 247
13.3 Setting Values of NA and ? to a Missing Character Value As we mentioned in the introduction, SAS arrays must contain all numeric or all character variables. If the variables you want to include in a character array are already defined as character (for example, they are coming from a SET statement), you can write an ARRAY statement resembling the one in the previous section. However, if you want the array to contain new variables, you need to include a dollar sign ($) and, optionally, a length when you define the array. For example, to create an array of character variables Q1–Q20, each with a length of 2 bytes, you would write the following: array mychars{20} $ 2 Q1-Q20;
It is a good idea to include the dollar sign ($) in every character array, even if the array variables have been previously defined as character. As an example, the following program uses a character array to convert all values of NA (Not Applicable) or question mark (?) to a SAS missing value. Suppose you are given a SAS data set Chars and you want to create a new data set named Missing with these changes. Here is the program:
Program 13-4 Converting values of NA and ? to missing character values data missing; set learn.chars; array char_vars{*} $ _character_; do loop = 1 to dim(char_vars); if char_vars{loop} in ('NA' '?') then call missing(char_vars{loop}); end; drop loop; run;
This program introduces a number of new features. First, an asterisk is used in place of the number of elements in the array. You can always use an asterisk here if you don’t know how many variables are in the array (as may be the case here where you haven’t counted the number of character variables in the Chars data set). SAS counts for you and computers are better at counting than people, anyway. Next, the keyword _CHARACTER_ is used as the variable list. Because this statement follows the SET statement, _CHARACTER_ includes all the character variables in the Chars data set. _CHARACTER_ references character variables that are present in the PDV at that point in the DATA step. For example, if you define character variables A, B, and C in the first three lines of a DATA step and then use the reference _CHARACTER_ followed by
248 Learning SAS by Example: A Programmer’s Guide
defining character variables D and E, only variables A, B, and C are referenced by _CHARACTER_. In Program 13-4, if you place the ARRAY statement before the SET statement, the array does not reference any variables. Because you used an asterisk in place of the number of elements in the array, what value do you use for the upper bound of the DO loop? Luckily, the DIM function comes to the rescue. It returns the number of elements in an array. Finally, an IF statement checks for the values of NA or ? and sets them equal to a SAS missing value.
13.4 Converting All Character Values to Lowercase If you have data (either a raw data file or a SAS data set) where the data entry folks were careless (or didn’t set standards ahead of time), you may have a hodge-podge of upper-, lower-, or proper case values. This section describes a simple program to convert all character values to lowercase. First, here is a listing of data set Careless: Listing of CARELESS Last Score 100
Name
Ans1
Ans2
Ans3
COdY
A
b
c
65
sMITH
C
C
d
95
scerbo
D
e
D
The following program converts all of the character values in a SAS data set (Careless) to lowercase:
Chapter 13: Working with Arrays 249
Program 13-5 Converting all character values in a SAS data set to lowercase data lower; set learn.careless; array all_chars{*} _character_; do i = 1 to dim(all_chars); all_chars{i} = lowcase(all_chars{i}); end; drop i; run;
The logic of this program is similar to the previous program. You use the keyword _CHARACTER_ to reference all the character variables in data set Careless, and then use a DO loop to convert all the values to lowercase. Data set Lower, with all the character values in lowercase, is shown here: Listing of LOWER Last Score 100
Name
Ans1
Ans2
Ans3
cody
a
b
c
65
smith
c
c
d
95
scerbo
d
e
d
13.5 Using an Array to Create New Variables You can include variables in an ARRAY statement that do not yet exist in your SAS data set. For example, if your SAS data set had variables Fahren1–Fahren24 containing 24 Fahrenheit temperatures, you could use an array to create 24 new variables (say Celsius1–Celsius24) with the Celsius equivalents. Here is a program that accomplishes this:
250 Learning SAS by Example: A Programmer’s Guide
Program 13-6 Using an array to create new variables data temp; input Fahren1-Fahren24 @@; array Fahren[24]; array Celsius[24] Celsius1-Celsius24; do Hour = 1 to 24; Celsius{Hour} = (Fahren{Hour} - 32)/1.8; end; drop Hour; datalines; 35 37 40 42 44 48 55 59 62 62 64 66 68 70 72 75 75 72 66 55 53 52 50 45 ;
The variables Celsius1–Celsius24 are created by the ARRAY statement. Inside the DO loop, you convert each of the 24 Fahrenheit temperatures to Celsius. Data set Temp contains all 24 Fahrenheit and 24 Celsius temperatures. You may wonder where the variable list went in the FAHREN array. If you omit a variable list in an ARRAY statement and you include the number of elements following the array name, SAS automatically creates variable names for you, using the array name as the base and adding the numbers from 1 to n, where n is the number of elements in the array. In this program, SAS creates the variables Fahren1–Fahren24. You could have used this feature for the Celsius array as well.
13.6 Changing the Array Bounds By default, SAS numbers the elements of an array starting from 1. There are times when it is useful to specify the beginning and ending values of the array elements. For example, if you have variables Income1999 to Income2006, it would be nice to have the array elements start with 1999 and end with 2006. The program that follows creates an array of the eight Income values, using the values of 1999 and 2006 as the array bounds, and computes the taxes for each of the eight years:
Chapter 13: Working with Arrays 251
Program 13-7 Changing the array bounds data account; input ID Income1999-Income2006; array income{1999:2006} Income1999–Income2006; array taxes{1999:2006} Taxes1999-Taxes2006; do Year = 1999 to 2006; Taxes{Year} = .25*Income{Year}; end; drop Year; format Income1999-Income2006 Taxes1999-Taxes2006 dollar10.; datalines; 001 45000 47000 47500 48000 48000 52000 53000 55000 002 67130 68000 72000 70000 65000 52000 49000 40100 ;
As you can see in this program, you specify the lower and upper bounds in the brackets following the array name and separate them with a colon.
13.7 Temporary Arrays You can create an array that only has elements and no variables! As strange as this sounds, elements of temporary arrays are great places to store values or perform table lookups. If you want, you can assign the array elements initial values when you create the temporary array. Alternatively, you can load values into the temporary array in the DATA step. Either way, the values in the temporary array are automatically retained (that is, they are not set to missing values when the DATA step iterates). Thus, they are useful places to store values that you need during the execution of the DATA step. We start out with an example that uses a temporary array to store the correct answer for each of 10 questions on a multiple-choice quiz. You can then score the quiz using the temporary array as the answer key. Here is the program:
252 Learning SAS by Example: A Programmer’s Guide
Program 13-8 Using a temporary array to score a test data score; array ans{10} $ 1; array key{10} $ 1 _temporary_ ('A','B','C','D','E','E','D','C','B','A'); input ID (Ans1-Ans10)($1.); RawScore = 0; do Ques = 1 to 10; RawScore + (key{Ques} eq Ans{Ques}); end; Percent = 100*RawScore/10; keep ID RawScore Percent; datalines; 123 ABCDEDDDCA 126 ABCDEEDCBA 129 DBCBCEDDEB ;
This program uses a temporary array (key) to hold the answers to the 10 quiz questions. The keyword _TEMPORARY_ tells SAS that this is a temporary array and the 10 values in parentheses are the initial values for each of the elements of this array. It is important to remember that there are no corresponding variables (Key1, Key2, and so on) in this DATA step. Also, because elements of a temporary array are retained, the 10 answer key values are available throughout the DATA step for scoring each of the student tests. The scoring is done in a DO loop. A “trick” is used to do the scoring: a logical comparison is performed between the student answer and the corresponding answer key. If they match, the logical comparison returns a 1 and this is added to RawScore. If not, the result is a 0 and RawScore is not incremented. Here is a listing of the output: Raw and Percent Scores Raw ID
Score
Percent
123
7
70
126
10
100
129
4
40
Chapter 13: Working with Arrays 253
13.8 Loading the Initial Values of a Temporary Array from a Raw Data File If you had a long test, you would probably prefer to load the answer key into the array elements by reading the values from a text file (especially if you are scoring the tests using an optical mark-sense reader). In the program that follows, you enter the answer key values on the first line of your data file:
Program 13-9 Loading the initial values of a temporary array from a raw data file data score; array ans{10} $ 1; array key{10} $ 1 _temporary_; /* Load the temporary array elements */ if _n_ = 1 then do Ques = 1 to 10; input key{Ques} $1. @; end; input ID (Ans1-Ans10)($1.); RawScore = 0; /* Score the test */ do Ques = 1 to 10; RawScore + (key{Ques} eq Ans{Ques}); end; Percent = 100*RawScore/10; keep ID RawScore Percent; datalines; ABCDEEDCBA 123 ABCDEDDDCA 126 ABCDEEDCBA 129 DBCBCEDDEB ;
Because you want to read the first line of the data file differently from the other lines, the value of _N_ (which counts iterations of the DATA step) can be used to ensure that the answer key values are read only once. The result of running this DATA step is identical to the data set you obtained from Program 13-8.
254 Learning SAS by Example: A Programmer’s Guide
13.9 Using a Multidimensional Array for Table Lookup SAS arrays can also be multidimensional. Instead of having a single index to identify elements of an array, you can, for example, use two indices (usually thought of as row and column indices) to identify an element. This is particularly useful when you want to retrieve a single value based on two selection criteria. To define a multidimensional array, you specify the number of elements in each dimension in the brackets following the array name, separated by commas. For example, to define an array called MULTI with three elements on the first dimension and five elements on the second dimension, you could use: array multi{3,5} X1-X15;
To determine the number of elements in a multidimensional array, you multiply the number of elements in each dimension. In this example, the 3 by 5 array has 3 times 5 equals 15 elements. The following table shows the benzene levels for each year (from 1944 to 1949) and the job codes (A through E) at a rubber factory. You want to retrieve a benzene level, given a year and job code. You can see a solution using formats in Chapter 22 of this book. Here is an array solution. To start, you need to create a two-dimensional array with one index representing the year and the other the job code. To make matters more convenient, you can make the index values for the years range from 1944 to 1949 rather than from 1 to 6. You also need to decide if you want to populate the array with the benzene values as part of the ARRAY statement or if you want to read those values from raw data or, perhaps, a SAS data set. Finally, you can use a regular array or a temporary array. The solution shown here loads the array from raw data and uses a temporary array to hold the benzene values. You start with a data set (Expose) that holds a worker ID, the year worked, and the job code. Given this information, you want to look up the worker’s benzene exposure.
Chapter 13: Working with Arrays 255
A listing of data set Expose is shown here: Listing of EXPOSE Job Worker
Year
Code
001
1944
B
002
1948
E
003
1947
C
005
1945
A
006
1948
D
Here is the table of benzene exposures by year and job code. Job Code Year
A
B
C
D
E
1944
220
180
210
110
90
1945
202
170
208
100
85
1946
150
110
150
60
50
1947
105
56
88
40
30
1948
60
30
40
20
10
1949
45
22
22
10
8
The first step is to load a temporary array with these values, as follows:
Program 13-10 Loading a two-dimensional, temporary array with data values data look_up; /****************************************************** Create the array, the first index is the year and it ranges from 1944 to 1949. The second index is the job code (we're using 1-5 to represent job codes A through E). *******************************************************/ array level{1944:1949,5} _temporary_; /* Populate the array */ if _n_ = 1 then do Year = 1944 to 1949; do Job = 1 to 5;
256 Learning SAS by Example: A Programmer’s Guide
input level{Year,Job} @; end; end; set learn.expose; /* Compute the job code index from the JobCode value */ Job = input(translate(Jobcode,'12345','ABCDE'),1.); Benzene = level{Year,Job}; drop Job; datalines; 220 180 210 110 90 202 170 208 100 85 150 110 150 60 50 105 56 88 40 30 60 30 40 20 10 45 22 22 10 8 ;
There is a lot going on in this program so let’s take it one step at a time. The key is the two-dimensional array (LEVEL). The dimensions of the array are defined by the comma in the brackets following the array name. You can think of the first dimension as a row and the second dimension as a column in a table. Each row of raw data following the DATALINES statement represents a different year (starting from 1944) and each of the five columns represents the values for job codes A through E. Since you only want to populate the array once, you execute the nested DO loops when _N_ is equal to 1. As mentioned earlier, the index values of the first dimension of the array range from 1944 to 1949. This saves you the trouble of computing the correct row value for a given year. Finally, you use the keyword _TEMPORARY_ to declare the array to be temporary. This has several advantages. First, you don’t have to drop (or maintain in the PDV) 30 variables for each of the array elements. Next, these values are automatically retained so they are available for the duration of the DATA step. Finally, using temporary arrays is very efficient. They require less storage than regular variables, and all the values are stored in memory for rapid retrieval. The only remaining problem is that the JobCode variable is a letter from A to E. You use the TRANSLATE function to convert each of the letters A to E to the character values 1 to 5. You then use the INPUT function to do the character-to-numeric conversion. To look up any benzene level, you simply obtain the array value corresponding to the Year and JobCode (converted to a number) values.
Chapter 13: Working with Arrays 257
Here is a listing of data set Look_Up: Listing of LOOK_UP Job Year
Worker
Code
Benzene
1944
001
B
180
1948
002
E
10
1947
003
C
88
1945
005
A
202
1948
006
D
20
As you have seen in this chapter, SAS arrays are powerful and flexible. You can change array bounds and create multidimensional arrays. Temporary arrays provide a convenient place to store values for efficient table lookup.
13.10 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Using the SAS data set Survey1, create a new, temporary SAS data set (Survey1) where the values of the variables Ques1–Ques5 are reversed as follows: 1 Æ 5; 2 Æ 4; 3 Æ 3; 4 Æ 2; 5 Æ 1. Note: Ques1–Ques5 are character variables. Accomplish this using an array. 2. Redo Problem 1, except use data set Survey2. Note: Ques1–Ques5 are numeric variables. 3. Using the SAS data set Nines, create a new temporary SAS data set (Nonines) where all values of 999 are replaced by SAS missing values. Do this without explicitly naming the numeric variables in data set Nines (use _NUMERIC_ when you define your array).
258 Learning SAS by Example: A Programmer’s Guide
4. Data set Survey2 has five numeric variables (Q1–Q5), each with values of 1, 2, 3, 4, or 5. You want to determine for each subject (observation) if they responded with a 5 on any of the five questions. This is easily done using the OR or the IN operators. However, for this question, use an array to check each of the five questions. Set variable (ANY5) equal to Yes if any of the five questions is a 5 and No otherwise. 5. The passing score on each of five tests is 65, 70, 60, 62, and 68. Using the data here, use a temporary array to count the number of tests passed by each student. ID
Test 1
Test 2
Test 3
Test 4
Test 5
001
90
88
92
95
90
002
64
64
77
72
71
003
68
69
80
75
70
004
88
77
66
77
67
P a r t
3
Presenting and Summarizing Your Data 261
Chapter 14
Displaying Your Data
Chapter 15
Creating Customized Reports
Chapter 16
Summarizing Your Data
Chapter 17
Counting Frequencies
Chapter 18
Creating Tabular Reports
Chapter 19
Introducing the Output Delivery System
Chapter 20
Generating High-Quality Graphics
287
319 341 363
411
397
260 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
14
Displaying Your Data 14.1
Introduction 262
14.2
The Basics 262
14.3
Changing the Appearance of Your Listing 263
14.4
Changing the Appearance of Values 265
14.5
Controlling the Observations That Appear in Your Listing 266
14.6
Adding Additional Titles and Footnotes to Your Listing 268
14.7
Changing the Order of Your Listing 270
14.8
Sorting by More Than One Variable 272
14.9
Labeling Your Column Headings 273
14.10 Adding Subtotals and Totals to Your Listing 274 14.11 Making Your Listing Easier to Read 277 14.12 Adding the Number of Observations to Your Listing 279 14.13 Double-Spacing Your Listing 280 14.14 Listing the First n Observations of Your Data Set 281 14.15 Problems 283
262 Learning SAS by Example: A Programmer’s Guide
14.1 Introduction This chapter shows you how to list the observations in a SAS data set using the PRINT procedure (PROC PRINT). In later chapters, you see fancier ways to display and summarize your data.
14.2 The Basics You have already seen how you can use PROC PRINT to list the observations in a SAS data set. Let’s see how you can add some options and statements to this procedure to allow you more control over what is displayed. You have a SAS data set called Sales that contains the following information on your sales: an employee Name and ID, the region where the sale was made, the name of the company to whom the sale was made, the part number, the quantity, the unit cost of the item, and the total amount of the sale (Quantity times UnitCost). Program 14-1 shows PROC PRINT with all the defaults:
Program 14-1 PROC PRINT using all the defaults title "Listing of SALES"; proc print data=learn.sales; run;
You obtain a listing like the one here: Listing of SALES Emp Obs
ID
Name
Region
Customer
1
1843
George Smith
North
Barco Corporation
2
1843
George Smith
South
Cost Cutter's
3
1843
George Smith
North
Minimart Inc.
4
1843
George Smith
North
Barco Corporation
5
1843
George Smith
South
Ely Corp.
6
0177
Glenda Johnson
East
Food Unlimited
7
0177
Glenda Johnson
East
Shop and Drop
(continued)
Chapter 14: Displaying Your Data 263
8
1843
George Smith
South
Cost Cutter's
9
9888
Sharon Lu
West
Cost Cutter's
10
9888
Sharon Lu
West
Pet's are Us
11
0017
Jason Nguyen
East
Roger's Spirits
12
0017
Jason Nguyen
South
Spirited Spirits
13
0177
Glenda Johnson
North
Minimart Inc.
14
0177
Glenda Johnson
East
Barco Corporation Total
Obs
Item
1
144L
2
122
3 4
Quantity
UnitCost
Sales
50
8.99
449.5
100
5.99
599.0
188S
3
5199.00
15597.0
908X
1
5129.00
5129.0
5
122L
10
29.95
299.5
6
188X
100
6.99
699.0
7
144L
100
8.99
899.0
8
855W
1
9109.00
9109.0
9
122
50
5.99
299.5
10
100W
1000
1.99
1990.0
11
122L
500
39.99
19995.0
12
407XX
100
19.95
1995.0
13
777
5
10.50
52.5
14
733
2
10000.00
20000.0
Note: By default, SAS centers all the titles and listings on your screen (or the printed page). If the system option NOCENTER is set, the titles and listings are leftaligned. You can see that the NOCENTER option was in effect when this program was run, as well as most of the other examples in this book.
14.3 Changing the Appearance of Your Listing You can control which variables appear in your listing, as well as the order of these variables, by supplying a VAR statement. You place the variables you would like to see, following the keyword VAR. The order of this list also controls the order the variables
264 Learning SAS by Example: A Programmer’s Guide
appear in the listing. So, if you run the code in Program 14-2, you obtain the output that follows it.
Program 14-2 Controlling which variables appear in the listing title "Listing of SALES"; proc print data=learn.sales; var EmpID Customer TotalSales; run; Listing of SALES Emp Obs
ID
Total Customer
Sales 449.5
1
1843
Barco Corporation
2
1843
Cost Cutter's
599.0
3
1843
Minimart Inc.
15597.0
4
1843
Barco Corporation
5
1843
Ely Corp.
299.5
6
0177
Food Unlimited
699.0
7
0177
Shop and Drop
899.0
8
1843
Cost Cutter's
9109.0
9
9888
Cost Cutter's
10
9888
Pet's are Us
11
0017
Roger's Spirits
12
0017
Spirited Spirits
13
0177
Minimart Inc.
14
0177
Barco Corporation
5129.0
299.5 1990.0 19995.0 1995.0 52.5 20000.0
Your next step is to omit the Obs column and replace it with the employee ID. Use an ID statement to do this, as follows:
Program 14-3 Using an ID statement to omit the Obs column title "Listing of SALES"; proc print data=learn.sales; id EmpID; var Customer TotalSales; run;
Chapter 14: Displaying Your Data 265
Notice how the listing has changed. Listing of SALES Emp ID
Total Customer
Sales
1843
Barco Corporation
449.5
1843
Cost Cutter's
599.0
1843
Minimart Inc.
15597.0
1843
Barco Corporation
1843
Ely Corp.
299.5
0177
Food Unlimited
699.0
0177
Shop and Drop
899.0
1843
Cost Cutter's
9109.0
9888
Cost Cutter's
9888
Pet's are Us
0017
Roger's Spirits
0017
Spirited Spirits
0177
Minimart Inc.
0177
Barco Corporation
5129.0
299.5 1990.0 19995.0 1995.0 52.5 20000.0
The variable (or variables) you place in the ID statement replaces the Obs column and is printed in the left-most columns of your listing. Notice that when you place a variable name in the ID statement, you do not also list it in the VAR statement. If you do, that variable appears twice in the listing. An alternative way to omit the Obs column is to use PROC PRINT with the option NOOBS. The advantage of using an ID statement is that the ID variable begins each page if you have more variables than will fit across a single page.
14.4 Changing the Appearance of Values Suppose you would like the TotalSales values to appear with dollar signs and commas. You can change the appearance of values in your listing by associating a format with one or more variables. SAS has many built-in formats that can add commas or dollar signs to numbers or display dates in different ways. Program 14-4 adds a FORMAT statement to list TotalSales with dollar signs and commas and Quantity with commas:
266 Learning SAS by Example: A Programmer’s Guide
Program 14-4 Adding a FORMAT statement to PROC PRINT proc print data=learn.sales; title "Listing of SALES"; id EmpID; var Customer Quantity TotalSales; format TotalSales dollar10.2 Quantity comma7.; run;
Notice the change in the Quantity and TotalSales columns. Listing of SALES Emp ID
Customer
Quantity 50
TotalSales
1843
Barco Corporation
$449.50
1843
Cost Cutter's
100
$599.00
1843
Minimart Inc.
3
$15,597.00
1843
Barco Corporation
1843
Ely Corp.
0177 0177
1
$5,129.00
10
$299.50
Food Unlimited
100
$699.00
Shop and Drop
100
$899.00
1843
Cost Cutter's
1
$9,109.00
9888
Cost Cutter's
9888
Pet's are Us
0017
50
$299.50
1,000
$1,990.00
Roger's Spirits
500
$19,995.00
0017
Spirited Spirits
100
$1,995.00
0177
Minimart Inc.
5
$52.50
0177
Barco Corporation
2
$20,000.00
14.5 Controlling the Observations That Appear in Your Listing You can also control which observations appear in a listing by including a WHERE statement in the procedure. For example, suppose you want your listing to contain only observations where the Quantity is greater than 400. The following program would do the trick:
Chapter 14: Displaying Your Data 267
Program 14-5 Controlling which observations appear in the listing (WHERE statement) title "Listing of SALES with Quantities greater than 400"; proc print data=learn.sales; where Quantity gt 400; id EmpID; var Customer Quantity TotalSales; format TotalSales dollar10.2 Quantity comma7.; run; Listing of SALES with Quantities greater than 400 Emp ID
Customer
9888
Pet's are Us
0017
Roger's Spirits
Quantity
TotalSales
1,000
$1,990.00
500
$19,995.00
Suppose you want to see the sales for two employees: 1843 and 0177. Using an IN operator along with a WHERE statement is a convenient way to do this.
Program 14-6 Using the IN operator in a WHERE statement title "Listing of SALES with Quantities greater than 400"; proc print data=learn.sales; where EmpID in ('1843' '0177'); id EmpID; var Customer Quantity TotalSales; format TotalSales dollar10.2 Quantity comma7.; run;
You can use the IN operator with either numeric or character values. You also have the choice of separating the values in the list with spaces or commas.
268 Learning SAS by Example: A Programmer’s Guide
14.6 Adding Additional Titles and Footnotes to Your Listing You can make your output more meaningful by adding additional title lines and by adding one or more footnotes to your listing. The TITLEn statement (where n is a number from 1 to 10) allows you to specify multiple title lines. Note that TITLE1 and TITLE are equivalent. The FOOTNOTEn statement allows you to specify from 1 to 10 footnotes, lines that appear at the bottom of the page. It is important to remember that once you issue a TITLE or FOOTNOTE statement, the titles or footnotes print on every page of your output until you change or cancel them. The short program here lists several variables from the Sales data set and adds several title and footnote lines:
Program 14-7 Adding titles and footnotes to your listing title1 "The XYZ Company"; title3 "Sales Figures for Fiscal 2006"; title4 "Prepared by Roger Rabbit"; title5 "-----------------------------"; footnote "All sales figures are confidential"; proc print data=learn.sales; id EmpID; var Customer Quantity TotalSales; format TotalSales dollar10.2 Quantity comma7.; run;
Chapter 14: Displaying Your Data 269
First, we show you the listing, followed by an explanation: The XYZ Company Sales Figures for Fiscal 2006 Prepared by Roger Rabbit ----------------------------Emp ID
Customer
Quantity
TotalSales
1843
Barco Corporation
50
$449.50
1843
Cost Cutter's
100
$599.00
1843
Minimart Inc.
3
$15,597.00
1843
Barco Corporation
1
$5,129.00
1843
Ely Corp.
10
$299.50
0177
Food Unlimited
100
$699.00
0177
Shop and Drop
100
$899.00
1843
Cost Cutter's
1
$9,109.00
9888
Cost Cutter's
50
$299.50
9888
Pet's are Us
1,000
$1,990.00
0017
Roger's Spirits
500
$19,995.00
0017
Spirited Spirits
100
$1,995.00
0177
Minimart Inc.
5
$52.50
0177
Barco Corporation
2
$20,000.00
All sales figures are confidential
Because the TITLE2 statement is missing, there is a blank line between the first and third title lines. If you submit a new TITLEn statement, it replaces the current TITLEn statement and all TITLE lines with higher values of n. For example, suppose you change the TITLE3 line in Program 14-7 to read: title3 "New Sales Figures for 2006";
270 Learning SAS by Example: A Programmer’s Guide
The title lines in your output now read as follows: The XYZ Company New Sales Figures for 2006
Line 3 has been replaced by the new text and Lines 4 and 5 are removed. The FOOTNOTEn statements work the same way. By the way, to cancel all title statements use the following: title;
In a similar manner, use the following to cancel all footnote lines: footnote;
14.7 Changing the Order of Your Listing If you want to see your list in a particular sorted order, you can precede PROC PRINT with a PROC SORT statement. If you want to see your listing in order of TotalSales, the following program could be used:
Program 14-8 Using PROC SORT to change the order of your observations proc sort data=learn.sales; by TotalSales; run; title "Listing of SALES"; proc print data=learn.sales; id EmpID; var Customer Quantity TotalSales; format TotalSales dollar10.2 Quantity comma7.; run;
Chapter 14: Displaying Your Data 271
Your listing is now in order of TotalSales, starting from the lowest to the highest. Listing of SALES Emp ID
Customer
Quantity
TotalSales
0177
Minimart Inc.
5
$52.50
1843
Ely Corp.
10
$299.50
9888
Cost Cutter's
50
$299.50
1843
Barco Corporation
50
$449.50
1843
Cost Cutter's
100
$599.00
0177
Food Unlimited
100
$699.00
0177
Shop and Drop
9888
Pet's are Us
0017
Spirited Spirits
1843
Barco Corporation
1843
Cost Cutter's
1
$9,109.00
1843
Minimart Inc.
3
$15,597.00
0017
Roger's Spirits
500
$19,995.00
0177
Barco Corporation
2
$20,000.00
100
$899.00
1,000
$1,990.00
100
$1,995.00
1
$5,129.00
To see your listing in order from highest to lowest, precede the variable name (TotalSales) with the keyword DESCENDING, like this:
Program 14-9 Demonstrating the DESCENDING option of PROC SORT proc sort data=learn.sales; by descending TotalSales; run;
In Program 14-8, you replaced the original SAS data set with one sorted in order of TotalSales. In some cases, this is fine. However, if you do not want the original data set changed, add an OUT= option to your sort to specify an output data set. This is especially important if you subset the data when you are performing your sort. As an example, the following program creates a temporary SAS data set (Sales) in descending order of TotalSales: proc sort data=learn.sales out=sales; by descending TotalSales; run;
272 Learning SAS by Example: A Programmer’s Guide
Notice that SAS is reading your permanent SAS data set (Sales) and creating a temporary SAS data set (Sales). It is fine (and sometimes convenient) to use the same name for each of the two data sets, as long as they are in different libraries.
14.8 Sorting by More Than One Variable You can sort your data set by more than one variable. This is called a multi-level sort. As an example, the following PROC SORT statements sort your data by employee ID and, within each ID, in decreasing value of total sales:
Program 14-10 Sorting by more than one variable proc sort data=learn.sales out=sales; by EmpID descending TotalSales; run; title "Sorting by More than One Variable"; proc print data=sales; id EmpID; var TotalSales Quantity; format TotalSales dollar10.2 Quantity comma7.; run;
As you can see here, the temporary data set Sales is now in EmpID order and decreasing values of TotalSales for each employee, as seen in the output here: Sorting by More than One Variable Emp ID
TotalSales
Quantity
0017
$19,995.00
500
0017
$1,995.00
100
0177
$20,000.00
2
0177
$899.00
100
0177
$699.00
100
0177
$52.50
5
1843
$15,597.00
3
(continued)
Chapter 14: Displaying Your Data 273
1843
$9,109.00
1843
$5,129.00
1 1
1843
$599.00
100
1843
$449.50
50
1843
$299.50
10
9888
$1,990.00
1,000
9888
$299.50
50
You can see how a multi-level sort works by looking at this listing. Notice that the observations are in order of increasing EmpID (the default order) and, within each EmpID, in decreasing order of Quantity.
14.9 Labeling Your Column Headings If you want to make your listing a bit more readable, at least for non-programmer types, you may want to use variable labels instead of variable names as your column headings. You need to do two things to make this happen. First, you need to use a LABEL statement, either in your DATA step or as a statement following your PROC PRINT statement. Next, you need to add a LABEL option to your PROC PRINT statement. We add labels to the Sales listing to demonstrate this.
Program 14-11 Using labels as column headings with PROC PRINT title "Using Labels as Column Headings"; proc print data=sales label; id EmpID; var TotalSales Quantity; label EmpID = "Employee ID" TotalSales = "Total Sales" Quantity = "Number Sold"; format TotalSales dollar10.2 Quantity comma7.; run;
Notice that we have added a LABEL statement to associate a label with each of the variable names, as well as a LABEL option to tell PROC PRINT to use the labels as column headings. If you forget the LABEL option, PROC PRINT will not use labels as
274 Learning SAS by Example: A Programmer’s Guide
column headings even if you have included a LABEL statement. The listing here is more readable than the previous listings that used variable names to head the columns: Using Labels as Column Headings Employee
Number
Total
Sold
Sales
0017
500
$19,995.00
0017
100
$1,995.00
0177
2
$20,000.00
0177
100
$899.00
0177
100
$699.00
0177
5
$52.50
1843
3
$15,597.00
1843
1
$9,109.00
1843
1
$5,129.00
1843
100
$599.00
1843
50
$449.50
1843
10
$299.50
9888
1,000
$1,990.00
9888
50
$299.50
ID
14.10 Adding Subtotals and Totals to Your Listing You can add subtotals and totals to your listing by including SUM and BY statements. In order to include a BY statement in PROC PRINT, remember that you need to have previously sorted your data set in the same order. For example, if you want to break down your listing by region, you first sort your data set by region and include a BY statement in PROC PRINT.
Chapter 14: Displaying Your Data 275
Program 14-12 Using a BY statement in PROC PRINT proc sort data=learn.sales out=sales; by Region; run; title "Using Labels as Column Headings"; proc print data=sales label; by Region; id EmpID; var TotalSales Quantity; label EmpID = "Employee ID" TotalSales = "Total Sales" Quantity = "Number Sold"; format TotalSales dollar10.2 Quantity comma7.; run;
The listing from this program is shown next: Using Labels as Column Headings Region=East Employee
Total
Number
ID
Sales
Sold
0177
$699.00
100
0177
$899.00
100
0017
$19,995.00
500
0177
$20,000.00
2
Employee
Total
Number
ID
Sales
Sold
1843
$449.50
50
1843
$15,597.00
3
1843
$5,129.00
1
0177
$52.50
5
Region=North
(continued)
276 Learning SAS by Example: A Programmer’s Guide
Region=South Employee
Total
Number
ID
Sales
Sold
$599.00
100
1843
$299.50
10
1843
$9,109.00
1
0017
$1,995.00
100
Employee
Total
Number
ID
Sales
Sold
9888
$299.50
50
9888
$1,990.00
1,000
1843
Region=West
If you want to see each region on a separate page, include a PAGEBY statement as well as a BY statement in PROC PRINT. Make sure the variables following the PAGEBY keyword are the same as the variables following the BY keyword in PROC PRINT. You can easily modify Program 14-12 to include subtotals and totals by including a SUM statement to your program, like this:
Program 14-13 Adding totals and subtotals to your listing proc sort data=learn.sales out=sales; by Region; run; title "Adding Totals and Subtotals to Your Listing"; proc print data=sales label; by Region; id EmpID; var TotalSales Quantity; sum Quantity TotalSales; label EmpID = "Employee ID" TotalSales = "Total Sales" Quantity = "Number Sold"; format TotalSales dollar10.2 Quantity comma7.; run;
Chapter 14: Displaying Your Data 277
A partial listing from Program 14-13 is shown here: Adding Totals and Subtotals to Your Listing (partial listing) Region=East Employee
Total
Number
ID
Sales
Sold
0177
$699.00
100
0177
$899.00
100
0017
$19,995.00
500
0177
$20,000.00
2
--------
----------
-------
Region
$41,593.00
702
Employee
Total
Number
ID
Sales
Sold
9888
$299.50
50
. . . Region=West
9888 -------Region
$1,990.00
1,000
----------
-------
$2,289.50
1,050
==========
=======
$77,113.00
2,022
14.11 Making Your Listing Easier to Read If you include a BY statement and an ID statement, each with the same variables, PROC PRINT does not repeat the variable in the first column if the value has not changed. If you want to see a listing in EmpID order, the listing will look better if you use EmpID as an ID variable and a BY variable. Here is the program:
278 Learning SAS by Example: A Programmer’s Guide
Program 14-14 Using an ID statement and a BY statement in PROC PRINT proc sort data=learn.sales out=sales; by EmpID; run; title "Using the Same Variable in an ID and BY Statement"; proc print data=sales label; by EmpID; id EmpID; var Customer TotalSales Quantity; label EmpID = "Employee ID" TotalSales = "Total Sales" Quantity = "Number Sold"; format TotalSales dollar10.2 Quantity comma7.; run;
Notice the improved appearance in the listing: Using the Same Variable in an ID and BY statement Employee ID 0017
Total
Number
Sales
Sold
$19,995.00
500
$1,995.00
100
Food Unlimited
$699.00
100
Shop and Drop
$899.00
100
Minimart Inc.
$52.50
5
Barco Corporation
$20,000.00
2
Barco Corporation
$449.50
50
Cost Cutter's
$599.00
100
Minimart Inc.
$15,597.00
3
$5,129.00
1
Customer Roger's Spirits Spirited Spirits
0177
1843
Barco Corporation Ely Corp.
9888
$299.50
10
Cost Cutter's
$9,109.00
1
Cost Cutter's
$299.50
50
$1,990.00
1,000
Pet's are Us
Chapter 14: Displaying Your Data 279
14.12 Adding the Number of Observations to Your Listing By including the N= option of PROC PRINT, the total number of observations in your data set prints at the bottom of the listing. If you want to be even fancier, you can use the form N=“your-label” to label this number with a label of your own choosing. Here is an example:
Program 14-15 Demonstrating the N= option with PROC PRINT title "Demonstrating the N option of PROC PRINT"; proc print data=sales n="Total number of Observations:"; id EmpID; var TotalSales Quantity; label EmpID = "Employee ID" TotalSales = "Total Sales" Quantity = "Number Sold"; format TotalSales dollar10.2 Quantity comma7.; run;
The resulting listing (shortened) is shown next: Demonstrating the N option of PROC PRINT Emp ID
TotalSales
Quantity
0017
$19,995.00
500
0017
$1,995.00
100
$9,109.00
1
. . . 1843 9888
$299.50
50
9888
$1,990.00
1,000
Total number of Observations: 14
You may wonder why the variable labels do not show up in this listing. Even though there is a LABEL statement, the LABEL option is not used so SAS uses the variable names to head the columns.
280 Learning SAS by Example: A Programmer’s Guide
14.13 Double-Spacing Your Listing The option DOUBLE, when used with PROC PRINT, double-spaces your listing, as shown in the following example:
Program 14-16 Double-spacing your listing title "Double-Spacing Your Listing"; proc print data=sales double; id EmpID; var TotalSales Quantity; label EmpID = "Employee ID" TotalSales = "Total Sales" Quantity = "Number Sold"; format TotalSales dollar10.2 Quantity comma7.; run;
A partial listing of this output is shown here: Double-Spacing Your Listing (partial listing) Emp ID
TotalSales
Quantity
0017
$19,995.00
500
0017
$1,995.00
100
0177
$699.00
100
9888
$299.50
50
9888
$1,990.00
1,000
. . .
Chapter 14: Displaying Your Data 281
14.14 Listing the First n Observations of Your Data Set A very useful data set option, OBS=n, allows you to list the first n observations of your data set. This is particularly useful when you have a very large data set and want to see just a few observations to be sure your program is running correctly. To demonstrate this, the program here lists the first five observations from the permanent SAS data set Sales:
Program 14-17 Listing the first five observations of your data set title "First Five Observations from SALES"; proc print data=learn.sales(obs=5); run; OBS=n is a data set option. You can tell this because it is placed in parentheses following the data set name. There are many other data set options that we will discuss later in this book. Notice the listing here stops at Observation 5: First Five Observations from SALES T o
E
C
Q
U
t
u
u
n
a
R
s
a
i
l
e
t
n
t
S
m
N
g
o
I
t
C
a
O
p
a
i
m
t
i
o
l
b
I
m
o
e
e
t
s
e
s
D
e
n
r
m
y
t
s
50
8.99
449.5
100
5.99
599.0
1 1843 George Smith North Barco Corporation 144L 2 1843 George Smith South Cost Cutter's
122
3 1843 George Smith North Minimart Inc.
188S
4 1843 George Smith North Barco Corporation 908X 5 1843 George Smith South Ely Corp.
122L
3 5199.00 15597.0 1 5129.00 10
29.95
5129.0 299.5
Notice another feature of this listing—the column headings are printed vertically. PROC PRINT sometimes prints variable names vertically if the variable names take up more space than the following lines of data and the LINESIZE option allows the data values to
282 Learning SAS by Example: A Programmer’s Guide
print on one line. If you would rather force PROC PRINT to keep the variable names horizontal, you can include the option HEADING=Horizontal (you can abbreviate this to HEADING=H). Here is Program 14-17 with this option included, followed by the listing:
Program 14-18 Forcing variable labels to print horizontally title "First Five Observations from SALES"; proc print data=learn.sales(obs=5) heading=horizontal; run; First Five Observations from SALES Emp Obs
ID
Name
Region
Customer
1
1843
2
1843
George Smith
North
Barco Corporation
George Smith
South
3
Cost Cutter's
1843
George Smith
North
Minimart Inc.
4
1843
George Smith
North
Barco Corporation
5
1843
George Smith
South
Ely Corp.
Obs
Item
Quantity
1
144L
2
122
100
5.99
599.0
3
188S
3
5199.00
15597.0
4
908X
1
5129.00
5129.0
5
122L
10
29.95
299.5
50
Unit
Total
Cost
Sales
8.99
449.5
You can combine another data set option, FIRSTOBS=, with OBS= to print observations starting from any point in the data set. For example, if you wrote the following, PROC PRINT would list Observations 4 through 7: proc print data=learn.sales(firstobs=4 obs=7)
Keep in mind that OBS= is not the number of observations you want to print—it is the last observation to be processed. When you use FIRSTOBS= and OBS= together, think of OBS= as LASTOBS.
Chapter 14: Displaying Your Data 283
Even though PROC PRINT is one of the simplest procedures in Base SAS, there are quite a few options that allow you some control over the appearance of the output. If you want more control over the appearance of a listing, you should consider using PROC REPORT. This procedure takes more programming effort than the PRINT procedure, but it provides much more control over the appearance of the output.
14.15 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. List the first 10 observations in data set Blood. Include only the variables Subject, WBC (white blood cell), RBC (red blood cell), and Chol. Label the last three variables “White Blood Cells,” “Red Blood Cells,” and “Cholesterol,” respectively. Omit the Obs column, and place Subject in the first column. Be sure the column headings are the variable labels, not the variable names. 2. Using the data set Sales, create the report shown here: Sales Figures from the SALES Data Set Total Region East
-----East
Quantity
Sales
100
699.0
100
899.0
500
19995.0
2
20000.0
-------702
------41593.0
...
(continued)
284 Learning SAS by Example: A Programmer’s Guide
West -----West
50
299.5
1000
1990.0
-------1050
------2289.5
========
=======
2022
77113.0
3. Use PROC PRINT (without any DATA steps) to create a listing like the one here. Note: The variables in the Hosp data set are Subject, AdmitDate (Admission Date), DischrDate (Discharge Date), and DOB (Date of Birth). Selected Patients from HOSP Data Set Admitted in September of 2004 Older than 83 years of age -------------------------------------Date of
Admission
Discharge
Birth
Date
Date
401
03/21/1921
09/13/2004
09/22/2004
407
08/26/1920
09/13/2004
09/18/2004
409
01/01/1921
09/13/2004
10/02/2004
2577
04/30/1920
09/27/2004
09/27/2004
6889
10/26/1920
09/17/2004
09/22/2004
7495
02/11/1921
09/21/2004
09/22/2004
Subject
Number of Patients = 6
Chapter 14: Displaying Your Data 285
Hints: Variable labels replace variable names. Observations are double-spaced. The number of observations is printed at the bottom of the listing. There are four title lines (the last one being dashes). 4. List the first five observations from data set Blood. Print only variables Subject, Gender, and BloodType. Omit the Obs column.
286 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
15
Creating Customized Reports 15.1
Introduction 288
15.2
Using PROC REPORT 289
15.3
Selecting Variables to Include in Your Report 291
15.4
Comparing Detail and Summary Reports 291
15.5
Producing a Summary Report 293
15.6
Demonstrating the FLOW Option of PROC REPORT 294
15.7
Using Two Grouping Variables 296
15.8
Changing the Order of Variables in the COLUMN Statement 297
15.9
Changing the Order of Rows in a Report 299
15.10 Applying the ORDER Usage to Two Variables 300 15.11 Creating a Multi-Column Report 301 15.12 Producing Report Breaks 303 15.13 Using a Nonprinting Variable to Order a Report 306 15.14 Computing a New Variable with PROC REPORT 307 15.15 Computing a Character Variable in a COMPUTE Block 308 15.16 Creating an ACROSS Variable with PROC REPORT 310
288 Learning SAS by Example: A Programmer’s Guide
15.17 Modifying the Column Label for an ACROSS Variable 311 15.18 Using an ACROSS Usage to Display Statistics 311 15.19 Problems 313
15.1 Introduction Although you can customize the output produced by PROC PRINT, there are times when you need a bit more control over the appearance of your report. PROC REPORT was developed to fit this need. Not only can you control the appearance of every column of your report, you can produce summary reports as well as detail listings. Two very useful features of PROC REPORT, multiple-panel reports and text wrapping within a column, are often the deciding factor in choosing PROC REPORT over PROC PRINT. First, let’s look at a listing of a data set called Medical using PROC PRINT. You can then see how this listing can be enhanced using PROC REPORT:
Program 15-1 Listing of Medical using PROC PRINT title "Listing of Data Set MEDICAL from PROC PRINT"; proc print data=learn.medical; id Patno; run;
Chapter 15: Creating Customized Reports 289
This program produces the following listing: Listing of Data Set MEDICAL from PROC PRINT Patno
Clinic
VisitDate
001
Mayo Clinic
10/21/2006
003
HMC
002
Mayo Clinic
004
HR
DX
120
78
7
09/01/2006
166
58
8
10/01/2006
210
68
9
HMC
11/11/2006
288
88
9
007
Mayo Clinic
05/01/2006
180
54
7
050
HMC
07/06/2006
199
60
123
Patno
Weight
Comment
001
Patient has had a persistent cough for 3 weeks
003
Patient placed on beta-blockers on 7/1/2006
002
Patient has been on antibiotics for 10 days
004
Patient advised to lose some weight
007
This patient is always under high stress
050
Refer this patient to mental health for evaluation
15.2 Using PROC REPORT Here is a report, using the same data set, produced by PROC REPORT:
Program 15-2 Using PROC REPORT (all defaults) title "Using the REPORT Procedure"; proc report data=learn.medical nowd; run;
290 Learning SAS by Example: A Programmer’s Guide
The NOWD (no windows) option is an instruction not to enter the interactive windows editor after you run the procedure. Here is the output: Using the REPORT Procedure Pat no
Heart Clinic
Visit Date Weight
001
Mayo Clinic
10/21/2006
120
Rate DX 78 7
003
HMC
09/01/2006
166
58 8
002
Mayo Clinic
10/01/2006
210
68 9
004
HMC
11/11/2006
288
88 9
007
Mayo Clinic
05/01/2006
180
54 7
050
HMC
07/06/2006
199
60 123
Using the REPORT Procedure
Comment Patient has had a persistent cough for 3 weeks Patient placed on beta-blockers on 7/1/2006 Patient has been on antibiotics for 10 days Patient advised to lose some weight This patient is always under high stress Refer this patient to mental health for evaluation
Except for the column widths, this report looks similar to the output from Program 15-1. Default column widths with PROC REPORT are computed as follows. For character variables, the column width is either the length of the character variable or the length of the formatted value (if the variable has a format). For numeric variables, the default column width is 9 or the width of a format (if the variable is formatted). You will see in a moment how to control the width of each column in the report. Another difference between PROC PRINT and PROC REPORT is that the default column headings for PROC PRINT are variable names and the default column headings for PROC REPORT are variable labels (if they exist) or variable names if a variable does not have a label.
Chapter 15: Creating Customized Reports 291
15.3 Selecting the Variables to Include in Your Report To specify which variables you want to include in your report, use a COLUMN statement. The COLUMN statement serves a similar function to the VAR statement of PROC PRINT—it allows you to select which variables you want in your report and the order that they appear. In addition, you need to list variables that you create in COMPUTE blocks (discussed later in this chapter). As an example, look at Program 15-3.
Program 15-3 Adding a COLUMN statement to PROC REPORT title "Adding a COLUMN Statement"; proc report data=learn.medical nowd; column Patno DX HR Weight; run;
Here you are selecting four variables (Patno, DX, HR, and Weight) to be included in the report (in that order). Here is the output: Adding a COLUMN Statement Pat no
Heart DX
Rate
Weight
001 7
78
120
003 8
58
166
002 9
68
210
004 9
88
288
007 7
54
180
050 123
60
199
15.4 Comparing Detail and Summary Reports Unlike PROC PRINT, PROC REPORT is capable of producing both detail reports (listings of all observations) and summary reports (reporting statistics such as sums and means).
292 Learning SAS by Example: A Programmer’s Guide
By default, PROC REPORT produces detail reports for character variables and summary reports for numeric variables. However, when you include a mix of both types of variables in a single report (as in Program 15-3), you obtain a detail listing showing all observations. To help understand this somewhat complex idea, look at what happens when you include only numeric variables in a report.
Program 15-4 Using PROC REPORT with only numeric variables title "Report with Only Numeric Variables"; proc report data=learn.medical nowd; column HR Weight; run;
The resulting summary report is as follows: Report with Only Numeric Variables Heart Rate
Weight
406
1163
These numbers represent the SUM of the heart rates and weights for all the observations in data set Medical. You have learned two things here: first, the default usage for numeric variables is ANALYSIS (which produces a summary report), and second, the default summary statistic is SUM. Learning how to control whether to produce a detail listing or a summary report and what statistics to produce leads you to the DEFINE statement. You can use a DEFINE statement to specify the usage for each variable; use a listing of all the observations, and use ANALYSIS to create a summary report. Suppose you want a detail listing of every person in the Medical data set instead of a summary report. You could use the following program:
Program 15-5 Using DEFINE statements to define a display usage title "Display Usage for Numeric Variables"; proc report data=learn.medical nowd; column HR Weight; define HR / display "Heart Rate" width=5; define Weight / display width=6; run;
Chapter 15: Creating Customized Reports 293
Several things have been added to this program. First, notice that there is now a DEFINE statement for each of the two numeric variables. Next, attributes for the variables are entered as options in the DEFINE statement (thus, they follow a slash after the variable name). The DEFINE option DISPLAY is an instruction to produce a detailed listing of all observations. Besides defining the usage as DISPLAY, the DEFINE statement for HR adds a label (placed in quotes) and a column width. Look at the output here to see the effect of these two DEFINE statements: Display Usage for Numeric Variables Heart Rate Weight 78
120
58
166
68
210
88
288
54
180
60
199
You now see a detailed listing of every observation rather than a single summary statistic.
15.5 Producing a Summary Report For this example, you want to list the mean heart rate and mean weight for each clinic in the Medical data set. To do this, you need to use the GROUP usage for the variable Clinic. In addition, you need to specify MEAN as the statistic for heart rate and weight. Program 15-6 does the trick:
Program 15-6 Specifying a GROUP usage to create a summary report title "Demonstrating a GROUP Usage"; proc report data=learn.medical nowd; column Clinic HR Weight; define Clinic / group width=11; define HR / analysis mean "Average Heart Rate" width=12 format=5.; define Weight / analysis mean "Average Weight" width=12 format=6.; run;
294 Learning SAS by Example: A Programmer’s Guide
The MEAN statistic, along with a format, is specified for each of the numeric variables. The keyword ANALYSIS is optional because this is the default usage for numeric variables. You may place these options (label, width, format, and so on) in any order you like. The report generated by Program 15-6 is shown next: Demonstrating a GROUP Usage Average
Average
Heart Rate
Weight
HMC
69
218
Mayo Clinic
67
170
Clinic
The figures in the report show the average heart rate and average weight for each of the two clinics.
15.6 Demonstrating the FLOW Option of PROC REPORT One of the nice features of PROC REPORT is its ability to wrap lines of text within a column when you have long values. To demonstrate this feature, here is a report that includes a variable called Comment that is 50 characters long. Because a value this long would take up most of the width of a page, you can use the FLOW option to improve the appearance of your report. Here is the code to accomplish this:
Program 15-7 Demonstrating the FLOW option with PROC REPORT title "Demonstrating the FLOW Option"; proc report data=learn.medical nowd headline split=' ' ls=74; column Patno VisitDate DX HR Weight Comment; define Patno / "Patient Number" width=7; define VisitDate / "Visit Date" width=9 format=date9.; define DX / "DX Code" width=4 right; define HR / "Heart Rate" width=6; define Weight / width=6; define Comment / width=30 flow; run;
Chapter 15: Creating Customized Reports 295
This program has several new features. First, the FLOW option was added to the DEFINE statement for the Comment variable. This wraps the comment field within the defined column width of 30. The SPLIT= option is required to tell the program that you want to split the comments between words (blanks). Without this option, PROC REPORT would use other characters such as the slashes in the dates as possible line breaks. The option RIGHT, used for the DX variable, right-aligns the DX values. The default alignment for character variables is LEFT. Alignment options are LEFT, RIGHT, and CENTER. Here is the report: Demonstrating the FLOW Option Patient Number
Visit
DX
Date Code
Heart Rate Weight
Comment
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
21OCT2006
7
78
120
Patient has had a persistent cough for 3 weeks
003
01SEP2006
8
58
166
002
01OCT2006
9
68
210
Patient placed on beta-blockers on 7/1/2006 Patient has been on antibiotics for 10 days
004
11NOV2006
9
88
288
007
01MAY2006
7
54
180
050
06JUL2006
123
60
199
Patient advised to lose some weight This patient is always under high stress Refer this patient to mental health for evaluation
This is a detail report, showing all the observations in the data set. You may wonder why this is so, when the default usage for numeric variables is ANALYSIS (with SUM as the default statistic). Because patient number (Patno) is a character variable (with a default usage of DISPLAY) and it is included in the report, the usage for the numeric variables in the report also has to be DISPLAY. Having one DISPLAY usage variable in the report forces the usage to be DISPLAY for the other variables. Some programmers prefer to explicitly code the usage for every variable in a report. Thus, Program 15-7 could be written like this:
296 Learning SAS by Example: A Programmer’s Guide
Program 15-8 Explicitly defining usage for every variable title "Demonstrating the FLOW Option"; proc report data=learn.medical nowd headline split=' ' ls=74; column Patno VisitDate DX HR Weight Comment; define Patno / display "Patient Number" width=7; define VisitDate / display "Visit Date" width=9 format=date9.; define DX / display "DX Code" width=4 right; define HR / display "Heart Rate" width=6; define Weight / display width=6; define Comment / display width=30 flow; run;
15.7 Using Two Grouping Variables To demonstrate how you can nest one group within another, we use a data set called Bicycles. This data set contains the country where the bicycles were sold, the model of bicycle (road, mountain, or hybrid), the manufacturer, the number of units sold, and the total sales. The goal here is to see the sum of total sales broken down by country and model. To do this, both the Country and Model variables are defined with a GROUP usage. The order of these variables in the COLUMN statement (Country first, followed by Model) specifies that Country comes first and that Model is nested within Country. Here is the program:
Program 15-9 Demonstrating the effect of two variables with GROUP usage title "Multiple GROUP Usages"; proc report data=learn.bicycles nowd headline ls=80; column Country Model Units TotalSales; define Country / group width=14; define Model / group width=13; define Units / sum "Number of Units" width=8 format=comma8.; define TotalSales / sum "Total Sales (in thousands)" width=15 format=dollar10.; run;
Chapter 15: Creating Customized Reports 297
The option HEADLINE was added to PROC REPORT. This option places a line between the column headings and the remainder of the report. Notice that Country comes before Model in the COLUMN statement. Output from this program is shown next: Multiple GROUP Usages Number Country
Model
of Units
Total Sales (in thousands)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ France
Italy
USA
Hybrid
1,100
$594
Mountain Bike
6,400
$8,799
Road Bike
4,300
$11,830
Hybrid
700
$483
Mountain Bike
3,400
$6,382
Road Bike
4,500
$13,005
Hybrid
4,500
$2,925
10,000
$18,000
Road Bike
7,000
$15,200
Hybrid
1,300
$832
Mountain Bike
1,211
$1,358
Road Bike
3,644
$7,680
Mountain Bike United Kingdom
The values of both Country and Model are printed only when the value changes (thus making for a more readable report). The values in the last two columns are the sums of the two variables (Units and TotalSales) for each combination of Country and Model.
15.8 Changing the Order of Variables in the COLUMN Statement Here is the same report, except that the order of the two variables Country and Model is reversed in the COLUMN statement and the manufacturer (Manuf) is added.
298 Learning SAS by Example: A Programmer’s Guide
Program 15-10 Reversing the order of variables in the COLUMN statement title "Multiple GROUP Usages"; proc report data=learn.bicycles nowd headline ls=80; column Model Country Manuf Units TotalSales; define Country / group width=14; define Model / group width=13; define Manuf / width=12; define Units / sum "Number of Units" width=8 format=comma8.; define TotalSales / sum "Total Sales (in thousands)" width=15 format=dollar10.; run;
Notice that Model now comes first in the report and that Country is nested within Model. Multiple GROUP Usages Number Model
Country
Manufacturer
of Units
Total Sales (in thousands)
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Hybrid
France
Trek
1,100
Italy
Trek
700
$483
USA
Trek
4,500
$2,925
United Kingdom Trek
800
$392
Cannondale Mountain Bike
France
Trek Cannondale
Road Bike
$594
500
$440
5,600
$7,280
800
$1,519
Italy
Trek
3,400
$6,382
USA
Trek
6,000
$7,200
Cannondale
4,000
$10,800
United Kingdom Trek
1,211
$1,358
France
3,400
$8,500
900
$3,330
Trek Cannondale
Italy
Trek
4,500
$13,005
USA
Trek
5,000
$11,000
Cannondale
2,000
$4,200
2,444
$5,132
1,200
$2,548
United Kingdom Trek Cannondale
Chapter 15: Creating Customized Reports 299
15.9 Changing the Order of Rows in a Report If you want a listing in sorted order using PROC PRINT, you must first sort the data set with PROC SORT. PROC REPORT allows you to request a report in sorted order within the procedure itself. This is accomplished by requesting the ORDER usage as an option in your DEFINE statement. As an example, suppose you want a listing of your Sales data set with the variables EmpID, Quantity, and TotalSales. In addition, you want this listing to be arranged in EmpID order. Here is the program:
Program 15-11 Demonstrating the ORDER usage of PROC REPORT title "Listing from SALES in EmpID Order"; proc report data=learn.sales nowd headline; column EmpID Quantity TotalSales; define EmpID / order "Employee ID" width=11; define Quantity / width=8 format=comma8.; define TotalSales / "Total Sales" width=9 format=dollar9.; run;
The keyword ORDER in the DEFINE statement for EmpID produces the report in order of ascending EmpID. This saves you the trouble of having to sort your data set prior to requesting the report. The resulting report is shown next: Listing from SALES in EmpID Order Total Employee ID
Quantity
Sales
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0017 0177
500
$19,995
100
$1,995
100
$699
100
$899
5
$53
2
$20,000
(continued)
300 Learning SAS by Example: A Programmer’s Guide
1843
9888
50
$450
100
$599
3
$15,597
1
$5,129
10
$300
1
$9,109
50
$300
1,000
$1,990
15.10 Applying the ORDER Usage to Two Variables You can apply the ORDER usage to several variables. For example, suppose you want the same report as the one produced by Program 15-11, but you want to see the total sales for each employee in decreasing order of total sales. Here is the program:
Program 15-12 Applying the ORDER usage for two variables title "Applying the ORDER Usage for Two Variables"; proc report data=learn.sales nowd headline; column EmpID Quantity TotalSales; define EmpID / order "Employee ID" width=11; define TotalSales / descending order "Total Sales" width=9 format=dollar9.; define Quantity / width=8 format=comma8.; run;
Because you want the report to show total sales for each employee in decreasing order, you precede ORDER with the keyword DESCENDING. Remember, the order of the ORDER variables in the report is controlled by their order in the COLUMN statement.
Chapter 15: Creating Customized Reports 301
Here is the report: Applying the ORDER Usage for Two Variables Total Employee ID Quantity
Sales
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0017 0177
1843
9888
500
$19,995
100
$1,995
2
$20,000
100
$899
100
$699
5
$53
3
$15,597
1
$9,109
1
$5,129
100
$599
50
$450
10
$300
1,000
$1,990
50
$300
As you can see, the report is in ascending order of employee ID and decreasing order of total sales.
15.11 Creating a Multi-Column Report PROC REPORT can create telephone book style, multi-column reports. This is especially useful when you have only a few variables to report and you want to save paper. Data set Assign contains subject numbers (Subject) and groups (A, B, or C). Using the PANELS= option of PROC REPORT, you can print this report with multiple columns. If you specify a large number for the number of panels, PROC REPORT fits as many panels as possible for a given line size. Here is the program:
302 Learning SAS by Example: A Programmer’s Guide
Program 15-13 Creating a multi-column report title "Random Assignment - Three Groups"; proc report data=learn.assign nowd panels=99 headline ps=16; columns Subject Group; define Subject / display width=7; define Group / width=5; run;
The page size (PS) is set to 16 (which limits the number of lines per page to 16) so that you can see the effect of the PANELS= option. In this program, a DISPLAY usage was used for Subject (a numeric variable). This was not necessary because the default usage for Group (a character variable) is DISPLAY, and you would have obtained a detail report anyway. However, it is fine to include it. Here is the report: Random Assignment - Three Groups Subject Group ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Subject Group ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Subject Group ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1
C
13
A
25
B
2
B
14
A
26
C
3
B
15
C
27
A
4
A
16
A
28
A
5
B
17
C
29
B
6
A
18
A
30
B
7
C
19
B
31
B
8
B
20
B
32
B
9
B
21
C
33
C
10
A
22
A
34
C
11
C
23
C
35
A
12
C
24
C
36
A
Chapter 15: Creating Customized Reports 303
15.12 Producing Report Breaks PROC REPORT can produce totals and subtotals on ANALYSIS variables by using BREAK and RBREAK statements. RBREAK, which stands for report break, is used to report summary statistics (typically sums or means) at the top or bottom of a report. The BREAK statement is used to report summary statistics each time a GROUP or ORDER variable changes value. Let’s look at a few examples. Using the Sales data set, you want to see totals for Quantity and TotalSales at the bottom of the report, with double lines printed above and below the values. The following program produces this report:
Program 15-14 Requesting a report break (RBREAK statement) title "Producing Report Breaks"; proc report data=learn.sales nowd headline; column Region Quantity TotalSales; define Region / width=6; define Quantity / sum width=8 format=comma8.; define TotalSales / sum "Total Sales" width=9 format=dollar9.; rbreak after / dol dul summarize; run;
Following the keyword RBREAK is a location, either BEFORE or AFTER. BEFORE places the summary statistics at the beginning of the report; AFTER places them at the end. The options DOL and DUL request double overlines and double underlines, respectively. (Other options are OL—overline and UL—underline.) The SUMMARIZE option prints a statistic (the one requested in the DEFINE statements) between these lines. The report produced by Program 15-14 is shown here:
304 Learning SAS by Example: A Programmer’s Guide
Producing Report Breaks Total Region Quantity
Sales
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ North
50
South
100
$599
North
3
$15,597
North
1
$5,129
South
$450
10
$300
East
100
$699
East
100
$899
1
$9,109
South West
50
$300
West
1,000
$1,990
East
500
$19,995
South
100
$1,995
North
5
$53
East
2
$20,000
========
=========
2,022
$77,113
========
=========
To display summary statistics for each value of one or more GROUP or ORDER variables, use the BREAK statement. For example, if you want to see the total Quantity and TotalSales for each region in the Sales data set, use the following program:
Program 15-15 Demonstrating the BREAK statement of PROC REPORT title "Producing Report Breaks"; proc report data=learn.sales nowd headline; column Region Quantity TotalSales; define Region / order width=6; define Quantity / sum width=8 format=comma8.; define TotalSales / sum "Total Sales" width=9 format=dollar9.; break after region / ol summarize skip; run;
Chapter 15: Creating Customized Reports 305
The BREAK statement uses the same location keywords as the RBREAK statement. With the BREAK statement, you also need to specify one or more GROUP or ORDER variables that determine where to print the summary statistics. The option SUMMARIZE prints the appropriate summary statistic after the break, and SKIP prints a blank line after the summary line. In Program 15-15, a line showing the sum of Quantity and TotalSales is printed for each change in the Region value. The output from this program is as follows: Producing Report Breaks Total Region Quantity
Sales
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ East
100
$699
100
$899
500
$19,995
2
$20,000
ƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒ East North
702
$41,593
50
$450
3
$15,597
1
$5,129
5
$53
ƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒ North
59
$21,228
South
100
$599
10
$300
1
$9,109
100
$1,995
ƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒ South West
211
$12,003
50
$300
1,000
$1,990
ƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒ West
1,050
$2,290
306 Learning SAS by Example: A Programmer’s Guide
If you don’t want the summary line to contain the values of the BREAK variable(s), use the SUPPRESS option, like this: break after region / ol summarize suppress skip;
This results in a cleaner looking report.
15.13 Using a Nonprinting Variable to Order a Report PROC REPORT can use a variable to order the rows of a report without including the variable in the report. For example, in data set Sales, the Name variable stores names as first name followed by last name. Now, if you want a report showing this variable but you want to arrange the rows alphabetically by last name, you can run the following program:
Program 15-16 Using a nonprinting variable to order the rows of a report data temp; set learn.sales; length LastName $ 10; LastName = scan(Name,-1,' '); run; title "Listing Ordered by Last Name"; proc report data=temp nowd headline; column LastName Name EmpID TotalSales; define LastName / group noprint; define Name / group width=15; define EmpID / "Employee ID" group width=11; define TotalSales / sum "Total Sales" width=9 format=dollar9.; run;
In the short DATA step, the SCAN function extracts the last name from the Name variable. The second argument to the SCAN function tells the system which “word” to extract from the first argument. A negative value says to start scanning from the right; the third argument defines the word delimiters—in this case, a space.
Chapter 15: Creating Customized Reports 307
The LastName variable must be listed in the COLUMN statement, even if you are not planning to include it in the report. In the DEFINE statement for LastName, you need to use the NOPRINT option. This option allows the report to be sorted by last name but removes this variable from the report. In the listing from Program 15-16, notice that the rows of the report are ordered by the last name: Listing Ordered by Last Name Total Name
Employee ID
Sales
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Glenda Johnson
0177
$21,651
Sharon Lu
9888
$2,290
Jason Nguyen
0017
$21,990
George Smith
1843
$31,183
15.14 Computing a New Variable with PROC REPORT One powerful feature of PROC REPORT is its ability to compute new variables. This makes PROC REPORT somewhat unique among SAS procedures. With most SAS procedures, you need to run a DATA step first if you want to perform any calculations. Suppose you want to report the weights of the patients in your Medical data set, but instead of reporting the weights in pounds (the units used in the data set), you want to see the weights in kilograms. Program 15-17 uses a compute block to accomplish this:
Program 15-17 Computing a new variable with PROC REPORT title "Computing a New Variable"; proc report data=learn.medical nowd; column Patno Weight WtKg; define Patno / display "Patient Number" width=7; define Weight / display noprint width=6; define WtKg / computed "Weight in Kg" width=6 format=6.1; compute WtKg; WtKg = Weight / 2.2; endcomp; run;
308 Learning SAS by Example: A Programmer’s Guide
Notice several things about this program:
The new variable (WtKg) and the variable used to compute it (Weight) are both listed in the COLUMN statement. It is important that you list Weight before WtKg in this statement—a computed value must follow the variable or variables used to define it. If you don’t want to include the original value of weight in the report, you use the keyword NOPRINT in the DEFINE statement for Weight.
Use a usage of COMPUTED in the DEFINE statement for your new variable. Use COMPUTE and ENDCOMP statements to create a COMPUTE block. You define your new variable inside this block.
As you will see in the next example, this block can contain programming logic as well as arithmetic computations. Here is the output from Program 15-17: Computing a New Variable Patient Number
Weight in Kg
001
54.5
003
75.5
002
95.5
004
130.9
007
81.8
050
90.5
15.15 Computing a Character Variable in a COMPUTE Block As we mentioned in the previous example, you can include programming logic within a COMPUTE block. In this example, you want to create a new character variable (Rate) that has a value of Fast, Normal, or Slow, based on the heart rate (HR). Here is the program:
Chapter 15: Creating Customized Reports 309
title "Creating a Character Variable in a COMPUTE Block"; proc report data=learn.medical nowd; column Patno HR Weight Rate; define Patno / display "Patient Number" width=7; define HR / display "Heart Rate" width=5; define Weight / display width=6; define Rate / computed width=6; compute Rate / character length=6; if HR gt 75 then Rate = 'Fast'; else if HR gt 55 then Rate = 'Normal'; else if not missing(HR) then Rate='Slow'; endcomp; run;
As before, you include the variable you are computing in the COLUMN statement. The COMPUTE statement now includes the keyword CHARACTER to tell SAS that Rate is a character variable. The option LENGTH=6 defines the storage length for this variable. Next, the logical statements to compute Rate are sandwiched between the COMPUTE and ENDCOMP statements. Here is the output: Creating a Character Variable in a COMPUTE Block Patient Heart Number
Rate Weight Rate
001
78
120 Fast
003
58
166 Normal
002
68
210 Normal
004
88
288 Fast
007
54
180 Slow
050
60
199 Normal
As you can see, the new variable (Rate) is now included in the report.
310 Learning SAS by Example: A Programmer’s Guide
15.16 Creating an ACROSS Variable with PROC REPORT Besides creating simple column reports, PROC REPORT can also create a tabular style report with each unique value of a variable forming a new column in your report. This is accomplished by using an ACROSS usage for your variable. In the example that follows, each type of bicycle (hybrid, mountain, and road) in the Bicycle data set is used to create a separate column in the report:
Program 15-18 Demonstrating an ACROSS usage in PROC REPORT ***Demonstrating an Across Usage; title "Demonstrating an ACROSS Usage"; proc report data=learn.bicycles nowd headline ls=80; column Model,Units Country; define Country / group width=14; define Model / across "Model"; define Units / sum "# of Units" width=14 format=comma8.; run;
Besides defining Model as an ACROSS variable, you need to tell the procedure what value you want to display for each row of the table. Notice the Model,Units term in the COLUMN statement. This is an instruction to PROC REPORT to report the number of units within each category of Model. The listing here should help to clarify how this works: Demonstrating an ACROSS Usage Model Hybrid
Mountain Bike
Road Bike
# of Units
# of Units
# of Units
Country
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1,100
6,400
4,300
France
700
3,400
4,500
Italy
4,500
10,000
7,000
USA
1,300
1,211
3,644
United Kingdom
Chapter 15: Creating Customized Reports 311
15.17 Modifying the Column Label for an ACROSS Variable When you begin and end the label text for an ACROSS variable with certain characters (such as a dash or underscore), SAS uses that character to extend the column label across all the columns created by the ACROSS variable. For example, replace the DEFINE statement in Program 15-18 with the following: define Model / across "- Model -";
The output looks like this: Demonstrating an ACROSS Usage Changing the Column Heading ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Model ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Hybrid
Mountain Bike
Road Bike
# of Units
# of Units
# of Units
Country
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1,100
6,400
4,300
France
700
3,400
4,500
Italy
4,500
10,000
7,000
USA
1,300
1,211
3,644
United Kingdom
15.18 Using an ACROSS Usage to Display Statistics You can use an ACROSS usage to create a report showing a summary statistic for each level of the ACROSS variable. Here is an example. You want to see the average white and red blood cell counts for each combination of gender, blood type, and age group in the Blood data set. You also want the age group variable to be displayed in separate columns of the report. Here is the program:
312 Learning SAS by Example: A Programmer’s Guide
title "Average Blood Counts by Age Group"; proc report data=learn.blood nowd headline; column Gender BloodType AgeGroup,WBC AgeGroup,RBC; define Gender / group width=8; define BloodType / group width=8 "Blood Group"; define AgeGroup / across "- Age Group -"; define WBC / analysis mean format=comma8.; define RBC / analysis mean format=8.2; run;
As you can see, the COLUMN statement shows that the mean value of WBC (white blood cells) and RBC (red blood cells) should be displayed for each value of AgeGroup. You also need to define WBC and RBC with an ANALYSIS usage with MEAN as the desired statistic. Here is the output: Average Blood Counts by Age Group ƒƒƒ Age Group ƒƒƒƒ Gender
ƒƒƒ Age Group ƒƒƒƒ
Blood
Old
Young
Old
Young
Group
WBC
WBC
RBC
RBC
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Female
Male
A
7,162
7,310
5.40
5.58
AB
7,556
7,251
5.05
5.96
B
6,931
6,501
5.69
5.30
O
7,033
7,071
5.56
5.48
A
6,995
7,138
5.38
5.59
AB
6,769
7,079
5.82
5.36
B
7,082
6,807
5.41
5.43
O
6,853
7,041
5.47
5.48
You should be aware that PROC TABULATE is well-suited for creating tables such as this. Many programmers, including this author, find PROC TABULATE easier to use when creating more complicated two-way tables.
Chapter 15: Creating Customized Reports 313
This chapter is only the “tip of the iceberg” as far as PROC REPORT is concerned. If you would like to learn more, there is an excellent SAS Press book1 and SAS manual2 on PROC REPORT.
15.19 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Use PROC REPORT to create a report, as shown here: Note: The data set is Blood, and the variables to be listed are Subject, WBC, and RBC. All three variables are numeric (be careful). There is a line after the heading. First 5 Observations from Blood Data Set
Subject Number
White Blood Cells
Red Blood Cells
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 2 3 4 5
7,710 6,560 5,690 6,680 .
7.40 4.70 7.53 6.85 7.72
1
See Michele M. Burlew, SAS Guide to Report Writing: Examples, Second Edition (Cary, NC: SAS Institute Inc., 2005), for more information on PROC REPORT.
2
See SAS Institute Inc., Base SAS 9.1 Procedures Guide, Volume 2 (Cary, NC: SAS Institute Inc., 2004), 845, for more information on PROC REPORT.
314 Learning SAS by Example: A Programmer’s Guide
2. Using the Blood data set, produce a summary report showing the average WBC and RBC count for each value of Gender as well as an overall average. Your report should look like this: Statistics from BLOOD by Gender Average Average Gender
WBC
RBC
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Female
7,112
5.50
Male
6,988
5.47
======= ======= 7,043
5.48
3. Using the Hosp data set, create the report shown here. Age should be computed using the YRDIF function and rounded to the nearest integer: Demonstrating a COMPUTE Block Admission Subject
Date
Age at DOB
Admission
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1
03/28/2003
09/15/1926
77
2
03/28/2003
07/08/1950
53
3
03/28/2003
12/30/1981
21
4
03/28/2003
06/11/1942
61
5
08/03/2003
06/28/1928
75
Chapter 15: Creating Customized Reports 315
4. Using the SAS data set BloodPressure, compute a new variable in your report. This variable (Hypertensive) is defined as Yes for females (Gender=F) if the SBP is greater than 138 or the DBP is greater than 88 and No otherwise. For males (Gender=M), Hypertensive is defined as Yes if the SBP is over 140 or the DBP is over 90 and No otherwise. Your report should look like this: Hypertensive Patients Gender
SBP
DBP
F
110
62
No
120
70
No
138
88
No
132
76
No
144
90
Yes
130
80
No
142
82
No
150
96
No
M
Hypertensive?
5. Using the SAS data set BloodPressure, produce a report showing Gender and Age, along with a new variable called AgeGroup. Values of AgeGroup are <= 50 or > 50 depending on the value of Age. Label this variable “Age Group.” Your report should look like this: Patient
Age Groups Age
Gender
Age
Group
M
23
<= 50
F
68
> 50
M
55
> 50
F
28
<= 50
M
35
<= 50
M
45
<= 50
F
48
<= 50
F
78
> 50
316 Learning SAS by Example: A Programmer’s Guide
6. Using the SAS data set BloodPressure, produce a report showing Gender, Age, SBP, and DBP. Order the report in Gender and Age order as shown here: Subject's in Gender and Age Order Systolic Diastolic Gender F
M
Blood
Blood
Age Pressure
Pressure
28
120
70
48
138
88
68
110
62
78
132
76
23
144
90
35
142
82
45
150
96
55
130
80
7. Using the SAS data set Blood, produce a report showing the mean cholesterol (Chol) for each combination of Gender and blood type (BloodType). Your report should look like this: Mean Cholesterol by Gender and Blood Type Blood Gender
Type
Mean Cholesterol
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Female
Male
A
201.4
AB
166.5
B
208.1
O
205.6
A
201.5
AB
191.2
B
211.1
O
197.9
Chapter 15: Creating Customized Reports 317
8. Using the data set Blood, produce a report like the one here. The numbers in the table are the average WBC and RBC counts for each combination of blood type and gender. Average Blood Counts by Gender ƒƒƒƒƒƒGenderƒƒƒƒƒƒ ƒƒƒƒƒƒGenderƒƒƒƒƒƒ Blood Type
Female
Male
Female
Male
WBC
WBC
RBC
RBC
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ A
7,218
7,051
5.47
5.46
AB
7,421
6,893
5.43
5.69
B
6,716
6,991
5.52
5.42
O
7,050
6,930
5.53
5.48
9. Using the SAS data set Survey, produce a report showing the ID, Gender, Age, and Salary variables and the average of the five variables Ques1–Ques5. Your report should look like this: Report on the Survey Data Set Average ID
Age Gender
Salary
Response
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
23 M
$28,000
1.8
002
55 F
$76,123
2.6
003
38 M
$36,500
1.8
004
67 F
$128,000
3.2
005
22 M
$23,060
3.0
006
63 M
$90,000
3.4
007
45 F
$76,100
3.6
318 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
16
Summarizing Your Data 16.1
Introduction 320
16.2
PROC MEANS—Starting from the Beginning 320
16.3
Adding a BY Statement to PROC MEANS 323
16.4
Using a CLASS Statement with PROC MEANS 324
16.5
Applying a Format to a CLASS Variable 325
16.6
Deciding between a BY Statement and a CLASS Statement 327
16.7
Creating Summary Data Sets Using PROC MEANS 327
16.8
Outputting Other Descriptive Statistics with PROC MEANS 328
16.9
Asking SAS to Name the Variables in the Output Data Set 329
16.10 Outputting a Summary Data Set: Including a BY Statement 330 16.11 Outputting a Summary Data Set: Including a CLASS Statement 331 16.12 Using Two CLASS Variables with PROC MEANS 333 16.13 Selecting Different Statistics for Each Variable 337 16.14 Problems 338
320 Learning SAS by Example: A Programmer’s Guide
16.1 Introduction You may have thought of PROC MEANS (or PROC SUMMARY) primarily as a way to generate summary reports, reporting the sums and means of your numeric variables. However, these procedures are much more versatile and can be used to create summary data sets that can then be analyzed with more DATA or PROC steps. All the examples in this chapter use PROC MEANS rather than PROC SUMMARY, even when all you want is an output data set. The reason for this is that using PROC MEANS with a NOPRINT option is identical to using PROC SUMMARY.
16.2 PROC MEANS—Starting from the Beginning You can begin by running PROC MEANS with all the defaults, using the permanent SAS data set Blood. Here it is:
Program 16-1 PROC MEANS with all the defaults title "PROC MEANS With All the Defaults"; proc means data=learn.blood; run;
Here is the resulting output: PROC MEANS With All the Defaults The MEANS Procedure Variable
Label
N
Mean
Std Dev
Minimum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Subject
1000
500.5000000
288.8194361
1.0000000
WBC
908
7042.97
1003.37
4070.00
RBC
916
5.4835262
0.9841158
1.7100000
795
201.4352201
49.8867157
17.0000000
Chol
Cholesterol
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chapter 16: Summarizing Your Data 321
Variable
Label
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Subject
1000.00
WBC
10550.00
RBC Chol
8.7500000 Cholesterol
331.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
By default, PROC MEANS produces statistics on all the numeric variables in the input SAS data set. Looking at the output, you see the default statistics produced are N (number of non-missing values), Mean (average), Std Dev (standard deviation), Minimum, and Maximum. The next step is to gain some control over this process. You can control which variables to include in the report by supplying a VAR statement. Statistics are chosen by selecting options in the PROC MEANS statement. Here is a partial list of some of the more commonly used options.
Table 16-1 Partial list of PROC MEANS options PROC MEANS Option
Statistic Produced
N
Number of non-missing values
NMISS
Number of missing values
MEAN
Mean or Average
SUM
Sum of the values
MIN
Minimum (non-missing) value
MAX
Maximum value
MEDIAN
Median value
STD
Standard deviation
VAR
Variance
CLM
95% confidence interval for the mean
Q1
Value of the first quartile (25th percentile)
Q3
Value of the third quartile (75th percentile)
QRANGE
Interquartile range (equal to Q3–Q1)
Besides these statistics, the option MAXDEC=value is especially useful. This value controls the number of places to the right of the decimal point that are printed in the output.
322 Learning SAS by Example: A Programmer’s Guide
Let’s use some of these options to customize the output. You will compute the number of missing and non-missing values, the mean, median, minimum, and maximum for the variables RBC (red blood cells) and WBC (white blood cells) in the Blood data set. Finally, you will report all statistics to the nearest 10th. Here is the program:
Program 16-2 Adding a VAR statement and requesting specific statistics with PROC MEANS title "Selected Statistics Using PROC MEANS"; proc means data=learn.blood n nmiss mean median min max maxdec=1; var RBC WBC; run;
The output follows: Selected Statistics Using PROC MEANS The MEANS Procedure N Variable
N
Miss
Mean
Median
Minimum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC
916
84
5.5
5.5
1.7
WBC
908
92
7043.0
7040.0
4070.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Variable
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC
8.8
WBC
10550.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
You now have only the statistics you requested on the variables listed in the VAR list. Notice also that all the statistics are reported to one decimal place, due to the MAXDEC=1 option. (Unfortunately, the number of decimal places you choose is applied to all variables. If you need more control over the number of printed decimal places for each variable, you can use PROC TABULATE.)
Chapter 16: Summarizing Your Data 323
16.3 Adding a BY Statement to PROC MEANS If you want to see descriptive statistics for each level of another variable, you can include a BY statement, listing one or more BY variables. Remember that you have to sort your data set first by the same variable or variables you list on the BY statement. Here are the same statistics displayed by Program 16-2, broken down by gender:
Program 16-3 Adding a BY statement to PROC MEANS proc sort data=learn.blood out=blood; by Gender; run; title "Adding a BY Statement to PROC MEANS"; proc means data=blood n nmiss mean median min max maxdec=1; by Gender; var RBC WBC; run;
You now have your descriptive statistics for males and females separately, as shown next: Adding a BY statement to PROC MEANS Gender=Female The MEANS Procedure N Variable
N
Miss
Mean
Median
Minimum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC
409
31
5.5
5.6
1.7
WBC
403
37
7112.4
7150.0
4620.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Variable
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC
8.8
WBC
10260.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
324 Learning SAS by Example: A Programmer’s Guide
Gender=Male N Variable
N
Miss
Mean
Median
Minimum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC
507
53
5.5
5.5
2.3
WBC
505
55
6987.5
6930.0
4070.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Variable
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC
8.4
WBC
10550.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
16.4 Using a CLASS Statement with PROC MEANS PROC MEANS lets you use a CLASS statement in place of a BY statement. The CLASS statement performs a similar function to the BY statement, with some significant differences. If you are using PROC MEANS to print a report and are not creating a summary output data set, the differences in the printed output between a BY and CLASS statement are basically cosmetic. The main difference, from a programmer’s perspective, is that you do not have to sort your data set before using a CLASS statement. To demonstrate this, run Program 16-3 again, substituting the CLASS statement for the BY statement, as follows:
Program 16-4 Using a CLASS statement with PROC MEANS title "Using a CLASS Statement with PROC MEANS"; proc means data=learn.blood n nmiss mean median min max maxdec=1; class Gender; var RBC WBC; run;
Chapter 16: Summarizing Your Data 325
Notice that you are using the permanent SAS data set (that may not be sorted) instead of the temporary sorted data set used in Program 16-3. There are some minor differences in the appearance of the printed output, as shown here: Using a CLASS Statement with PROC MEANS The MEANS Procedure N Gender
Obs
N Variable
N
Miss
Mean
Median
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Female
Male
440
560
RBC
409
31
5.5
5.6
WBC
403
37
7112.4
7150.0
RBC
507
53
5.5
5.5
WBC
505
55
6987.5
6930.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ N Gender
Obs
Variable
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Female
Male
440
560
RBC
1.7
8.8
WBC
4620.0
10260.0
RBC
2.3
8.4
WBC
4070.0
10550.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
16.5 Applying a Format to a CLASS Variable One very nice feature of using a CLASS statement (besides not having to sort your data) is that SAS uses formatted values of the CLASS variable(s). You can use this to your advantage by adding a FORMAT statement to the procedure and changing how the CLASS variable groups your data, all without having to modify your original data set.
326 Learning SAS by Example: A Programmer’s Guide
Suppose you want to see some basic statistics (mean and median) for the two blood count variables (RBC and WBC), broken down by subjects with cholesterol levels below 200 versus subjects with cholesterol levels above 200. See how easy this is to do with a CLASS statement and a FORMAT statement:
Program 16-5 Demonstrating the effect of a formatted CLASS variable proc format; value chol_group low -< 200 = 'Low' 200 - high = 'High'; run; proc means data=learn.blood n nmiss mean median maxdec=1; title "Using a CLASS Statement with PROC MEANS"; class Chol; format Chol chol_group.; var RBC WBC; run;
Do you see the tremendous power in this method? You can try different groupings of CLASS variables by supplying a new format. Here is the output: Using a CLASS Statement with PROC MEANS The MEANS Procedure N Cholesterol
Obs
N Variable
N
Miss
Mean
Median
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Low
High
384
411
RBC
352
32
5.5
5.5
WBC
351
33
6938.2
6910.0
RBC
376
35
5.5
5.5
WBC
374
37
7138.9
7130.0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chapter 16: Summarizing Your Data 327
16.6 Deciding between a BY Statement and a CLASS Statement There are a number of considerations in deciding whether to use a BY or a CLASS statement when you want to use PROC MEANS to produce printed output (the differences are more important if you are using PROC MEANS to create summary output data sets). First, if you have a very large data set that is not sorted, you may want to use a CLASS statement. However, if the data set is already in the correct sorted order, a BY statement is more efficient. If you have a large number of CLASS variables and there are many distinct values for each of these variables, you may need considerable computer memory to run the procedure. If you have relatively small data sets, choose the statement that produces the style of printed output that you prefer.
16.7 Creating Summary Data Sets Using PROC MEANS You can use PROC MEANS (or SUMMARY) to create a new data set that contains summary information such as sums and means. This data set can then be used for further analysis. This first example shows how to compute two means and output them to a SAS data set.
Program 16-6 Creating a summary data set using PROC MEANS proc means data=learn.blood noprint; var RBC WBC; output out = my_summary mean = MeanRBC MeanWBC; run; title "Listing of MY_SUMMARY"; proc print data=my_summary noobs; run;
An OUTPUT statement tells PROC MEANS that you want to create a summary SAS data set. The keyword OUT= is used to name the new data set. In this example, you are going to name your output data set My_Summary. Next, you can specify exactly what statistics you want in this data set. You can output most of the statistics listed in
328 Learning SAS by Example: A Programmer’s Guide
Table 16-1. Following the keyword and an equal sign, you provide a list of the variable names that correspond to these values. In this program, the variable MeanRBC is the mean of all the RBC values, and MeanWBC is the mean of all the WBC values. You can name these variables anything you want—they are arbitrary. The order of the variables here corresponds to the order of the variables in the VAR list. It is convenient to name these variables in a way that helps you remember what they stand for. Shortly, you will see that SAS can name these variables for you in a very convenient way. Finally, if you want to create a SAS data set but do not want any printed output from PROC MEANS, use the NOPRINT option. As we mentioned earlier, if you choose to use PROC SUMMARY instead of PROC MEANS, you do not need a NOPRINT option—it is the default for this procedure. Program 16-6 also includes a PROC PRINT statement so that you can see the contents of the output data set. Here is the listing of the summary data set My_Summary: Listing of MY_SUMMARY _TYPE_
_FREQ_
MeanRBC
MeanWBC
0
1000
5.48353
7042.97
This data set has only one observation. It includes the two means plus two additional variables created by PROC MEANS. We will discuss these two variables later when we run the procedure with a CLASS statement. When you are outputting only a single statistic (such as a mean in the previous example), you can omit the variable list following the keyword MEAN= if you want. If you do, SAS gives these summary variables the same names as the variables listed in the VAR statement. It is strongly recommended that you do not do this. It is easy to make a mistake when you have the same variable name in two data sets, with one representing individual values and the other representing a summary statistic.
16.8 Outputting Other Descriptive Statistics with PROC MEANS As we mentioned earlier, you can output more than one statistic in your output data set. As an example, let’s output the mean, the number of non-missing observations, the number of missing observations, and the median. Here is the program:
Chapter 16: Summarizing Your Data 329
Program 16-7 Outputting more than one statistic with PROC MEANS proc means data=learn.blood noprint; var RBC WBC; output out = many_stats mean = M_RBC M_WBC n = N_RBC N_WBC nmiss = Miss_RBC Miss_WBC median = Med_RBC Med_WBC; run;
So, you get the idea. You can output as many statistics as you want and name them anything you want. A listing of this data set, using PROC PRINT, is shown here: Listing of MANY_STATS _TYPE_
_FREQ_
M_RBC
M_WBC
N_RBC
N_WBC
0
1000
5.48353
7042.97
916
908
Miss_RBC 84
Miss_WBC 92
Med_RBC 5.52
Med_WBC 7040
The two variables M_RBC and M_WBC represent the means of RBC and WBC, respectively. The two variables Med_RBC and Med_WBC are the medians of the two variables. N_RBC and N_WBC represent the number of non-missing values; Miss_RBC and Miss_WBC represent the number of missing values for these two variables. As a check, notice that N_RBC plus Miss_RBC equals 1,000, the total number of observations in data set Blood.
16.9 Asking SAS to Name the Variables in the Output Data Set You can ask SAS to create variable names for any of the summary statistics produced by PROC MEANS. The OUTPUT option AUTONAME causes PROC MEANS to create variable names for the selected statistics by using the variable names in the VAR statement and adding an underscore character followed by the name of the statistic. The best way to see this is to look at a program and a listing of the output data set:
330 Learning SAS by Example: A Programmer’s Guide
Program 16-8 Demonstrating the OUTPUT option AUTONAME proc means data=learn.blood noprint; var RBC WBC; output out = many_stats mean = n = nmiss = median = / autoname; run;
Because AUTONAME is a statement option, it follows a slash in the OUTPUT statement. Take a look at a listing of this data set to see how SAS named these variables: Demonstrating the AUTONAME Output Option RBC_ WBC_
RBC_
WBC_
_TYPE_ _FREQ_ RBC_Mean WBC_Mean RBC_N WBC_N NMiss NMiss Median Median 0
1000
5.48353
7042.97
916
908
84
92
5.52
7040
As you can see, the OUTPUT option AUTONAME is quite useful, and it creates consistent and easy-to-understand variable names.
16.10 Outputting a Summary Data Set: Including a BY Statement Although there are many uses for a summary data set containing summary statistics on an entire data set, there are times when you would like to output summary statistics for each level of one or more classification variables. For example, you might want to see n's and means of RBC and WBC broken down by gender. Remember that to use a BY statement, the data set must be sorted in the same order. Program 16-9 creates a summary data set containing the mean values of RBC and WBC for males and females.
Chapter 16: Summarizing Your Data 331
Program 16-9 Adding a BY statement to PROC MEANS proc sort data=learn.blood out=blood; by Gender; run; proc means data=blood noprint; by Gender; var RBC WBC; output out = by_gender mean = MeanRBC MeanWBC n = N_RBC N_WBC; run;
A listing of the output data set follows: Listing of BY_GENDER Gender
_TYPE_
_FREQ_
MeanRBC
MeanWBC
N_RBC
N_WBC
Female
0
440
5.49848
7112.43
409
403
Male
0
560
5.47146
6987.54
507
505
In this data set, _FREQ_ represents the number of observations for each value of gender (it doesn’t matter if there are missing values for RBC or WBC). For example, you see that there are 440 females in the data set and that the means for RBC and WBC are 5.49848 and 7112.43, respectively. The variables N_RBC and N_WBC represent the number of non-missing values that were used to compute the two means.
16.11 Outputting a Summary Data Set: Including a CLASS Statement As you saw earlier in this chapter, you can use a CLASS statement with PROC MEANS to obtain results similar to those obtained using a BY statement. The difference between the two becomes more apparent when you are creating output data sets. The program here is a repeat of Program 16-9, with a CLASS statement replacing the BY statement (and the PROC SORT omitted):
332 Learning SAS by Example: A Programmer’s Guide
Program 16-10 Adding a CLASS statement to PROC MEANS proc means data=blood noprint; class Gender; var RBC WBC; output out = with_class mean = MeanRBC MeanWBC n = N_RBC N_WBC; run;
The resulting data set is listed here: Listing of WITH_CLASS Gender
Female Male
_TYPE_
_FREQ_
MeanRBC
0
1000
5.48353
7042.97
916
908
1
440
5.49848
7112.43
409
403
560
5.47146
6987.54
507
505
1
MeanWBC
N_RBC
N_WBC
Notice that you now have three observations instead of just two. The first observation in this data set, with _TYPE_ equal to 0, is the mean for both males and females. Statisticians call this the grand mean. The other two observations with _TYPE_ equal to 1 represent the means for females and males. We will discuss the interpretation of the _TYPE_ variable in the next section, which demonstrates the use of more than one CLASS variable. If you want only the means broken down by gender and do not want an observation with the grand mean in your output data set, use the NWAY option of PROC MEANS. Using this option makes the output data set using a CLASS statement identical (except for the value of _TYPE_) to an output data set using a BY statement. Thus, you can write Program 16-11:
Program 16-11 Adding the NWAY option to PROC MEANS proc means data=blood noprint nway; class Gender; var RBC WBC; output out = with_class mean = MeanRBC MeanWBC n = N_RBC N_WBC; run;
Chapter 16: Summarizing Your Data 333
The resulting data set is as follows: Listing of WITH_CLASS With the NWAY Option Gender
_TYPE_
_FREQ_
MeanRBC
MeanWBC
N_RBC
N_WBC
Female
1
440
5.49848
7112.43
409
403
Male
1
560
5.47146
6987.54
507
505
It is very important to remember to use the NWAY option if you only want to see your descriptive statistics at each level of the CLASS variable.
16.12 Using Two CLASS Variables with PROC MEANS Things start to get more complicated when you have more than one CLASS variable. Suppose you want to compute the mean and the number of non-missing values for RBC and WBC for each combination of AgeGroup and Gender. You place these two variables in the CLASS statement and create an output data set just as you did earlier with only one CLASS variable. Here are the PROC statements to create such an output data set:
Program 16-12 Using two CLASS variables with PROC MEANS proc means data=learn.blood noprint; class Gender AgeGroup; var RBC WBC; output out = summary mean = n = / autoname; run;
334 Learning SAS by Example: A Programmer’s Guide
And here is a listing of the output data set: Listing of SUMMARY Age Obs Gender Group _TYPE_ _FREQ_ RBC_Mean WBC_Mean RBC_N WBC_N 1
0
1000
5.48353
7042.97
916
908
2
Old
1
598
5.45779
7011.56
551
540
3
Young
1
402
5.52238
7089.08
365
368
4
Female
2
440
5.49848
7112.43
409
403
5
Male
2
560
5.47146
6987.54
507
505
6
Female Old
3
258
5.47921
7105.98
242
234
7
Female Young
3
182
5.52641
7121.36
167
169
8
Male
Old
3
340
5.44100
6939.35
309
306
9
Male
Young
3
220
5.51899
7061.66
198
199
Notice that things are getting more complicated. Even though there are only two levels of Gender and two levels of AgeGroup, there are nine observations in the output data set with four values of _TYPE_. The _TYPE_ = 0 observation is the same as you saw earlier—it represents the grand mean, the mean of all non-missing values in the data set. To understand the other observations in this data set, you have to know either how to count in binary or how to use another PROC MEANS option—CHARTYPE. Let’s take the easy way out and rerun this program with the CHARTYPE procedure option as follows:
Program 16-13 Adding the CHARTYPE procedure option to PROC MEANS proc means data=learn.blood noprint chartype; class Gender AgeGroup; var RBC WBC; output out = summary mean = n = / autoname; run;
Chapter 16: Summarizing Your Data 335
A listing of the output now looks like this: Listing of SUMMARY Age Obs Gender Group _TYPE_ _FREQ_ RBC_Mean WBC_Mean RBC_N WBC_N 1
00
1000
5.48353
7042.97
916
908
2
Old
01
598
5.45779
7011.56
551
540
3
Young
01
402
5.52238
7089.08
365
368
4
Female
10
440
5.49848
7112.43
409
403
5
Male
10
560
5.47146
6987.54
507
505
6
Female
Old
11
258
5.47921
7105.98
242
234
7
Female
Young
11
182
5.52641
7121.36
167
169
8
Male
Old
11
340
5.44100
6939.35
309
306
9
Male
Young
11
220
5.51899
7061.66
198
199
This may not look any simpler, but it is now easier to explain what _TYPE_ represents. First of all, the option CHARTYPE (this stands for character type) causes the _TYPE_ variable to be a character string of 1’s and 0’s. If you are familiar with counting in binary, you will observe that the character strings under the _TYPE_ column represent the previous _TYPE_ values, except they are now in a binary representation. (It’s OK if this last sentence is meaningless to you—that’s why SAS created the CHARTYPE option in the first place.) If you look at the CLASS statement, you see the two variables Gender and AgeGroup. If you imagine writing the left-most character of _TYPE_ under Gender and the other character of _TYPE_ under AgeGroup, there is a simple rule for interpreting the summary statistic for any value of _TYPE_. Here are the CLASS variables with the values of _TYPE_ written underneath (we will use the mean as an example):
_TYPE_
Gender
AgeGroup
Interpretation
0
0
Mean for all Genders and all AgeGroups
0
1
Mean for each level of AgeGroup
1
0
Mean for each level of Gender
1
1
Mean for each combination of Gender and AgeGroup
336 Learning SAS by Example: A Programmer’s Guide
Here is the rule: if there is a 1 under the CLASS variable, the statistics in the output data set represent the statistics computed on each level of that variable. For example, if you are computing means, the mean with a value of _TYPE_ equal to 10 is a mean computed at each level of Gender (for all age groups). If you want only the statistics broken down by each level of the CLASS variables, you can use the NWAY option. This selects only the largest value of _TYPE_. (If you use the CHARTYPE option, NWAY selects the value of _TYPE_ containing all 1’s.) You can use the _TYPE_ variable to select the appropriate breakdown. For example, you can use the _TYPE_ values in a WHERE statement in a PROC PRINT statement, like this:
Program 16-14 Using the _TYPE_ variable to select cell means title "Statistics Broken Down by Gender"; proc print data=summary(drop = _freq_) noobs; where _TYPE_ = '10'; run; Statistics Broken Down by Gender Age Gender
Group
_TYPE_
RBC_Mean
WBC_Mean
RBC_N
WBC_N
Female
10
5.49848
7112.43
409
403
Male
10
5.47146
6987.54
507
505
Because the values are broken down by Gender, AgeGroup has a missing value in each observation (it was left in the listing for demonstration purposes). You can also use this summary data set to create separate data sets: one for the grand mean, one for means broken down by Gender, one broken down by AgeGroup, and one containing cell means. You can do this in one DATA step, like this:
Program 16-15 Using a DATA step to create separate summary data sets data grand(drop = Gender AgeGroup) by_gender(drop = AgeGroup) by_age(drop = Gender) cellmeans; set summary; drop _type_; rename _freq_ = Number; if _type_ = '00' then output grand;
Chapter 16: Summarizing Your Data 337
else if _type_ = '01' then output by_age; else if _type_ = '10' then output by_gender; else if _type_ = '11' then output cellmeans; run;
Because you want a different selection of variables in each data set, you can use a KEEP or DROP data set option to make your selection. Because you want to drop _TYPE_ from every data set, it is easier to do this with a DROP statement in the DATA step. A RENAME statement renames _FREQ_ to Number. Again, this change applies to all four data sets. _TYPE_ is a character variable because of the CHARTYPE option, and you can use it to route each observation to the proper output data set.
16.13 Selecting Different Statistics for Each Variable There is an alternative way of selecting which variables and statistics you want in your summary data set, as shown in Program 16-16.
Program 16-16 Selecting different statistics for each variable using PROC MEANS proc means data=learn.blood noprint nway; class Gender AgeGroup; output out = summary(drop = _:) mean(RBC WBC) = n(RBC WBC Chol) = median(Chol) = / autoname; run;
We have added several new features to this program. Each statistic keyword is followed, in a set of parentheses, by a list of variables for which you want to compute this statistic. For example, this program computes means for RBC and WBC, the number of nonmissing values for all three variables (RBC, WBC, and Chol), and the median of Chol. Although you can name the newly created statistic following the equal sign as before, this program uses the AUTONAME output option to name them. Finally, you may wonder about the rather strange-looking DROP= data set option in the SUMMARY data set. This DROP= option uses the colon wildcard notation. All variables beginning with an underscore character will be dropped. In general, name: used in any
338 Learning SAS by Example: A Programmer’s Guide
location where you would place a variable list refers to all variables beginning with the letters name. You can think of the colon the same way you use an asterisk as a DOS wildcard (if anyone still remembers DOS). A listing of the Summary data set follows: Listing of SUMMARY Age
Chol_
Gender
Group
RBC_Mean
WBC_Mean
RBC_N
WBC_N
Chol_N
Median
Female
Old
5.47921
7105.98
242
234
208
198
Female
Young
5.52641
7121.36
167
169
141
211
Male
Old
5.44100
6939.35
309
306
279
195
Male
Young
5.51899
7061.66
198
199
167
208
As you saw in the examples in this chapter, PROC MEANS can do a lot more than print summary statistics—it can create summary data sets broken down by one or more CLASS variables. By using options such as CHARTYPE and AUTONAME, you can simplify your programs and let SAS name your summary variable for you.
16.14 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Using the SAS data set College, compute the mean, median, minimum, and maximum and the number of both missing and non-missing values for the variables ClassRank and GPA. Report the statistics to two decimal places. 2. Repeat Problem 1, except compute the desired statistics for each combination of Gender SchoolSize. Do this twice, once using a BY statement, and once using a CLASS statement. Note: The data set College has permanent formats for Gender, SchoolSize, and Scholarship. Either make this format catalog available to SAS (see Chapter 5), run the PROC FORMAT statements that follow, or use the system option
Chapter 16: Summarizing Your Data 339
NOFMTERR (no format error) that allows you to access SAS data sets that have permanent user-defined formats without causing an error. proc format; value $yesno
'Y','1' = 'Yes' 'N','0' = 'No' ' ' = 'Not Given'; value $size 'S' = 'Small' 'M' = 'Medium' 'L' = 'Large' ' ' = 'Missing'; value $gender 'F' = 'Female' 'M' = 'Male' ' ' = 'Not Given'; run;
3. Using the SAS data set College, report the mean and median GPA and ClassRank broken down by school size (SchoolSize). Do this twice, once using a BY statement, and once using a CLASS statement. 4. Repeat Problem 3 (CLASS statement only), except group small and medium school sizes together. Do this by writing a new format for SchoolSize (values are S, M, and L). Do not use any DATA steps. 5. Using the SAS data set College, report the mean GPA for the following categories of ClassRank: 0–50 = bottom half, 51–74 = 3rd quartile, and 75 to 100 = top quarter. Do this by creating an appropriate format. Do not use a DATA step. 6. Using the SAS data set College, create a summary data set (call it Class_Summary) containing the n, mean, and median of ClassRank and GPA for each value of SchoolSize. Use a CLASS statement and be sure that the summary data set only contains statistics for each level of SchoolSize. Use the AUTONAME option to name the variables in this data set. 7. Using the SAS data set College, create four summary data sets containing the number of non-missing and missing values and the mean, minimum, and maximum for ClassRank and GPA, broken down by Gender and SchoolSize. The first data set (Grand) should contain the statistics for all subjects, the second data set (ByGender) should contain the statistics broken down by Gender, the third data set (BySize) should contain the statistics broken down by SchoolSize, and the fourth data set (Cell) should contain the statistics broken down by Gender and SchoolSize. Do this by using PROC MEANS (with a CLASS statement) and one DATA step. Hint: Use the CHARTYPE procedure option.
340 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
17
Counting Frequencies 17.1
Introduction 342
17.2
Counting Frequencies 342
17.3
Selecting Variables for PROC FREQ 345
17.4
Using Formats to Label the Output 346
17.5
Using Formats to Group Values 347
17.6
Problems Grouping Values with PROC FREQ 349
17.7
Displaying Missing Values in the Frequency Table 351
17.8
Changing the Order of Values in PROC FREQ 353
17.9
Producing Two-Way Tables 356
17.10 Requesting Multiple Two-Way Tables 358 17.11 Producing Three-Way Tables 358 17.12 Problems 360
342 Learning SAS by Example: A Programmer’s Guide
17.1 Introduction PROC FREQ can be used to count frequencies of both character and numeric variables, in one-way, two-way, and three-way tables. In addition, you can use PROC FREQ to create output data sets containing counts and percentages. Finally, if you are statistically inclined, you can use this procedure to compute various statistics such as chi-square, odds 1 ratio, and relative risk.
17.2 Counting Frequencies Let’s start out by running PROC FREQ with all the defaults, as shown in Program 17-1:
Program 17-1 Counting frequencies: one-way tables using PROC FREQ title "PROC FREQ with all the Defaults"; proc freq data=learn.survey; run;
The default action of PROC FREQ is to compute frequencies on every variable in your data set, both character and numeric. Here is a partial listing of the output from Program 17-1: PROC FREQ with all the Defaults The FREQ Procedure Cumulative ID
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
1
14.29
1
14.29
002
1
14.29
2
28.57
003
1
14.29
3
42.86
(continued)
1
See Ron Cody and Jeffrey K. Smith, Applied Statistics and the Programming Language, 5th ed. (Englewood Cliffs, NJ: Prentice Hall, 2005), which is available from SAS Press, for a discussion of the statistical applications of PROC FREQ.
Chapter 17: Counting Frequencies 343
004
1
14.29
4
57.14
005
1
14.29
5
71.43
006
1
14.29
6
85.71
007
1
14.29
7
100.00
Cumulative Gender
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ F
3
42.86
3
42.86
M
4
57.14
7
100.00
Cumulative Age
Frequency
Percent
Cumulative
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 22
1
14.29
1
14.29
23
1
14.29
2
28.57
38
1
14.29
3
42.86
45
1
14.29
4
57.14
55
1
14.29
5
71.43
63
1
14.29
6
85.71
67
1
14.29
7
100.00
Cumulative Salary
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 23060
1
14.29
1
14.29
28000
1
14.29
2
28.57
36500
1
14.29
3
42.86
76100
1
14.29
4
57.14
76123
1
14.29
5
71.43
90000
1
14.29
6
85.71
128000
1
14.29
7
100.00
(continued)
344 Learning SAS by Example: A Programmer’s Guide
PROC FREQ with all the Defaults The FREQ Procedure Cumulative Ques1
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1
1
14.29
1
14.29
2
2
28.57
3
42.86
3
1
14.29
4
57.14
4
1
14.29
5
71.43
5
2
28.57
7
100.00
. . . Cumulative Ques5
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1
2
28.57
2
28.57
2
1
14.29
3
42.86
3
3
42.86
6
85.71
4
1
14.29
7
100.00
Take a moment to look at the frequencies for Gender. The column labeled Frequency tells you that there are three females and four males in the data set. Expressed as a percentage, this corresponds to 42.86% and 57.14%, respectively. The two columns labeled Cumulative Frequency and Cumulative Percent are a cumulative count of frequencies and percentages. As you can see, PROC FREQ computes counts for each unique value of a variable. For example, if you look at the variable Age, you see how many subjects are 22 years old, 23 years old, and so forth. You would probably prefer to see frequencies based on age groups. We'll get to that in a minute.
Chapter 17: Counting Frequencies 345
17.3 Selecting Variables for PROC FREQ Because you will rarely want to compute frequencies on every variable in a data set, you need to include a TABLES statement to list the variables for which you want to compute frequencies. You may also want to eliminate the cumulative columns because they are not usually needed. The following program selects Gender and Ques1–Ques3 and eliminates the cumulative statistics as well.
Program 17-2 Adding a TABLES statement to PROC FREQ title "Adding a TABLES Statement"; proc freq data=learn.survey; tables Gender Ques1-Ques3 / nocum; run;
You use a TABLES statement (TABLE, singular, works as well) to list the variables you want PROC FREQ to include. NOCUM is a TABLES option that tells PROC FREQ not to include the two cumulative statistics columns in the output. Because NOCUM is an option in the TABLES statement, it follows a slash (this is the syntax for all statement options within a procedure). In this program, you obtain counts and percentages for the four variables Gender, Ques1, Ques2, and Ques3. Here is the output: Adding a TABLES Statement The FREQ Procedure Gender
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ F
3
42.86
M
4
57.14
Frequency
Percent
Ques1
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1
1
14.29
2
2
28.57
3
1
14.29
4
1
14.29
5
2
28.57
(continued)
346 Learning SAS by Example: A Programmer’s Guide
Ques2
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 2
2
28.57
3
4
57.14
5
1
14.29
Ques3
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1
1
14.29
2
3
42.86
3
1
14.29
4
1
14.29
5
1
14.29
By the way, if you don’t want PROC FREQ to compute percentages, you can add the NOPERCENT option in the TABLES statement.
17.4 Using Formats to Label the Output It is easy enough to realize that F stands for Female and M for Male in the listing for Gender. However, it would improve the appearance of the report if you used a format to label these values. You might also want to supply a format for the Ques variables. Here is Program 17-2 with formats added:
Program 17-3 Adding formats to Program 17-2 proc format; value $gender 'F' = 'Female' 'M' = 'Male'; value $likert '1' = 'Strongly disagree' '2' = 'Disagree' '3' = 'No opinion' '4' = 'Agree' '5' = 'Strongly agree'; run;
Chapter 17: Counting Frequencies 347
title "Adding Formats"; proc freq data=learn.survey; tables Gender Ques1-Ques3 / nocum; format Gender $gender. Ques1-Ques3 $likert.; run;
Adding formats to these variables greatly improves the readability of the output as shown here (partial listing): Adding Formats The FREQ Procedure Gender
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Female
3
42.86
Male
4
57.14
. . . Ques3
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Strongly disagree
1
14.29
Disagree
3
42.86
No opinion
1
14.29
Agree
1
14.29
Strongly agree
1
14.29
17.5 Using Formats to Group Values Because PROC FREQ computes frequencies on formatted values, you can use formats to group values together into larger categories. Data set Survey contains an Age variable. Suppose you want to generate frequencies for age, broken down into three age groups. You also want to look at question 5 (Ques5) with the values 1 and 2 combined into a Generally disagree category and values 4 and 5 combined into a Generally agree category. You can use formats to accomplish these tasks, as shown in the next program:
348 Learning SAS by Example: A Programmer’s Guide
Program 17-4 Using formats to group values proc format; value agegroup low-<30 = 'Less than 30' 30-<60 = '30 to 59' 60-high = '60 and higher'; value $agree_disagree '1','2' = 'Generally disagree' '3' = 'No opinion' '4','5' = 'Generally agree'; run; title "Using Formats to Create Groups"; proc freq data=learn.survey; tables Age Ques5 / nocum nopercent; format Age agegroup. Ques5 $agree_disagree.; run;
This is a useful technique because you don’t have to create a new data set. If you want to see frequencies for different groupings, you only need to make a new format. Here is the output from Program 17-4: Using Formats to Create Groups The FREQ Procedure Age
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Less than 30
2
30 to 59
3
60 and higher
2
Ques5
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Generally disagree
3
No opinion
3
Generally agree
1
Chapter 17: Counting Frequencies 349
17.6 Problems Grouping Values with PROC FREQ A problem can occur when PROC FREQ uses formatted values to create groups. When you use the keyword OTHER as a range when you create a format, all values that do not match a format range are grouped together. PROC FREQ assigns all of them the value of the variable with the lowest value. As an example, you have a data set (Grouping) with the following values: Obs
X
Obs
X
Obs
X
1
2
5
4
9
5
2
2
6
4
10
5
3
3
7
4
11
5
4
3
8
missing
12
6
You write the following program:
Program 17-5 Demonstrating a problem in how PROC FREQ groups values proc format; value two low-3 = 'Group 1' 4-5 = 'Group 2' other = 'Other values'; run; title "Grouping Values (First Try)"; proc freq data=learn.grouping; tables X / nocum nopercent; format X two.; run;
350 Learning SAS by Example: A Programmer’s Guide
Looking at the values in data set Grouping, you would expect to see four values in Group 1 and seven values in Group 2. Here is what you get: Grouping Values (First Try) The FREQ Procedure X
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Group 1
4
Group 2
6
Frequency Missing = 2
Because the values of missing and 6 both fall into the OTHER category, they are assigned the smallest value of the two (missing), resulting in this output. To fix this problem, all you need to do is assign a separate category for missing values when you create your format, as in the following example:
Program 17-6 Fixing the grouping problem proc format; value two low-3 = 4-5 = . = other = run;
'Group 1' 'Group 2' 'Missing' 'Other values';
Now, missing values and other values are not placed into the same group. Here is the result: Grouping Values (Fixing the Problem) The FREQ Procedure X
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Group 1
4
Group 2
6
Other values
1
Frequency Missing = 1
Chapter 17: Counting Frequencies 351
17.7 Displaying Missing Values in the Frequency Table SAS normally tells you the frequency of missing values in a separate listing below the frequency table. You can ask SAS to treat missing values just like any other values and include them in the frequency table by including the TABLES option MISSING. Not only does this option bring the count of missing values up into the main table, it also changes how SAS reports percentages. Without the MISSING option, percentages are computed as the frequency of each category divided by the number of non-missing values; with the MISSING option, SAS computes frequencies by dividing the frequencies by the number of missing and non-missing observations. To be sure this is clear, let’s run PROC FREQ with and without the MISSING option:
Program 17-7 Demonstrating the effect of the MISSING option of PROC FREQ title "PROC FREQ Using the MISSING Option"; proc freq data=learn.grouping; tables X / missing; format X two.; run; title "PROC FREQ Without the MISSING Option"; proc freq data=learn.grouping; format X two.; tables X; run;
352 Learning SAS by Example: A Programmer’s Guide
Here is the output: PROC FREQ Using the MISSING Option The FREQ Procedure Cumulative X
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Missing
1
8.33
1
8.33
Group 1
4
Group 2
6
33.33
5
41.67
50.00
11
Other values
1
91.67
8.33
12
100.00
PROC FREQ Without the MISSING Option The FREQ Procedure Cumulative X
Frequency
Percent
Frequency
Cumulative Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Group 1
4
36.36
4
36.36
Group 2
6
54.55
10
90.91
Other values
1
9.09
11
100.00
Frequency Missing = 1
Look at the Percent column for Group 1 in both of these tables. With the MISSING option, the value, 33.33%, is obtained by dividing the frequency for Group 1 (4) by the total number of observations (12). Without the MISSING option, the percent, 36.36%, is obtained by dividing the frequency for Group 1 (4) by the number of non-missing values (11).
Chapter 17: Counting Frequencies 353
17.8 Changing the Order of Values in PROC FREQ By default, PROC FREQ orders the values based on internal values (even if a variable has a format). You can change the order of the frequency listings with the ORDER= procedure option. Valid values for this option are as follows: (Default) Orders values by their internal value Orders values by their formatted values Orders values from the most frequent to the least frequent Orders values based on their order in the input data set
ORDER = internal ORDER = formatted ORDER = freq ORDER = data
Note that these options do not affect missing values, which are always listed first. To demonstrate how these options work, we start with a data set (Test) with values of 1, 2, 3, 4, and missing, with frequencies as shown in the table here: Internal Value
Formatted Value
Frequency
1
Yellow
2
2
Blue
3
3
Red
4
4
Green
1
Using these values, you create a format that corresponds to the values in the table and then run PROC FREQ like this:
Program 17-8 Demonstrating the ORDER= option of PROC FREQ proc format; value darwin 1 = 'Yellow' 2 = 'Blue' 3 = 'Red' 4 = 'Green' . = 'Missing'; run; data test; input Color @@; datalines;
354 Learning SAS by Example: A Programmer’s Guide
3 4 1 2 3 3 3 1 2 2 ; title "Default Order (Internal)"; proc freq data=test; tables Color / nocum nopercent missing; format Color darwin.; run;
Here is the output: Default Order (Internal) The FREQ Procedure Color
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Yellow
2
Blue
3
Red
4
Green
1
Notice that the values are ordered by the internal values of Color. The next program demonstrates how the other ORDER= options work:
Program 17-9 Demonstrating the ORDER= formatted, data, and freq options title "ORDER = formatted"; proc freq data=test order=formatted; tables Color / nocum nopercent; format Color darwin.; run; title "ORDER = data"; proc freq data=test order=data; tables Color / nocum nopercent; format Color darwin.; run; title "ORDER = freq"; proc freq data=test order=freq; tables Color / nocum nopercent; format Color darwin.; run;
Chapter 17: Counting Frequencies 355
Here is the output: ORDER = formatted The FREQ Procedure Color
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Blue
3
Green
1
Red
4
Yellow
2
ORDER = data The FREQ Procedure Color
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Red
4
Green
1
Yellow
2
Blue
3
ORDER = freq The FREQ Procedure Color
Frequency
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Red
4
Blue
3
Yellow
2
Green
1
356 Learning SAS by Example: A Programmer’s Guide
ORDER=formatted alphabetizes the list based on the formatted values. In order to understand the ORDER=data option, notice that the first four observations in the data set
Test are 3, 4, 1, and 2. This order controls the order of the frequencies in the table. Keep in mind that ORDER=data produces tables that may change if you run the program with different data or data in a different order. Finally, the ORDER=freq option lists the values from the most frequent to the least frequent. This last option is especially useful when you have a large number of categories and you want to see which values are most frequent.
17.9 Producing Two-Way Tables You can easily produce two-way tables (also called crosstab tables) by specifying the row and column variables in a TABLES statement, separated by an asterisk (*). For example, to see a table of gender by blood type in the Blood data set, you would write the following:
Program 17-10 Requesting a two-way table title "A Two-way Table of Gender by Blood Type"; proc freq data=learn.blood; tables Gender * BloodType; run;
The asterisk between Gender and BloodType tells PROC FREQ that you want a two-way table with Gender forming the rows of the table and BloodType forming the columns.
Chapter 17: Counting Frequencies 357
Here is the table: A Two-way Table of Gender by Blood Type The FREQ Procedure Table of Gender by BloodType Gender(Gender)
BloodType(Blood Type)
Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚A ‚AB ‚B ‚O ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Female ‚ 178 ‚ 20 ‚ 34 ‚ 208 ‚ 440 ‚ 17.80 ‚ 2.00 ‚ 3.40 ‚ 20.80 ‚ 44.00 ‚ 40.45 ‚ 4.55 ‚ 7.73 ‚ 47.27 ‚ ‚ 43.20 ‚ 45.45 ‚ 35.42 ‚ 46.43 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Male ‚ 234 ‚ 24 ‚ 62 ‚ 240 ‚ 560 ‚ 23.40 ‚ 2.40 ‚ 6.20 ‚ 24.00 ‚ 56.00 ‚ 41.79 ‚ 4.29 ‚ 11.07 ‚ 42.86 ‚ ‚ 56.80 ‚ 54.55 ‚ 64.58 ‚ 53.57 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 412 44 96 448 1000 41.20 4.40 9.60 44.80 100.00
The upper left-hand corner of the table is the key to the four numbers in each box. For example, looking at females with blood type A, the first number (Frequency = 178) tells you there are 178 females with type A blood. The second number (17.80) is the percentage of all subjects who were females with type A blood. The third number in the box is a row percentage. The 40.45 in this cell tells you that of all the females (which form a row in the table), 40.45% have type A blood. Finally, the fourth number in each cell is a column percentage. In this same cell, 56.8% of those subjects with type A blood are female.
358 Learning SAS by Example: A Programmer’s Guide
17.10 Requesting Multiple Two-Way Tables You can request multiple two-way tables in several ways. For example, if you want to see one row variable broken down by several column variables, you can use a TABLES statement like this: tables A * (B C D);
This statement generates three tables: A by B, A by C, and A by D. You can supply a list of variables (in parentheses) for both the row and column variables like this: tables (A B) * (C D);
This request generates four tables: A by C, A by D, B by C, and B by D. Remember that you can use any one of the short-hand notations that SAS allows for referring to a group of variables. To review, you have the following: Notation
Result
Ques1 – Ques10
Ques1, Ques2, … Ques10
VarA –- VarB
All variables from VarA to VarB in the order in which they appear in the SAS data set
Ques
All variables that begin with 'Ques'
_numeric_
All numeric variables in the SAS data set
_character_
All character variables in the SAS data set
_all_
All variables in the SAS data set
Of course, you couldn’t use _all_ in a two-way table request; it was included in the table for completeness.
17.11 Producing Three-Way Tables You may wonder, “How can I display three dimensions on a piece of paper?” You can do so when you request a three-way table in the form: tables Page * Row * Column;
Chapter 17: Counting Frequencies 359
SAS uses separate pages for each value of Page and displays a table of Row by Column on each page. Unless you have a relative in the paper business, you should be very cautious when submitting three-way tables because the output can become quite large. As an example of a three-way table, Program 17-11 produces a table of AgeGroup by BloodType for each value of Gender.
Program 17-11 Requesting a three-way table with PROC FREQ title "Example of a Three-way Table"; proc freq data=learn.blood; tables Gender * AgeGroup * BloodType / nocol norow nopercent; run;
Three table options, NOCOL, NOROW, and NOPERCENT, were added to this program. They eliminate the column percentage, row percentage, and overall percentage figures from the table. Here is the output: Example of a Three-way Table The FREQ Procedure Table 1 of AgeGroup by BloodType Controlling for Gender=Female AgeGroup(Age Group)
BloodType(Blood Type)
Frequency‚A ‚AB ‚B ‚O ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Old ‚ 110 ‚ 11 ‚ 18 ‚ 119 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Young ‚ 68 ‚ 9 ‚ 16 ‚ 89 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 178 20 34 208
Total 258 182 440
(continued)
360 Learning SAS by Example: A Programmer’s Guide
Table 2 of AgeGroup by BloodType Controlling for Gender=Male AgeGroup(Age Group)
BloodType(Blood Type)
Frequency‚A ‚AB ‚B ‚O ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Old ‚ 143 ‚ 15 ‚ 41 ‚ 141 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Young ‚ 91 ‚ 9 ‚ 21 ‚ 99 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 234 24 62 240
Total 340 220 560
For a discussion of how to use PROC FREQ to create a data set of frequencies, please refer to Chapter 24, Section 24.3.
17.12 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Using the SAS data set Blood, generate one-way frequencies for the variables Gender, BloodType, and AgeGroup. Use the appropriate options to omit the cumulative statistics and percentages. 2. Using the SAS data set BloodPressure, generate frequencies for the variable Age. Use a user-defined format to group ages into three categories: 40 and younger, 41 to 60, and 61 and older. Use the appropriate options to omit the cumulative statistics and percentages. 3. Using the data set Blood, produce frequencies for the variable Chol (cholesterol). Use a format to group the frequencies into three groups: low to 200 (normal), 201 and higher (high), and missing. Run PROC FREQ twice, once using the MISSING option, and once without. Compare the percentages in both listings. 4. Using the SAS data set Voter, produce two-way tables for Party by each of the four questions (Ques1–Ques4).
Chapter 17: Counting Frequencies 361
5. Using the SAS data set College, create a two-way table of Scholarship (rows) by ClassRank (columns). Use a user-defined format to group class rank into two groups: 70 and lower, and 71 and higher. (Please see the note in Chapter 16, Problem 2, about the permanent formats used in this data set.) 6. Using the SAS data set College, produce a three-way table of Gender (page) by Scholarship (row) by SchoolSize (column). 7. Using the SAS data set Blood, produce a table of frequencies for BloodType, in frequency order.
362 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
18
Creating Tabular Reports 18.1
Introduction 364
18.2
A Simple PROC TABULATE Table 364
18.3
Describing the Three PROC TABULATE Operators 366
18.4
Using the Keyword ALL 369
18.5
Producing Descriptive Statistics 370
18.6
Combining CLASS and Analysis Variables in a Table 372
18.7
Customizing Your Table 374
18.8
Demonstrating a More Complex Table 377
18.9
Computing Row and Column Percentages 379
18.10 Displaying Percentages in a Two-Dimensional Table 381 18.11 Computing Column Percentages 382 18.12 Computing Percentages on Numeric Variables 384 18.13 Understanding How Missing Values Affect PROC TABULATE Output 385 18.14 Problems 390
364 Learning SAS by Example: A Programmer’s Guide
18.1 Introduction PROC TABULATE is an underused, underappreciated procedure that can create a wide variety of tabular reports, displaying frequencies, percentages, and descriptive statistics (such as sums and means) broken down by one or more CLASS variables. There are 1 several excellent books devoted to PROC TABULATE. This chapter introduces you to this procedure and, we hope, piques your interest. To demonstrate many of the features of PROC TABULATE, we will use a data set called Blood that has information on blood types, genders, age groups, red and white blood cell counts, and cholesterol levels. A listing showing the first 10 observations of this data set follows: Listing of Blood (first 10 observations) Blood Subject
Age
Gender
Type
Group
WBC
RBC
Chol
1
Female
AB
Young
7710
7.40
258
2
Male
AB
Old
6560
4.70
.
3
Male
A
Young
5690
7.53
184
4
Male
B
Old
6680
6.85
.
5
Male
A
Young
.
7.72
187
6
Male
A
Old
6140
3.69
142
7
Female
A
Young
6550
4.78
290
8
Male
O
Old
5200
4.96
151
9
Male
O
Young
.
5.66
311
Female
O
Young
7710
5.55
.
10
18.2 A Simple PROC TABULATE Table Although PROC TABULATE can create complex tables, there are only three operators that control a table’s appearance. Let’s start out by running PROC TABULATE with all the defaults and selecting a single CLASS variable, like this: 1
See Lauren E. Haworth, PROC TABULATE By Example (Cary, NC: SAS Institute Inc., 1999), for one of the best and most complete books on this topic.
Chapter 18: Creating Tabular Reports 365
Program 18-1 PROC TABULATE with all the defaults and a single CLASS variable title "All Defaults with One CLASS Variable"; proc tabulate data=learn.blood; class Gender; table Gender; run;
PROC TABULATE uses a CLASS statement to allow you to specify variables that represent categories (often character variables) where you want to compute frequencies or percentages. In Program 18-1, Gender is selected as the CLASS variable. The TABLE statement specifies the table’s appearance. Any variable listed in a TABLE statement must be listed in a CLASS statement or a VAR statement (to be discussed shortly). Here is the output: All Defaults with One CLASS Variable „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ Gender ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ Female ‚ Male ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ 440.00‚ 560.00‚ Šƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
With only a single CLASS variable listed in the TABLE statement, PROC TABULATE computes frequencies for each value of the selected variable. It also displays these frequencies as columns. You can see here that there are 440 females and 560 males in the data set.
366 Learning SAS by Example: A Programmer’s Guide
18.3 Describing the Three PROC TABULATE Operators 18.3.1 Concatenation The first tabulate operator we will discuss is concatenation. That is, you can place information about several variables next to each other in a table. A space between variables in a TABLE statement is used to concatenate table values. If you want to see Gender and BloodType as columns in a table, you can submit the following tabulate statements:
Program 18-2 Demonstrating concatenation with PROC TABULATE title "Demonstrating Concatenation"; proc tabulate data=learn.blood format=6.; class Gender BloodType; table Gender BloodType; run;
The space between Gender and BloodType in the TABLE statement represents concatenation. The procedure option, FORMAT=, was included in this program. If you specify this option, the selected format is used for every cell in the table. Later, you will see how to control the format of every individual cell of a PROC TABULATE listing. Here is the table: Demonstrating Concatenation „ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ Gender ‚ Blood Type ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚Female‚ Male ‚ A ‚ AB ‚ B ‚ O ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ N ‚ N ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ 440‚ 560‚ 412‚ 44‚ 96‚ 448‚ Šƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
Notice that the frequencies for BloodType have been placed next to the frequencies for Gender.
Chapter 18: Creating Tabular Reports 367
18.3.2 Table Dimensions (Page, Row, and Column) The comma used in a TABLE statement controls the table’s dimensions. If you do not include any commas in the TABLE statement, all of your table information displays as columns. If you include a single comma, the specification following the comma generates columns in the table; the specification before the comma generates rows in the table. To demonstrate Program 18-3 shows a request for blood type frequencies to be displayed as columns and gender values to be displayed as rows.
Program 18-3 Demonstrating table dimensions with PROC TABULATE title "Demonstrating Table Dimensions"; proc tabulate data=learn.blood format=6.; class Gender BloodType; table Gender, BloodType; run; Demonstrating Table Dimensions „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ Blood Type ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ A ‚ AB ‚ B ‚ O ‚ ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Gender ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚Female ‚ 178‚ 20‚ 34‚ 208‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Male ‚ 234‚ 24‚ 62‚ 240‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
If you use two commas in a table request, the first specification represents pages, the second, rows, and the last, columns.
368 Learning SAS by Example: A Programmer’s Guide
18.3.3 Nesting The third TABLE operator is the asterisk (*). If you specify variable 1 * variable 2, variable 2 is nested within variable 1. To demonstrate, here is a request to nest BloodType within Gender:
Program 18-4 Demonstrating the nesting operator with PROC TABULATE title "Demonstrating Nesting"; proc tabulate data=learn.blood format=6.; class Gender BloodType; table Gender * BloodType; run;
Here is the resulting table: Demonstrating Nesting „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ Gender ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ Female ‚ Male ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ Blood Type ‚ Blood Type ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ A ‚ AB ‚ B ‚ O ‚ A ‚ AB ‚ B ‚ O ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ N ‚ N ‚ N ‚ N ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ 178‚ 20‚ 34‚ 208‚ 234‚ 24‚ 62‚ 240‚ Šƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
For each value of Gender, you see the frequencies of each of the values of BloodType. Notice that you can add spaces between any of the operators (for example, there are spaces around the asterisk in the previous program) to make your TABLE statements more readable. If you do not place a comma or asterisk between two variable names, they are concatenated in the table.
Chapter 18: Creating Tabular Reports 369
18.4 Using the Keyword ALL Unlike many other SAS procedures, there are some reserved keywords used by PROC TABULATE (which also means you have to be careful not to name any of your variables the same as a PROC TABULATE keyword). One such keyword is ALL. When you place ALL after a variable name in a table request, PROC TABULATE includes a column representing all levels of the preceding variable. To demonstrate, you can add ALL to both dimensions of the table produced by Program 18-3, as follows:
Program 18-5 Adding the keyword ALL to your table request title "Adding the Keyword ALL to the TABLE Request"; proc tabulate data=learn.blood format=6.; class Gender BloodType; table Gender ALL, BloodType ALL; run;
Here is the resulting table: Adding the Keyword ALL to the TABLE Request „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ† ‚ ‚ Blood Type ‚ ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ ‚ A ‚ AB ‚ B ‚ O ‚ All ‚ ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Gender ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚Female ‚ 178‚ 20‚ 34‚ 208‚ 440‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Male ‚ 234‚ 24‚ 62‚ 240‚ 560‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚All ‚ 412‚ 44‚ 96‚ 448‚ 1000‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
Because there is a space between ALL and the preceding variable, the frequency associated with ALL is concatenated to the appropriate variable.
370 Learning SAS by Example: A Programmer’s Guide
18.5 Producing Descriptive Statistics PROC TABULATE can also produce descriptive statistics such as sums and means. To specify that you want to generate descriptive statistics rather than frequency counts, you list all of your analysis variables (that is, the ones for which you want statistics) in a VAR statement. Descriptive statistics may appear on any dimension of a table, but you may only define statistics on a single dimension of a table. Program 18-6 computes descriptive statistics for the RBC (red blood cell) and WBC (white blood cell) counts:
Program 18-6 Using PROC TABULATE to produce descriptive statistics title "Demonstrating Analysis Variables"; proc tabulate data=learn.blood; var RBC WBC; table RBC WBC; run;
The resulting output follows: Demonstrating Analysis Variables „ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ RBC ‚ WBC ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ Sum ‚ Sum ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ 5022.91‚ 6395020.00‚ Šƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Because the operator between RBC and WBC was a space, the analysis columns are concatenated. The default descriptive statistic is the sum (not a very useful statistic for blood counts). To specify one or more analyses for each variable, you use keywords (sum, mean, min, max, std, and so on) to specify which statistics you want and you associate these statistics with a variable using the nesting (asterisk) operator. For example, to specify the mean rather than a sum for RBC and WBC, you would use the code in Program 18-7.
Chapter 18: Creating Tabular Reports 371
Program 18-7 Specifying statistics on an analysis variable with PROC TABULATE title "Specifying Statistics"; proc tabulate data=learn.blood; var RBC WBC; table RBC*mean WBC*mean; run;
To save typing, you can also write the TABLE statement like this: table (RBC WBC)*mean;
Either way, you now have a table showing the mean (average) blood counts: Specifying Statistics „ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ RBC ‚ WBC ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ Mean ‚ Mean ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ 5.48‚ 7042.97‚ Šƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
You can specify several statistics for each variable as well. For example, if you would like the mean, minimum, and maximum values for the blood counts, you could write Program 18-8:
Program 18-8 Specifying more than one descriptive statistic with PROC TABULATE title "Specifying More than One Statistic"; proc tabulate data=learn.blood format=comma9.2; var RBC WBC; table (RBC WBC)*(mean min max); run;
372 Learning SAS by Example: A Programmer’s Guide
Each statistic is now listed for each of the two variables, like this: Specifying More than One Statistic „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ RBC ‚ WBC ‚ ‡ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ‰ ‚ Mean ‚ Min ‚ Max ‚ Mean ‚ Min ‚ Max ‚ ‡ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ‰ ‚ 5.48‚ 1.71‚ 8.75‚ 7,042.97‚ 4,070.00‚10,550.00‚ Šƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒŒ
Program 18-8 uses the PROC TABULATE option FORMAT=COMMA9.2 to improve the readability of the output. You now have the mean, minimum, and maximum value for RBC and WBC counts, each displayed with a 9.2 numeric format.
18.6 Combining CLASS and Analysis Variables in a Table What if you want to see the mean value for RBC and WBC broken down by one or more of the CLASS variables (such as Gender or BloodType). You can combine frequencies and descriptive statistics in a single table. Here is where some people complain that PROC TABULATE is very difficult to use. Actually, the problem is usually not with PROC TABULATE, rather, the problem is more likely that the user did not decide what the table should look like before starting to write the TABULATE statements. It also takes some practice to arrange a table so that columns are not split between pages (making it very hard to read). The next program is a request for the mean RBC, WBC, and Chol (cholesterol), broken down by Gender and AgeGroup.
Program 18-9 Combining CLASS and analysis variables in a table title "Combining CLASS and Analysis Variables"; proc tabulate data=learn.blood format=comma11.2; class Gender AgeGroup; var RBC WBC Chol; table (Gender ALL)*(AgeGroup All), (RBC WBC Chol)*mean; run;
Chapter 18: Creating Tabular Reports 373
This request nests AgeGroup and ALL within each value of Gender and ALL. The means of RBC, WBC, and Chol are displayed as columns of the table, like this: Combining CLASS and Analysis Variables „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ RBC ‚ WBC ‚Cholesterol‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ Mean ‚ Mean ‚ Mean ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚Gender ‚Age ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ‰Group ‚ ‚ ‚ ‚ ‚Female ‡ƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚Old ‚ 5.48‚ 7,105.98‚ 195.88‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚Young ‚ 5.53‚ 7,121.36‚ 212.27‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚All ‚ 5.50‚ 7,112.43‚ 202.50‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚Male ‚Age ‚ ‚ ‚ ‚ ‚ ‚Group ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚Old ‚ 5.44‚ 6,939.35‚ 199.12‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚Young ‚ 5.52‚ 7,061.66‚ 203.08‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚All ‚ 5.47‚ 6,987.54‚ 200.60‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚All ‚Age ‚ ‚ ‚ ‚ ‚ ‚Group ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚Old ‚ 5.46‚ 7,011.56‚ 197.73‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚Young ‚ 5.52‚ 7,089.08‚ 207.29‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚All ‚ 5.48‚ 7,042.97‚ 201.44‚ Šƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒŒ
374 Learning SAS by Example: A Programmer’s Guide
18.7 Customizing Your Table PROC TABULATE allows you to customize your table in several ways. You can associate a format with a particular value in the table or rename any of the keywords, such as ALL or MEAN. If you look back at the output produced by Program 18-7, you see that the default format is useful for values of RBC but not for values of WBC (where you would rather not see any decimal values). The program that follows associates a different format with each of these two variables:
Program 18-10 Associating a different format with each variable in a table title "Specifying Formats"; proc tabulate data=learn.blood; var RBC WBC; table RBC*mean*f=7.2 WBC*mean*f=comma7.; run;
Here you see another use for the asterisk. It is used to associate a format with a specific value. You can think of the format being nested within the statistic. Here is the output: Specifying Formats „ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ† ‚ RBC ‚ WBC ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚ Mean ‚ Mean ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚ 5.48‚ 7,043‚ Šƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
To demonstrate how to rename PROC TABULATE keywords, take a look at Program 18-11.
Chapter 18: Creating Tabular Reports 375
Program 18-11 Renaming keywords with PROC TABULATE title "Specifying Formats and Renaming Keywords"; proc tabulate data=learn.blood; class Gender; var RBC WBC; table Gender ALL, RBC*(mean*f=9.1 std*f=9.2) WBC*(mean*f=comma9. std*f=comma9.1); keylabel ALL = 'Total' mean = 'Average' std = 'Standard Deviation'; run;
The KEYLABEL statement allows you to provide a label for any of the keywords used by the procedure. In this program, ALL is replaced with Total, Mean by Average, and Std by Standard Deviation. In addition, different formats are associated with the mean and the standard deviation. This is particularly useful when you are computing statistics like N (the number of non-missing observations) or NMISS (the number of missing values), where you want integer values, along with means or other statistics, where you may want to see digits to the right of the decimal point. Here is the table: Specifying Formats and Renaming Keywords „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ RBC ‚ WBC ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚Standard ‚ ‚Standard ‚ ‚ ‚ Average ‚Deviation‚ Average ‚Deviation‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ‰ ‚Gender ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚Female ‚ 5.5‚ 0.98‚ 7,112‚ 997.8‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ‰ ‚Male ‚ 5.5‚ 0.99‚ 6,988‚ 1,005.3‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ‰ ‚Total ‚ 5.5‚ 0.98‚ 7,043‚ 1,003.4‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒŒ
376 Learning SAS by Example: A Programmer’s Guide
While we are on the topic of improving the appearance of your tables, look back at the output table from Program 18-1. The row of Ns across the table is not necessary. You can eliminate them by providing a null label (a blank) within the TABLE statement request, like this:
Program 18-12 Eliminating the N column in a PROC TABULATE table title "Eliminating the 'N' Row from the Table"; proc tabulate data=learn.blood format=6.; class Gender; table Gender*n=' '; run;
The resulting table follows: Eliminating the 'N' Row from the Table „ƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ Gender ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚Female‚ Male ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ 440‚ 560‚ Šƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
You could accomplish the same goal by including a KEYLABEL statement and providing a null label for N, like this: keylabel n = ' ';
Finally, on the topic of customizing your table, the NOSEPS (no separators) procedure option removes the internal horizontal separator lines from the table, shortening the table considerably. Suppose you rerun Program 18-9, with the NOSEPS option: proc tabulate data=learn.hosp format=comma9.2 noseps;
Chapter 18: Creating Tabular Reports 377
This results in the shortened table here: Combining CLASS and Analysis Variables „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ† ‚ ‚ ‚ ‚Choleste-‚ ‚ ‚ RBC ‚ WBC ‚ rol ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ Mean ‚ Mean ‚ Mean ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒ‰ ‚Gender Age ‚ ‚ ‚ ‚ ‚Female Group ‚ ‚ ‚ ‚ ‚ Old ‚ 5.48‚ 7,105.98‚ 195.88‚ ‚ Young ‚ 5.53‚ 7,121.36‚ 212.27‚ ‚ All ‚ 5.50‚ 7,112.43‚ 202.50‚ ‚Male Age ‚ ‚ ‚ ‚ ‚ Group ‚ ‚ ‚ ‚ ‚ Old ‚ 5.44‚ 6,939.35‚ 199.12‚ ‚ Young ‚ 5.52‚ 7,061.66‚ 203.08‚ ‚ All ‚ 5.47‚ 6,987.54‚ 200.60‚ ‚All Age ‚ ‚ ‚ ‚ ‚ Group ‚ ‚ ‚ ‚ ‚ Old ‚ 5.46‚ 7,011.56‚ 197.73‚ ‚ Young ‚ 5.52‚ 7,089.08‚ 207.29‚ ‚ All ‚ 5.48‚ 7,042.97‚ 201.44‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒŒ
18.8 Demonstrating a More Complex Table As a final demonstration of combining CLASS and analysis variables in a table and customizing the table, we present the following program:
Program 18-13 Demonstrating a more complex table title "Combining CLASS and Analysis Variables"; proc tabulate data=learn.blood format=comma9.2 noseps; class Gender AgeGroup; var RBC WBC Chol; table (Gender=' ' ALL)*(AgeGroup=' ' All), RBC*(n*f=3. mean*f=5.1) WBC*(n*f=3. mean*f=comma7.) Chol*(n*f=4. mean*f=7.1); keylabel ALL = 'Total'; run;
378 Learning SAS by Example: A Programmer’s Guide
Program 18-13 demonstrates several PROC TABULATE features. First, the option NOSEPS eliminates the horizontal lines within the table (to save space). Next, a label is attached to the variables Gender and AgeGroup. This is accomplished by following the variable name with an equal sign, followed by the label. Because a null label is used, the row normally used to display these variable names is eliminated altogether. (Try running this program without this addition to see the difference.) Finally, a separate format is attached to each of the statistics in the table (N and MEAN). Often, the formats you use are chosen so that the column headings or labels are not split between rows, rather than the optimum value, to display a particular value. Here is the nice compact table produced by this program: Combining CLASS and Analysis Variables „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ RBC ‚ WBC ‚Cholesterol ‚ ‚ ‡ƒƒƒ…ƒƒƒƒƒˆƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚ N ‚Mean ‚ N ‚ Mean ‚ N ‚ Mean ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒˆƒƒƒƒƒˆƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Female Old ‚242‚ 5.5‚234‚ 7,106‚ 208‚ 195.9‚ ‚ Young ‚167‚ 5.5‚169‚ 7,121‚ 141‚ 212.3‚ ‚ Total ‚409‚ 5.5‚403‚ 7,112‚ 349‚ 202.5‚ ‚Male Old ‚309‚ 5.4‚306‚ 6,939‚ 279‚ 199.1‚ ‚ Young ‚198‚ 5.5‚199‚ 7,062‚ 167‚ 203.1‚ ‚ Total ‚507‚ 5.5‚505‚ 6,988‚ 446‚ 200.6‚ ‚Total Old ‚551‚ 5.5‚540‚ 7,012‚ 487‚ 197.7‚ ‚ Young ‚365‚ 5.5‚368‚ 7,089‚ 308‚ 207.3‚ ‚ Total ‚916‚ 5.5‚908‚ 7,043‚ 795‚ 201.4‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
Chapter 18: Creating Tabular Reports 379
18.9 Computing Row and Column Percentages The statistic PCTN is used with CLASS variables to compute percentages. Program 18-14 demonstrates how this statistic is used. This program computes the numbers and percentages of each of the four blood types in a one-dimensional table.
Program 18-14 Computing percentages in a one-dimensional table title "Counts and Percentages"; proc tabulate data=learn.blood format=6.; class BloodType; table BloodType*(n pctn); run;
The output, while not too pretty, follows: Counts and Percentages „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ Blood Type ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ A ‚ AB ‚ B ‚ O ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ N ‚ PctN ‚ N ‚ PctN ‚ N ‚ PctN ‚ N ‚ PctN ‚ ‡ƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ 412‚ 41‚ 44‚ 4‚ 96‚ 10‚ 448‚ 45‚ Šƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
380 Learning SAS by Example: A Programmer’s Guide
The numbers in the PctN columns add up to 100%. It’s time to improve on this table’s appearance. Let’s add an ALL column, rename N and PctN, and add percent signs to the percentage numbers. Here goes:
Program 18-15 Improving the appearance of output from Program 18-14 proc format; picture pctfmt low-high='009.9%'; run; title "Counts and Percentages"; proc tabulate data=learn.blood; class BloodType; table (BloodType ALL)*(n*f=5. pctn*f=pctfmt7.1); keylabel n = 'Count' pctn = 'Percent'; run;
Several improvements were made in this program. A user-defined picture format was created with PROC FORMAT. This format allows for values up to three digits before the decimal place, with the right-most digit always being printed (even if it is 0). The 0s in a picture format are place holders; the 9s are also place holders, but they print leading 0s if necessary. The percent sign in the picture definition prints as is. Separate formats were requested for N (5.) and for PctN (pctfmt7.1). Finally, a KEYLABEL statement was used to rename N and PctN to Count and Percent, respectively. By the way, the format width of 7 for the percentage values was chosen so the Percent column heading would fit. Here is the table: Counts and Percentages „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ Blood Type ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ A ‚ AB ‚ B ‚ O ‚ All ‚ ‡ƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚Count‚Percent‚Count‚Percent‚Count‚Percent‚Count‚Percent‚Count‚Percent‚ ‡ƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚ 412‚ 41.2%‚ 44‚ 4.4%‚ 96‚ 9.6%‚ 448‚ 44.8%‚ 1000‚ 100.0%‚ Šƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
Chapter 18: Creating Tabular Reports 381
18.10 Displaying Percentages in a Two-Dimensional Table Here is where things get tricky. Let’s make a table of Gender (rows) by BloodType (columns) and request N and PctN, like this:
Program 18-16 Counts and percentages in a two-dimensional table proc format; picture pctfmt low-high='009.9%'; run; title "Counts and Percentages"; proc tabulate data=learn.blood noseps; class Gender BloodType; table (BloodType ALL), (Gender ALL)*(n*f=5. pctn*f=pctfmt7.1) /RTS=25; keylabel ALL = 'Both Genders' n = 'Count' pctn = 'Percent'; run;
The NOSEPS procedure option saves some space and a TABLE option, RTS=, increases the row title space (the space for the width of the row title, including the vertical lines). In the table here, there are 25 spaces from the left-most vertical line to the line preceding the word Female.
382 Learning SAS by Example: A Programmer’s Guide
Here is the output: Counts and Percentages „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ Gender ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ Female ‚ Male ‚Both Genders ‚ ‚ ‡ƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚Count‚Percent‚Count‚Percent‚Count‚Percent‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Blood Type ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚A ‚ 178‚ 17.8%‚ 234‚ 23.4%‚ 412‚ 41.2%‚ ‚AB ‚ 20‚ 2.0%‚ 24‚ 2.4%‚ 44‚ 4.4%‚ ‚B ‚ 34‚ 3.4%‚ 62‚ 6.2%‚ 96‚ 9.6%‚ ‚O ‚ 208‚ 20.8%‚ 240‚ 24.0%‚ 448‚ 44.8%‚ ‚Both Genders ‚ 440‚ 44.0%‚ 560‚ 56.0%‚ 1000‚ 100.0%‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
The percentage values in this table represent the number of subjects in a cell (females with blood type A, for example) as a percentage of all subjects. Thus, the number of females who have blood type A (178) represents 17.8% of the total number of subjects (1,000). Suppose you want the percentages to represent the distribution of blood types within each gender value. That is, you want the percentages in each column to add to 100%.
18.11 Computing Column Percentages If you want the table to include column percentages, you need to modify Program 18-16. The original (read hard) way to do this is to provide a denominator definition. Lauren 2 Haworth’s book, which was mentioned in the introduction to this chapter, discusses denominator definitions. The easy way to do this is to use the recently added keywords COLPCTN and ROWPCTN, which compute column and row percentages. Program 1817 uses the COLPCTN keyword to compute column percentages.
2
See Lauren E. Haworth, PROC TABULATE By Example (Cary, NC: SAS Institute Inc., 1999), for a discussion of denominator definitions.
Chapter 18: Creating Tabular Reports 383
Program 18-17 Using COLPCTN to compute column percentages title "Percents on Column Dimension"; proc tabulate data=learn.blood noseps; class Gender BloodType; table (BloodType ALL='All Blood Types'), (Gender ALL)*(n*f=5. colpctn*f=pctfmt7.1) /RTS=25; keylabel All = 'Both Genders' n = 'Count' colpctn = 'Percent'; run;
Gender (and ALL) form the columns of this table. By substituting COLPCTN for PTCN, the percentages you see in the table are now column percentages. For example, the value of 40.4% for females with blood type A is computed by dividing the number of females with blood type A (178) by the total number of females (440). One other change was made to this program. The KEYLABEL statement assigns the label Both Genders to the keyword ALL. However, you want the label for ALL, associated with BloodType, to read All Blood Types. Another way to assign a label in a PROC TABULATE table is to follow the variable name with an equal sign, followed by the label in quotation marks. That is why the bottom row of the table below reads All Blood Types. Percents on Column Dimension „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ Gender ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ Female ‚ Male ‚Both Genders ‚ ‚ ‡ƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚Count‚Percent‚Count‚Percent‚Count‚Percent‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Blood Type ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚A ‚ 178‚ 40.4%‚ 234‚ 41.7%‚ 412‚ 41.2%‚ ‚AB ‚ 20‚ 4.5%‚ 24‚ 4.2%‚ 44‚ 4.4%‚ ‚B ‚ 34‚ 7.7%‚ 62‚ 11.0%‚ 96‚ 9.6%‚ ‚O ‚ 208‚ 47.2%‚ 240‚ 42.8%‚ 448‚ 44.8%‚ ‚All Blood Types ‚ 440‚ 100.0%‚ 560‚ 100.0%‚ 1000‚ 100.0%‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
Row percentages are requested in a similar fashion, using the keyword ROWPCTN instead of COLPCTN.
384 Learning SAS by Example: A Programmer’s Guide
18.12 Computing Percentages on Numeric Variables There is another type of percentage calculation that is available in PROC TABULATE— you can compute the percentage of the SUM statistic that is represented by a column or row value. For example, the data set Sales contains variables Region and TotalSales. If you want to see the contribution to total sales made by each region, you can use the PCTSUM statistic. If you place Region on the row dimension and TotalSales on the column dimension, you can use the PCTSUM statistic to see the regional contribution of sales as a percentage of total sales for all regions. Here is the program:
Program 18-18 Computing percentages on a numeric value title "Computing Percentages on a Numerical Value"; proc tabulate data=learn.sales; class Region; var TotalSales; table (Region ALL), TotalSales*(n*f=6. sum*f=dollar8. pctsum*f=pctfmt7.); keylabel ALL n sum pctsum label TotalSales run;
= = = = =
'All Regions' 'Number of Sales' 'Average' 'Percent'; 'Total Sales';
If you have CLASS variables on both dimensions of a table, you can use the two keywords COLPCTSUM and ROWPCTSUM in a similar manner to COLPCTN and ROWPCTN discussed in the previous section. In the table here, the column labeled Percent represents the sum of the variable TotalSales for each region divided by the sum of TotalSales for all regions.
Chapter 18: Creating Tabular Reports 385
Computing Percentages on a Numerical Value „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ Total Sales ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚Number‚ ‚ ‚ ‚ ‚ of ‚ ‚ ‚ ‚ ‚Sales ‚Average ‚Percent‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Region ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚East ‚ 4‚ $41,593‚ 53.9%‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚North ‚ 4‚ $21,228‚ 27.5%‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚South ‚ 4‚ $12,003‚ 15.5%‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚West ‚ 2‚ $2,290‚ 2.9%‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚All Regions ‚ 14‚ $77,113‚ 100.0%‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
18.13 Understanding How Missing Values Affect PROC TABULATE Output Although this is the last topic in this chapter, it is a very important one. Note: If you have missing values for one or more variables listed in a CLASS statement, any observation with one or more missing values will be eliminated from the table! This is true even for a variable that is listed in a CLASS statement and is not referenced in a TABLE statement. To help understand how this works (and later to see how to fix it), look at the small data set (Missing):
386 Learning SAS by Example: A Programmer’s Guide
Listing of MISSING Obs
A
B
C
1 2 3 4 5 6
X X Z X Y X
Y Y Z X Z
Z Y Z
Now, see what happens when you request some simple tables:
Program 18-19 Demonstrating the effect of missing values on CLASS variables title "The Effect of Missing Values on CLASS variables"; proc tabulate data=learn.missing format=4.; class A B; table A ALL,B ALL; run; The Effect of Missing Values on CLASS variables „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒ† ‚ ‚ B ‚ ‚ ‚ ‡ƒƒƒƒ…ƒƒƒƒ…ƒƒƒƒ‰ ‚ ‚ ‚ X ‚ Y ‚ Z ‚All ‚ ‚ ‡ƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚A ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚X ‚ 1‚ 2‚ .‚ 3‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚Y ‚ .‚ .‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚Z ‚ .‚ .‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚All ‚ 1‚ 2‚ 2‚ 5‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒŒ
Chapter 18: Creating Tabular Reports 387
Notice that although there are six observations in the data set, the total shown in the table is five. Now, look what happens if you include the variable C in the CLASS statement, like this:
Program 18-20 Missing values on a CLASS variable that is not used in the table title "The Effect of Missing Values on CLASS variables"; proc tabulate data=learn.missing format=4.; class A B C; table A ALL,B ALL; run;
Here is the table: The Effect of Missing Values on CLASS variables „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒ…ƒƒƒƒ† ‚ ‚ B ‚ ‚ ‚ ‡ƒƒƒƒ…ƒƒƒƒ‰ ‚ ‚ ‚ Y ‚ Z ‚All ‚ ‚ ‡ƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚A ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚X ‚ 2‚ .‚ 2‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚Z ‚ .‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚All ‚ 2‚ 1‚ 3‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒŒ
Notice that the total number of observations is now three, even though variable C was not used to generate the table. PROC TABULATE chooses the observations to exclude from the table based on any missing values in any of the variables listed in the CLASS statement. Let’s rerun Program 18-19 but include the procedure option MISSING, like this:
388 Learning SAS by Example: A Programmer’s Guide
Program 18-21 Adding the PROC TABULATE procedure option MISSING title "The Effect of Missing Values on CLASS variables"; proc tabulate data=learn.missing format=4. missing; class A B; table A ALL,B ALL; run;
The MISSING option includes missing values as a category in the table. Take a look: The Effect of Missing Values on CLASS variables „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒ† ‚ ‚ B ‚ ‚ ‚ ‡ƒƒƒƒ…ƒƒƒƒ…ƒƒƒƒ…ƒƒƒƒ‰ ‚ ‚ ‚ ‚ X ‚ Y ‚ Z ‚All ‚ ‚ ‡ƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚A ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚X ‚ 1‚ 1‚ 2‚ .‚ 4‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚Y ‚ .‚ .‚ .‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚Z ‚ .‚ .‚ .‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ‰ ‚All ‚ 1‚ 1‚ 2‚ 2‚ 6‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒ‹ƒƒƒƒŒ
Notice that variable B now has a column representing missing values and the total number of observations in the table is six. By the way, if you would prefer to see something other than a period representing missing values in your table, the TABLES option MISSTEXT= can be used to substitute any text you want. For example:
Chapter 18: Creating Tabular Reports 389
Program 18-22 Demonstrating the MISSTEXT= TABLES option title "Demonstrating the MISSTEXT TABLES Option"; proc tabulate data=learn.missing format=7. missing; class A B; table A ALL,B ALL / misstext='no data'; run;
The text no data now appears instead of the period in each cell that contains a missing value. Note: The format width increased to 7 to accommodate the text. Here is the listing: Demonstrating the MISSTEXT TABLES Option „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ† ‚ ‚ B ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ X ‚ Y ‚ Z ‚ All ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ N ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚A ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚X ‚ 1‚ 1‚ 2‚no data‚ 4‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Y ‚no data‚no data‚no data‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Z ‚no data‚no data‚no data‚ 1‚ 1‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚All ‚ 1‚ 1‚ 2‚ 2‚ 6‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
We hope that this introduction to PROC TABULATE gives you the courage to give this useful and powerful procedure a try. This author was TABULATE-phobic for a long time until forced to use the procedure for a project that required nice-looking summary tables. Now, he is a convert and tries to get others to use this procedure more often.
390 Learning SAS by Example: A Programmer’s Guide
18.14 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. All the problems, except Problem 10, use the SAS data set College. Note: Data set College has permanent formats for Gender, SchoolSize, and Scholarship. Either make this format catalog available to SAS (see Chapter 5), run the PROC FORMAT statements here, or use the system option NOFMTERR (no format error) that allows you to access SAS data sets that have permanent user-defined formats without causing an error. proc format; value $yesno 'Y','1' = 'Yes' 'N','0' = 'No' ' ' = 'Not Given'; value $size 'S' = 'Small' 'M' = 'Medium' 'L' = 'Large' ' ' = 'Missing'; value $gender 'F' = 'Female' 'M' = 'Male' ' ' = 'Not Given'; run;
For each problem, you need to write the appropriate PROC TABULATE statements to produce the given table.
Chapter 18: Creating Tabular Reports 391
1. Produce the following table. Note that the last row in the table represents all subjects. Demographics from COLLEGE Data Set „ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ SchoolSize ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚Large ‚Medium‚Small ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Gender ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚F ‚ 10‚ 23‚ 26‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚M ‚ 8‚ 18‚ 11‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Scholarship ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚No ‚ 16‚ 36‚ 32‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Yes ‚ 2‚ 5‚ 5‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚All ‚ 18‚ 41‚ 37‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
2. Produce the following table. Note that the ALL column has been renamed Total. Demographics from COLLEGE Data Set „ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ† ‚ ‚ Gender ‚ Scholarship ‚ ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ ‚ F ‚ M ‚ No ‚ Yes ‚Total ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚SchoolSize ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚Large ‚ 10‚ 8‚ 16‚ 2‚ 18‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Medium ‚ 23‚ 18‚ 36‚ 5‚ 41‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Small ‚ 26‚ 11‚ 32‚ 5‚ 37‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Total ‚ 59‚ 37‚ 84‚ 12‚ 96‚
392 Learning SAS by Example: A Programmer’s Guide
3. Produce the following table. Note that the ALL column has been renamed Total and Gender has been formatted. Demographics from COLLEGE Data Set „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ† ‚ ‚ SchoolSize ‚ ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ ‚Large ‚Medium‚Small ‚Total ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Gender ‚Scholarship‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚Female ‚No ‚ 9‚ 19‚ 22‚ 50‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Yes ‚ 1‚ 4‚ 4‚ 9‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Total ‚ 10‚ 23‚ 26‚ 59‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Male ‚Scholarship‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚No ‚ 7‚ 17‚ 10‚ 34‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Yes ‚ 1‚ 1‚ 1‚ 3‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Total ‚ 8‚ 18‚ 11‚ 37‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Total ‚Scholarship‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚No ‚ 16‚ 36‚ 32‚ 84‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Yes ‚ 2‚ 5‚ 5‚ 12‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Total ‚ 18‚ 41‚ 37‚ 96‚ Šƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
4. Produce the following table. Note that the keyword ALL has been renamed Total, Gender is formatted, and ClassRank (a continuous numeric variable) has been formatted into two groups (0–70 and 71 and higher).
Chapter 18: Creating Tabular Reports 393
Demographics from COLLEGE Data Set „ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ ClassRank ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ Low to 70 ‚ 71 and higher ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ Gender ‚ ‚ Gender ‚ ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ ‚Female‚ Male ‚Total ‚Female‚ Male ‚Total ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Scholarship ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ ‚ ‚ ‚ ‚No ‚ 20‚ 15‚ 35‚ 23‚ 19‚ 42‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Yes ‚ 4‚ .‚ 4‚ 5‚ 2‚ 7‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Total ‚ 24‚ 15‚ 39‚ 28‚ 21‚ 49‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
5. Produce the following table. Note that the keywords ALL, N, MIN, and MAX have all been renamed. Descriptive Statistics „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ† ‚ ‚ Gender ‚ ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ ‚ F ‚ M ‚Total ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚GPA ‚Number ‚ 56‚ 38‚ 94‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Average ‚ 3.5‚ 3.5‚ 3.5‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Minimum ‚ 2.3‚ 2.4‚ 2.3‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Maximum ‚ 4.0‚ 4.0‚ 4.0‚ Šƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
394 Learning SAS by Example: A Programmer’s Guide
6. Produce the following table. Note that the keywords ALL, N, and MEAN have all been renamed. Descriptive Statistics „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒ† ‚ ‚ Gender ‚ ‚ ‚ ‡ƒƒƒƒƒƒ…ƒƒƒƒƒƒ‰ ‚ ‚ ‚ F ‚ M ‚Total ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚GPA ‚Number ‚ 56‚ 38‚ 94‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Average ‚ 3.5‚ 3.5‚ 3.5‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚Class Rank ‚Number ‚ 52‚ 36‚ 88‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒ‰ ‚ ‚Average ‚ 71.3‚ 72.3‚ 71.7‚ Šƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒ‹ƒƒƒƒƒƒŒ
7. Produce the following table. Note that the keywords MIN and MAX have been renamed and the two variables ClassRank and SchoolSize now have labels. A procedure option was used to remove the horizontal table lines. More Descriptive Statistics „ƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ GPA ‚ Class Rank ‚ ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚Median ‚Minimum‚Maximum‚Median ‚Minimum‚Maximum‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚School Size ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚Large ‚ 3.6‚ 3.0‚ 4.0‚ 71‚ 45‚ 98‚ ‚Medium ‚ 3.7‚ 2.4‚ 4.0‚ 71‚ 42‚ 100‚ ‚Small ‚ 3.4‚ 2.3‚ 4.0‚ 79‚ 41‚ 99‚ ‚Total ‚ 3.6‚ 2.3‚ 4.0‚ 73‚ 41‚ 100‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
Chapter 18: Creating Tabular Reports 395
8. Produce the following table. Note that the keyword ROWPCTN has been renamed as Percent and Gender has been formatted. A procedure option was used to remove the horizontal table lines. Demonstrating Row Percents „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ† ‚ ‚ Scholarship ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ No ‚ Yes ‚ All ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚ ‚Percent‚Percent‚Percent‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Gender ‚ ‚ ‚ ‚ ‚Female ‚ 83‚ 17‚ 100‚ ‚Male ‚ 93‚ 8‚ 100‚ ‚All ‚ 87‚ 13‚ 100‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
9. Produce the following table. Note that the ALL column has been renamed Total, COLPCTN has been renamed Percent, Gender has been formatted, and the order of the columns is Yes followed by No. Hint: Think of using the procedure option ORDER=data and figure a way to place the data set with Yes values before the No values. A procedure option was used to remove the horizontal table lines. Demonstrating Column Percents „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ† ‚ ‚ Scholarship ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒ‰ ‚ ‚ ‚ Yes ‚ No ‚ Total ‚ ‚ ‡ƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚ ‚Percent‚Percent‚Percent‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒ‰ ‚Gender ‚ ‚ ‚ ‚ ‚Female ‚ 77‚ 57‚ 60‚ ‚Male ‚ 23‚ 43‚ 40‚ ‚Total ‚ 100‚ 100‚ 100‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒŒ
396 Learning SAS by Example: A Programmer’s Guide
10. Using the SAS data set Sales, produce the table shown here. The variable TotalSales has been labeled as Total Sales and the horizontal table lines have been removed. The percentages shown in the table represent percentages of the total quantities sold and of the total sales. Demonstrating the PCTSUM Statistic „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒ† ‚ ‚ ‚ Total ‚ ‚ ‚Quantity‚ Sales ‚ ‚ ‡ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚ ‚Percent ‚Percent ‚ ‚ ‚of Total‚of Total‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ‰ ‚Region ‚ ‚ ‚ ‚East ‚ 35‚ 54‚ ‚North ‚ 3‚ 28‚ ‚South ‚ 10‚ 16‚ ‚West ‚ 52‚ 3‚ ‚All ‚ 100‚ 100‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒŒ
C h a p t e r
19
Introducing the Output Delivery System 19.1 Introduction 397 19.2 Sending SAS Output to an HTML File 398 19.3 Creating a Table of Contents 400 19.4 Selecting a Different HTML Style 401 19.5 Choosing Other ODS Destinations
402
19.6 Selecting or Excluding Portions of SAS Output 403 19.7 Sending Output to a SAS Data Set 407 19.8 Problems 409
19.1 Introduction One of the most important changes in SAS, starting with Version 7, was the introduction of the Output Delivery System, abbreviated as ODS. Prior to ODS, each SAS procedure produced its own unique output, usually featuring a non-proportional font. This output
398 Learning SAS by Example: A Programmer’s Guide
was difficult to incorporate into other documents, such as Microsoft Office Word or Microsoft Office PowerPoint. The Output Delivery System changed all that. Now each SAS procedure creates output objects that can be sent to such destinations as HTML, RTF, PDF, and SAS data sets. There is now a separation between the results produced by each procedure and the delivery of this information. Not only can you send your SAS output to all of these formats, you can, if you are brave enough, customize the output’s fonts, colors, size, and layout. Finally, you can now capture virtually every piece of SAS output to a SAS data set for further processing. For a comprehensive discussion of the Output Delivery System, take a look at Lauren 1 2 Haworth’s book and Sunil Gupta’s book.
19.2 Sending SAS Output to an HTML File One of the more popular destinations for SAS output is an HTML file. The obvious reason is that these files can be used as Web pages (not to mention that they look pretty slick). As an example, suppose you want to send the output of PROC PRINT and PROC MEANS to an HTML file. It takes only two SAS statement to do this, as shown next:
Program 19-1 Sending SAS output to an HTML file ods html file='c:\books\learning\sample.html'; title "Listing of TEST_SCORES"; proc print data=learn.test_scores; title2 "Sample of HTML Output - all defaults"; id ID; var Name Score1-Score3; run; title "Descriptive Statistics"; proc means data=learn.test_scores n mean min max; var Score1-Score3; run; ods html close;
1 2
See Lauren E. Haworth, Output Delivery System, The Basics (Cary, NC: SAS Institute Inc., 2001). See Sunil K. Gupta, Quick Results with the Output Delivery System (Cary, NC: SAS Institute Inc., 2003).
Chapter 19: Introducing the Output Delivery System 399
All that is required is to place an ODS HTML FILE statement before the procedures whose output you want to capture and an ODS HTML CLOSE statement at the end. You specify the file destination by placing the filename in quotes or by using a FILENAME statement and using a fileref (without quotes) in place of the filename. The HTML extension (or HTM for some environments) is not needed by SAS, but it allows the operating system to recognize that the file contains HTML tags and to open it with the appropriate browser. The HTML output is shown here:
Note: The actual HTML file is in color. When you ran Program 19-1, you also produced, by default, a listing file, that is, the normal output that SAS produces and appears in an output window on PC and UNIX platforms. If you do not want a listing style output, use the ODS statement before the ODS HTML FILE statement: ods listing close;
If you want to reinstate the listing output, submit the following statement: ods listing;
400 Learning SAS by Example: A Programmer’s Guide
19.3 Creating a Table of Contents If your HTML output is large, you may elect to produce a table of contents along with the normal HTML output. Remember that HTML is primarily intended to be viewed with a Web browser, not printed. A table of contents embeds links to each separate part of the output, allowing users to click on a link and go directly to different parts of the output. To create a table of contents, you need to specify three output files:
A main file that contains the procedure output A table of contents file A frame file to display the other two
Here is an example, using the same procedures as Program 19-1:
Program 19-2 Creating a table of contents for HTML output ods html body = 'body_sample.html' contents = 'contents_sample.html' frame = 'frame_sample.html' path = 'c:\books\learning' (url=none); title "Using ODS to Create a Table of Contents"; proc print data=learn.test_scores; id ID; var Name Score1-Score3; run; title "Descriptive Statistics"; proc means data=learn.test_scores n mean min max; var Score1-Score3; run; ods html close;
The three keywords are BODY=, CONTENTS=, and FRAME=. Note: A PATH statement was included to indicate that all three files are to be located in the c:\books\learning folder. You can name these three files anything you like—the names do not need to include the words body, contents, or frame as in this example. If you are creating a Web page from these files, you need to provide a link to the FRAME file. A display (in black and white) of the output is shown next.
Chapter 19: Introducing the Output Delivery System 401
There are many ways to customize this output. Please refer to either of the books on ODS listed in the introduction for more details.
19.4 Selecting a Different HTML Style Although you can customize every aspect of the HTML output, it takes time and effort. There are a number of built-in styles that change the appearance of the output without any work on your part. One way to see a list of styles, if you are in a windowing environment, is to select Tools Æ Options Æ Preferences Æ and then select the Results tab. You will see a long list of built-in styles that you can choose from. For example, if you want a cleaner looking output, you could choose the FancyPrinter style, like this:
Program 19-3 Choosing a style for HTML output ods html file = 'c:\books\learning\styles.html' style=FancyPrinter; title "Listing of TEST_SCORES"; proc print data=learn.test_scores; id ID; var Name Score1-Score3; run; ods html close;
402 Learning SAS by Example: A Programmer’s Guide
The resulting output follows:
Another style choice for very clean-looking output is JOURNAL. This style does not contain any shading and, as the name implies, is suitable for printing journal-style tables.
19.5 Choosing Other ODS Destinations You can create RTF (rich text format) or PDF (portable document format) files, which are both readable with Adobe Reader, in the same manner as HTML output. Just substitute the keywords RTF or PDF for HTML in the statements in Program 19-1 and change the file extensions to RTF or PDF, respectively. Besides providing a better appearance, using RTF or PDF files allows users of software other than SAS to see well-formatted SAS output without having SAS fonts on their machines. RTF is a somewhat universal format that can be incorporated into a Microsoft Office Word or Corel WordPerfect document directly. Have you ever seen what SAS output looks like when it is displayed in a font other than a SAS font? Take a look at output from PROC FREQ, which is displayed in Word with a Courier font (it looks even worse when printed with a proportional font such as Arial or Times Roman):
Chapter 19: Introducing the Output Delivery System 403
Sample PROC FREQ Output The FREQ Procedure Table of Gender by AgeGroup Gender(Gender)
AgeGroup(Age Group)
Frequency‚ Row Pct ‚ Col Pct ‚Old ‚Young ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Female ‚ 258 ‚ 182 ‚ ‚ 58.64 ‚ 41.36 ‚ ‚ 43.14 ‚ 45.27 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Male ‚ 340 ‚ 220 ‚ ‚ 60.71 ‚ 39.29 ‚ ‚ 56.86 ‚ 54.73 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 598 402
Total 440
560
1000
This output is a strong argument for using RTF or PDF output destinations.
19.6 Selecting or Excluding Portions of SAS Output You can use an ODS SELECT or ODS EXCLUDE statement before a SAS procedure to control which portions of the output you want. Suppose you want to use PROC UNIVARIATE to list the five highest and five lowest values of a variable. This is a normal part of the output from PROC UNIVARIATE, but you also get several pages of additional output as well. Program 19-4 uses an ODS SELECT statement to restrict the output.
404 Learning SAS by Example: A Programmer’s Guide
Program 19-4 Using an ODS SELECT statement to restrict PROC UNIVARIATE output ods select extremeobs; title "Extreme Values of RBC"; proc Univariate data=learn.blood; id Subject; var RBC; run;
When you run this program, the only output is as follows: Extreme Values of RBC The UNIVARIATE Procedure Variable:
RBC Extreme Observations
---------Lowest---------
---------Highest--------
Value
Subject
Obs
Value
Subject
Obs
1.71
525
525
7.99
565
565
2.33
440
440
8.12
984
984
2.55
113
113
8.26
288
288
2.92
293
293
8.43
726
726
3.13
635
635
8.75
135
135
This seems simple enough, but how do you determine the name of the output objects (in this case, EXTREMEOBS)? One way is to run the procedure sandwiched between ODS TRACE ON and ODS TRACE OFF statements, like this:
Program 19-5 Using the ODS TRACE statement to identify output objects ods trace on; title "Extreme Values of RBC"; proc Univariate data=learn.blood; id Subject; var RBC; run; ods trace off;
Chapter 19: Introducing the Output Delivery System 405
When you run the procedure, the names of the output objects appear in the SAS log as shown next (only selected portions are shown): 252
ods trace on;
253
proc Univariate data=learn.blood;
254
title "Extreme Values of RBC";
255
id Subject;
256
var RBC;
257
run;
Output Added: Name:
Moments
Label:
Moments
Template:
base.univariate.Moments
Path:
Univariate.RBC.Moments
Output Added: Name:
BasicMeasures
Label:
Basic Measures of Location and Variability
Template:
base.univariate.Measures
Path:
Univariate.RBC.BasicMeasures
. . . Name:
ExtremeObs
Label:
Extreme Observations
Template:
base.univariate.ExtObs
Path:
Univariate.RBC.ExtremeObs
NOTE: PROCEDURE UNIVARIATE used (Total process time):
258
real time
0.46 seconds
cpu time
0.02 seconds
ods trace off;
406 Learning SAS by Example: A Programmer’s Guide
One way to figure out the correspondence between output objects and portions of the output is to look through the output listing and the list of objects in the SAS log and make an educated guess. It’s usually quite obvious which object goes with each portion of output. If you use TRACE ON/LISTING, the information on each output object is placed in the Output window, along with the listing. This is yet another way to know which output object names go with each portion of the output. For a more systematic approach, look at the Results window of SAS Display Manager and notice the labels in the list. For example, the following display results from running Program 19-5:
Clicking on any of these labels takes you to the appropriate portion of the output. You can then look in the list of output objects in the SAS log and determine the object names. You can also select objects by using the labels instead of the object names. For example, you can use the following: ods select "Extreme Observations";
in place of this statement: ods select extremeobs;
If you use the label, be sure to place it in single or double quotes. You can include a list of output objects following the ODS SELECT statement (separated by spaces). If you want to select most of the output objects from a particular procedure, it might be easier to use an ODS EXCLUDE statement to exclude the ones you don't want.
Chapter 19: Introducing the Output Delivery System 407
Remember that ODS SELECT or EXCLUDE statements only operate on the procedure that follows. If you want the selections to remain for other procedures, you can use the PERSIST option, like this: ods select extremeobs(persist);
19.7 Sending Output to a SAS Data Set One of the possible output destinations is a SAS data set. This feature of ODS allows you to capture just about any value computed by a procedure and use it in further calculations or a customized report. Prior to the development of ODS, PROC PRINTTO was used to capture SAS output from a procedure and use it as input to a DATA step. Many procedures already provide you with the ability to capture information in a SAS data set with an OUT= option, usually in an OUTPUT statement. Using ODS to capture output is more general—it lets you capture any value you want. Suppose you want to run a statistical procedure called a t-test. This statistical test produces a t-value and a p-value (the significance of the result). These two values are not available in an output data set from PROC TTEST, but you can use ODS to place these values into a SAS data set. If you are not familiar with PROC TTEST, it should still be clear how to capture SAS procedure output into SAS data sets. Here is a program to capture a portion of the t-test output to a data set:
Program 19-6 Using ODS to send procedure output to a SAS data set ods listing close; ods output ttests=t_test_data; proc ttest data=learn.blood; class Gender; var RBC WBC Chol; run; ods listing; title "Listing of T_TEST_DATA"; proc print data=t_test_data; run;
This program first closes the listing destination so that output is not sent to the Output window.
408 Learning SAS by Example: A Programmer’s Guide
Note: You cannot use a NOPRINT option on procedures that allow it. If you do, the values will not be available to ODS to send them to an alternate destination. Next, the ODS OUTPUT statement is used to send the output to a data set. The keyword to the left of the equal sign is the name of an output object produced by PROC TTEST. Remember that you can determine these names by running the procedure with the ODS TRACE ON statement (which sends the names of the output objects to the SAS log) or by checking the Results window to see the labels of the output objects. In this case, the output object TTESTS contains, among other things, the t- and p-values that you want. Following the equal sign is the name of the data set you want to create. You then run the procedure in the usual way. The first time you do this, you should run a PROC PRINT to determine the structure of the output data set. Don't forget to turn the LISTING destination back on before you do this. (If you forget to turn the LISTING output back on, you may scratch your head, wondering why you are not getting any output—this author certainly did.) Here is the output from PROC PRINT: Listing of T_TEST_DATA Obs
Variable
Method
Variances
tValue
DF
Probt
1
RBC
Pooled
Equal
0.41
914
0.6797
2
RBC
Satterthwaite
Unequal
0.41
875
0.6796
3
WBC
Pooled
Equal
1.87
906
0.0624
4
WBC
Satterthwaite
Unequal
1.87
865
0.0622
5
Chol
Pooled
Equal
0.53
793
0.5953
6
Chol
Satterthwaite
Unequal
0.54
771
0.5918
The structure of these output data sets can be complicated. In this instance, you know that PROC TTEST produces two t-values, one under the assumption of equal variance and the other under the assumption of unequal variance. What can you do with this data set? Sending computed values to a data set enables you to perform additional analyses on them. Also, you might want to present the output from a SAS procedure differently than the way SAS presents it. As an example, suppose you want to see a simple report showing the t- and p-values from your t-test, rather than the more complicated output from PROC TTEST. Once you know the structure of the output data set created by ODS, you can proceed like this:
Chapter 19: Introducing the Output Delivery System 409
Program 19-7 Using an output data set to create a simplified report title "T-Test Results – Using Equal Variance Method"; proc report data=t_test_data nowd headline; where Variances = "Equal"; columns Variable tValue ProbT; define Variable / width=8; define tValue / "T-Value" width=7 format=7.2; define ProbT / "P-Value" width=7 format=7.5; run;
Inspection of data set T_Test_Data shows that there are separate observations for the two variance assumptions. You can use a WHERE statement to select statistics using the equal variance assumption. The result of the PROC REPORT is a simple listing showing the variable name along with the t- and p-values. Here it is: T-Test Results – Using Equal Variance Method Variable T-Value P-Value ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ RBC 0.41 0.67972 WBC 1.87 0.06237 Chol 0.53 0.59529
This chapter has just touched the surface of the capabilities of the Output Delivery System. However, by routing your output to destinations such as HTML or PDF files or by using ODS SELECT statements, you can create output for Web pages or company reports that look much more impressive than a standard listing report.
19.8 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems.
410 Learning SAS by Example: A Programmer’s Guide
1. Run the following program, sending the output to an HTML file. Issue the appropriate commands to prevent SAS from creating a listing file. See Chapter 16, Problem 2, for the note about creating formats for this data set. title "Sending Output to an HTML File"; proc print data=learn.college(obs=8) noobs; run; proc means data=learn.college n mean maxdec=2; var GPA ClassRank; run;
2. Run the same two procedures shown in Problem 1, except create a contents file, a body file, and a frame file. 3. Run the same two procedures as shown in Problem 1, except use the JOURNAL (or FANCYPRINTER) style instead of the default style. 4. Send the results of a PROC PRINT on the data set Survey to an RTF file. 5. Run PROC UNIVARIATE on the variables Age and Salary from the Survey data set. Use the TRACE ON/TRACE OFF statements to display the names of the output objects created by this procedure. Once you have done this, run PROC UNIVARIATE again, selecting only the output object that shows Quantiles. 6. Run the same PROC UNIVARIATE as in Problem 5. Issue two ODS statements: one to select the MOMENTS output object and the other to send this output to a SAS data set. Run PROC PRINT to see a listing of this data set.
C h a p t e r
20
Generating High-Quality Graphics 20.1
Introduction 412
20.2
Some Basic Concepts 412
20.3
Producing Simple Bar Charts Using PROC GCHART 413
20.4
Creating Pie Charts 415
20.5
Creating Bar Charts for a Continuous Variable 416
20.6
Creating Charts with Values Representing Categories 418
20.7
Creating Bar Charts Representing Sums 420
20.8
Creating Bar Charts Representing Means 422
20.9
Adding Another Variable to the Chart 423
20.10 Producing Scatter Plots 425 20.11 Connecting Points 427 20.12 Connecting Points with a Smooth Line 430 20.13 Problems 431
412 Learning SAS by Example: A Programmer’s Guide
20.1 Introduction This chapter introduces some basic concepts behind SAS/GRAPH software. Here you will see how to create simple bar charts, scatter plots, and line graphs. There are many excellent books and manuals available from SAS Press. Among these, I recommend 1 2 3 books by Thomas Miron, Art Carpenter and Charles Shipp, and Mike Zdeb.
20.2 Some Basic Concepts First of all, in order to run all the procedures described in this chapter, you need to have SAS/GRAPH installed on your computer. Without it, you are relegated to using the older style “line printer” type plots using such procedures as PROC CHART and PROC PLOT. These procedures are fine if you want a quick look at your data in graphical form; however, they are not suitable for presentations. The appearance of graphs and charts is controlled by graphics options and global statements such as SYMBOL and PATTERN. Options that are set using these statements act somewhat like titles created using TITLE statements. That is, they remain in effect until you change them and they are additive. For example, if you already have dots selected as your plotting symbols, with the color set to black, and you change the color to red, your plots display with red dots. Because you may forget that certain graphics options have been previously set, it is a good idea to include the RESET=all graphics option before you begin a new chart or plot. This option resets all of the graphics options to their default values. (It also eliminates all TITLE and FOOTNOTE statements, even those used in non-graphics situations.)
1
See Thomas Miron, The How-To Book for SAS/GRAPH Software (Cary, NC: SAS Institute Inc., 1995). See Arthur L. Carpenter and Charles E. Shipp, Quick Results with SAS/GRAPH Software (Cary, NC: SAS Institute Inc., 1995). 3 See Mike Zdeb, Maps Made Easy Using SAS (Cary, NC: SAS Institute Inc., 2002), if you are interested in using SAS to produce maps. 2
Chapter 20: Generating High-Quality Graphics 413
Here is an example of some basic graphics options you may want to set using the GOPTIONS statement (you can pronounce this as “G-Options” or “Gop-tions”): goptions reset=all ftext='arial' htext=1.0 ftitle='arial/bo' htitle=1.5 colors=(black);
Besides the RESET=all option, the other options in this statement set the font for text to Arial (a sans-serif font that looks nice), with the text height set to 1.0 units (you can play with this number and try out different sizes). Titles are printed in Arial bold (that’s what the /bo is all about) and the title heights are set to 1.5 units. Finally, the color is set to black. The appearance of the output from various SAS/GRAPH procedures can be influenced by additional statements such as SYMBOL (for example, defines plotting symbols and line styles), PATTERN (defines styles for bar graphs), and AXIS (defines horizontal and vertical axes). In each example, the appropriate SAS/GRAPH procedure is preceded by one or more of these statements. Let’s get started.
20.3 Producing Simple Bar Charts Using PROC GCHART If you already know how to use PROC CHART, you are well on your way to using PROC GCHART. As a matter of fact, if you submit the same PROC CHART statements and use PROC GCHART instead, you will produce charts—just not very pretty ones. The short program here produces a bar chart showing the distribution of blood types from the Blood data set.
414 Learning SAS by Example: A Programmer’s Guide
Program 20-1 Producing a simple bar chart using PROC GCHART title "Distribution of Blood Types"; pattern value=empty; proc gchart data=learn.blood; vbar BloodType; run; quit;
A PATTERN statement requests that the bars in your vertical bar chart consist of an outline only (the default is to fill in the bar). The VBAR statement requests a vertical bar chart for the variable BloodType. Alternatives to VBAR are as follows: HBAR VBAR3D HBAR3D PIE PIE3D DONUT STAR
Horizontal bar chart Three-dimensional vertical bar chart Three-dimensional horizontal bar chart Pie chart Three-dimensional pie chart Donut chart Star chart
Try substituting these keywords for VBAR in Program 20-1 to see what each one produces. Following is the chart produced by this program. Note: An additional GOPTIONS statement using the VSIZE=4 (vertical size) option was used to shorten the chart.
Chapter 20: Generating High-Quality Graphics 415
20.4 Creating Pie Charts By simply replacing the keyword VBAR with PIE, you can create a simple pie chart with the size of each slice proportional to the frequency of the displayed values. Here is Program 20-1 with the keyword PIE substituted for VBAR and the resulting chart:
Program 20-2 Creating a simple pie chart title "Creating a Pie Chart"; pattern value=pempty; proc gchart data=learn.blood; pie BloodType / noheading; run; quit;
The option NOHEADING removes the default heading (in this case, FREQUENCY of Blood Type) that is normally added when you create a pie chart.
416 Learning SAS by Example: A Programmer’s Guide
20.5 Creating Bar Charts for a Continuous Variable You can use PROC GCHART to create bar charts for continuous variables. SAS automatically computes midpoints for each bar. In most cases, you will want to override this action and supply your own midpoints. To demonstrate how this works, Program 20-3 creates a bar chart showing the distribution of white blood cell (WBC) counts in the Blood data set.
Program 20-3 Creating a bar chart for a continuous variable pattern value=empty; proc gchart data=learn.blood; vbar WBC; run;
Chapter 20: Generating High-Quality Graphics 417
Here is the output:
You can improve the appearance of this chart by specifying your own midpoints and adding a format, as follows:
Program 20-4 Selecting your own midpoints for the chart pattern value=L2; title "Distribution of WBC's"; proc gchart data=learn.blood; vbar WBC / midpoints=4000 to 11000 by 1000; format WBC comma6.; run;
418 Learning SAS by Example: A Programmer’s Guide
The midpoints option specifies that the horizontal axis runs from 4,000 to 11,000, with a bar every 1,000 units. A FORMAT statement specifies a comma format for WBC. Also, the PATTERN statement was changed to select diagonal lines (starting at the top left) with a density of 2 (higher numbers are darker). Here is the chart:
20.6 Creating Charts with Values Representing Categories As you saw in the previous section, SAS wants to place continuous values into groups before generating a frequency bar chart. There are times when you want to treat the values as discrete categories rather than continuous values. For example, you may have a numeric variable representing the day of the week or the month of the year. In this example, you want to create a bar chart showing the frequencies by day of the week for visits to a hospital. The data set Hosp contains a variable called AdmitDate. This variable is a SAS date representing the date each patient was admitted to the hospital. In order to plot a frequency bar chart, you first need to compute the day of the week corresponding to each admission date. The following program does this and it creates a bar chart:
Chapter 20: Generating High-Quality Graphics 419
Program 20-5 Demonstrating the need for the DISCRETE option of PROC GCHART data day_of_week; set learn.hosp; Day = weekday(AdmitDate); run; title "Visits by Day of Week"; pattern value=R1; proc gchart data=day_of_week; vbar Day; run;
In the DATA step, you use the WEEKDAY function to compute the day of the week from the SAS date. PROC GCHART is then used to create a bar chart. A different pattern (R1) was also used so that you can see how the different patterns look. Here is the result:
Notice that SAS, seeing a numeric variable listed in the VBAR statement, attempts to group the values. To prevent this from happening, use the VBAR option DISCRETE. This tells SAS to treat the values as if they were categories. Here is the revised program and the resulting output:
420 Learning SAS by Example: A Programmer’s Guide
Program 20-6 Demonstrating the DISCRETE option of PROC GCHART title "Visits by Month of the Year"; pattern value=R1; proc gchart data=day_of_week; vbar Day / discrete; run;
That certainly looks a lot better!
20.7 Creating Bar Charts Representing Sums The same GCHART procedure can be used to create bar charts where the height of the bars represents some statistic, means (or sums) for example, for each value of a classification variable. To demonstrate this, let’s create a bar chart showing the sum of the variable TotalSales (from the Sales data set) for each region of the country.
Chapter 20: Generating High-Quality Graphics 421
Program 20-7 Creating a bar chart where the height of the bars represents sums title "Total Sales by Region"; pattern1 value=L1; axis1 order=('North' 'South' 'East' 'West'); proc gchart data=learn.sales; vbar Region / sumvar=TotalSales type=sum maxis=axis1; format TotalSales dollar8.; run; quit;
For this example, we jumped right in and added some bells and whistles to improve the chart’s appearance. A VBAR statement indicates you want each value of Region on the horizontal axis. Next, the two options, SUMVAR= and TYPE=, tell the procedure that you want the sum of the variable TotalSales to represent the height of each bar. Note: The default statistic is SUM, but it is a good idea to always specify the statistic you want. To control the order of the values on the horizontal axis, you use an AXIS statement. The option ORDER= allows you to specify the order of the Region values. You associate this AXIS statement with the horizontal (midpoint) axis by specifying MAXIS=axis1 as an option in the VBAR statement.
422 Learning SAS by Example: A Programmer’s Guide
Here is the chart:
20.8 Creating Ba ar Charts Representing Means By simply changing the option TYPE=sum to TYPE=mean, the heights of the bars will represent means of the variable you list in the SUMVAR statement. We start with a simple chart showing the mean cholesterol levels for males and females, using data from the Blood data set. We then show you how to modify the program so that this same information is also broken down by blood type. But first, the simple chart:
Program 20-8 Creating a bar chart where the height of the bars represents means title "Average Cholesterol by Gender"; pattern1 value=L1; proc gchart data=learn.blood; vbar Gender / sumvar=Chol type=mean; run; quit;
Chapter 20: Generating High-Quality Graphics 423
Here is the resulting output:
20.9 Adding Another Variable to the Chart There are two ways to add another variable to the chart. The first way is to use a GROUP= option in the VBAR statement. To see the cholesterol means by gender for each blood type, proceed as follows:
Program 20-9 Adding another variable to the chart title "Average Cholesterol by Gender"; pattern1 value=L1; proc gchart data=learn.blood; vbar Gender / sumvar=Chol type=mean group=BloodType; run; quit;
424 Learning SAS by Example: A Programmer’s Guide
Here is the chart:
The other way to introduce another variable into the chart is to use the SUBGROUP= option. This option uses different patterns within each bar to represent a value of another variable. Because it doesn’t make any sense to show two different means within a single bar, we will change the program to display frequency information on blood type and gender. Here is the program:
Program 20-10 Demonstrating the SUBGROUP= option title "Average Cholesterol by Gender"; pattern1 value=L1; pattern2 value=R3; proc gchart data=learn.blood; vbar BloodType / subgroup=Gender; run; quit;
Chapter 20: Generating High-Quality Graphics 425
Here you are requesting that each gender be represented by a different pattern within each bar representing blood type. Two PATTERN statements are used to specify the patterns you want to represent females and males in the chart. Here is the chart:
20.10 Producing Scatter Plots To display an X-Y plot, also known as a scatter plot, which shows the relationship between two variables, you can use PROC GPLOT. By adding SYMBOL statements, you can select a plotting symbol, connect the points by straight or curvy lines, or even plot a regression line. To demonstrate this procedure, we start with a simple plot of systolic blood pressure (SBP) by diastolic blood pressure (DBP) using data from the Clinic data set. We start with all the defaults in place:
Program 20-11 Creating a simple scatter plot using all the defaults title "Scatter Plot of SBP by DBP"; proc gplot data=learn.clinic; plot SBP * DBP; run;
426 Learning SAS by Example: A Programmer’s Guide
If you are familiar with PROC PLOT, you already know how to use PROC GPLOT. Just add a “G” to the procedure name! In the previous program, the variable in the PLOT statement before the asterisk is plotted on the y-axis and the variable in the PLOT statement after the asterisk is plotted on the x-axis. Here is the resulting scatter plot:
To change the plotting symbol, use a SYMBOL statement, like this:
Program 20-12 Changing the plotting symbol and controlling the axis ranges title "Scatter Plot of SBP by DBP"; symbol value=dot; proc gplot data=learn.clinic; plot SBP * DBP / haxis=70 to 120 by 5 vaxis=100 to 220 by 10; run;
The VALUE= option specifies that you want to use dots (this author’s favorite) as your plotting symbol. The two options HAXIS and VAXIS control the ranges on the horizontal and vertical axes, respectively.
Chapter 20: Generating High-Quality Graphics 427
Here is the result:
20.11 Connecting Points You can add an interpolation option to the SYMBOL statement to request that the points be joined or that a straight or curved line be drawn. Here are some examples:
Program 20-13 Joining the points with straight lines (first attempt) title "Scatter Plot of SBP by DBP"; title2 h=1.2 "Interpolation Methods"; symbol value=dot interpol=join width=2; proc gplot data=learn.clinic; plot SBP * DBP; run;
428 Learning SAS by Example: A Programmer’s Guide
The option INTERPOL= (or just I=) has quite a few options that control the appearance of your plot. The JOIN option connects the points with straight lines. The WIDTH= option controls the width of the line—larger values yielding darker lines. A second TITLE statement is used also, with the height set to 1.2 units. Note: Height units depend on the output device. You can use actual units, such as H=1.2cm., which will be the same size, independent of the output device. Here is the result:
Well, that was interesting! If you don’t specify how you want to join the points, SAS joins the points in the order they appear in the input data set. So, you might want to sort your data set first, like this:
Chapter 20: Generating High-Quality Graphics 429
Program 20-14 Using the JOIN option on a sorted data set proc sort data=learn.clinic out=clinic; by DBP; run; title "Scatter Plot of SBP by DBP"; title2 h=1.2 "Interpolation Methods"; symbol value=dot interpol=join width=2; proc gplot data=clinic; plot SBP * DBP; run;
The result looks a bit better:
430 Learning SAS by Example: A Programmer’s Guide
20.12 Connecting Points with a Smooth Line Another interpolation option is INTERPOL=sm (stands for smooth). To save you the trouble of sorting your data set, you can request INTERPOL=sms. The “s” added to the SM option stands for sort. When you use this option, SAS connects the points in the proper sorted order. You can also follow the SM with a number from 0 to 99 to control the smoothness of the curve. Low values cause the line to wiggle a lot and attempt to touch each of the points—high values give you smoother lines, with values near 99 giving you an almost straight line. The default value is 0. The program here demonstrates the smooth option:
Program 20-15 Drawing a smooth line through your data points title "Scatter Plot of SBP by DBP"; title2 h=1.2 "Interprelation Methods"; symbol value=dot interpol=sms line=1 width=2; proc gplot data=learn.clinic; plot SBP * DBP; run;
The LINE= option allows you to select line types (1=a smooth line, the default value). Here is the plot:
Chapter 20: Generating High-Quality Graphics 431
These examples touch only the surface of what can be done using SAS/GRAPH. You may require additional resources (as mentioned in the Introduction to this chapter) to produce the charts and plots you want. It is also useful to find someone who is really good with SAS/GRAPH and take him or her to lunch every once in a while—maybe even dinner!
20.13 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Using the SAS data set Bicycles, produce two vertical bar charts showing frequencies for Country and Model. Use the PATTERN option VALUE=empty. 2. Repeat Problem 1, except produce a pie chart instead of a bar chart. 3. Using the SAS data set Bicycles, produce a vertical bar chart showing the distribution of Total Sales (TotalSales). Use midpoints of 0 to 12,000, with intervals of 2,000. 4. Again, using the Bicycles data set, show the distribution of units sold (Units) for each value of Model. Your chart should look like this:
432 Learning SAS by Example: A Programmer’s Guide
5. Using the SAS data set Bicycles, produce a vertical bar chart showing a frequency distribution of Country. Within each bar, show the distribution of Model. Your chart should look like this:
6. Using the SAS data set Bicycles, show the sum of total sales (TotalSales) for each Country. Your chart should look like this:
Chapter 20: Generating High-Quality Graphics 433
7. Using the SAS data set College, produce a vertical bar chart where the mean GPA is shown for each value of school size (SchoolSize). Remember to include a FMTSEARCH option, use the system option NOFMTERR, or write a format of your own. Your chart should look like this:
8. Using the SAS data set Fitness, produce a scatter plot showing the time to run a mile (TimeMile) on the y-axis and the resting pulse (RestPulse) on the x-axis. Use the dot as the plotting symbol. 9. Using the SAS data set Stocks, produce a scatter plot of Price versus Date. Use the dot as the plotting symbol and connect the dots with a smooth line.
434 Learning SAS by Example: A Programmer’s Guide
P a r t
4
Advanced Topics 437
Chapter 21
Using Advanced INPUT Techniques
Chapter 22
Using Advanced Features of User-Defined Formats and Informats
461 493
Chapter 23
Restructuring SAS Data Sets
Chapter 24
Working with Multiple Observations per Subject
Chapter 25
Introducing the SAS Macro Language
Chapter 26
Introducing the Structured Query Language
521 535
505
436 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
21
Using Advanced INPUT Techniques 21.1 21.2
Introduction 438 Handling Missing Values at the End of a Line 438
21.3
Reading Short Data Lines 440
21.4
Reading External Files with Lines Longer Than 256 Characters 443
21.5
Detecting the End of the File 443
21.6
Reading a Portion of a Raw Data File 445
21.7
Reading Data from Multiple Files 446
21.8
Reading Data from Multiple Files Using a FILENAME Statement 447
21.9
Reading External Filenames from a Data File 447
21.10 Reading Multiple Lines of Data to Form One Observation 448 21.11 Reading Data Conditionally (the Single Trailing @ Sign) 451 21.12 More Examples of the Single Trailing @ Sign 453 21.13 Creating Multiple Observations from One Line of Input 454 21.14 Using Variable and Informat Lists 455
438 Learning SAS by Example: A Programmer’s Guide
21.15 Using Relative Column Pointers to Read a Complex Data Structure Efficiently 456 21.16 Problems 458
21.1 Introduction Chapter 3 covered the basics of reading raw data. This chapter discusses more advanced input topics. For an excellent reference discussing the INFILE options MISSOVER, 1 PAD, and TRUNCOVER, see Randall Cates’ paper on the subject.
21.2 Handling Missing Values at the End of a Line Suppose you have a raw data file with three numbers on most of the lines (representing x, y, and z). Some of the lines contain fewer than three numbers. As an example, look at the raw data file here: File c:\books\learning\missing.txt 1 2 3 4 5 6 7 8 9 10 11
What happens if you run a program like Program 21-1?
Program 21-1 Missing values at the end of a line with list input data missing; infile 'c:\books\learning\missing.txt'; input x y z; run;
1
See Randall Cates, MPH, “Missover, Truncover, and PAD, Oh My!! Or Making Sense of the INFILE and INPUT Statements,” Paper 9-26, which is available at http://www2.sas.com/proceedings/sugi26/p009-26.pdf.
Chapter 21: Using Advanced INPUT Techniques 439
Here is the output: Listing of MISSING Obs
x
y
z
1
1
2
3
2
4
5
6
3
9
10
11
You should notice immediately that something is wrong: There were four lines of raw data but only three observations in the data set. Not only that, SAS read the value of 6 on the third line of data as the value for z in the second observation. The remaining two values on the third data line (7 and 8) disappeared. What happened? The first observation is straightforward (x=1, y=2, and z=3). What happens when SAS reaches Line 2? There are values for x and y but none for z. So, SAS goes to the next line and sets z equal to 6. You are now at the bottom of the DATA step. Because there are more lines of data to be read, control returns back to the top of the DATA step and SAS moves the pointer to the next line of data, ignoring the two other values (7 and 8) on that line, and reads the last three values, 9, 10, and 11, for x, y, and z, respectively. Before you see how to fix this problem, take a look at the SAS log: 42
data missing;
43
infile 'c:\books\learning\missing.txt';
44
input x y z;
45
run;
NOTE: The infile 'c:\books\learning\missing.txt' is: File Name=c:\books\learning\missing.txt, RECFM=V,LRECL=256 NOTE: 4 records were read from the infile 'c:\books\learning\missing.txt'. The minimum record length was 3. The maximum record length was 7. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.MISSING has 3 observations and 3 variables.
440 Learning SAS by Example: A Programmer’s Guide
The note that SAS went to a new line is important! It tells you exactly what happened and that you need to modify your program. When you have list input and you have some lines of data with fewer values than you need, you can use the INFILE option MISSOVER to fix things. MISSOVER tells SAS to set all the remaining variables to missing if you have more variables than there are data values in one line of raw data. Here is the program with the MISSOVER option added: data missing; infile 'c:\books\learning\missing.txt' missover; input x y z; run;
The resulting data set appears next: Listing of MISSING Obs
x
y
z
1
1
2
3
2
4
5
.
3
6
7
8
4
9
10
11
The example shows how important it is to look at the SAS log, even if your program runs and produces output.
21.3 Reading Short Data Lines Take a look at the following file: File c:\books\learning\short.txt 1
2
3
123456789012345678901234567890 ------------------------------001Josuha Tyson 002Helen Ames 003ShouEn Lu 004Pam Mann
100 97 95 87 85 98 98 92 100100 99
Chapter 21: Using Advanced INPUT Techniques 441
These lines of data represent a subject number (Columns 1–3), a name (Columns 4–19), and three quiz scores (20–22, 23–25, and 26–28). It’s important to know that Line 2 contains a short record and it is not padded with blanks (that is, the person who entered the data pressed the carriage return after typing the 85). Let’s see what happens if you read this file with either column or formatted input on a Windows or UNIX platform (flat files like this are padded in mainframe files). Here is a program to do that:
Program 21-2 Reading a raw data file with short records data short; infile 'c:\books\learning\short.txt'; input Subject $ 1-3 Name $ 4-19 Quiz1 20-22 Quiz2 23-25 Quiz3 26-28; run;
Now look at the SAS log and a listing from PROC PRINT: 56
data short;
57
infile 'c:\books\learning\short.txt';
58
input Subject
$ 1-3
59
Name
60
Quiz1
20-22
61
Quiz2
23-25
62
Quiz3
26-28;
63
$ 4-19
run;
NOTE: The infile 'c:\books\learning\short.txt' is: File Name=c:\books\learning\short.txt, RECFM=V,LRECL=256 NOTE: 4 records were read from the infile 'c:\books\learning\short.txt'. The minimum record length was 25. The maximum record length was 28. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.SHORT has 3 observations and 5 variables.
442 Learning SAS by Example: A Programmer’s Guide
Listing of SHORT Obs
Subject
Name
1
001
Josuha Tyson
2
002
Helen Ames
3
004
Pam Mann
Quiz1
Quiz2
Quiz3
100
97
95
87
85
3
100
100
99
Again, you see that something went wrong—very wrong. Four records were read, but the data set only has three observations. Worse yet, Subject 002 has a Quiz3 score of 3. Even though the INPUT statement instructed SAS to read a value of Quiz3 in Columns 26–28, it went to the next line when it was unable to read those columns in the second record. It then read the subject number in Line 3 as the quiz score. SAS then moved the pointer to the next record and read data for subject 4 correctly. One way to prevent this from happening is to use the PAD option in the INFILE statement. This option pads each of the input records with blanks, out to the end of the logical record. On Windows and UNIX platforms, the default logical record (abbreviated LRECL) is 256. You will see later how to change this length.
Program 21-3 Demonstrating the INFILE PAD option data short; infile 'c:\books\learning\short.txt' pad; input Subject $ 1-3 Name $ 4-19 Quiz1 20-22 Quiz2 23-25 Quiz3 26-28; run;
After you run this program, you can see by looking at the listing of the data set here that the short record no longer causes a problem: Listing of SHORT Obs
Subject
Name
Quiz1
Quiz2
Quiz3
1
001
Josuha Tyson
100
97
2
002
Helen Ames
87
85
.
3
003
ShouEn Lu
98
98
92
4
004
Pam Mann
100
100
99
95
Chapter 21: Using Advanced INPUT Techniques 443
You now have the correct values in the data set. It is a good idea to use the PAD option whenever you are reading files using column or formatted input. The option TRUNCOVER may be used in place of PAD. It has the effect of the PAD and MISSOVER options combined and is a good overall choice for the INFILE option when you are reading variable length records.
21.4 Reading External Files with Lines Longer Than 256 Characters Because the default logical record length on Windows and UNIX systems is 256 bytes, you need to use the INFILE option LRECL to specify record lengths greater than 256. Suppose you are given a raw data file (Long.txt) where each line contains 3,000 bytes. In order to read this file, you need to specify the record length, like this: infile 'long.txt' lrecl=3000;
If you use the PAD option in combination with the LRECL option, all the records are padded to whatever you specify for the LRECL.
21.5 Detecting the End of the File There are times when you want to read records from a file and perform certain calculations when you have reached the end of the file. One way to do this is with the END= option in the INFILE statement. Here is an example:
444 Learning SAS by Example: A Programmer’s Guide
You have a file, Month.txt, with monthly sales totals. Here is a listing of the file: File c:\books\learning\month.txt Jan 2000 Feb 3000 Mar 2500 Apr 2600 May 1200 Jun 2300 Jul 1000 Aug 2300 Sep 1500 Oct 1900 Sep 2600 Oct 3400 Nov 4000 Dec 1200
You want to read this file and print out the total sales for the year. Here’s how:
Program 21-4 Demonstrating the END= option in the INFILE statement data _null_; file print; infile 'c:\books\learning\month.txt' end=Last; input @1 Month $3. @5 MonthTotal 4.; YearTotal + MonthTotal; if last then put "Total for the year is" YearTotal dollar8.; run;
If you need a review of DATA _NULL_, please refer to Chapter 4, Section 10. The variable name Last in the END= option is a temporary variable that has a value of false (0) until the last record is being read from the external file—it then has a value of true (1). You can use this logical variable to control the action of your program. In Program 21-4, you are reading monthly values and totaling them in the sum statement. When you have read the last record in the external file, the variable Last becomes true and the PUT statement executes, writing out the message “Total for the year is” followed by the yearly total. (Because of the FILE PRINT statement, this message is printed to your output device.)
Chapter 21: Using Advanced INPUT Techniques 445
END= is also a SET option and it is used to determine when you are reading the last observation in a SAS data set.
21.6 Reading a Portion of a Raw Data File Two INFILE options, OBS= and FIRSTOBS=, allow you to read a portion of a raw data file. If you specify OBS=n as an INFILE option, SAS stops reading from the file after it has read the nth line of data. If you include a FIRSTOBS=m INFILE option, SAS starts reading at Line m. When used together, you can read any number of contiguous lines of data. As an example, suppose you wanted to read only the first three months of data from the Month.txt file mentioned in Section 21.5. You could use the OBS= INFILE option like
this:
Program 21-5 Demonstrating the OBS= INFILE option to read the first three lines of data data readthree; infile 'c:\books\learning\month.txt' obs=3; input @1 Month $3. @5 MonthTotal 4.; run;
Data set ReadThree consists of observations for January, February, and March only. Using the OBS= INFILE option is particularly useful if you have a very large raw data file and you want to test your program by reading only a few lines of data. If you don’t want to start reading from the first line of a file, you can use the FIRSTOBS= INFILE option to tell SAS where to begin reading data. This is useful if you have one or more header records in a file that you want to skip or if you want to read data from the middle or end of the file. As an example, if you want to read Lines 5 through 7 from the Month.txt file, you could use Program 21-6:
446 Learning SAS by Example: A Programmer’s Guide
Program 21-6 Using the OBS= and FIRSTOBS= INFILE options together data read5to7; infile 'c:\books\learning\month.txt' firstobs=5 obs=7; input @1 Month $3. @5 MonthTotal 4.; run;
This program reads Lines 5, 6, and 7 from the file Month.txt and the resulting SAS data set contains observations for months May, June, and July.
21.7 Reading Data from Multiple Files SAS has several ways to read from multiple raw data files. If the names are similar, you may choose to use wildcards in the INFILE or FILENAME statement. If not, there are other methods you can use. You have four files called Quarter1.txt, Quarter2.txt, Quarter3.txt, and Quarter4.txt. You want to read data from all four files, so you write the following INFILE statement: infile 'c:\books\learning\quarter?.txt';
You are free to use the DOS wildcard characters question mark (?) (which substitutes for any single character) or the asterisk (*) (0 or more characters) in your INFILE or FILENAME statements. Another way to read multiple files is to execute the INFILE statement conditionally. You can use the END= option in the INFILE statement to detect when you are at the end of one file and use that information to tell SAS to go to another file. To help make this clear, suppose you have two files: Alpha.txt and Beta.txt. You want to read all the records from Alpha.txt followed by all the records from Beta.txt. One way to do this is by executing Program 21-7:
Program 21-7 Using the END= option to read data from multiple files data combined; if finished = 0 then infile 'alpha.txt' end=finished; else infile 'beta.txt'; input . . .; . . . run;
Chapter 21: Using Advanced INPUT Techniques 447
21.8 Reading Data from Multiple Files Using a FILENAME Statement Another way to read raw data from multiple files is to enter the filenames in a FILENAME statement. To demonstrate this, here is an alternative to Program 21-7: filename bilbo ('alpha.txt' ‘beta.txt'); data combined; infile bilbo; input . . .; . . . run;
You may enter as many filenames as necessary when using this method of reading data from multiple files.
21.9 Reading External Filenames from a Data File For applications where it would be tedious to list all the external filenames in a FILENAME statement, you can have SAS read the names of the external files from a file containing the filenames. Here is an example:
Program 21-8 Reading external filenames from an external file data readmany; infile 'c:\books\learning\filenames.txt'; input ExternalNames $ 40.; infile dummy filevar=ExternalNames end=Last; do until (last); input . . .; output; end; run;
448 Learning SAS by Example: A Programmer’s Guide
Here the data file Filenames.txt contains the names of the files you want to read. The FILEVAR option in the INFILE statement uses the file whose name is stored in the variable ExternalNames. You need to include a dummy fileref (you can call it anything— SAS ignores it) to preserve proper syntax of the INFILE statement. The DO UNTIL loop reads data from the first file until the last record is read (and the variable Last becomes true). Then the program returns to the top of the DATA step, a new filename is read from the Filenames.txt file, and processing continues. A variation on Program 21-8 is to place the names of the external files following a DATALINES statement, as shown in the following program:
Program 21-9 Reading external filenames using a DATALINES statement data readmany; input ExternalNames $ 40.; infile dummy filevar=ExternalNames end=Last; do until (last); input . . .; output; end; datalines; c:\books\learning\data1.txt c:\books\learning\moredata.txt c:\books\learning\fred.txt ;
Placing the filenames in the program may be more convenient than placing them in an external file.
21.10 Reading Multiple Lines of Data to Form One Observation Back in the “old days” when many flat files had a maximum length of 80 bytes, it was necessary to place data for one subject (observation) on multiple lines. You may even face this problem today.
Chapter 21: Using Advanced INPUT Techniques 449
As an example, you have a raw data file (Health.txt) with two lines (records) per subject and the following data layout: File c:\books\learning\health.txt Variable
Description
Line #
Starting Column
Length
Type
Subj
Subject Number
1
1
3
Char
DOB
Date of Birth
1
4
10
mm/dd/yyyy
Weight
Weight in Lbs.
1
14
3
Num
HR
Heart Rate
2
4
3
Num
SBP
Systolic Blood Pressure
2
7
3
Num
DBP
Diastolic Blood Pressure
2
10
3
Num
File c:\books\learning\health.txt 00112/25/1944210 80160100 00205/11/1966102 88122 76 00308/03/2000 66 90102 62
Here is a program to read this data file and create a temporary SAS data set called Health:
Program 21-10 Reading multiple lines of data to create one observation data health; infile 'c:\books\learning\health.txt' pad; input #1 @1 Subj $3. @4 DOB mmddyy10. @14 Weight 3. #2 @4 HR 3. @7 SBP 3. @10 DBP 3.; format DOB mmddyy10.; run;
450 Learning SAS by Example: A Programmer’s Guide
The salient feature of this program is the line pointer, a pound sign ( # ) in the INPUT statement. The # pointer tells SAS which line to read when you have multiple lines of data per observation. A listing of the SAS data from PROC PRINT is as follows: Listing of HEALTH Obs
Subj
DOB
Weight
HR
SBP
DBP
1
001
12/25/1944
210
80
160
100
2
002
05/11/1966
102
88
122
76
3
003
08/03/2000
66
90
102
62
An alternative way to read this data file is to use a relative line pointer, a forward slash (/), to tell SAS to skip to the next line of input. Rewriting Program 21-10 using this method is shown next:
Program 21-11 Using an alternate method of reading multiple lines of data to form one SAS observation data health; infile 'c:\books\learning\health.txt' pad; input @1 Subj $3. @4 DOB mmddyy10. @14 Weight 3. / @4 HR 3. @7 SBP 3. @10 DBP 3.; format DOB mmddyy10.; run;
Between these two methods, this author prefers the # line pointer over the slash notation because it makes programs that are easier to read.
Chapter 21: Using Advanced INPUT Techniques 451
21.11 Reading Data Conditionally (the Single Trailing @ Sign) Some data files are more complex than the ones you have seen so far. For example, you may have a mixture of record types in a single file. In order to read values from a file of this type, you need to first read part of the line to determine how to read the remainder of the line. SAS provides you with the single trailing at sign (@) to solve this problem. Here is an example: You are given a file from a survey taken over two years. Unfortunately, in the second year, a question was added in the middle of the survey. So, the data layout for Year 1 and Year 2 is different. Worse yet, data from both years are completely mixed up in a single file. First, look at the data layout for each of the two years: Data Layout for Year 2005 Survey Variable
Description
Starting Column
Length
Type
Number
Survey Number
1
3
Num
Q1
Question 1
4
1
Char
Q2
Question 2
5
1
Char
Q3
Question 3
6
1
Char
Q4
Question 4
7
1
Char
Year
Survey Year
9
4
Char
Starting Column
Length
Type
Data Layout for Year 2006 Survey Variable
Description
Number
Survey Number
1
3
Num
Q1
Question 1
4
1
Char
Q2
Question 2
5
1
Char
Q2B
Question 2B
6
1
Char
Q3
Question 3
7
1
Char
Q4
Question 4
8
1
Char
Year
Survey Year
9
4
Char
452 Learning SAS by Example: A Programmer’s Guide
Some sample lines of data follow: File c:\books\learning\survey56.txt 001ABED 2005 002AABCD2006 005AADD 2005 007BBCDE2006 009ABABA2006 010DEEB 2005
Here is a program that does NOT work:
Program 21-12 Incorrect attempt to read a file of mixed record types data survey; infile 'c:\books\learning\survey56.txt' pad; input @9 year $4.; n if year = '2005' then input @1 Number @4 Q1 @5 Q2 @6 Q3 @7 Q4; else if year = '2006' then input @1 Number @4 Q1 @5 Q2 @6 Q2B @7 Q3 @8 Q4; run;
Here is the problem: after SAS processes the first INPUT statement n, the line pointer moves to the next line in the file so that subsequent data values are read from the next line. To solve this problem, use a trailing @ at the end of the first INPUT statement. This is an instruction to “hold the line” for another INPUT statement in the same DATA step. By “holding the line,” we mean to leave the pointer at the present position and not to advance to the next record. The single trailing @ holds the line until another INPUT statement, (without a trailing @) is encountered further down in the DATA step, or the end of the DATA step is reached.
Chapter 21: Using Advanced INPUT Techniques 453
Here is the program, rewritten correctly:
Program 21-13 Using a trailing @ to read a file with mixed record types data survey; infile 'c:\books\learning\survey56.txt' pad; input @9 Year $4. @; if Year = '2005' then input @1 Number @4 Q1 @5 Q2 @6 Q3 @7 Q4; else if Year = '2006' then input @1 Number @4 Q1 @5 Q2 @6 Q2B @7 Q3 @8 Q4; run;
That little @ sign makes all the difference! Now, after SAS reads a value for Year, it executes the appropriate INPUT statement and the remaining data values are read from the same line of data.
21.12 More Examples of the Single Trailing @ Sign It might be useful to show some more examples of the single trailing @. Suppose you have a raw data file containing data on males and females. Suppose further that you only want to read data on females. One way would be to read all the variables of interest, including Gender, and test if Gender is equal to F (Female). If so, keep the observation; if not, delete it. This is very inefficient. It is better to read the Gender value, check it, and then only read the remaining values if Gender is equal to F. Let’s use the file Bank.txt (see the following description) for an example of how this works. Here is the file description:
454 Learning SAS by Example: A Programmer’s Guide
File c:\books\learning\bank.txt Ending Column
Data Type
1
3
Character
Date of Birth
4
13
Character
Gender
Gender
14
14
Character
Balance
Bank Account Balance
15
21
Numeric
Variable
Description
Subj
Subject Number
DOB
Starting Column
An efficient program to read data from this file, keeping only data on females, would be as follows:
Program 21-14 Another example of a trailing @ sign data females; infile 'c:\books\learning\bank.txt' pad; input @14 Gender $1. @; if Gender ne 'F' then delete; input @1 Subj $3. @4 DOB mmddyy10. @15 Balance 7.; run;
The first INPUT statement reads a single column to determine the value of Gender. The trailing @ prevents SAS from going to a new line. Next, if the value is not F, the DELETE statement instructs SAS not to output an observation to the SAS data set and to return control to the top of the DATA step.
21.13 Creating Multiple Observations from One Line of Input There are times when you want to place data for several subjects on a single line (or someone gave you a file like this). For example, suppose you have some X,Y pairs and want to create a SAS data set. One way is to place one pair of points on each input line, like this:
Chapter 21: Using Advanced INPUT Techniques 455
Program 21-15 Creating one observation from one line of data data pairs; input X Y; datalines; 1 2 3 4 5 7 8 9 11 14 13 18 21 27 30 40 ;
This is fine, but it could also be written like this:
Program 21-16 Creating several observations from one line of data data pairs; input X Y @@; datalines; 1 2 3 4 5 7 8 9 30 40 ;
11 14
13 18
21 27
The double @ signs prevent SAS from moving to a new line, even when you reach the bottom of the DATA step (unless there is another INPUT statement in the DATA step without a trailing @@). In other words, the double trailing @ signs allow SAS to read a stream of data, eating through all the values on a line and then going to the next line. Use this with great care. If you are missing a data value in the file, everything gets out of whack. Note: Without the double trailing @ signs in Program 21-16, there would be only two observations (X=1, Y=2 and X=30, Y=40).
21.14 Using Variable and Informat Lists You can supply a single informat to a list of variables and save some typing. To do this, place a list of variables in parentheses and follow this list with a list of one or more informats. For example, you could write the following line to read three numeric values (each with an informat of 2.) and five character values (each with an informat of $1.):
456 Learning SAS by Example: A Programmer’s Guide
input @1 (X Y Z Char1-Char5)(3*2. 5*$1.);
Notice that each of the informats in this statement is preceded by a multiplier, written as a number followed by an asterisk. In this example, there are as many informats as there are variables. If there are more variables than there are informats, SAS cycles back to the beginning of the informat list to obtain informats for the remaining variables. You can take advantage of this feature to read a long list of variables that all share the same informat. For example, to read 50 one-digit character values, starting in Column 10, you could use the following code: input @10 (Ques1-Ques50) ($1.);
This saves quite a bit of typing. You can also use an informat list with list input, like this: input (Ques1-Ques50) (: $1.);
Don’t forget the colon before the informat. The colon says to read a value according to the appropriate informat but to stop reading when a delimiter is encountered.
21.15 Using Relative Column Pointers to Read a Complex Data Structure Efficiently Besides the absolute column pointer (@), SAS also has a relative column pointer, a plus sign (+). You can use this to move the current pointer left or right of its current position. Let’s look at a few examples: You have seven readings of the variables Time and Temperature arranged in pairs—Time values are in seconds and occupy 2 bytes; Temperature values are in degrees Celsius and occupy 4 bytes. The data pairs start in Column 5. Here are a few sample lines of data:
Chapter 21: Using Advanced INPUT Techniques 457
File c:\books\learning\tempdata.txt 1
2
3
4
5
12345678901234567890123456789012345678901234567890 -------------------------------------------------1020.11120.21321.61525.01728.72031.42533.8 1021.71322.81528.01728.92129.92430.42833.7
The tedious way to read this file is like this: input @5 @17 @29 @41
Time1 Time3 Time5 Time7
2. 2. 2. 2.
@7 @19 @31 @43
T1 T3 T5 T7
4. @11 Time2 2. @13 T2 4. 4. @23 Time4 2. @25 T4 4. 4. @35 Time6 2. @37 T6 4. 4.;
If you use a relative column pointer, along with a variable and informat list, this simplifies to the following: input @5 (Time1-Time7)(2. + 4) @7 (T1-T7)(4. + 2);
This INPUT statement moves the column pointer to Column 5. Then SAS reads a value for Time using a 2. informat and skips four spaces. Because there are still variables in the variable list, this informat (read two spaces and skip four) repeats for the remaining values of Time. Next, the pointer moves back to Column 7. Here you read a value of Temperature using a 4. informat and skip two spaces. This informat is repeated until all the values of Temperature have been read. If you use either of these INPUT statements in a DATA step, the resulting data sets are identical (except for the order of the variables). As you saw in this chapter and in Chapter 3, the INPUT statement is one of the most powerful statements in SAS and allows you to read almost any data file.
458 Learning SAS by Example: A Programmer’s Guide
21.16 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. The raw data file Scores.txt contains space-delimited data with up to three scores per line. Write a DATA step to read values Score1–Score3 as numeric variables from this file. Be careful because not all lines of data contain three scores. Your resulting data set should have three observations. 2. Repeat Problem 1, except read the text file Scores_Comma.csv. This file is similar to Scores.txt, except that it is a comma-separated values (CSV) file. Assume that two commas in a row indicate that there is a missing value. 3. The raw data file Scores_Columns.txt has three scores per line. Values start in Column 1 and each score occupies two columns. Thus, Score1 is in Columns 1–2, Score2 in Columns 3–4, and Score3 in Columns 5–6. Using column input, read this data file and create a SAS data set. 4. Repeat Problem 3 using formatted input. 5. You want to read observations from the SAS data set Bicycles. Using a DATA _NULL_ step, sum the value of units sold (Units) and the sum of Sales (TotalSales) for all the observations. Call the former TotalUnits and the latter Sum_of_Sales. Use a SET option to test when the last observation from Bicycles is being read and, using a PUT statement, list the values of the two variables TotalUnits and Sum_of_Sales. Your report should look like this: Summary Report from BICYCLES Data Set --------------------------------------Total Units Sold is 48,055 Sales Total is $87,088
6. Write a SAS DATA step to read Lines 2 through 5 (inclusive) from the Month.txt raw data file. This file contains a three-character month value starting in Column 1 and a month total (four digits) starting in Column 5. Use PROC PRINT to verify that you have observations for February through May.
Chapter 21: Using Advanced INPUT Techniques 459
7. You want to read raw data from two files, File_A.txt and File_B.txt. You want to skip the first line of each file (it contains header information). The remaining lines contain space-delimited values for three variables X, Y, and Z. Use the END= INFILE option to test when you are finished reading from the first file and switch input to the second file. Use PROC PRINT to list the contents of the resulting SAS data set. 8. Two files, Xyz1.txt and Xyz2.txt, each contain three values on each line, separated by spaces. Write a SAS DATA step to read values for X, Y, and Z from both of these files. Use an INFILE statement using a wildcard character. 9. Repeat Problem 5 using a FILENAME statement and listing the two files by name. 10. The raw data file Mixed_Recs.txt contains two types of records. Records with a 1 in Column 16 contain sales records with Date (in mmddyy10. form) starting in Column 1 and Amount in Columns 11–15 (in standard numeric form). Records with a 2 in Column 16 are inventory records and they contain two values, a part number (character, 5 bytes) starting in column 1 and a quantity. These two values are separated by a space. A listing of this file is shown below: File Mixed_Recs.txt 10/21/2005
1001
11/15/2005
2001
A13688 250
2
B11112 300
2
01/03/2005 50001 A88778 19
2
Write a DATA step to read this file and create two SAS data sets, one called Sales and the other Inventory. The two data sets should look like this: Listing of SALES Obs 1 2 3
Date 10/21/2005 11/15/2005 01/03/2005
Listing of INVENTORY Amount 100 200 5000
Obs
Part Number
Quantity
1 2 3
A13688 B11112 A88778
250 300 19
460 Learning SAS by Example: A Programmer’s Guide
11. Each record in the text file Three_Per_Line.txt contains three sets of vital signs: heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Values start in Column 1 and each value takes three columns. Write an INPUT statement that uses a format list and relative column pointers so that you can read the three HRs, the three SBPs, and the three DBPs together. Call the three HR values HR1, HR2, and HR3; the three SPB values SBP1, SBP2, and SBP3; and the three DBP values DBP1, DBP2, and DBP3. Some sample data lines are shown here: 068120 80 72130 80 69122 78 072180110 76178102 70178100 054118 70 56118 72 50114 78
C h a p t e r
22
Using Advanced Features of User-Defined Formats and Informats 22.1
Introduction 462
22.2
Using Formats to Recode Variables 462
22.3
Using Formats with a PUT Function to Create New Variables 463
22.4
Creating User-Defined Informats 464
22.5
Reading Character and Numeric Data in One Step 467
22.6
Using Formats (and Informats) to Perform Table Lookup 470
22.7
Using a SAS Data Set to Create a Format 471
22.8
Updating and Maintaining Your Formats 477
22.9
Using Formats within Formats 479
22.10 Using Multilabel Formats 482 22.11 Using the INPUTN Function to Perform a More Complicated Table Lookup 485 22.12 Problems 490
462 Learning SAS by Example: A Programmer’s Guide
22.1 Introduction User-defined formats can do much more than make output from SAS procedures more readable. You will see how to use formats to create new variables and to perform table lookups. You can even create your own informats to alter data values as they are being read into a SAS data set. Finally, certain SAS procedures support multi-label formats— that is, the ability to have a single value correspond to more than one format range.
22.2 Using Formats to Recode Variables Many SAS procedures use formatted values of variables in their processing. For example, PROC FREQ uses formatted values when it computes frequencies; PROC MEANS uses formatted values of a CLASS variable in its calculations. As an example, suppose you want to compute frequencies on age groups (say, 20-year intervals), but your SAS data set contains the variable Age. Rather than create a new variable representing age groups, you can simply write a format and use PROC FREQ, like this:
Program 22-1 Using a format to recode a variable proc format; value agefmt
0 20 40 60
-
<20 <40 <60 high
= = = =
'< 20' '20 to 39' '40 to 59' '60+';
run; title "Using a Format to Recode a Variable"; proc freq data=learn.survey; tables Age / nocum nopercent; format Age agefmt.; run;
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 463
When you run this program, the frequencies are computed for the age groups, not Age, as you can see in this listing: Using a Format to Recode a Variable The FREQ Procedure Age Frequency ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 20 to 39 3 40 to 59 2 60+ 2
This same technique works for CLASS variables in PROC MEANS, PROC SUMMARY, PROC TABULATE, PROC ANOVA, PROC GLM, and several other procedures.
22.3 Using Formats with a PUT Function to Create New Variables There are times when you want to create a new variable consisting of formatted values. Program 22-2 demonstrates how this is done.
Program 22-2 Using a format and a PUT function to create a new variable proc format; value agefmt
0 20 40 60
-
<20 <40 <60 high
= = = =
'< 20' '20 to 39' '40 to 59' '60+';
run; data survey; set learn.survey; AgeGroup = put(Age,agefmt.); run;
464 Learning SAS by Example: A Programmer’s Guide
The key to this program is the PUT function. This function takes the value of its first argument, formats this value with the format listed as the second argument, and returns the formatted value. Remember that the PUT function always returns a character value. In this example, AgeGroup is a character variable with a length of 8 bytes (the length of the longest formatted value). Here is a listing of data set Survey: Listing of SURVEY ID 001 002 003 004 005 006 007
Gender M F M F M M F
Age
Salary
Ques1
Ques2
Ques3
Ques4
Ques5
23 55 38 67 22 63 45
28000 76123 36500 128000 23060 90000 76100
1 4 2 5 3 2 5
2 5 2 3 3 3 3
1 2 2 2 3 5 4
2 1 2 2 4 4 3
3 1 1 4 2 3 3
AgeGroup 20 to 40 to 20 to 60+ 20 to 60+ 40 to
39 59 39 39 59
22.4 Creating User-Defined Informats Although you are probably familiar with using PROC FORMAT to create user-written formats, you may not have used this procedure to create your own informats. You can use user-written informats to alter values as they are read from a raw data file or with the INPUT function to perform a table lookup. Let’s look at an example. Your raw data consists of IDs and letter grades (A+, A, A–, and so on). You want to convert these letter grades into numbers according to a predefined table. Program 22-3 creates an informat called CONVERT and uses that informat to read the letter grades.
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 465
Program 22-3 Demonstrating a user-written informat proc format; invalue convert 'A+' = 100 'A' = 96 'A-' = 92 'B+' = 88 'B' = 84 'B-' = 80 'C+' = 76 'C' = 72 'F' = 65; run; data grades; input @1 ID $3. @4 Grade convert2.; datalines; 001A002B+ 003F 004C+ 005A ;
You use an INVALUE statement to create an informat. The rules are similar to the ones you use in creating formats. One tiny difference is that informat names can only be 31 characters in length (including the dollar sign ($) if it will be used to read character data). For those with curious minds, SAS uses 1 byte in the format catalog (the @ sign) to differentiate between formats and informats. (If you look at the entries in your catalog with the FMTLIB option or PROC CATALOG, you will see the @ signs added to the informat names.) Following the keyword INVALUE, you type the name of the informat you want to create. Just as with formats, you must start the name with a dollar sign if it is to be used to create character data. In the previous program, even though you are reading character values (for example, A+, A, A–), the resulting variable (Grade) is a numeric variable. If you used the name $CONVERT instead of CONVERT in this program, the variable Grade would be character. Next, you specify ranges and equal sign and labels. Again, you use the same rules as formats.
466 Learning SAS by Example: A Programmer’s Guide
You use your informat just as you would a built-in SAS informat. Notice that you can add a width to it to specify how many columns of data to read. Here is a listing of data set Grades: Listing of GRADES ID 001 002 003 004 005
Grade 92 88 65 76 96
It is important to remember that the original character data values are not part of this data set. If you wanted both the original letter grade and its corresponding numeric value, you could read the letter grade as a character variable and use an INPUT function (with the CONVERT informat) to create the numeric grade. There are some useful options that you can use when you create your informats. UPCASE and JUST are two such options. UPCASE, as the name implies, converts the data values to uppercase before checking on the informat ranges. JUST left-aligns character values. Let’s use these two options to enhance Program 22-3.
Program 22-4 Demonstrating informat options UPCASE and JUST proc format; invalue convert(upcase just) 'A+' = 100 'A' = 96 'A-' = 92 'B+' = 88 'B' = 84 'B-' = 80 'C+' = 76 'C' = 72 'F' = 65 other = .; run;
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 467
data grades; input @1 ID $3. @4 Grade convert2.; datalines; 001A002b+ 003F 004c+ 005 A 006X ; title "Listing of GRADES"; proc print data=grades noobs; run;
Notice that the raw data values being read contain upper- and lowercase values and the value for Student 005 contains a leading blank. We also added a student with an invalid grade (X). To prevent error messages in the SAS log, the informat category OTHER was added so that invalid values are converted to numeric missing values. The following listing shows that all the data values were read correctly: Listing of GRADES ID 001 002 003 004 005 006
Grade 92 88 65 76 96 .
22.5 Reading Character and Numeric Data in One Step There is a little-known feature in user-defined informats—the ability to create an informat that reads both character and numeric data and creates a numeric variable. The examples that follow show ways that you can take advantage of this feature.
468 Learning SAS by Example: A Programmer’s Guide
For this first example, you have temperature readings on hospital patients. Because many of the patients have a normal temperature (98.6 degrees Fahrenheit), it is convenient to record normal temperatures by entering an N as a data value. Some sample lines of data are as follows: 101 N 97.3 n N 104.5
Notice that some of the Ns are in uppercase and others in lowercase. Before we look at the elegant enhanced informat solution, let’s look at a traditional approach:
Program 22-5 A traditional approach to reading a combination of character and numeric data data temperatures; input Dummy $ @@; if upcase(Dummy) = 'N' then Temp = 98.6; else Temp = input(Dummy,8.); drop dummy; datalines; 101 N 97.3 n N 104.5 ;
Each data value is read as character data. A check is made to see if this value is equal to an upper- or lowercase N. If so, Temp is set to 98.6—if not, the INPUT function performs the character-to-numeric conversion. This works fine, but compulsive programmers (who, me?) are always looking for a more elegant solution. Program 22-6 uses an enhanced numeric informat:
Program 22-6 Using an enhanced numeric informat to read a combination of character and numeric data proc format; invalue readtemp(upcase) 96 - 106 = _same_ 'N' = 98.6 other = .; run; data temperatures; input Temp : readtemp5. @@; datalines; 101 N 97.3 n N 67 104.5 ;
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 469
The UPCASE option converts any character to uppercase. The keyword _SAME_ in this informat leaves any numeric values in the range 96 to 106 unchanged. Values of N are converted to the numeric value of 98.6 and any values that are not in the range of 96 to 106 or equal to N are set to a numeric missing value. The technique of using the keyword _SAME_ for valid values and setting other values to missing is one way to screen unwanted extreme values as they are being read. A listing of data set Temperatures is shown next: Listing of Data Set TEMPERATURES Temp 101.0 98.6 97.3 98.6 98.6 . 104.5
The next program reads a combination of letter and number grades. The grades A–F are converted to number grades as follows: A=95, B=85, C=75, and F=65. Number grades are not changed. Here is the program:
Program 22-7 Another example of an enhanced numeric informat proc format; invalue readgrade(upcase) 'A' = 95 'B' = 85 'C' = 75 'F' = 65 other = _same_; run; data school; input Grade : readgrade3. @@; datalines; 97 99 A C 72 f b ;
470 Learning SAS by Example: A Programmer’s Guide
Here the values A through F are converted to the appropriate numeric value and all other values are left unchanged. Here is a listing of the data set: Listing of SCHOOL Grade 97 99 95 75 72 65 85
22.6 Using Formats (and Informats) to Perform Table Lookup Table lookup is a process where one or more data values are used to retrieve a value for another variable. For example, given an item number from a catalog, you might want to retrieve a product name and a price. SAS provides you with a variety of ways to perform a table lookup, such as a data set merge, a key value from an index, an array, a format, or an informat. Because SAS formats are stored in memory, they are fast and efficient. Program 22-8 uses an item number (ItemNumber) to look up a product name and its price. Here is the program:
Program 22-8 Using formats and informats to perform a table lookup proc format; value namelookup 122 = 'Salt' 188 = 'Sugar' 101 = 'Cereal' 755 = 'Eggs' other = ' '; invalue pricelookup 'Salt' = 3.76 'Sugar' = 4.99
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 471
'Cereal' = 5.97 'Eggs' = 2.65 other = .; run; data grocery; input ItemNumber @@; Name = put(ItemNumber,namelookup.); Price = input(Name,pricelookup.); datalines; 101 755 122 188 999 755 ;
A user-defined format is used, along with a PUT function, to retrieve a product name from the item number. An informat, relating product names and prices, is used with an INPUT function to retrieve the price. An informat was used for the price lookup because the INPUT function can return a numeric value. Remember that the PUT function always returns a character value. (If you used a format, you would have to perform a characterto-numeric conversion to obtain a numeric value for the price.) Both the format and informat use the keyword OTHER to assign a missing value to an invalid item code. A listing of data set Grocery follows: Listing of GROCERY Item Number 101 755 122 188 999 755
Name Cereal Eggs Salt Sugar Eggs
Price 5.97 2.65 3.76 4.99 . 2.65
22.7 Using a SAS Data Set to Create a Format The example in the last section used only a small number of item codes. If you have a large number of values, writing a format statement by hand is tedious. Luckily, SAS allows you to use a specially constructed SAS data set to create a SAS format. The PROC FORMAT option CNTLIN= (control in) allows you to name the SAS data set (or data view) that you want to use.
472 Learning SAS by Example: A Programmer’s Guide
Control input data sets require a minimum of three variables, named FMTNAME, START, and LABEL. FMTNAME is a character variable that specifies the format or informat name; START is either a single value to be formatted or, if END is also used, the beginning of a range. You should include a $ as the first character of FMTNAME if you are creating a character format. Optionally, you can include a variable called TYPE that is either a C for character formats or an N for numeric formats. If you have data values and format labels already stored in a SAS data set, you can use that data set as input to a DATA step (with variables renamed as needed) to create your control data set. This sounds complicated, but an example should make it clear. You have a permanent SAS data set (Codes) that contains ICD-9 code values (International Classification of Diseases, Version 9) and descriptions. You want to create a format that provides labels for each of the ICD-9 codes. So that you can try this yourself, the program here creates a permanent SAS data set called Codes.
Program 22-9 Creating a test data set that will be used to make a CNTLIN data set data learn.codes; input ICD9 : $5. Description & $21.; datalines; 020 Plague 022 Anthrax 390 Rheumatic fever 410 Myocardial infarction 493 Asthma 540 Appendicitis ;
You want to use this data set to create a control input data set, as follows:
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 473
Program 22-10 Creating a CNTLIN data set from an existing SAS data set data control; set learn.codes(rename= (ICD9 = Start Description = Label)); retain Fmtname '$ICDFMT' Type 'C'; run; title "Demonstrating an Input Control Data Set"; proc format cntlin=control fmtlib; run;
The RENAME= data set option renames ICD9 to Start and Description to Label. A RETAIN statement sets FMTNAME to $ICDFMT and TYPE equal to C. Using a RETAIN statement is more efficient than an assignment statement, since these values are set at compile time—an assignment statement executes for each iteration of the DATA step. The CNTLIN= option names this data set and the FMTLIB option generates a table showing the ranges and format labels. Here are both a listing of the Control data set and the output from PROC FORMAT: Listing of CONTROL Start 020 022 390 410 493 540
Label
Fmtname
Plague Anthrax Rheumatic fever Myocardial infarction Asthma Appendicitis
$ICDFMT $ICDFMT $ICDFMT $ICDFMT $ICDFMT $ICDFMT
Type C C C C C C
474 Learning SAS by Example: A Programmer’s Guide
Demonstrating an Input Control Data Set
FORMAT NAME: $ICDFMT LENGTH: 21 NUMBER OF VALUES: 6 MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 21 FUZZ: 0 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ START ‚END ‚LABEL (VER. V7|V8 21NOV2005:10:14:05) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 020 ‚020 ‚Plague 022 ‚022 ‚Anthrax 390 ‚390 ‚Rheumatic fever 410 ‚410 ‚Myocardial infarction 493 ‚493 ‚Asthma 540 ‚540 ‚Appendicitis
Instead of creating a SAS data set to be used with CNTLIN, you could elect to create a data view instead. A data view can be thought of as a virtual data set. Depending on your application, using a data view may be more efficient than using a SAS data set. The only change to Program 22-10 would be to replace the DATA statement with the following: data control / view = control;
Program 22-11 shows how you might use this format:
Program 22-11 Using the CNTLIN= created data set data disease; input ICD9 : $5. @@; datalines; 020 410 500 493 ; title "Listing of DISEASE"; proc print data=disease noobs; format ICD9 $ICDFMT.; run;
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 475
Here is the output: Listing of DISEASE ICD9 Plague Myocardial infarction 500 Asthma
Notice that there is no format for the ICD-9 code of 500. If you were writing your own PROC FORMAT statements, you could use the keyword OTHER to provide a label for non-matching codes. If you want to accomplish this using a CNTLIN data set, you need a way to assign a value to OTHER. CNTLIN data sets use the variable HLO (stands for High, Low, Other) to indicate that you want to use one of these keywords and not an explicit range value. If you want to assign the label Not Found for all ICD-9 codes not in the list, you proceed as follows:
Program 22-12 Adding an OTHER category to your format data control / view=control; set learn.codes(rename= (ICD9 = Start Description = Label)) end = last; retain Fmtname '$ICDFMT' Type 'C'; if last then do; Start = ' '; Hlo = 'o'; Label = 'Not Found'; end; run;
Program 22-12 also demonstrates how to use a data view instead of a SAS data set. You want to add one observation at the bottom of your Control data set with the value of HLO equal to o (OTHER) and LABEL equal to Not Found. You can do this by using the END= data set option to test for the end of the input data set. The variable Last is true when the last observation in data set Codes has been read. HLO and LABEL are both assigned a value and a single observation is added to the end of the data set. Here is a listing of data set Control:
476 Learning SAS by Example: A Programmer’s Guide
Listing of CONTROL Start 020 022 390 410 493
Label
Fmtname
Plague Anthrax Rheumatic fever Myocardial infarction Asthma Not Found
$ICDFMT $ICDFMT $ICDFMT $ICDFMT $ICDFMT $ICDFMT
Type C C C C C C
Hlo
o
The last observation in this data set contains the instructions for setting OTHER equal to Not Found. Although we decided to set Start to a missing value in this program, it was not actually necessary because the keyword OTHER, not an actual range, is being used. When this new format is used in Program 22-11, the ICD-9 code of 500 is now labeled as Not Found. Here are both the output from PROC FORMAT showing that the OTHER category was added to the format and a listing of the Disease data set: FORMAT NAME: $ICDFMT LENGTH: 21 NUMBER OF VALUES: 6 MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 21 FUZZ: 0 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ START ‚END ‚LABEL (VER. V7|V8 21NOV2005:10:43:16) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 020 ‚020 ‚Plague 022 ‚022 ‚Anthrax 390 ‚390 ‚Rheumatic fever 410 ‚410 ‚Myocardial infarction 493 ‚493 ‚Asthma **OTHER** ‚**OTHER** ‚Not Found Listing of DISEASE ICD9 Plague Myocardial infarction Not Found Asthma
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 477
22.8 Updating and Maintaining Your Formats Suppose you want to add some ICD-9 codes to your existing $ICDFMT format. The easiest way to accomplish this is to first use the CNTLOUT= option of PROC FORMAT to create a data set containing all the formatting information. You can then use this data set in a DATA step to add new codes or modify existing ones. As an example, suppose you want to add two new ICD-9 codes to your $ICDFMT format, 427.5 (bronchitis) and 466 (cardiac arrest). The following program does the trick:
Program 22-13 Updating an existing format using a CNTLOUT= data set option proc format cntlout=control_out; select $ICDFMT; run; data new_control; length Label $ 21; set control_out end=Last; output; if Last then do; Hlo = ' '; Start = '427.5'; End = Start; Label = 'Cardiac Arrest'; output; Start = '466'; End = Start; Label = 'Bronchitis'; output; end; run; proc format cntlin=new_control; select $ICDFMT; run;
You first run PROC FORMAT with the CNTLOUT= option, creating an output data set containing all of the information necessary to re-create the format. A SELECT statement allows you to choose which format you want to use to create this data set. Here is a listing of this Control_Out data set:
478 Learning SAS by Example: A Programmer’s Guide
Listing of CONTROL_OUT FMTNAME START ICDFMT ICDFMT ICDFMT ICDFMT ICDFMT ICDFMT
END
LABEL
MIN MAX DEFAULT LENGTH FUZZ
020 020 Plague 022 022 Anthrax 390 390 Rheumatic fever 410 410 Myocardial infarction 493 493 Asthma **OTHER** **OTHER** Not Found
1 1 1 1 1 1
40 40 40 40 40 40
21 21 21 21 21 21
21 21 21 21 21 21
0 0 0 0 0 0
PREFIX MULT FILL NOEDIT TYPE SEXCL EEXCL HLO DECSEP DIG3SEP DATATYPE LANGUAGE 0 0 0 0 0 0
0 0 0 0 0 0
C C C C C C
N N N N N N
N N N N N N
O
Notice that there are additional variables that PROC FORMAT uses when creating a format. You do not need to concern yourself with these variables—SAS uses them as needed. The next step is to add observations to the end of this data set that contain the new codes and their labels. You can do this in a DATA step, by first reading all of the existing format information and then adding the new ones. (Alternatively, you could create a data set of the new codes and use PROC APPEND to add these observations to the end of the existing Control data set.) A LENGTH statement ensures that the storage length for Label is sufficient for any new labels you want to create. Program 22-13 uses the data set option END= to determine when you have read the last observation from data set Control_Out. At this point, the statements in the DO group execute, outputting two observations to the end of the new data set. It is important to set the variable HLO (High, Low, Other) to missing and to set the variable End equal to the Start values. Because these variables exist in the Control_Out data set and are read with a SET statement, they are automatically retained. Therefore, if you do not assign values to these variables, they retain the values they had in the last observation in data set Control_Out.
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 479
Finally, you use the CNTLIN= option to re-create the $ICDFMT format. Instead of the FMTLIB option, a SELECT statement is used to identify which format you want to list. When you use a SELECT statement with PROC FORMAT, it is not necessary to include the FMTLIB option as well. Here is the output from PROC FORMAT, showing that the format now contains the two additional codes: Listing of CONTROL_OUT FORMAT NAME: $ICDFMT LENGTH: 21 NUMBER OF VALUES: 8 MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 21 FUZZ: 0 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ START ‚END ‚LABEL (VER. V7|V8 22NOV2005:10:23:49) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 020 ‚020 ‚Plague 022 ‚022 ‚Anthrax 390 ‚390 ‚Rheumatic fever 410 ‚410 ‚Myocardial infarction 427.5 ‚427.5 ‚Cardiac Arrest 466 ‚466 ‚Bronchitis 493 ‚493 ‚Asthma **OTHER** ‚**OTHER** ‚Not Found
22.9 Using Formats within Formats When you define format or informat labels, you can also include the name of a SAS or user-written format or informat, rather than a text string in place of a label. Here is an example. You want to read dates from July 15, 2005, to December 31, 2006, using the MMDDYY10. informat. Dates before July 15, 2005, should be formatted as Not Open and dates after December 31, 2006, should be formatted as Too Late. You can use nested formats as follows to accomplish this task:
480 Learning SAS by Example: A Programmer’s Guide
Program 22-14 Demonstrating nested formats proc format; value registration low - <'15Jul2005'd = 'Not Open' '15Jul2005'd - '31Dec2006'd = [mmddyy10.] '01Jan2007'd - high = 'Too Late'; run;
The format MMDDYY10. is placed in square brackets where you normally place a format label. Program 22-15 uses this format to process registration dates and produce a list of dates and names.
Program 22-15 Using the nested format in a DATA step data conference; input @1 Name $10. @11 Date mmddyy10.; format Date registration.; datalines; Smith 10/21/2005 Jones 06/13/2005 Harris 01/03/2007 Arnold 09/12/2005 ;
A listing of data set Conference is shown here: Listing of CONFERENCE Name
Date
Smith Jones Harris Arnold
10/21/2005 Not Open Too Late 09/12/2005
The next example uses nested user-written informats. In this example, you are given data on benzene exposure. Some of the values are actual benzene levels in parts per million, while other values are the year a worker was exposed. For each of the years from 1946 to 1952, benzene levels were tabulated. For years outside this range, the actual benzene levels were reported. Some sample data, consisting of either a year or a benzene level, is shown next:
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 481
File c:\books\learning\benzene.txt 001 002 003 004 005
90 1950 217 1952 177
All of the benzene levels are much less than any of the year values, so there is no confusion whether to read a value as a year or a benzene level. Creating a data set of IDs and benzene levels is greatly simplified by nesting informats, like this: proc format; invalue yearexp 1946 = 250 1947 = 244 1948 = 240 1949 = 200 1950 = 188 1951 = 150 1952 = 100; invalue exp low - <1946 = [7.1] 1946 - 1952 = [yearexp.] 1952< - high = [7.1]; run; data benzene; infile 'c:\books\learning\benzene.txt'; input ID Exposure : exp4.; run;
The first informat, YEAREXP, substitutes a benzene level for each of the years from 1946 to 1952. The informat EXP uses a SAS 7.1 informat for values less than 1946 or greater than 1952. Otherwise, the YEAREXP informat is used. This problem could have also been solved using formats instead of informats. To use formats, you would need to read a raw data value and use a PUT function to obtain a formatted value. Because the result of a PUT function is a character value, you would also need to use an INPUT function to obtain a benzene level as a numeric value. Creating informats makes the process much simpler.
482 Learning SAS by Example: A Programmer’s Guide
22.10 Multilabel Formats Under normal circumstances, you get an error message if any of your format ranges overlap. However, you can create a format with overlapping ranges if you use the MULTILABEL option in the VALUE statement. Certain procedures that use multilabels can then use the MULTILABEL format (MLF) to produce tables showing all of the format ranges. Here is an example. You want to see the variable Age (from the Survey data set) broken down two ways: one, in 20-year intervals, and the other, by a split at 50 years old. You first create a MULTILABEL format like this:
Program 22-16 Creating a MULTILABEL format proc format; value agegroup (multilabel) 0 - <20 = '0 to <20' 20 - <40 = '20 to <40' 40 - <60 = '40 to <60' 60 - <80 = '60 to <80' 80 - high = '80 +' 0 - <50 = 'Less than 50' 50 - high = '> or = to 50'; run;
Without the MULTILABEL option, PROC FORMAT issues an error message about overlapping ranges and fails to create the format. You can use this format in PROC MEANS, PROC SUMMARY, and PROC TABULATE. Here is a PROC MEANS example:
Program 22-17 Using a MULTILABEL format with PROC MEANS title "Demonstrating a Multilabel Format"; title2 "PROC MEANS Example"; proc means data=learn.survey; class Age / MLF; var Salary; format Age agegroup.; run;
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 483
It is important to use the MLF option in the CLASS statement if you want to use this feature of the format. Here is the output: Demonstrating a Multilabel Format PROC MEANS Example The MEANS Procedure Analysis Variable : Salary N Age Obs N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 20 to <40 3 3 29186.67 6798.13 23060.00 36500.00 40 to <60
2
2
76111.50
16.2634560
76100.00
76123.00
60 to <80
2
2
109000.00
26870.06
90000.00
128000.00
> or = to 50
3
3
98041.00
26857.01
76123.00
128000.00
Less than 50 4 4 40915.00 24104.46 23060.00 76100.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Inspection of this output shows the average Salary broken down by Age in two ways: in 20-year intervals, and by a split at 50 years old. You can use MULTILABEL formats with PROC TABULATE to create interesting tables. Here is a program to compute frequencies in a table of Age by Gender: title "Demonstrating a Multilabel Format"; title2 "PROC TABULATE Example"; proc tabulate data=learn.survey; class Age Gender / MLF; table Age , Gender; format Age agegroup.; run; quit;
484 Learning SAS by Example: A Programmer’s Guide
This produces the following output: Demonstrating a Multilabel Format PROC TABULATE Example „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ Gender ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ F ‚ M ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚Age ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚20 to <40 ‚ .‚ 3.00‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚40 to <60 ‚ 2.00‚ .‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚60 to <80 ‚ 1.00‚ 1.00‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚> or = to 50 ‚ 2.00‚ 1.00‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚Less than 50 ‚ 1.00‚ 3.00‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Notice that certain age groups are not shown in the table. For example, there were no subjects in the 0 to <20 age group. To see all of the possible format ranges in your table, you can use the PRELOADFMT option in the CLASS statement. Preloaded formats allow PROC TABULATE to list categories for which there are no data values. Here is the PROC TABULATE code with the PRELOADFMT option and two additional options to control the printing of missing values. (They are explained following the program.)
Program 22-18 Using the PRELOADFMT, PRINTMISS, and MISSTEXT options with PROC TABULATE title "Demonstrating a Multilabel Format"; title2 "PROC TABULATE Example"; proc tabulate data=learn.survey; class Age Gender / MLF preloadfmt; table Age , Gender / printmiss misstext=' '; format Age agegroup.; run; quit;
The PRELOADFMT option forces the procedure to list all of the format ranges. This option has no effect without the PRINTMISS option in the TABLE statement. Remember that PRINTMISS is an instruction to include categories that contain one or more missing
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 485
values of a CLASS variable. Finally, the MISSTEXT= option tells the procedure what character (or characters) you would like printed that represents missing values (the default is a period). Here you choose a blank to represent missing values. Finally, the output: Demonstrating a Multilabel Format PROC TABULATE Example „ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ† ‚ ‚ Gender ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ F ‚ M ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ N ‚ N ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚Age ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚ ‚ ‚0 to <20 ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚20 to <40 ‚ ‚ 3.00‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚40 to <60 ‚ 2.00‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚60 to <80 ‚ 1.00‚ 1.00‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚80 + ‚ ‚ ‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚> or = to 50 ‚ 2.00‚ 1.00‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰ ‚Less than 50 ‚ 1.00‚ 3.00‚ Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Now all the Age ranges appear in the table and the missing values print as blanks.
22.11 Using the INPUTN Function to Perform a More Complicated Table Lookup This final example of the chapter performs a two-way table lookup using a function that allows you to compute an informat name in a DATA step. This section is a bit 1 complicated so you may either want to skip it or refer to SAS Online Doc .
1
See SAS OnlineDoc at http://support.sas.com/documentation/onlinedoc/index.html for more information on the INPUTN function and control input data sets.
486 Learning SAS by Example: A Programmer’s Guide
The problem: you have a table of years (1944 to 1949) and job codes (A through E). Each combination of year and job code has a benzene exposure (in parts per million) associated with it. The following table lists these values: Job Code Year
A
B
C
D
E
1944
220
180
210
110
90
1945
202
170
208
100
85
1946
150
110
150
60
50
1947
105
56
88
40
30
1948
60
30
40
20
10
1949
45
22
22
10
8
One way to compute an exposure, given a year and job code, is to create an informat for each year, assigning each of the job codes a label containing the exposure value. An informat is used instead of a format because you can use an INPUT (or, as you will see, an INPUTN) function with the user-defined informats to obtain a numeric value directly. If you did this “by hand,” you would proceed like this:
Program 22-19 Partial program showing how to create several informats proc format; invalue exp1944fmt (upcase) 'A' = 220 'B' = 180 'C' = 210 'D' = 110 'E' = 90 other = .; invalue exp1945fmt (upcase) 'A' = 202 'B' = 170 'C' = 208 'D' = 100 'E' = 85 other = .; invalue exp1946fmt (upcase) 'A' = 150 . . . run;
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 487
It takes far less work to create an input control data set (CNTLIN) to create the required informats. You proceed as follows:
Program 22-20 Creating several informats with a single CNTLIN data set data exposure; retain Type 'I' Hlo 'U'; do Year = 1944 to 1949; Fmtname = cats('exp',Year,'fmt'); do Start = 'A','B','C','D','E'; End = Start; input Label : $3. @; output; end; end; drop Year; datalines; 220 180 210 110 90 202 170 208 100 85 150 110 150 60 50 105 56 88 40 30 60 30 40 20 10 45 22 22 10 8 ; title "Creating the Exposure Format"; proc format cntlin=exposure fmtlib; run; Type=I tells PROC FORMAT that you want to create a numeric informat. The value of U for the HLO variable tells the procedure to use the UPCASE option in the INVALUE statement. You want informat names of the following form: expyyyyfmt
Here, yyyy is a year value from 1944 to 1949. The CATS function concatenates (joins) each of the arguments after first stripping off any leading or trailing blanks. This function also allows numeric arguments and performs the numeric-to-character conversion as well. Here are the first few observations in the Exposure data set:
488 Learning SAS by Example: A Programmer’s Guide
First 10 Observations of EXPOSURE Type I I I I I I I I I I
Hlo U U U U U U U U U U
Fmtname exp1944fmt exp1944fmt exp1944fmt exp1944fmt exp1944fmt exp1945fmt exp1945fmt exp1945fmt exp1945fmt exp1945fmt
Start
End
Label
A B C D E A B C D E
A B C D E A B C D E
220 180 210 110 90 202 170 208 100 85
By the way, if you want to see selected informats using the SELECT statement of PROC FORMAT, you need to include an @ sign as the first character in the informat name. For example, to list the contents of EXP1944FMT and EXP1945FMT, you would use Program 22-21:
Program 22-21 Using a SELECT statement to display the contents of two informats proc format; select @exp1944fmt @exp1945fmt; run;
When you use a SELECT statement with PROC FORMAT, it is not necessary to include the FMTLIB option as well. Continuing with our program. Now that you have an informat for each of the six years, all you have to do is use the year and job code to create an informat name and use that name with an INPUT function to retrieve the exposure. Here is the program:
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 489
Program 22-22 Using user-defined informats to perform a table lookup using the INPUTN function data read_exp; input Worker $ Year JobCode $; Exposure = inputn(JobCode,cats('exp',Year,'fmt8.')); datalines; 001 1944 B 002 1948 E 003 1947 C 005 1945 A 006 1948 d ;
You cannot use the INPUT function in this program because the informat name must be a constant. The INPUTN function allows you to create the informat name in the DATA step. By the way, the ‘N’ in the name INPUTN stands for numeric. You need to tell SAS ahead of time if the informat you are going to create is going to be character or numeric (it can’t tell if the name starts with a $ until it is created). There are two corresponding functions, PUTN and PUTC, that allow you to create character or numeric format names in the DATA step. When this program runs, the informat names are created with the CATS function. Each year value creates a different informat name to be used in the INPUTN function. Here is a listing of the resulting data set: Listing of READ_EXP
Worker 001 002 003 005 006
Year 1944 1948 1947 1945 1948
Job Code B E C A d
Exposure 180 10 88 202 20
The last few sections of this chapter may seem, at first glance, to be a “bit off the deep end.” However, if you try experimenting and writing a few programs of your own using these ideas, you will see how powerful these tools can be.
490 Learning SAS by Example: A Programmer’s Guide
22.12 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. You are given a SAS data set (BloodPressure) with variables SBP and DBP (systolic blood pressure and diastolic blood pressure). You want to see frequencies for SBP and DBP grouped as follows: SBP pressures below 140 are to be called Normal and pressures of 140 and above are to be called High SBP. For DBP, pressures below 90 are to be called Normal and pressures of 90 and above are to be called High DBP. Create an appropriate format and use PROC FREQ to produce these frequencies. Do not use a DATA step. Use the appropriate options to omit the cumulative frequencies and percentages from the PROC FREQ output. 2. Given the SAS data set Grades, produce frequencies of Grade (a numeric variable) grouped as follows: Less than 65 = F 65 to less than 75 = C 75 to less than 85 = B 85 to 100 = A
Create a format and use PROC FREQ to list these frequencies. Do not use a DATA step. 3. Using the same definitions for high and low SBP and DBP from Problem 1, create a new, temporary SAS data set (BloodPressure) with two new variables (SBPGroup and DBPGroup). 4. Using the same definitions for letter grades from Problem 2, create a new, temporary SAS data set (Grades) with a new variable (LetterGrade) created by using a PUT function with a user-defined format. 5. Using the same definitions for letter grades from Problem 2, create a new, temporary SAS data set (Grades) with a new variable LetterGrade (with values of A, B, C, and F). Do this by creating an informat to read the raw scores from the text file NumGrades.txt.
Chapter 22: Using Advanced Features of User-Defined Formats and Informats 491
6. You want to read the following line of data: A 6.7 X b c a 10.9 11.6 C
The numbers represent blood lead levels and the letters A, B, and C represent default values from three labs (A=4.5, B=5.5, and C=3.5). An X represents a missing value. Use an enhanced numeric informat to read this line of data with the letters (notice that the letters are in mixed case) converted to the appropriate values. 7. Use the SAS data set DxCodes to create a control data set to be used to create a character format (DXCODES). This data set contains variables Dx (character diagnosis codes) and Description. The Description variable contains the labels you want to use for each of the Dx codes. Use a SELECT statement to display this format. 8. Repeat Problem 7, except add the necessary statements in your DATA step to assign a label of Not Found to any value that does not match a format range (equivalent to OTHER=Not Found). 9. Starting with the permanent SAS data set Gym, create a listing using PROC PRINT that labels all dates from January 1, 1990 to December 31, 2004 as Too Early and dates from after January 1, 2007 as Too Late. Leave all other dates unchanged (and listed using the MMDDYY10. format). Hint: Create a format that uses the embedded format MMDDYY10. for dates in the appropriate range and labels the early and late dates as defined.
492 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
23
Restructuring SAS Data Sets 23.1 Introduction 494 23.2 Converting a Data Set with One Observation per Subject to a Data Set with Several Observations per Subject: Using a DATA Step 494 23.3 Converting a Data Set with Several Observations per Subject to a Data Set with One Observation per Subject: Using a DATA Step 496 23.4 Converting a Data Set with One Observation per Subject to a Data Set with Several Observations per Subject: Using PROC TRANSPOSE 498 23.5 Converting a Data Set with Several Observations per Subject to a Data Set with One Observation per Subject: Using PROC TRANSPOSE 500 23.6 Problems 501
494 Learning SAS by Example: A Programmer’s Guide
23.1 Introduction The term restructuring, also called transposing, means to take a data set with one observation per subject and convert it to a data set with many observations per subject, or vice versa. Why would you want or need to do this? There are some operations that are most easily performed when all the information per subject (or other unit of analysis) is contained in a single observation. Other operations are more convenient when there are several observations per subject. Several of the statistical procedures require data to be stored one way or the other, depending on what type of analysis is required. This chapter demonstrates how to restructure data sets using DATA step approaches and 1 PROC TRANSPOSE.
23.2 Converting a Data Set with One Observation per Subject to a Data Set with Several Observations per Subject: Using a DATA Step This first example uses a small data set (Oneper) where each subject has from one to three diagnosis codes (Dx1–Dx3). Here is a listing of this data set: Listing of ONEPER Obs
1
Subj
Dx1
Dx2
Dx3
1
001
450
430
410
2
002
250
240
.
3
003
410
250
500
4
004
240
.
.
See Ron Cody, Longitudinal Data and SAS: A Programmer's Guide (Cary, NC: SAS Institute Inc., 2001), for more information on this topic.
Chapter 23: Restructuring SAS Data Sets 495
Notice that some subjects have three diagnosis codes, some two, and one (Subject 004) only one. How do you obtain a frequency distribution for diagnosis codes? For example, Subject 001 had code 410 as the third diagnosis and Subject 003 has this same code listed as Dx1. It would be easier to compute frequencies on the diagnosis codes if the data set were structured like this: Listing of MANYPER Obs
Subj
Visit
Diagnosis
1
001
1
450
2
001
2
430
3
001
3
410
4
002
1
250
5
002
2
240
6
003
1
410
7
003
2
250
8
003
3
500
9
004
1
240
Data set Manyper has from one to three observations per subject. Here is a program to make this conversion:
Program 23-1 Creating a data set with several observations per subject from a data set with one observation per subject data learn.manyper; set learn.oneper; array Dx{3}; do Visit = 1 to 3; if missing(Dx{Visit}) then leave; Diagnosis = Dx{Visit}; output; end; keep Subj Diagnosis Visit; run;
Although you don’t have to use arrays to solve this problem, it does make the program more compact. The DX array has three elements: Dx1, Dx2, and Dx3 (remember that if you leave off the variable list, the variable names default to the array name with the digits 1 to n appended to the end).
496 Learning SAS by Example: A Programmer’s Guide
Let’s take the time to describe in detail how this program works (feel free to skip this section if this program seems intuitively clear to you). The DO loop starts with Visit set equal to 1. The MISSING function tests if this value is missing. For Subject 001, none of the diagnosis codes are missing, so the IF statement is never true for this subject. A new variable, Diagnosis, is set equal to the value of the array element Dx{1}, which is the same as the variable Dx1, which is equal to 450. At this point, the program data vector (PDV) contains the following: Subj 001
Dx1 450
Dx2 430
Dx3 410
Visit 1
Diagnosis 450
Because of the KEEP statement, only the variables Subj, Visit, and Diagnosis are written out to data set Manyper at the bottom of the DO loop. During the next iteration of the DO loop, Visit is equal to 2, Diagnosis is equal to 430, and these values are written out to the second observation in data set Manyper. Finally, Visit is set to 3, Diagnosis is set to 410, and the third observation is written to the output data set. The DATA step has now reached the bottom. Because the end of the file on the Oneper data set has not been reached, a new observation from data set Oneper is brought into the PDV and the process continues. When the DO loop reaches the third visit for Subject 002, Dx{3}, which is equal to Dx3, is a missing value. The MISSING function therefore returns a value of true and the LEAVE statement executes. A LEAVE statement branches to the first statement following the end of the DO loop. This prevents observations from being written to the new data set with Diagnosis equal to a missing value. (It is also assumed that if there are any missing Dx codes, they come after the non-missing codes.)
23.3 Converting a Data Set with Several Observations per Subject to a Data Set with One Observation per Subject: Using a DATA Step What if you want to go the other way, creating a data set with one observation per subject from one with several observations per subject? This process is a bit more complicated, and you need to take a few special precautions.
Chapter 23: Restructuring SAS Data Sets 497
As an example, suppose you started with data set Manyper and wanted to create a data set that looked like Oneper. Here is one way to do it:
Program 23-2 Creating a data set with one observation per subject from a data set with several observations per subject proc sort data=learn.manyper out=manyper; by Subj Visit; run; data oneper; set manyper; by Subj Visit; array Dx{3}; retain Dx1-Dx3; if first.Subj then call missing(of Dx1-Dx3); Dx{Visit} = Diagnosis; if last.Subj then output; keep Subj Dx1-Dx3; run;
You first sort the input data set by Subj. (The original data set was already in Subj order, but the sort was included to make the program more general.) Next, you set up an array to hold the three Dx values and retain these three variables. You need to retain these three variables because they do not come from a SAS data set and are, by default, set equal to a missing value for each iteration of the DATA step. The RETAIN statement prevents this from happening. Next, when you start processing the first visit for each subject, you set the three values of Dx to missing. If you don’t do this, a subject with fewer than three visits may wind up with a diagnosis from the previous subject. The CALL MISSING routine can set any number of numeric and/or character values to missing at one time. As with many of the SAS functions and CALL routines, if you use a variable list in the form Var1–Varn, you need to precede the variable list with the keyword OF. Next, you assign the value of Diagnosis to the appropriate Dx variable (Dx1 if Visit=1, Dx2 if Visit=2, and Dx3 if Visit=3). Finally, if you are processing the last visit for a patient, you output a single observation keeping the variables Subj and Dx1–Dx3. A listing of data set Oneper is identical to the one shown earlier in this chapter.
498 Learning SAS by Example: A Programmer’s Guide
23.4 Converting a Data Set with One Observation per Subject to a Data Set with Several Observations per Subject: Using PROC TRANSPOSE PROC TRANSPOSE can also be used to restructure SAS data sets. Sometimes, PROC TRANSPOSE provides a quick and simple solution—sometimes the PROC TRANSPOSE solution can be quite complicated. In general, a DATA step solution gives you more control over the restructuring process. As you will see in this first PROC TRANSPOSE example, you need to add some data set options (see Program 23-4) to achieve the same results as the original DATA step solution. The following program attempts to solve the same problem as in Section 23.2. You start out with a program that restructures the Oneper data set, but it does not completely solve the problem. Here is the code:
Program 23-3 Using PROC TRANSPOSE to convert a data set with one observation per subject into a data set with several observations per subject (first attempt) ***Note: data set already in Subject order; proc transpose data=learn.oneper out=manyper; by Subj; var Dx1-Dx3; run;
PROC TRANSPOSE takes an input data set and outputs a data set where the original rows become columns and the original columns become rows. This program includes a BY SUBJECT statement that performs the operation for each value of Subject. The result is the data set listed next.
Chapter 23: Restructuring SAS Data Sets 499
Listing of MANYPER Obs
Subj
_NAME_
COL1
1
001
Dx1
450
2
001
Dx2
430
3
001
Dx3
410
4
002
Dx1
250
5
002
Dx2
240
6
002
Dx3
.
7
003
Dx1
410
8
003
Dx2
250
9
003
Dx3
500
10
004
Dx1
240
11
004
Dx2
.
12
004
Dx3
.
This is almost what you want. All that is needed is to rename the variable COL1 to Diagnosis, eliminate the _NAME_ variable, and remove the observations with missing Dx values. All these goals are accomplished by using some data set options on the output data set, as follows:
Program 23-4 Using PROC TRANSPOSE to convert a data set with one observation per subject into a data set with several observations per subject proc transpose data=learn.oneper out=t_manyper(rename=(col1=Diagnosis) drop=_name_ where=(Diagnosis is not null)); by Subj; var Dx1-Dx3; run;
The RENAME= option renames COL1 to Diagnosis, while the DROP= option eliminates the _NAME_ variable from the data set. Finally, a WHERE= data set option removes observations where Diagnosis is missing. The result is identical to the output from Program 23-1.
500 Learning SAS by Example: A Programmer’s Guide
23.5 Converting a Data Set with Several Observations per Subject to a Data Set with One Observation per Subject: Using PROC TRANSPOSE The last example in this chapter shows how to convert a data set with several observations per subject into a data set with one observation per subject using PROC TRANSPOSE.
Program 23-5 Using PROC TRANSPOSE to convert a SAS data set with several observations per subject into one with one observation per subject proc transpose data=learn.manyper prefix=Dx out=oneper(drop=_NAME_); by Subj; id Visit; var Diagnosis; run;
The PREFIX= procedure option and an ID statement create an output data set identical to the one produced by Program 23-2. Listing of Data Set ONEPER Subj
Dx1
Dx2
Dx3
001
450
430
410
002
250
240
.
003
410
250
500
004
240
.
.
The PREFIX= option combines the prefix value (Dx) with the values of Visit (1, 2, and 3) to create the three variables Dx1, Dx2, and Dx3. PROC TRANSPOSE knew to use the values of Visit to create these variable names because it was identified as an ID variable in the procedure.
Chapter 23: Restructuring SAS Data Sets 501
23.6 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. A listing of data set Wide, containing the variables Subj, X1–X5, and Y1–Y5, is shown here: Listing of WIDE Obs
Subj
X1
X2
X3
X4
X5
Y1
Y2
Y3
Y4
Y5
1
001
8
5
6
5
4
10
20
30
40
50
2
002
7
5
6
4
5
11
33
29
34
56
3
003
2
2
4
5
6
22
38
21
20
34
Using a DATA step, create a temporary SAS data set (Long) using data set Wide as input. This data set should contain Subj, Time, X, and Y, with five observations per subject. The first 10 observations from data set Long should look like this: Listing of LONG Obs
Subj
Time
X
Y
1
001
1
8
10
2
001
2
5
20
3
001
3
6
30
4
001
4
5
40
5
001
5
4
50
6
002
1
7
11
7
002
2
5
33
8
002
3
6
29
9
002
4
4
34
10
002
5
5
56
502 Learning SAS by Example: A Programmer’s Guide
2. Using the SAS data set Narrow (shown here), create a new, temporary SAS data set (Stretch) where the five scores for each subject are contained in a single observation, with the variable names S1–S5. S1 is the Score at Time 1, S2 is the Score at Time 2, etc. Do this using a DATA step. Listing of NARROW (first 10 observations) Obs
Subj
Time
Score
1
001
1
7
2
001
2
6
3
001
3
5
4
001
4
5
5
001
5
4
6
002
1
8
7
002
2
7
8
002
3
6
9
002
4
6
10
002
5
6
Data set Stretch should look like this: Listing of STRETCH Obs
Subj
S1
S2
S3
S4
S5
1
001
7
6
5
5
4
2
002
8
7
6
6
6
3
003
8
7
6
6
5
Chapter 23: Restructuring SAS Data Sets 503
3. Repeat Problem 1 using PROC TRANSPOSE. Do this only for the variables X1–X5. Your resulting data set should look like this: Listing of LONG Obs
Subj
X
1
001
8
2
001
5
3
001
6
4
001
5
5
001
4
6
002
7
7
002
5
8
002
6
9
002
4
10
002
5
11
003
2
12
003
2
13
003
4
14
003
5
15
003
6
4. Repeat Problem 2 using PROC TRANSPOSE.
504 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
24
Working with Multiple Observations per Subject 24.1 Introduction 506 24.2 Identifying the First or Last Observation in a Group 506 24.3 Counting the Number of Visits Using PROC FREQ 509 24.4 Counting the Number of Visits Using PROC MEANS 511 24.5 Computing Differences between Observations 512 24.6 Computing Differences between the First and Last Observation in a BY Group Using the LAG Function 514 24.7 Computing Differences between the First and Last Observation in a BY Group Using a RETAIN Statement 515 24.8 Using a Retained Variable to “Remember” a Previous Value 517 24.9 Problems 518
506 Learning SAS by Example: A Programmer’s Guide
24.1 Introduction You will encounter many data sets with multiple observations per subject (or other groupings such as transactions on a single day or observations grouped in other ways). 1 These data structures are sometimes referred to as longitudinal data. Because SAS processes one observation at a time, you need special techniques to perform calculations across observations. For example, if you have a data set representing patient visits to a clinic, you might want to compare patient values from one visit to the next or from the first visit to the last visit. You may just want to retrieve the data from the last visit for each patient. The good news is that SAS has all the tools you need to accomplish these tasks. This chapter shows you how to use these tools.
24.2 Identifying the First or Last Observation in a Group For most of the examples in this chapter, you will use a data set (Clinic) consisting of medical data on patients who visit a clinic. Here is a listing of this data set: Listing of CLINIC ID 101 101 255 255 255 255 303 409 409 409 712 712
1
VisitDate 10/21/2005 02/25/2006 09/01/2005 12/18/2005 02/01/2006 04/01/2006 10/10/2006 09/01/2005 10/02/2005 12/15/2006 04/06/2006 04/15/2006
Dx
HR
SBP
DBP
GI Problems Cold Routine Visit Routine Visit Heart Problems Heart Problems Routine Visit Injury Routine Visit Routine Visit Infection Infection
68 68 76 74 79 72 72 88 72 68 58 56
120 122 188 180 210 180 138 142 136 130 118 118
80 84 100 95 110 88 84 92 90 84 70 72
See Ron Cody, Longitudinal Data and SAS: A Programmer's Guide (Cary, NC: SAS Institute Inc., 2001), for a more in-depth discussion of programming techniques for longitudinal data.
Chapter 24: Working with Multiple Observations per Subject
507
Variables in this data set are patient ID (ID), visit date (VisitDate), diagnosis (Dx), heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Notice that some patients have several visits, but one patient, Number 303, has a single visit. One of the important tasks when dealing with data sets like this is to identify when you are processing the first or the last observation for each patient (or the variable you want to use to identify a group). Program 24-1 shows how to do this:
Program 24-1 Creating FIRST. and LAST. variables proc sort data=learn.clinic out=clinic; by ID VisitDate; run; data last; set clinic; by ID; put ID= VisitDate= First.ID= Last.ID=; if last.ID; run;
The key to this program is the BY statement following the SET statement. Note: The data set must be sorted by the same variables. This BY statement creates two temporary SAS variables, First.ID and Last.ID. The PUT statement in Program 24-1 was added so you can see the value of these two temporary variables. Here is the section of the SAS log showing these data values. Note: The shading was added to help identify multiple visits for each patient. ID=101 VisitDate=10/21/2005 FIRST.ID=1 LAST.ID=0 ID=101 VisitDate=02/25/2006 FIRST.ID=0 LAST.ID=1 ID=255 VisitDate=09/01/2005 FIRST.ID=1 LAST.ID=0 ID=255 VisitDate=12/18/2005 FIRST.ID=0 LAST.ID=0 ID=255 VisitDate=02/01/2006 FIRST.ID=0 LAST.ID=0 ID=255 VisitDate=04/01/2006 FIRST.ID=0 LAST.ID=1 ID=303 VisitDate=10/10/2006 FIRST.ID=1 LAST.ID=1 ID=409 VisitDate=09/01/2005 FIRST.ID=1 LAST.ID=0 ID=409 VisitDate=10/02/2005 FIRST.ID=0 LAST.ID=0 ID=409 VisitDate=12/15/2006 FIRST.ID=0 LAST.ID=1 ID=712 VisitDate=04/06/2006 FIRST.ID=1 LAST.ID=0 ID=712 VisitDate=04/15/2006 FIRST.ID=0 LAST.ID=1
508 Learning SAS by Example: A Programmer’s Guide
It should be pretty clear what these two variables represent: First.ID is true (1) for the first visit for each patient and false (0) otherwise. Last.ID is true for the last visit for each patient and false otherwise. For Patient Number 303, both First.ID and Last.ID are true (it is both the first and the last visit for this patient). To create a data set consisting of the last visit for each patient, the program uses a subsetting IF statement to check when the value of Last.ID is equal to 1. Here is a listing of data set Last: Listing of LAST ID
VisitDate
Dx
HR
SBP
DBP
101
02/25/2006
Cold
68
122
84
255
04/01/2006
Heart Problems
72
180
88
303
10/10/2006
Routine Visit
72
138
84
409
12/15/2006
Routine Visit
68
130
84
712
04/15/2006
Infection
56
118
72
A common data processing requirement is to perform such operations as setting counters equal to 0,when you are processing the first observation in a BY group and outputting counts when you are processing the last observation in a BY group. Program 24-2 uses a DATA step to count the number of visits for each patient. (We will demonstrate later how to do this using either PROC FREQ or PROC MEANS.)
Program 24-2 Counting the number of visits per patient using a DATA step data count; set clinic; by ID; *Initialize counter at first visit; if first.ID then N_visits = 0; *Increment the visit counter; N_visits + 1; *Output an observation at the last visit; if last.ID then output; run;
Chapter 24: Working with Multiple Observations per Subject
509
Each time you encounter a new ID, you set the variable N_visits equal to 0. You then increment this variable by one for each iteration of the DATA step, using a sum statement. Finally, an observation is written out to data set Count when you are processing the last visit for each patient. Here is a listing of data set Count: Listing of COUNT ID
VisitDate
Dx
HR
SBP
DBP
N_visits
101
02/25/2006
Cold
68
122
84
2
255
04/01/2006
Heart Problems
72
180
88
4
303
10/10/2006
Routine Visit
72
138
84
1
409
12/15/2006
Routine Visit
68
130
84
3
712
04/15/2006
Infection
56
118
72
2
Notice that the values for VisitDate, Dx, HR, SBP, and DBP are values from the last visit for each patient.
24.3 Counting the Number of Visits Using PROC FREQ Instead of using a DATA step, you can use either PROC FREQ or PROC MEANS to compute the number of visits per patient. Here is the PROC FREQ solution:
Program 24-3 Using PROC FREQ to count the number of observations in a BY group proc freq data=learn.clinic noprint; tables ID / out=counts; run;
The NOPRINT option is included because you want PROC FREQ to create a data set and you do not need the listing output. In this program, you name the data set Counts. PROC FREQ uses the variable names Count and Percent for the variables representing the frequencies and percentages for each value of the variable listed in the TABLES statement. This is clear if you look at the listing of data set Counts created by Program 24-3.
510 Learning SAS by Example: A Programmer’s Guide
Listing of COUNTS ID
COUNT
PERCENT
101
2
16.6667
255
4
33.3333
303
1
8.3333
409
3
25.0000
712
2
16.6667
You can use the RENAME= and DROP= or KEEP= data set options to control what variables are in this data set and what they are called. For example, the next program renames Count to N_Visits and drops the Percent variable. In addition, a MERGE statement adds the variable N_Visits to each observation in the original Clinic data set. Here is the program:
Program 24-4 Using the RENAME= and DROP= data set options to control the output data set proc freq data=clinic noprint; tables ID / out=counts (rename=(count = N_Visits) drop=percent); run; data clinic; merge learn.clinic counts; by ID; run;
The RENAME= data set option renames the variable Count (the name chosen by SAS) to N_Visits. The DROP= data set option tells SAS that you don’t want the variable Percent in the new data set. You don’t need to sort data set Counts because it will be in ID order. The default ordering for PROC FREQ is to order values by their internal (unformatted) values. Data set Clinic is already in ID order. Here is the result:
Chapter 24: Working with Multiple Observations per Subject
511
Listing of CLINIC ID
VisitDate
Dx
HR
SBP
DBP
N_Visits
101
10/21/2005
GI Problems
68
120
80
2
101
02/25/2006
Cold
68
122
84
2
255
09/01/2005
Routine Visit
76
188
100
4
255
12/18/2005
Routine Visit
74
180
95
4
255
02/01/2006
Heart Problems
79
210
110
4
255
04/01/2006
Heart Problems
72
180
88
4
303
10/10/2006
Routine Visit
72
138
84
1
409
09/01/2005
Injury
88
142
92
3
409
10/02/2005
Routine Visit
72
136
90
3
409
12/15/2006
Routine Visit
68
130
84
3
712
04/06/2006
Infection
58
118
70
2
712
04/15/2006
Infection
56
118
72
2
You now have the same result produced by Program 24-2. You may prefer the DATA step approach or using PROC FREQ to do the counting for you.
24.4 Counting the Number of Visits Using PROC MEANS You can accomplish the same result by using PROC MEANS instead of PROC FREQ to do the counting. Here is such a program:
Program 24-5 Using PROC MEANS to count the number of observations in a BY group proc means data=learn.clinic nway noprint; class ID; output out=counts(rename=(_freq_ = N_Visits) drop= _type_); run;
There are several important things to notice about this program. First, you add the NWAY and NOPRINT procedure options. NWAY tells SAS that you only want the statistics computed for each level of the CLASS variable and you do not want the grand
512 Learning SAS by Example: A Programmer’s Guide
mean (refer to Chapter 16 for more details on this). The NOPRINT option eliminates the output listing from PROC MEANS. Next, you specify ID as a CLASS variable. This computes statistics for each value of ID. You then supply an OUTPUT statement, telling SAS that you want to create a summary data set called Counts. Notice that there is no VAR statement and no output statistics were requested. This is OK. The key here is to use the variable _FREQ_ that is added to any output data set created by PROC MEANS. It represents the number of observations (missing or non-missing) for each value of the CLASS variable. Therefore, in this program, _FREQ_ represents the number of observations for each patient. The RENAME= data set option renames _FREQ_ to N_Visits. Because you don’t need the variable _TYPE_ (part of the output data set created by PROC MEANS), it is dropped with the DROP= data set option. The resulting data set is identical to the one produced by PROC FREQ in Program 24-4. This may seem more complicated that just using PROC FREQ, but you may be more used to creating output data sets using PROC MEANS than PROC FREQ. Use whichever method you feel most comfortable with.
24.5 Computing Differences between Observations Suppose you want to see changes in heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP) from visit to visit. The LAG function provides an easy way to compare values from the present observation to ones in a previous observation. Here is a program to output changes from one visit to the next:
Program 24-6 Computing differences between observations data difference; set clinic; by ID; *Delete patients with only one visit; if first.ID and last.ID then delete; Diff_HR = HR – lag(HR); Diff_SBP = SBP – lag(SBP); Diff_DBP = DBP – lag(DBP); if not first.ID then output; run;
Chapter 24: Working with Multiple Observations per Subject
513
There are a few important points to notice about this program. First, you delete patients with only one visit (when both First.ID and Last.ID are true). Next, you use the LAG function to obtain values of HR, SBP, and DBP from the previous visit. Let’s take a moment and follow the DATA step logic. The first patient (ID=101) has more than one visit, so the first IF statement is not true. The value of the three Diff variables are all missing during the first iteration of the DATA step. First.ID is true, so NOT First.ID is false, and an observation is not written out to the data set. During the second iteration of the DATA step, the three Diff variables represent the difference of the current values from the values from the last visit. The last IF statement is true, and an observation is written out to data set Difference. The next observation is the first visit for Patient Number 255. Again, this patient has more than one visit, so the first IF statement is not true. The three Diff variables represent the difference between the three values for Patient Number 255 and the last visit for Patient Number 101! This seems wrong, but hold on, this will work out OK since you are not going to output an observation this time (the last IF statement is not true). You can think of this iteration of the DATA step as “priming the pump” so that the LAG function returns the correct values on the next iteration of the DATA step. You might be tempted to execute the three DIFF statements conditionally. Do not do this. You need to execute the LAG function for every iteration of the DATA step in this program so you get the correct difference values. The next observation from the Clinic data set is the second visit for Patient Number 255. Now the three DIFF statements correctly compute the difference between the current values for Patient Number 255 and the values from the previous visit. Because the last IF statement is now true, this observation is written out to the Difference data set. You may think this explanation is too long and tedious—and you may be right. However, it is really important that you understand the logic of this program if you are to use the LAG function correctly. By the way, for you compulsive programmers (not me!), you can use the DIF function to compute the three differences directly. Refer to Chapter 11 for details on this function.
514 Learning SAS by Example: A Programmer’s Guide
Here is a listing of data set Difference: Listing of DIFFERENCE ID 101 255 255 255 409 409 712
VisitDate Dx 02/25/2006 12/18/2005 02/01/2006 04/01/2006 10/02/2005 12/15/2006 04/15/2006
Cold Routine Visit Heart Problems Heart Problems Routine Visit Routine Visit Infection
HR SBP DBP Diff_HR Diff_SBP Diff_DBP 68 74 79 72 72 68 56
122 84 180 95 210 110 180 88 136 90 130 84 118 72
0 -2 5 -7 -16 -4 -2
2 -8 30 -30 -6 -6 0
4 -5 15 -22 -2 -6 2
The value of HR, SBP, and DBP represent the values for the current visit. The three Diff values represent these values minus the values from the previous visit.
24.6 Computing Differences between the First and Last Observation in a BY Group Using the LAG Function This section demonstrates how to compute the difference between the first and last visit for each patient. Remember how you were cautioned not to execute the LAG function conditionally? The program that follows does just that—and it works:
Program 24-7 Computing differences between the first and last observation in a BY group data first_last; set clinic; by ID; *Delete patients with only one visit; if first.ID and last.ID then delete; if first.ID or last.ID then do; Diff_HR = HR – lag(HR); Diff_SBP = SBP – lag(SBP); Diff_DBP = DBP – lag(DBP); end; if last.ID then output; run;
Chapter 24: Working with Multiple Observations per Subject
515
The first time the LAG function executes is when the first visit for each patient is being processed. The next time the LAG function executes is during the last visit. Therefore, the differences are the values in the last visit minus the values from the first visit (the last time the LAG function executed). Notice also that an observation is only written out during the last visit for each patient. Here is a listing of this data set: Listing of FIRST_LAST ID
VisitDate Dx
101 02/25/2006 Cold
HR SBP DBP Diff_HR Diff_SBP Diff_DBP 68 122
84
0
2
4
255 04/01/2006 Heart Problems 72 180
88
409 12/15/2006 Routine Visit
68 130
84
-4
-8
-12
-20
-12
712 04/15/2006 Infection
56 118
72
-2
-8
0
2
24.7 Computing Differences between the First and Last Observation in a BY Group Using a RETAIN Statement Because SAS has so many useful tools, there are often several ways to solve the same problem. This section describes how to accomplish the same task as the previous section, except you use retained variables instead of the LAG function to accomplish the task. Using retained variables is one of the best ways to “remember” values from previous observations. Remember that variables that do not come from SAS data sets are, by default, set to a missing value during each iteration of the DATA step. A RETAIN statement allows you to tell SAS not to do this. Armed with this knowledge, take a look at the following program:
516 Learning SAS by Example: A Programmer’s Guide
Program 24-8 Demonstrating the use of retained variables data first_last; set clinic; by ID; *Delete patients with only one visit; if first.ID and last.ID then delete; retain First_HR First_SBP First_DBP; if first.ID then do; First_HR = HR; First_SBP = SBP; First_DBP = DBP; end; if last.ID then do; Diff_HR = HR – First_HR; Diff_SBP = SBP - First_SBP; Diff_DBP = DBP – First_DBP; output; end; drop First_: ; run;
You start out the same as Program 24-7. Next, you use a RETAIN statement so that the three variables, First_HR, First_SBP, and First_DBP, are not set back to missing when the DATA step iterates. (The RETAIN statement operates at compile time and is not an executable statement.) Next, when you are processing the first visit for each patient, you set the three retained variables equal to the values of HR, SBP, and DBP, respectively. These values remain in the program data vector (PDV) and are not replaced until the first visit for a new patient is processed. When you reach the last visit for each patient, you compute the three difference values and output them. One final comment on this program involves the DROP statement. The value First_: refers to all variables that begin with the characters FIRST_. The resulting data set is identical to the one created in Program 24-7.
Chapter 24: Working with Multiple Observations per Subject
517
24.8 Using a Retained Variable to “Remember” a Previous Value This section describes another problem where using a retained variable greatly simplifies the code. For each patient in your Clinic data set, you want to check if the patient had a systolic blood pressure (SBP) over 140 during any of the visits. Here is the program:
Program 24-9 Using a retained variable to “remember” a previous value data hypertension; set learn.clinic; by ID; retain HighBP; if first.ID then HighBP = 'No '; if SBP gt 140 then HighBP = 'Yes'; if last.ID then output; run;
This program uses the retained variable HighBP to “remember” if a patient ever had a systolic blood pressure over 140. As each new patient is processed, HighBP is set to No. Then, if any SBP is greater than 140, HighBP is set to Yes. Because this value is retained, it remains equal to Yes even if the SBP is less than 140 on all subsequent visits. When SAS reaches the last visit for each patient, an observation is written to the Hypertension data set. Here is a listing of Hypertension: Listing of HYPERTENSION High ID
VisitDate
Dx
HR
SBP
DBP
BP
101
02/25/2006
Cold
68
122
84
No
255
04/01/2006
Heart Problems
72
180
88
Yes
303
10/10/2006
Routine Visit
72
138
84
No
409
12/15/2006
Routine Visit
68
130
84
Yes
712
04/15/2006
Infection
56
118
72
No
Even though SAS operates one observation at a time, you can use the techniques in this chapter to perform computations between observations in a SAS data set.
518 Learning SAS by Example: A Programmer’s Guide
24.9 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Data set DailyPrices contains a stock symbol and from 1 to 5 observations showing the price for each date. Here is a listing of this data set: Listing of DAILYPRICES Symbol
Date
Price
CSCO
01/01/2007
19.75
CSCO
01/02/2007
20.00
CSCO
01/03/2007
20.50
CSCO
01/04/2007
21.00
IBM
01/01/2007
76.00
IBM
01/02/2007
78.00
IBM
01/03/2007
75.00
IBM
01/04/2007
79.00
IBM
01/05/2007
81.00
LU
01/01/2007
2.55
LU
01/02/2007
2.53
AVID
01/01/2007
41.25
BAC
01/01/2007
51.00
BAC
01/02/2007
51.00
BAC
01/03/2007
51.20
BAC
01/04/2007
49.90
BAC
01/05/2007
52.10
Create a listing showing the stock symbol, the date, and the price from the most recent date for each stock.
Chapter 24: Working with Multiple Observations per Subject
519
2. Create a temporary SAS data set containing the average price for each stock and the number of values used to create this average. In addition, this data set should contain the minimum and maximum price for each stock. A listing of this data set should look like this: Listing of SUMMARY Price_ Symbol
Mean
Price_
Price_
Price_N
Min
Max
AVID
41.2500
1
41.25
41.25
BAC
51.0400
5
49.90
52.10
CSCO
20.3125
4
19.75
21.00
IBM
77.8000
5
75.00
81.00
2.5400
2
2.53
2.55
LU
3. Create a temporary SAS data set with two variables, N_Days and Symbol, by writing a DATA step and using the SAS data set DailyPrices as input. N_Days represents the number of observations for each stock. 4. Repeat Problem 3 using PROC FREQ to count the number of observations for each unique stock symbol. 5. Using the SAS data set DailyPrices, compute the difference between the price on the last day minus the price on the first day. Omit any stocks that have only one observation. Do this using a DATA step and retained variables. 6. Repeat Problem 5 using the LAG or DIF function. 7. Using the SAS data set DailyPrices, compute day-to-day differences for all stocks that have more than one observation.
520 Learning SAS by Example: A Programmer’s Guide
C h a p t e r
25
Introducing the SAS Macro Language 25.1 Introduction 522 25.2 Macro Variables: What Are They? 522 25.3 Some Built-In Macro Variables 523 25.4 Assigning Values to Macro Variables with a %LET Statement 524 25.5 Demonstrating a Simple Macro 525 25.6 A Word about Tokens 527 25.7 Another Example of Using a Macro Variable as a Prefix 529 25.8 Using a Macro Variable to Transfer a Value between DATA Steps 530 25.9 Problems 532
522 Learning SAS by Example: A Programmer’s Guide
25.1 Introduction Although the SAS macro language is usually thought of as an advanced topic, there are aspects of this language that are useful and easy to use, even to the beginning or intermediate SAS programmer. This chapter gives you the tools to use macro variables and write simple macros. Macros are particularly useful if you want to make your SAS programs more flexible and allow them to be used in different situations without having to rewrite entire programs. Macro variables also provide a useful method for passing information from one DATA step to another. You can even select which procedures within a large program to run, depending on the day of the week or the values in your data. If you want to learn more about the SAS macro language, I highly recommend two books. The first book, by Michele Burlew, might be more appropriate to beginning 1 2 programmers. The second book, by Art Carpenter, covers more advanced topics.
25.2 Macro Variables: What Are They? You may have seen programs with funny-looking variable names such as &VAR or &SYSDATE in them. Perhaps you have seen statements such as %LET or other statements containing percent signs. When you submit a SAS program for processing, before SAS starts to compile and then execute your program, it first checks for the existence of ampersands (&) or percent signs (%) in the program and calls in the macro processor. In a sense, the macro processor works like the find-and-replace feature of Microsoft Office Word—it looks for all the macro variables (names starting with ampersands) and replaces them with text values. There are several ways to assign values to macro variables and you will see several described in this chapter.
1 2
See Michele M. Burlew, SAS Macro Programming Made Easy, Second Edition (Cary, NC: SAS Institute Inc., 2006). See Art Carpenter, Carpenter’s Complete Guide to the SAS Macro Programming Language, Second Edition (Cary, NC: SAS Institute Inc., 2004).
Chapter 25: Introducing the SAS Macro Language
523
25.3 Some Built-In Macro Variables There are several built-in macro variables available to you as soon as you begin a SAS session. Two useful ones are &SYSDATE9 and &SYSTIME. As you may guess, the former stores the date you started your SAS session (in DATE9. format) and the latter stores the time the session started. As an example, you could use these automatic macro variables to include a date and time in a TITLE statement, like this:
Program 25-1 Using an automatic macro variable to include a date and time in a title title "The Date is &sysdate9 - the Time is &systime"; proc print data=learn.test_scores noobs; run;
If you started your SAS session at 10:30 on August 19, 2006, the macro processor would first resolve the macros variables (substitute text for the macro variable) before running the program. The listing would look like this: The Date is 19AUG2006 - the Time is 10:20 ID
Name
Score1
Score2
Score3
1 2
Milton
90
95
98
Washington
78
77
75
3
Smith
88
91
92
This is a good time to tell you that if you place a macro variable inside quotation marks, SAS resolves only macro variables that are inside double quotation marks. If you used single quotation marks in Program 25-1, the title would read: The Date is &sysdate9 - the Time is &systime
524 Learning SAS by Example: A Programmer’s Guide
25.4 Assigning Values to Macro Variables with a %LET Statement You can assign a value to a macro variable with a %LET statement. You usually place %LET statements in open code, that is, not inside a DATA or PROC step. Here is a simple example:
Program 25-2 Assigning a value to a macro variable with a %LET statement %let var_list = RBC WBC Chol; title "Using a Macro Variable List"; proc means data=learn.blood n mean min max maxdec=1; var &var_list; run;
Every place in your program where you use the macro variable &VAR_LIST, SAS substitutes the text RBC WBC Chol. This is a simple but useful trick when you need to repeat lists of variables. Here is another example of using a %LET statement. Suppose you want to generate a data set containing random integers. You want the program to be flexible so that you can choose how many random numbers to generate each time you run the program. Here is the program:
Program 25-3 Another example of using a %LET statement %let n = 3; data generate; do Subj = 1 to &n; x = int(100*ranuni(0) + 1); output; end; run; title "Data Set with &n Random Numbers"; proc print data=generate noobs; run;
Notice that the %LET statement comes before the DATA step. When this program runs, each occurrence of &n is replaced with a 3. While it would be easy to simply use the SAS Program Editor to change a 3 to some other value each time you run this program,
Chapter 25: Introducing the SAS Macro Language
525
imagine that this value was used many times in a longer program. By simply editing the single %LET statement, you can change the value of n everywhere in the program.
25.5 Demonstrating a Simple Macro So far, you have seen macro variables used in a program. Now it’s time to see an actual SAS macro. Macros can be entire SAS programs or just pieces of SAS code. You start a macro with a %MACRO statement and end it with a %MEND (macro end) statement. Here is an example. You want to make Program 25-3 more general. Not only do you want to change the number of random numbers you generate, you want to change the starting and ending values of these numbers as well. The macro that follows does just that:
Program 25-4 Writing a simple macro %macro gen(n,Start,End); data generate; do Subj = 1 to &n; x = int((&End - &Start + 1)*ranuni(0) + &Start); output; end; run; proc print data=generate noobs; title "Randomly Generated Data Set with &n Obs"; title2 "Values are integers from &Start to &End"; run; %mend gen;
The macro name is GEN and there are three positional arguments: n, Start, and End. To generate four random integers from 1 to 100, you would call the macro like this: %gen(4,1,100)
As you can see, the macro calling statement consists of a percent sign followed by the macro name. In this calling statement, you specify the values that you want to substitute for n, Start, and End in parentheses following the macro name. If you submit this macro with the option MPRINT turned on, you see the actual SAS code that was generated. Here is a listing of the SAS log (with the MPRINT option in effect):
526 Learning SAS by Example: A Programmer’s Guide
MPRINT(GEN):
data generate;
MPRINT(GEN):
do Subj = 1 to 4;
MPRINT(GEN):
x = int((100 - 1 + 1)*ranuni(0) + 1);
MPRINT(GEN):
output;
MPRINT(GEN):
end;
MPRINT(GEN):
run;
NOTE: The data set WORK.GENERATE has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time
0.00 seconds
cpu time
0.00 seconds
MPRINT(GEN):
proc print data=generate noobs;
MPRINT(GEN):
title "Randomly Generated Data Set with 4 Obs";
MPRINT(GEN):
title2 "Values are integers from 1 to 100";
MPRINT(GEN):
run;
Notice that each macro variable was replaced by the value specified in the calling sequence. Here is the output: Randomly Generated Data Set with 4 Obs Values are integers from 1 to 100 Subj
x
1
62
2
57
3
13
4
56
When you call the macro, there is no semicolon at the end of the statement. If you look at the SAS code that was generated (in the previous SAS log), you can see that each SAS statement that was generated ends in a semicolon. Remember that SAS macros may generate pieces of SAS code, sometimes even parts of other SAS statements. Therefore, following the macro call with a semicolon could result in an error.
Chapter 25: Introducing the SAS Macro Language
527
Because it may not always be obvious what the parameters in a macro represent, it is an excellent idea to document them when you write the macro. Here is Program 25-4 with some documentation included:
Program 25-5 Program 25-4 with documentation added %macro gen(n, /* number of random numbers */ Start, /* Starting value */ End, /* Ending value */); /******************************************** Example: To generate 4 random numbers from 1 to 100 use: %gen(4,1,100) *********************************************/ data generate; do Subj = 1 to &n; x = int((&End - &Start + 1)*ranuni(0) + &Start); output; end; run; proc print data=generate noobs; title "Randomly Generated Data Set with &n Obs"; title2 "Values are integers from &Start to &End"; run; %mend gen;
25.6 A Word about Tokens The SAS macro processor must figure out where your macro variables start and end. In the previous macro, each time a macro variable was used, it was followed by a character that was not valid in a variable name. Suppose you submitted the following SAS code:
Program 25-6 Demonstrating a problem with resolving a macro variable %let prefix = abc; data &prefix123; x = 3; run;
528 Learning SAS by Example: A Programmer’s Guide
You are hoping to generate a data set called Abc123. However, take a look at the SAS log: 192
%let prefix = abc;
193 WARNING: Apparent symbolic reference PREFIX123 not resolved. 194
data &prefix123; 22 --------202
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string, /, ;, _DATA_, _LAST_, _NULL_. ERROR 202-322: The option or parameter is not recognized and will be ignored. 195 196
x = 3; run;
This message tells you that the macro processor cannot resolve (find) a macro variable with the name PREFIX123. SAS assumes that the macro variable name is PREFIX123 since this is a valid macro variable name. To tell SAS that the macro variable name is &PREFIX, you use a period to indicate where the macro variable name ends, like this:
Program 25-7 Program 25-6 corrected %let prefix = abc; data &prefix.123; x = 3; run;
Chapter 25: Introducing the SAS Macro Language
529
The period tells the macro processor that &PREFIX is the macro variable name, not &PREFIX123. Notice that the SAS log no longer shows an error: 197
%let prefix = abc;
198 199 200 201
data &prefix.123; x = 3; run;
NOTE: The data set WORK.ABC123 has 1 observations and 1 variables.
25.7 Another Example of Using a Macro Variable as a Prefix Suppose you want to use a macro variable to specify a libref (the first part of a two-part data set name). You might be tempted to write a program like this:
Program 25-8 Using a macro variable as a prefix (incorrect version) %let libref = learn; proc print data=&libref.survey; title "Listing of SURVEY"; run;
Here’s what happens when you try to run this program: 182
%let libref = learn;
183 184
proc print data=&libref.survey;
ERROR: File WORK.LEARNSURVEY.DATA does not exist. 185 186
title "Listing of SURVEY"; run;
What happened? The period between the &libref and the name Survey was eaten up by the macro processor. It told the macro processor that the name of the macro variable was &PREFIX. With the period removed, SAS was looking to create a data set called Work.LearnSurvey. The solution (as strange as it looks) is to use two periods: one to tell
530 Learning SAS by Example: A Programmer’s Guide
the macro processor where the macro variable name ends and the other to separate the libref from the data set name. The corrected program is as follows:
Program 25-9 Using a macro variable as a prefix (corrected version) %let libref = learn; proc print data=&libref..survey; title "Listing of SURVEY"; run;
Here is the SAS log after running this program: 209
%let libref = learn;
210 211 212 213
proc print data=&libref..survey; title "Listing of SURVEY"; run;
NOTE: There were 7 observations read from the data set LEARN.SURVEY. NOTE: PROCEDURE PRINT used (Total process time): real time
0.04 seconds
cpu time
0.01 seconds
You can see that SAS resolved the macro variable and the result was a data set called Learn.Survey.
25.8 Using a Macro Variable to Transfer a Value between DATA Steps Macro values defined outside of a macro are, by default, global. That is, they maintain their value during a SAS session. This makes them a very useful tool for transferring values between DATA steps. As an example, suppose you ran PROC MEANS to compute the means of the variables RBC and WBC from the Blood data set and wanted to compare each individual value of RBC and WBC against the mean values. One way to do this is to include a conditional SET statement in your DATA step, like this: if _n_ = 1 then set means;
Chapter 25: Introducing the SAS Macro Language
531
Here, Means is the one-observation data set produced by PROC MEANS. Another way, which we describe next, uses macro variables to hold the mean values. Here is the program:
Program 25-10 Using macro variables to transfer values from one DATA step to another proc means data=learn.blood noprint; var RBC WBC; output out=means mean= M_RBC M_WBC; run; data _null_; set means; call symput('AveRBC',M_RBC); call symput('AveWBC',M_WBC); run; data new; set learn.blood(obs=5 keep=Subject RBC WBC); Per_RBC = RBC / &AveRBC; Per_WBC = WBC / &AveWBC; format Per_RBC Per_WBC percent8.; run;
PROC MEANS creates a data set (called Means in this example) consisting of one observation and two variables (M_RBC and M_WBC), which represent the mean values of RBC and WBC, respectively. The DATA _NULL_ step uses a CALL routine called SYMPUT. This CALL routine assigns the value of a DATA step variable to a macro variable. You cannot use a %LET statement here because you don’t know the value of M_RBC or M_WBC. The first argument of CALL SYMPUT is the name of the macro variable you want to create. The second argument is the value you want to assign to the macro variable. When this DATA step executes, each of the two macro variables (AveRBC and AveWBC) will be the two mean values. The values of these two macro variables are not available in the same DATA step, so a final DATA step is needed. In the final DATA step, the two variables Per_RBC and Per_WBC are computed by dividing the value in the current observation by the mean value. You may wonder why this statement doesn’t multiply the resulting value by 100 to create a percentage. The SAS format PERCENT not only adds a percent sign, it multiplies by 100 as well.
532 Learning SAS by Example: A Programmer’s Guide
Here is a listing of data set New: RBC and WBC as a Percent of the Mean Subject
WBC
RBC
1
7710
7.40
2
6560
3
5690
4 5
Per_RBC
Per_WBC
135%
109%
4.70
86%
93%
7.53
137%
81%
6680
6.85
125%
95%
.
7.72
141%
.
Hopefully, this very brief introduction to SAS macros has shown you that the SAS macro language is not as scary as you might have thought and has given you the courage to try using them in your own programs.
25.9 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Use PROC PRINT to list the first five observations from data set Stocks. Use the system macro variables &SYSDAY (returns the day of the week), &SYSDATE, and &SYSTIME to customize the title (see the following example, which was run on a Friday): Listing produced on Friday, 01SEP2006 at 09:47 date
Price
01/01/2006
$34
01/02/2006
$35
01/03/2006
$39
01/04/2006
$30
01/05/2006
$35
Chapter 25: Introducing the SAS Macro Language
533
2. Rewrite the following program using %LET to assign the starting and ending values of the DO loop to macro variables. Be sure these values also print in the PROC PRINT output. data sqrt_table; do n = 1 to 5; Sqrt_n = sqrt(n); output; end; run; title "Square Root Table from 1 to 5"; proc print data=sqrt_table noobs; run;
3. Create a macro (call it PRINT_N) to produce a listing of the first n observations from a selected data set. Use the following program as a guide, replacing the data set name and the number of observations to print as calling arguments to the macro. title "Listing of the first 5 Observations from " "Data set STOCKS"; proc print data=learn.stocks(obs=5) noobs; run;
Use the macro to list the first four observations from SAS data set Bicycles. Your listing should look like this: Listing of the first 4 Observations from Data set learn.bicycles Country
Model
Manuf
Units
UnitCost
TotalSales
USA
Road Bike
Trek
5000
$2,200
$11,000
USA
Road Bike
Cannondale
2000
$2,100
$4,200
USA
Mountain Bike
Trek
6000
$1,200
$7,200
USA
Mountain Bike
Cannondale
4000
$2,700
$10,800
534 Learning SAS by Example: A Programmer’s Guide
4. Turn the following program into a macro (call it STATS), making it more general. Use as calling arguments the input data set name (Dsn), the CLASS variables (Class), and variables listed in the VAR statement (Vars): title "Statistics from data set learn.bicycles"; proc means data=bicycles n mean min max maxdec=1; class Country; var Units TotalSales; run;
Test your macro by calling it like this: %STATS(learn.bicycles, Country, Units TotalSales)
5. List three variables (TimeMile, RestPulse, and MaxPulse) from the data set Fitness. In addition to these three variables, compute three new variables (call them P_TimeMile, P_RestPulse, and P_MaxPulse) that represent these three values as a percentage of the mean for all subjects in the data set. Do this by first computing the three means using PROC MEANS. Next, in a DATA _NULL_ step, use CALL SYMPUT to create three macro variables representing the means of these three values. Finally, in a DATA step, use these three macro variables in assignment statements to compute the percentage values.
C h a p t e r
26
Introducing the Structured Query Language 26.1 Introduction 536 26.2 Some Basics 536 26.3 Joining Two Tables (Merge) 539 26.4 Left, Right, and Full Joins 543 26.5 Concatenating Data Sets 546 26.6 Using Summary Functions 549 26.7 Demonstrating an ORDER Clause 551 26.8 An Example of Fuzzy Matching 551 26.9 Problems 553
536 Learning SAS by Example: A Programmer’s Guide
26.1 Introduction PROC SQL (structured query language—usually pronounced sequel) offers an alternative to the DATA step for querying and combining SAS data sets. There are some tasks that PROC SQL can perform much better and easier than the DATA step. Other tasks may be easier or more efficient using a DATA step. You may also be more familiar with SQL versus the DATA step or vice-versa. The best advice is to learn both and use the tool that works best for you in each situation. This chapter touches only on the basics of PROC SQL. There are some excellent books published by SAS that you will probably want to own. One book is part of the SAS 1 2 3 OnlineDoc collection, and the other books are by Kirk Lafler , and Katherine Prairie . I 4 also recommend a particularly useful paper by Christianna Williams.
26.2 Some Basics Programmers familiar with SAS syntax may have some difficulty getting started with PROC SQL. For example, you use commas, not spaces, to separate variable and data set names in PROC SQL. You may also find yourself putting semicolons where they don’t belong. Let’s start out with some simple examples of querying a data set and creating a SAS data set by subsetting observations from a larger SAS data set. Before we show you a program, a few words on terminology are in order. The following table lists SAS terms and the corresponding SQL terms:
1
SAS Term
SQL Equivalent
Data set
Table
Observation
Row
Variable
Column
See SAS Institute Inc., SAS 9.1 SQL Procedure User's Guide (Cary, NC: SAS Institute Inc., 2003). See Kirk Paul Lafler, PROC SQL: Beyond the Basics Using SAS (Cary, NC: SAS Institute Inc., 2004). 3 See Katherine Prairie, The Essential PROC SQL Handbook for SAS Users (Cary, NC: SAS Institute Inc., 2005). 4 See Christianna S. Williams, “PROC SQL for DATA Step Die-hards,” which is available at http://www.nesug.info/Proceedings/nesug05/how/how7.pdf. 2
Chapter 26: Introducing the Structured Query Language 537
For the first few examples, we will be working with a SAS data set called Health. Here is a listing: Listing of HEALTH Subj
Height
Weight
001
68
155
003
74
250
004
63
110
005
60
95
If you want to print a subset of this data set, selecting all subjects with heights over 66 inches, you could submit the following query:
Program 26-1 Demonstrating a simple query from a single data set title "Subjects from HEALTH with Height > 65"; proc sql; select Subj, Height, Weight from learn.health where Height gt 66; quit;
This query starts with a SELECT keyword where you list the variables you want. Notice that the variables in this list are separated by commas (spaces do not work). The keyword FROM names the data set you want to read. Finally, a WHERE clause describes the particular subset you want. SELECT, FROM, and WHERE form a single query, which you end with a single semicolon. In this example, you are not creating an output data set, so, by default, the result of this query is sent as a listing in the Output window (or whatever output location you have specified). Finally, the query ends with a QUIT statement. You do not need a RUN statement because PROC SQL executes as soon as a complete query has been specified. If you don’t include a QUIT statement, PROC SQL remains in memory for another query.
538 Learning SAS by Example: A Programmer’s Guide
Here is the output that resulted from this program: Subjects from HEALTH with Height > 65 Subj
Height
Weight
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
68
155
003
74
250
If you want to select all the variables from a data set, you can use an asterisk (*), like this:
Program 26-2 Using an asterisk to select all the variables in a data set proc sql; select * from learn.health where Height gt 66; quit;
If you want the result of the query to be stored in a SAS data set, include a CREATE keyword, like this:
Program 26-3 Using PROC SQL to create a SAS data set proc sql; create table height65 as select * from learn.health where Height gt 66; quit;
Chapter 26: Introducing the Structured Query Language 539
26.3 Joining Two Tables (Merge) Two tables, Health (used in the last example) and Demographic, are used to demonstrate various ways to perform joins using PROC SQL. Here are listings of these two tables: Table Health Subj
Height
Table Demographic
Weight
Subj
DOB
Gender
Name
001
68
155
001
10/15/1960
M
Friedman
003
74
250
002
08/01/1955
M
Stern
004
63
110
003
12/25/1988
F
McGoldrick
005
60
95
005
05/28/1949
F
Chien
You can select variables from two tables by listing all the variables of interest in the SELECT clause and listing the two data sets in the FROM clause. If a variable has the same name in both data sets, you need a way to distinguish which data set to use. Here’s how it’s done:
Program 26-4 Joining two tables (Cartesian product) title "Demonstrating a Cartesian Product"; proc sql; select health.Subj, demographic.Subj, Height, Weight, Name, Gender from learn.health, learn.demographic; quit;
Because the column Subj is in both tables, you prefix the variable name with the table name. You will see in a minute that you can simplify this a bit. The result from this query is called a Cartesian product and it represents all possible combinations of rows from the first table with rows from the second table. The listing shows two columns, both with the heading of Subj.
540 Learning SAS by Example: A Programmer’s Guide
Here is a listing of the result: Demonstrating a Cartesian Product Subj
Subj
Height
Weight
Name
Gender
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
001
68
155
Friedman
M
001
002
68
155
Stern
M
001
003
68
155
McGoldrick
F
001
005
68
155
Chien
F
003
001
74
250
Friedman
M
003
002
74
250
Stern
M
003
003
74
250
McGoldrick
F
003
005
74
250
Chien
F
004
001
63
110
Friedman
M
004
002
63
110
Stern
M
004
003
63
110
McGoldrick
F
004
005
63
110
Chien
F
005
001
60
95
Friedman
M
005
002
60
95
Stern
M
005
003
60
95
McGoldrick
F
005
005
60
95
Chien
F
If you use this same query to create a SAS data set, there will only be one variable called Subj. If you would like to keep both values of Subj from each data set, you can rename the columns, like this:
Program 26-5 Renaming the two Subj variables title "Renaming the Two Subj Variables"; proc sql; select health.Subj as Health_Subj, demographic.Subj as Demog_Subj, Height, Weight, Name, Gender from learn.health, learn.demographic; quit;
Chapter 26: Introducing the Structured Query Language 541
Running this query results in the following: Renaming the Two Subj Variables (Partial Listing) Health_
Demog_
Subj
Subj
Height
Weight
Name
Gender
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
001
68
155
Friedman
M
001
002
68
155
Stern
M
001
003
68
155
McGoldrick
F
001
005
68
155
Chien
F
003
001
74
250
Friedman
M
003
002
74
250
Stern
M
003
003
74
250
McGoldrick
F
003
005
74
250
Chien
F
Notice that the two subject columns are renamed. A Cartesian product is especially useful when you want to perform matches between names in two tables that are similar (sometimes called a fuzzy merge). As you can see, the number of rows in this table is the number of rows in the first table times the number of rows in the second table. In practice, you will want to add a WHERE clause to restrict which rows to select. In the program that follows, we add a WHERE clause to select only those rows where the subject number is the same in the two tables. Besides adding a WHERE clause, the next program also shows how to distinguish between two columns both with a heading of Subj. Finally, this next program uses a simpler method of naming the two Subj variables in the SELECT clause. Here it is:
Program 26-6 Using aliases to simplify naming variables proc sql; select h.Subj as Subj_Health, d.Subj as Subj_Demog, Height, Weight, Name, Gender from learn.health as h, learn.demographic as d where h.Subj eq d.Subj; quit;
542 Learning SAS by Example: A Programmer’s Guide
First take a look at the FROM clause. To make it easier to name variables with the same name from different tables, you create aliases for each of the tables, h and d, in this program. You can use these aliases as a prefix in the SELECT clause (h.Subj and i.Subj). In this program, a WHERE clause was added, restricting the result to rows where the subject number is the same in both tables. Here is the result: Demonstrating an Inner Join (Merge) Subj_
Subj_
Health
Demog
Height
Weight
Name
Gender
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
001
68
155
Friedman
M
003
003
74
250
McGoldrick
F
005
005
60
95
Chien
F
Only subjects who are in both tables are listed here. In SQL terminology, this is called an inner join. It is equivalent to a merge in a DATA step where each of the two data sets contributes to the merge. Just so this is clear, here is the same (well almost—in PROC SQL, you get two Subj variables) result using a DATA step: proc sort data=learn.health out=health; by Subj; run; proc sort data=learn.demographic out=demographic; by Subj; run; data inner; merge health(in=in1) demographic(in=in2); by Subj; if in1 and in2; run; title "Performing an Inner Join Using a DATA Step"; proc print data=inner; id Subj; run;
Isn’t it nice that you don’t have to sort the data sets first when you use SQL?
Chapter 26: Introducing the Structured Query Language 543
26.4 Left, Right, and Full Joins An alternative to Program 26-6 is to separate the two table names with the term INNER JOIN, like this:
Program 26-7 Performing an inner join title "Demonstrating an Inner Join (Merge)"; proc sql; select h.Subj as Subj_Health, d.Subj as Subj_Demog, Height, Weight, Name, Gender from learn.health as h inner join learn.demographic as d on h.Subj eq d.Subj; quit;
One of the rules of SQL is that when you use the keyword JOIN to join two tables, you use an ON clause instead of a WHERE clause. (You may further subset the result with a WHERE clause.) If you write your inner join this way, it is easy to replace the term INNER JOIN with one of the following: LEFT JOIN, RIGHT JOIN, or FULL JOIN. A left join includes all the rows from the first (left) table and those rows from the second table where there is a corresponding value in the first table. A right join includes all rows from the second (right) table and only matching rows from the first table. A full join includes all rows from both tables (equivalent to a merge in a DATA step). The following program demonstrates these three joins:
544 Learning SAS by Example: A Programmer’s Guide
Program 26-8 Demonstrating a left, right, and full join proc sql; title "Left Join"; select h.Subj as Subj_Health, d.Subj as Subj_Demog, Height, Gender from learn.health as h left join learn.demographic as d on h.Subj eq d.Subj; title "Right Join"; select h.Subj as Subj_Health, d.Subj as Subj_Demog, Height, Gender from learn.health as h right join learn.demographic as d on h.Subj eq d.Subj; title "Full Join"; select h.Subj as Subj_Health, d.Subj as Subj_Demog, Height, Gender from learn.health as h full join learn.demographic as d on h.Subj eq d.Subj; quit;
Chapter 26: Introducing the Structured Query Language 545
The results are as follows: Left Join Subj_Health
Subj_Demog
Height
Gender
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
001
68
M
003
003
74
F
004 005
63 005
60
F
Right Join Subj_Health
Subj_Demog
Height
Gender
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
001
68
M
002
.
M
003
003
74
F
005
005
60
F
Full Join Subj_Health
Subj_Demog
Height
Gender
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001 003
001
68
M
002
.
M
003
74
F
004 005
63 005
60
F
Inspection of this output should help make clear the distinctions among the different types of joins.
546 Learning SAS by Example: A Programmer’s Guide
26.5 Concatenating Data Sets In a DATA step, you concatenate two data sets by naming them in a single SET statement. In PROC SQL, you use a UNION operator to concatenate selections from two tables. Unlike the DATA step, there are different “flavors” of UNION operators. Here is a summary. Operator Union
Description Matches by column position (not column name) and drops duplicates
Union All
Matches by column position but does not drop duplicates
Union Corresponding Union All Corresponding Except
Matches by column name and drops duplicates
Intersection
Matches by column name and keeps unique rows in both tables
Matches by column name and does not drop duplicates Matches by column name and drops rows found in both tables
The UNION ALL CORRESPONDING operator is equivalent to naming two data sets in a SET statement in a DATA step. It is very important to realize that a UNION operator without the keyword CORRESPONDING results in the two data sets being concatenated by column position, not column name. This is illustrated in the examples here. The table New_Members was created to illustrate various types of unions. Here is the listing: Listing of NEW_MEMBERS Subj
Gender
Name
DOB
010
F
Ostermeier
03/05/1977
013
M
Brown
06/07/1999
Chapter 26: Introducing the Structured Query Language 547
For reference, here is the listing of data set Demographic: Subj
DOB
Gender
Name
001
10/15/1960
M
Friedman
002
08/01/1955
M
Stern
003
12/25/1988
F
McGoldrick
005
05/28/1949
F
Chien
Suppose you want to add these new members to the Demographic data set and call the new data set Demog_Complete. Here’s how to do it using PROC SQL. (Notice that the columns are not in the same order as the Demographic data set.)
Program 26-9 Concatenating two tables proc sql; create table demog_complete as select * from learn.demographic union all corresponding select * from learn.new_members quit;
The resulting table contains all the rows from Demographic followed by all the rows from New_Members. Listing of DEMOG_COMPLETE Subj
DOB
Gender
Name
001
10/15/1960
M
Friedman
002
08/01/1955
M
Stern
003
12/25/1988
F
McGoldrick
005
05/28/1949
F
Chien
010
03/05/1977
F
Ostermeier
013
06/07/1999
M
Brown
548 Learning SAS by Example: A Programmer’s Guide
If you leave out the keyword CORRESPONDING, here is the result (SAS log): 3
***Concatenating rows from two tables;
4
proc sql;
5
create table demog_complete as
6
select *
7
from learn.demographic union all
8
select *
9 10
from learn.new_members quit;
ERROR: Column 2 from the first contributor of UNION ALL is not the same type as its counterpart from the second. ERROR: Column 4 from the first contributor of UNION ALL is not the same type as its counterpart from the second.:
If, by chance, the data types match column by column in the two data sets, SQL will perform the union. To understand this, here is another data set, New_Members_Order, where the order of the columns is changed: Listing of NEW_MEMBERS_ORDER Name
DOB
Gender
Subj
Ostermeier
03/05/1977
F
010
Brown
06/07/1999
M
013
Chapter 26: Introducing the Structured Query Language 549
Each of the four columns of New_Members_Order has the same data type (character or numeric) as the four columns of Demographic. So, if you omit the CORRESPONDING keyword when performing a union of these two data sets, you have the following result: Listing of DEMOG_COMPLETE Without the CORRESPONDING Keyword Subj
DOB
Gender
Name
001
10/15/1960
M
Friedman
002
08/01/1955
M
Stern
003
12/25/1988
F
McGoldrick
005
05/28/1949
F
Chien
Ostermeier
03/05/1977
F
010
Brown
06/07/1999
M
013
You can now see why you need to choose the correct UNION operator when concatenating two data sets.
26.6 Using Summary Functions You can use functions such as MEAN and SUM to create new variables that represent means or sums of other variables. You can also create new variables within the query. The following program shows one of the strengths of PROC SQL, which is the ability to add a summary variable to an existing table. Suppose you want to express each person’s height in the Health data set as a percentage of the mean height of all the subjects. Using a DATA step, you would first use PROC MEANS to create a data set containing the mean height. You would then combine this with the original data set and perform the calculation. PROC SQL makes this task much easier. Let’s take a look.
550 Learning SAS by Example: A Programmer’s Guide
Program 26-10 Using a summary function in PROC SQL proc sql; select Subj, Height, Weight, mean(Height) as Ave_Height, 100*Height/calculated Ave_Height as Percent_Height from learn.health quit;
The mean height is computed using the MEAN function. This value is also given the variable name Ave_Height. When you use this variable in a calculation, you need to precede it with the keyword CALCULATED, so that PROC SQL doesn’t look for the variable in one of the input data sets. Here is the result: Using a Summary Function Percent_ Subj
Height
Weight
Ave_Height
Height
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
68
155
66.25
102.6415
003
74
250
66.25
111.6981
004
63
110
66.25
95.09434
005
60
95
66.25
90.56604
Notice how much easier this is using PROC SQL compared to a DATA step.
Chapter 26: Introducing the Structured Query Language 551
26.7 Demonstrating an ORDER Clause PROC SQL can also sort your table if you use an ORDER clause. For example, if you want the subjects in the Health table in height order, use the following:
Program 26-11 Demonstrating an ORDER clause proc sql; title "Listing in Height Order"; select Subj, Height, Weight from learn.health order by Height; quit;
The result (not shown) is a listing of the variables Subj, Height, and Weight in order of increasing Height.
26.8 An Example of Fuzzy Matching One of the strengths of PROC SQL is its ability to create a Cartesian product. As mentioned earlier in this chapter, a Cartesian product is a pairing of every row in one table with every row in another table. Here are two tables: Demographic (used in many of the other examples in this chapter) and Insurance. Table Demographic Subj 001 002 003 005
DOB 10/15/1960 08/01/1955 12/25/1988 05/28/1949
Table Insurance Gender M M F F
Name
Name
Friedman Stern McGoldrick Chien
Fridman Goldman Chein Stern
Type F P F P
552 Learning SAS by Example: A Programmer’s Guide
You want to join (merge) these tables by Name, allowing for slight misspelling of the names. Here is an SQL query that does just that:
Program 26-12 Using PROC SQL to perform a fuzzy match proc sql; title "Example of a Fuzzy Match"; select Subj, h.Name as health_name, i.Name as insurance_name from learn.demographic as h, learn.insurance as i where spedis(health_name,insurance_name) le 25; quit;
The SPEDIS (spelling distance) function allows for misspellings (see Chapter 12). The WHERE clause operates on every combination of names from the two tables and selects those names that are within a spelling distance of 25. In practice, you would want to compare other variables such as Gender and DOB between two files to increase the likelihood that a valid match is being made. Take a look at the listing that follows to see which names were matched by this program: Example of a Fuzzy Match Subj
health_name
insurance_name
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 001
Friedman
Fridman
005
Chien
Chein
002
Stern
Stern
We have only touched the surface of what you can do with SQL. Hopefully, this introduction to SQL will encourage you to learn more.
Chapter 26: Introducing the Structured Query Language 553
26.9 Problems Solutions to odd-numbered problems are located at the back of this book and on the CD that accompanies this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion Web site at http://support.sas.com/cody for information about how to obtain the solutions to all problems. 1. Use PROC SQL to list all the observations from data set Inventory where Price is greater than 20. 2. Repeat Problem 1, except use PROC SQL to create a new, temporary SAS data set (Price20) containing all observations from Inventory where the Price is greater than 20. 3. Use PROC SQL to create a new, temporary, SAS data set (N_Sales) containing the observations from Sales where Region has a value of North. Include only the variables Name and TotalSales in the new data set. 4. Data sets Inventory and Purchase are shown here: Listing of INVENTORY
Listing of PURCHASE
Model
Cust Number
M567 S888 L776 X999 M123 S776
Price $23.50 $12.99 $159.98 $29.95 $4.59 $1.99
101 102 103 103
Model L776 M123 X999 M567
Quantity 1 10 2 1
Use PROC SQL to list all purchased items showing the Cust Number, Model, Quantity, Price, and a new variable, Cost, equal to the Price times the Quantity. 5. Data sets Left and Right are shown here. Use PROC SQL to create a new, temporary SAS data set (Both) containing Subj, Height, Weight, and Salary. Do this three ways: first, include only those subjects who are in both data sets, second, include all subjects from both data sets, and third, include only those subjects who are in data set Left.
554 Learning SAS by Example: A Programmer’s Guide
Listing of LEFT Obs
Subj
1 2 3 4 5 6
Listing of RIGHT
Height
001 002 003 005 006 009
68 75 65 79 70 61
Weight
Obs
155 220 99 266 190 122
1 2 3 4 5 6
Subj 001 003 004 005 006 007
Salary $46,000 $67,900 $28,200 $98,202 $88,000 $57,200
6. Write the necessary PROC SQL statements to accomplish the same goal as the program here: data allproducts; set learn.inventory learn.newproducts; run;
7. Write the necessary PROC SQL statements to accomplish the same goal as the program here: data third; set learn.first learn.second; run;
Be careful! The order of the variables is not the same in both data sets. Also, some subject numbers are in both data sets. 8. Use PROC SQL to list the values of RBC (red blood cells) and WBC (white blood cells) from the Blood data set. Include two new variables in this list: Percent_RBC and Percent_WBC. These variables are the values of RBC and WBC expressed as a percentage of the mean value for all subjects. The first few observations in this list should look like this: RBC
WBC
Percent_RBC
Percent_WBC
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 7.4
7710
134.9497
109.4708
4.7
6560
85.71127
93.14248
7.53
5690
137.3204
80.78974
6.85
6680
124.9196
94.8463
7.72
.
140.7853
.
3.69
6140
67.29247
87.17909
Chapter 26: Introducing the Structured Query Language 555
9. In a similar manner to Problem 8, use the Blood data set to create a new, temporary SAS data set (Percentages) containing the variables Subject, RBC, WBC, MeanRBC, MeanWBC, Percent_RBC, and Percent_WBC. 10. Run the program here to create two temporary SAS data sets, XXX and YYY. data xxx; input NameX : $15. PhoneX : $13.; datalines; Friedman (908)848-2323 Chien (212)777-1324 ; data yyy; input NameY : $15. PhoneY : $13.; datalines; Chen (212)888-1325 Chambliss (830)257-8362 Saffer (740)470-5887 ;
Then write the PROC SQL statements to perform a fuzzy match between the names in each data set. List the observations where the names are within a spelling distance (the SPEDIS function) of 25. The result should be only one observation, as follows: Listing of FUZZY Name Obs
NameX
PhoneX
1
Chien
(212)777-1324
Y Chen
PhoneY (212)888-1325
556 Learning SAS by Example: A Programmer’s Guide
Solutions to Odd-Numbered Problems
*---------------------------------------------------------* | This SAS file contains the solutions to all the odd| | numbered problems in the text: Learning SAS by Example: | | A Programmer's Guide. | *---------------------------------------------------------*; ***You need to modify any libname and infile statements so that they point to the appropriate folder on your computer; ***The simplest way to convert all the libname and infile statements in this file is to find the string: c:\books\learning and replace it with the folder where you placed your SAS data sets and text files. If you are storing your SAS data sets and text files in separate places, you will need to search separately for libname and infile statements and make changes appropriately;
558 Learning SAS by Example: A Programmer’s Guide
Chapter 1 Solutions libname learn 'c:\books\learning'; options fmtsearch=(learn); *1-1; /* Invalid variable names are: Wt-Kg (contains a dash) 76Trombones (starts with a number) */ *1-3; /* Number of variables is 5 Number of observations is 10 */ *1-5; /* Default length for numerics is 8 */
Chapter 2 Solutions *2-1; *---------------------------------------------------* | Program name: stocks.sas in c:\books\learning | | Purpose: Read in raw data on stock prices and | | compute values | | Programmer: Ron Cody | | Date: June 23, 2006 | *---------------------------------------------------*; *a; data portfolio; infile 'c:\books\learning\stocks.txt'; input Symbol $ Price Number; Value = Number*Price; run; title "Listing of Portfolio"; proc print data=portfolio noobs; run; *b; title "Means and Sums of Portfolio Variables"; proc means data=portfolio n mean sum maxdec=0; var Price Number; run;
Solutions to Odd-Numbered Problems 559
*2-3; /* EMF = 1.45*V + (R/E)*v**3 - 125; */
Chapter 3 Solutions *3-1; *a - c; data scores; infile 'c:\books\learning\scores.txt'; input Gender : $1. English History Math Science; Average = (English + History + Math + Science) / 4; run; title "Listing of SCORES"; proc print data=scores noobs; run; *3-3; data company; infile 'c:\books\learning\company.txt' dsd dlm='$'; input LastName $ EmpNo $ Salary; format Salary dollar10.; /* optional statement */ run; title "Listing of COMPANY"; proc print data=company noobs; run; *3-5; data testdata; input X Y; Z = 100 + 50*X + 2*X**2 - 25*Y + Y**2; datalines; 1 2 3 5 5 9 9 11 ; title "Listing of TESTDATA";
560 Learning SAS by Example: A Programmer’s Guide
proc print data=testdata noobs; run; *3-7; data cache; infile 'c:\books\learning\geocaching.txt' pad; ***Note: PAD not necessary but a good idea See Chapter 21 for a discussion of this; input GeoName $ 1-20 LongDeg 21-22 LongMin 23-28 LatDeg 29-30 LatMin 31-36; run; title "Listing of CACHE"; proc print data=cache noobs; run; *3-9; data cache; infile 'c:\books\learning\geocaching.txt' pad; input @1 GeoName $20. @21 LongDeg 2. @23 LongMin 6. @29 LatDeg 2. @31 LatMin 6.; run; title "Listing of CACHE"; proc print data=cache noobs; run; *3-11; data employ; infile 'c:\books\learning\employee.csv' dsd missover; ***Note: missover is not needed but a good idea. truncover will also work See Chapter 21 for an explanation of missover and truncover infile options; informat ID $3. Name $20. Depart $8. DateHire mmddyy10. Salary dollar8.; input ID Name Depart DateHire Salary; format DateHire date9.; run; title "Listing of EMPLOY"; proc print data=employ noobs; run;
Solutions to Odd-Numbered Problems 561
Chapter 4 Solutions *4-1; libname learn 'c:\books\learning'; data learn.perm; input ID : $3. Gender : $1. DOB : mmddyy10. Height Weight; label DOB = 'Date of Birth' Height = 'Height in inches' Weight = 'Weight in pounds'; format DOB date9.; datalines; 001 M 10/21/1946 68 150 002 F 5/26/1950 63 122 003 M 5/11/1981 72 175 004 M 7/4/1983 70 128 005 F 12/25/2005 30 40 ; title "Contents of data set PERM"; proc contents data=learn.perm varnum; run; *4-3; libname perm 'c:\books\learning'; data perm.survey; input Age Gender $ (Ques1-Ques5)($1.); datalines; 23 M 15243 30 F 11123 42 M 23555 48 F 55541 55 F 42232 62 F 33333 68 M 44122 ; ***Opening up a new session, you need to reissue a libname statement; libname perm 'c:\books\learning'; title "Computing Average Age"; proc means data=perm.survey2007; var Age; run;
562 Learning SAS by Example: A Programmer’s Guide
Chapter 5 Solutions *5-1; proc format; value agegrp 0 - 30 = '0 to 30' 31 - 50 = '31 to 50' 51 - 70 = '50 to 70' 71 - high = '71 and older'; value $party 'D' = 'Democrat' 'R' = 'Republican'; value $likert '1' = 'Strongly Disagree' '2' = 'Disagree' '3' = 'No Opinion' '4' = 'Agree' '5' = 'Strongly Agree'; run; data voter; input Age Party : $1. (Ques1-Ques4)($1. + 1); label Ques1 = 'The president is doing a good job' Ques2 = 'Congress is doing a good job' Ques3 = 'Taxes are too high' Ques4 = 'Government should cut spending'; format Age agegrp. Party $party. Ques1-Ques4 $likert.; datalines; 23 D 1 1 2 2 45 R 5 5 4 1 67 D 2 4 3 3 39 R 4 4 4 4 19 D 2 1 2 1 75 D 3 3 2 3 57 R 4 3 4 4 ; title "Listing of Voter"; proc print data=voter; ***Add the option LABEL if you want to use the labels as column headings; run; title "Frequencies on the Four Questions"; proc freq data=voter; tables Ques1-Ques4; run;
Solutions to Odd-Numbered Problems 563
*5-3; data colors; input Color : $1. @@; datalines; R R B G Y Y . . B G R B G Y P O O V V B ; proc format; value $color 'R','B','G' = 'Group 1' 'Y','O' = 'Group 2' ' ' = 'Not Given' Other = 'Group 3'; run; title "Color Frequencies (Grouped)"; proc freq data=colors; tables color / nocum missing; *The MISSING option places the frequency of missing values in the body of the table and causes the percentages to be computed on the number of observations, missing or non-missing; format color $color.; run; *5-5; libname learn 'c:\books\learning'; options fmtsearch=(learn); proc format library=learn fmtlib; value yesno 1='Yes' 2='No'; value $yesno 'Y'='Yes' 'N'='No'; value $gender 'M'='Male' 'F'='Female'; value age20yr low-20 = '1' 21-40 = '2' 41-60 = '3' 61-80 = '4' 81-high = '5'; run;
Chapter 6 Solutions *6-1; /* Select File --> Import Data Choose Excel and select Drugtest.xls. */
564 Learning SAS by Example: A Programmer’s Guide
*6-3; libname readit 'c:\books\learning\soccer.xls'; title "Using the Excel Engine to read data"; proc print data=readit.'soccer$'n noobs; run;
Chapter 7 Solutions *7-1; data school; input Age Quiz : $1. Midterm Final; if Age = 12 then Grade = 6; else if Age = 13 then Grade = 9; if Quiz = 'A' then QuizGrade = 95; else if Quiz = 'B' then QuizGrade = 85; else if Quiz = 'C' then QuizGrade = 75; else if Quiz = 'D' then QuizGrade = 70; else if Quiz = 'F' then QuizGrade = 65; CourseGrade = .2*QuizGrade + .3*Midterm + .5*Final; datalines; 12 A 92 95 12 B 88 88 13 C 78 75 13 A 92 93 12 F 55 62 13 B 88 82 ; title "Listing of SCHOOL"; proc print data=school noobs; run; *7-3; title "Selected Employees from SALES"; proc print data=learn.sales; where EmpID = '9888' or EmpID = '0177'; run; proc print data=learn.sales; where EmpID in ('9888' '0177'); run; *7-5; data blood; set learn.blood;
Solutions to Odd-Numbered Problems 565
length CholGroup $ 6; select; when (missing(Chol)) CholGroup = ' '; when (Chol le 110) CholGroup = 'Low'; when (Chol le 140) CholGroup = 'Medium'; otherwise CholGroup = 'High'; end; run; title "Listing of BLOOD"; proc print data=blood noobs; run; *7-7; title "Selected Observations from BICYCLES"; proc print data=learn.bicycles noobs; where Model eq "Road Bike" and UnitCost gt 2500 or Model eq "Hybrid" and UnitCost gt 660; *Note: parentheses are not needed since the AND operation is performed before OR. You may include them if you wish; run;
Chapter 8 Solutions *8-1; data vitals; input ID : $3. Age Pulse SBP DBP; label SBP = "Systolic Blood Pressure" DBP = "Diastolic Blood Pressure"; datalines; 001 23 68 120 80 002 55 72 188 96 003 78 82 200 100 004 18 58 110 70 005 43 52 120 82 006 37 74 150 98 007 . 82 140 100 ; ***Note: this program assumes there are no missing values for Pulse or SBP;
566 Learning SAS by Example: A Programmer’s Guide
data newvitals; set vitals; if Age lt 50 and not missing(Age) then do; if Pulse lt 70 then PulseGroup = 'Low '; else PulseGroup = 'High'; if SBP lt 140 then SBPGroup = 'Low '; else SBPGroup = 'High'; end; else if Age ge 50 then do; if Pulse lt 74 then PulseGroup = 'Low'; else PulseGroup = 'High'; if SBP lt 140 then SBPGroup = 'Low'; else SBPGroup = 'High'; end; run; title "Listing of NEWVITALS"; proc print data=newvitals noobs; run; *8-3; data test; input Score1-Score3; Subj + 1; datalines; 90 88 92 75 76 88 88 82 91 72 68 70 ; title "Listing of TEST"; proc print data=test noobs; run; *8-5; data logs; do N = 1 to 20; LogN = log(N); output; end; run; title "Listing of LOGS"; proc print data=logs noobs; run; *8-7; data plotit; do x = 0 to 10 by .1;
Solutions to Odd-Numbered Problems 567
y = 3*x**2 - 5*x + 10; output; end; run; goptions reset=all ftext='arial' htext=1.0 ftitle='arial/bo' htitle=1.5 colors=(black); symbol v=none i=sm; title "Problem 7"; proc gplot data=plotit; plot y * x; run; quit; *8-9; data temperatures; do Day = 'Mon','Tues','Wed','Thu','Fri','Sat','Sun'; input Temp @; output; end; datalines; 70 72 74 76 77 78 85 ; title "Listing of TEMPERATURES"; proc print data=temperatures noobs; run; *8-11; data temperature; length City $ 7; do City = 'Dallas','Houston'; do Hour = 1 to 24; input Temp @; output; end; end; datalines; 80 81 82 83 84 84 87 88 89 89 91 93 93 95 96 97 99 95 92 90 88 86 84 80 78 76 77 78 80 81 82 82 86 88 90 92 92 93 96 94 92 90 88 84 82 78 76 74 ;
568 Learning SAS by Example: A Programmer’s Guide
title "Temperatures in Dallas and Houston"; proc print data=temperature; run; *8-13; data money; do Year = 1 to 999 until (Amount ge 30000); Amount + 1000; do Quarter = 1 to 4; Amount + Amount*(.0425/4); output; end; end; format Amount dollar10.; run; title "Listing of MONEY"; proc print data=money; run;
Chapter 9 Solutions *9-1; data dates; input @1 Subj $3. @4 DOB mmddyy10. @14 Visit date9.; Age = yrdif(DOB,Visit,'Actual'); format DOB Visit date9.; datalines; 00110/21/195011Nov2006 00201/02/195525May2005 00312/25/200525Dec2006 ; title "Listing of DATES"; proc print data=dates noobs; run; *9-3; options yearcutoff=1910; data year1910_2006; input @1 Date mmddyy8.; format Date date9.; datalines; 01/01/11 02/23/05
Solutions to Odd-Numbered Problems 569
03/15/15 05/09/06 ; options yearcutoff=1920; /* Good idea to set yearcutoff back to the default after you change it */ title "Listing of YEAR1910_2006"; proc print data=year1910_2006 noobs; run; *9-5; data freq; set learn.hosp(keep=AdmitDate); Day = weekday(AdmitDate); Month = month(AdmitDate); Year = year(AdmitDate); run; proc format; value days 1='Sun' 2='Mon' 3='Tue' 4='Wed' 5='Thu' 6='Fri' 7='Sat'; value months 1='Jan' 2='Feb' 3='Mar' 4='Apr' 5='May' 6='Jun' 7='Jul' 8='Aug' 9='Sep' 10='Oct' 11='Nov' 12='Dec'; run; title "Frequencies for Hospital Admissions"; proc freq data=freq; tables Day Month Year / nocum nopercent; format Day days. Month months.; run; *9-7; title "Admissions before July 15, 2002"; proc print data=learn.hosp; where AdmitDate le '01Jul2002'd and AdmitDate is not missing; run; *9-9; data dates; input Day Month Year; if missing(Day) then Date = mdy(Month,15,Year); else Date = mdy(Month,Day,Year); format Date mmddyy10.; datalines; 25 12 2005 . 5 2002
570 Learning SAS by Example: A Programmer’s Guide
12 8 ;
2006
title "Listing of DATES"; proc print data=dates noobs; run; *9-11; data intervals; set learn.medical; Quarters = intck('qtr','01Jan2006'd,VisitDate); run; title "Listing of INTERVALS"; proc print data=intervals noobs; run; *9-13; data return; set learn.medical(keep=Patno VisitDate); Return = intnx('month',VisitDate,6,'sameday'); format VisitDate Return worddate.; run; title "Return Visits for Medical Patients"; proc print data=return noobs; run;
Chapter 10 Solutions *10-1; data subset_a; set learn.blood; where Gender eq 'Female' and BloodType='AB'; Combined = .001*WBC + RBC; run; title "Listing of SUBSET_A"; proc print data=subset_a noobs; run; data subset_b; set learn.blood; Combined = .001*WBC + RBC; if Gender eq 'Female' and BloodType='AB' and Combined ge 14; run;
Solutions to Odd-Numbered Problems 571
title "Listing of SUBSET_B"; proc print data=subset_b noobs; run; *10-3; data lowmale lowfemale; set learn.blood; where Chol lt 100 and Chol is not missing; /* alternative statement where Chol lt 100 and not missing(Chol); */ if Gender = 'Female' then output lowfemale; else if Gender = 'Male' then output lowmale; run; title "Listing of LOWMALE"; proc print data=lowmale noobs; run; title "Listing of LOWFEMALE"; proc print data=lowfemale noobs; run; *10-5; title "Listing of INVENTORY"; proc print data=learn.inventory noobs; run; title "Listing of NEWPRODUCTS"; proc print data=learn.newproducts noobs; run; data updated; set learn.inventory learn.newproducts; run; proc sort data=updated; by Model; run; title "Listing of updated"; proc print data=updated; run;
572 Learning SAS by Example: A Programmer’s Guide
*10-7; proc means data=learn.gym noprint; var fee; output out=Meanfee(drop=_type_ _freq_) Mean=Avefee; run; data percent; set learn.gym; if _n_ = 1 then set Meanfee; FeePercent = round(100*fee / Avefee); drop Avefee; run; title "Listing of PERCENT"; proc print data=PERCENT; run; *10-9; proc sort data=learn.inventory out=inventory; by Model; run; proc sort data=learn.purchase out=purchase; by Model; run; data pur_price; merge inventory purchase(in=InPurchase); by Model; if InPurchase; TotalPrice = Quantity*Price; format TotalPrice dollar8.2; run; title "Listing of PUR_PRICE"; proc print data=pur_price noobs; run; *10-11; options mergenoby=nowarn; data try1; merge learn.inventory learn.purchase; run; title "Listing of TRY1"; proc print data=try1; run;
Solutions to Odd-Numbered Problems 573
options mergenoby=warn; data try2; merge learn.inventory learn.purchase; run; title "Listing of TRY2"; proc print data=try2; run; options mergenoby=error; data try3; merge learn.inventory learn.purchase; run; title "Listing of TRY3"; proc print data=try4; run; *10-13; /* Solution where the numeric identifier is converted to a character value */ proc sort data=learn.demographic out=demographic; by ID; run; data survey2; set learn.survey2(rename=(ID = NumID)); ID = put(NumID,z3.); drop NumID; run; proc sort data=survey2; by ID; run; data combine; merge demographic survey2; by ID; run; title "Listing of COMBINE"; proc print data=combine noobs; run; /* Solution where the character identifier is converted to a numeric value */ data demographic; set learn.demographic(rename=(ID = CharID));
574 Learning SAS by Example: A Programmer’s Guide
ID = input(CharID,3.); drop CharID; run; proc sort data=demographic; by ID; run; proc sort data=learn.survey2 out=survey2; by ID; run; data combine; merge demographic survey2; by ID; run; title "Listing of COMBINE"; proc print data=combine noobs; run;
Chapter 11 Solutions *11-1; data health; set learn.health; BMI = (Weight/2.2) / (Height*.0254)**2; BMIRound = round(BMI); BMIRound_tenth = round(BMI,.1); BMIGroup = round(BMI,5); BMITrunc = int(BMI); run; title "Listing of HEALTH"; proc print data=health noobs; run; *11-3; data miss_blood; set learn.blood; if missing(WBC) then call missing(Gender,RBC, Chol); run;
Solutions to Odd-Numbered Problems 575
title "Listing of MISS_BLOOD"; proc print data=miss_blood noobs; run; *11-5; data psychscore; set learn.psych; ScoreAve = mean(largest(1,of Score1-Score5), largest(2,of Score1-Score5), largest(3,of Score1-Score5)); if n(of Ques1-Ques10) ge 7 then QuesAve = mean(of Ques1-Ques10); Composit = ScoreAve + 10*QuesAve; keep ID ScoreAve QuesAve Composit; run; title "Listing of PSYCHSCORE"; proc print data=psychscore noobs; run; *11-7; data _null_; x = 10; y = 20; z = -30; AbsZ = abs(z); ExpX = round(exp(x),.001); Circumference = round(2*constant('pi')*y,.001); put _all_; run; *11-9; data fake; do Subj = 1 to 100; if ranuni(12345) le .4 then Gender = 'Female'; else Gender = 'Male'; Age = int(ranuni(12345)*50 + 10); output; end; run; title "Frequencies"; proc freq data=fake; tables Gender / nocum; run; title "First 10 Observations of FAKE"; proc print data=fake(obs=10); run; *11-11; data convert; set learn.char_num(rename=
576 Learning SAS by Example: A Programmer’s Guide
(Age = Char_Age Weight = Char_Weight Zip = Num_Zip SS = Num_ss)); Age = input(Char_Age,8.); Weight = input(Char_Age,8.); SS = put(Num_SS,ssn11.); Zip = put(Num_Zip,z5.); drop Char_: Num_:; run; title "Listing of CONCERT"; proc print data=convert noobs; run; *11-13; data smooth; set learn.stocks; Price1 = lag(Price); Price2 = lag2(Price); Average = mean(Price, Price1, Price2); run; goptions reset=all colors=(black) ftext=swiss htitle=1.5; symbol1 v=dot line=1 i=smooth; symbol2 v=square line=2 i=smooth; title "Plot of Price and Moving Average"; proc gplot data=smooth; plot Price*Date Average*Date / overlay; run; quit;
Chapter 12 Solutions *12-1; *One way to test the storage lengths is to use the LENGTHC function that returns storage lengths compared to the LENGTH function that returns the length of a character string, not counting trailing blanks; data storage; length A $ 4 B $ 4; Name = 'Goldstein'; AandB = A || B;
Solutions to Odd-Numbered Problems 577
Cat = cats(A,B); if Name = 'Smith' then Match = 'No'; else Match = 'Yes'; Substring = substr(Name,5,2); L_A = lengthc(A); L_B = lengthc(B); L_Name = lengthc(Name); L_AandB = lengthc(AandB); L_Cat = lengthc(Cat); L_Match = lengthc(Match); L_Substring = lengthc(Substring); run; title "Lengths of Character Variables"; proc print data=storage noobs; var L_:; *All variables starting with L_; run; /* Variable Storage Length A 4 B 4 Name 9 AandB 8 Cat 200 Match 2 Substring 9 */ *12-3; data names_and_more; set learn.names_and_more; Name = compbl(Name); Phone = compress(Phone,,'kd'); run; title "Listing of Data Set LEARN.NAMES_AND_MORE"; proc print data=names_and_more noobs; run; *12-5; data convert; set learn.names_and_more(keep=Mixed); Integer = input(scan(Mixed,1,' /'),8.); Numerator = input(scan(Mixed,2,' /'),8.); Denominator = input(scan(Mixed,3,' /'),8.); if missing(Numerator) then Price = Integer; else Price = Integer + Numerator / Denominator; drop Numerator Denominator Integer; run;
578 Learning SAS by Example: A Programmer’s Guide
title "Listing of CONVERT"; proc print data=convert noobs; run; *12-7; *Using one of the CAT functions; data concat; set learn.study(keep=Group Subgroup); length Combined $ 3; Combined = catx('-',Group,Subgroup); run; title "Listing of CONCAT"; proc print data=concat noobs; run; *Without using CAT functions; data concat; set learn.study(keep=Group Subgroup); length Combined $ 3; Combined = trim(Group) || '-' || put(Subgroup,1.); run; title "Listing of CONCAT"; proc print data=concat noobs; run; *12-9; data spirited; set learn.sales; where find(Customer,'spirit','i'); run; title "Listing of SPIRITED"; proc print data=spirited noobs; run; *12-11; title "Subjects from ERRORS with Digits in the Name"; proc print data=learn.errors noobs; where anydigit(Name); var Subj Name; run; *12-13; data exact within25; set learn.social; if SS1 eq SS2 then output exact; else if spedis(SS1,SS2) le 25 and
Solutions to Odd-Numbered Problems 579
not missing(SS1) and not missing(SS2) then output within25; run; title "Listing of EXACT"; proc print data=exact noobs; run; title "Listing of WITHIN25"; proc print data=within25 noobs; run; *12-15; data numbers; set learn.names_and_more(keep=phone); length AreaCode $ 3; AreaCode = substr(Phone,2,3); run; title "Listing of NUMBERS"; proc print data=numbers; run; *12-17; data personal; set learn.personal(drop=Food1-Food8); substr(SS,1,7) = '******'; substr(AcctNum,5,1) = '-'; run; title "Listing of PERSONAL (with masked values)"; proc print data=personal noobs; run;
Chapter 13 Solutions *13-1; data survey1; set learn.survey1; array Ques{5} $ Q1-Q5; do i = 1 to 5; Ques{i} = translate(Ques{i},'54321','12345'); end; drop i; run;
580 Learning SAS by Example: A Programmer’s Guide
title "List of SURVEY1 (rescaled)"; proc print data=survey1; run; *13-3; data nonines; set learn.nines; array nums{*} _numeric_; do i = 1 to dim(nums); if nums{i} = 999 then call missing(nums{i}); end; drop i; run; title "Listing of NONINES"; proc print data=nonines; run; *13-5; data passing; array pass_score{5} _temporary_ (65,70,60,62,68); array Score{5}; input ID : $3. Score1-Score5; NumberPassed = 0; do Test = 1 to 5; NumberPassed + (Score{Test} ge pass_score{Test}); end; drop Test; datalines; 001 90 88 92 95 90 002 64 64 77 72 71 003 68 69 80 75 70 004 88 77 66 77 67 ; title "Listing of PASSING"; proc print data=passing; id ID; run;
Solutions to Odd-Numbered Problems 581
Chapter 14 Solutions *14-1; title "First 10 Observations in BLOOD"; proc print data=learn.blood(obs=10) label; id Subject; var WBC RBC Chol; label WBC = 'White Blood Cells' RBC = 'Red Blood Cells' Chol = 'Cholesterol'; run; *14-3; title "Selected Patients from HOSP Data Set"; title2 "Admitted in September of 2004"; title3 "Older than 83 years of age"; title4 "--------------------------------------"; proc print data=learn.hosp n='Number of Patients = ' label double; where Year(AdmitDate) eq 2004 and Month(AdmitDate) eq 9 and yrdif(DOB,AdmitDate,'Actual') ge 83; id Subject; var DOB AdmitDate DischrDate; label AdmitDate = 'Admission Date' DischrDate = 'Discharge Date' DOB = 'Date of Birth'; run;
Chapter 15 Solutions *15-1; title "First 5 Observations from Blood Data Set"; proc report data=learn.blood(obs=5) nowd headline; column Subject WBC RBC; define Subject / display "Subject Number" width=7; define WBC / "White Blood Cells" width=6 format=comma6.0; define RBC / "Red Blood Cells" width=5 format=5.2; run; quit;
582 Learning SAS by Example: A Programmer’s Guide
*15-3; title "Demonstrating a Compute Block"; proc report data=learn.hosp(obs=5) nowd headline; column Subject AdmitDate DOB Age; define AdmitDate / display "Admission Date" width=10; define DOB / display; define Subject / display width=7; define Age / computed "Age at Admission" ; compute Age; Age = round(yrdif(DOB,AdmitDate,'Actual')); endcomp; run; quit; *15-5; title "Patient Age Groups"; proc report data=learn.bloodpressure nowd; column Gender Age AgeGroup; define Gender / width=6; define Age / display width=5; define AgeGroup / computed "Age Group"; compute AgeGroup / character length=5; if Age gt 50 then AgeGroup = '> 50'; else if not missing(Age) then AgeGroup = '<= 50'; endcomp; run; quit; *15-7; title "Mean Cholesterol by Gender and Blood Type"; proc report data=learn.blood nowd headline; column Gender BloodType Chol; define Gender / group width=6; define BloodType / group "Blood Type" width=5; define Chol / analysis mean "Mean Cholesterol" width=11 format=5.1; run; quit; *15-9; title "Report on the Survey Data Set"; proc report data=learn.survey nowd headline; column ID Age Gender Salary Ques1-ques5 AveResponse; define ID / display width=4; define Age / display width=5; define Gender / display width=6; define Salary / display width=10 format=dollar10.; define Ques1 / display noprint;
Solutions to Odd-Numbered Problems 583
define Ques2 / display noprint; define Ques3 / display noprint; define Ques4 / display noprint; define Ques5 / display noprint; *Note: This solution will cause an automatic character to numeric conversion; compute AveResponse; AveResponse = mean(of Ques1-Ques5); endcomp; /********************************************** To avoid the automatic conversion, substitute the code below for the compute block: compute AveResponse; Q1 = input(Ques1,1.); Q2 = input(Ques2,1.); Q3 = input(Ques3,1.); Q4 = input(Ques4,1.); Q5 = input(Ques5,1.); AveResponse = mean(of Q1-Q5); endcomp; ************************************************/ define AveResponse / computed "Average Response" width=8 format=3.1; run; quit;
Chapter 16 Solutions *16-1; options fmtsearch=(learn); ***This is where the file formats.sas7bcat was placed; title "Statistics on the College Data Set"; proc means data=learn.college n nmiss mean median min max maxdec=2;
584 Learning SAS by Example: A Programmer’s Guide
var ClassRank GPA; run; *16-3; proc sort data=learn.college out=college; by SchoolSize; run; title "Statistics on the College Data Set - Using BY"; title2 "Broken down by School Size"; proc means data=college n mean median min max maxdec=2; by SchoolSize; var ClassRank GPA;run; title "Statistics on the College Data Set - Using CLASS"; title2 "Broken down by School Size"; proc means data=learn.college n mean median min max maxdec=2; class SchoolSize; var ClassRank GPA; run; *16-5; proc format; value rank 0-50 = 'Bottom Half' 51-74 = 'Third Quartile' 75-100 = 'Top Quarter'; run; title "Statistics on the College Data Set"; title2 "Broken down by School Size"; proc means data=learn.college n mean maxdec=2; class ClassRank; var GPA; format ClassRank rank.; run;
Solutions to Odd-Numbered Problems 585
*16-7; proc means data=learn.college noprint chartype; class Gender SchoolSize; var ClassRank GPA; output out=summary n= mean= median= min= max= / autoname; run; data grand(drop=Gender SchoolSize) bygender(drop=SchoolSize) bysize(drop=Gender) cell; drop _freq_; set summary; if _type_ = '00' then output grand; else if _type_ = '10' then output bygender; else if _type_ = '01' then output bysize; else if _type_ = '11' then output cell; run; title "Listing of GRAND"; proc print data=grand noobs; run; title "Listing of BYGENDER"; proc print data=bygender noobs; run; title "Listing of BYSIZE"; proc print data=bysize noobs; run; title "Listing of CELL"; proc print data=cell noobs; run;
Chapter 17 Solutions *17-1; title "One-way Frequencies from BLOOD Data Set"; proc freq data=learn.blood; tables Gender BloodType AgeGroup / nocum nopercent; run;
586 Learning SAS by Example: A Programmer’s Guide
*17-3; proc format; value cholgrp low-200 = 'Normal' 201-high = 'High' . = 'Missing'; run; title "Demonstrating the MISSING Option"; title2 "Without MISSING Option"; proc freq data=learn.blood; tables Chol / nocum; format Chol cholgrp.; run; title "Demonstrating the MISSING Option"; title2 "With MISSING Option"; proc freq data=learn.blood; tables Chol / nocum missing; format Chol cholgrp.; run; *17-5; proc format; value rank low-70 = 'Low to 70' 71-high = '71 and higher'; run; title "Scholarship by Class Rank"; proc freq data=learn.college; tables Scholarship*ClassRank; format ClassRank rank.; run; *17-7; title "Blood Types in Decreasing Frequency Order"; proc freq data=learn.blood order=freq; tables BloodType / nocum nopercent; run;
Chapter 18 Solutions *18-1; options fmtsearch=(learn); title "Demographics from COLLEGE Data Set"; proc tabulate data=learn.college format=6.; class Gender Scholarship SchoolSize; tables Gender Scholarship all, SchoolSize / rts=15;
Solutions to Odd-Numbered Problems 587
keylabel n=' '; run; *18-3; proc format; value $gender 'F' = 'Female' 'M' = 'Male'; run; title "Demographics from COLLEGE Data Set"; proc tabulate data=learn.college format=6.; class Gender Scholarship SchoolSize; tables (Gender all)*(Scholarship all), SchoolSize all / rts=25; keylabel n=' ' all = 'Total'; format Gender $gender.; run; *18-5; title "Descriptive Statistics"; proc tabulate data=learn.college format=6.1; class Gender; var GPA; tables GPA*(n*f=4. mean min max), Gender all; keylabel n = 'Number' all = 'Total' mean = 'Average' min = 'Minimum' max = 'Maximum'; run; *18-7; title "More Descriptive Statistics"; proc tabulate data=learn.college format=7.1 noseps; class Gender SchoolSize; var GPA ClassRank; tables SchoolSize all, GPA*(median min max) ClassRank*(median*f=7. min*f=7. max*f=7.)/ rts=15; keylabel all = 'Total' median = 'Median' min = 'Minimum' max = 'Maximum'; label ClassRank = 'Class Rank' SchoolSize = 'School Size'; run;
588 Learning SAS by Example: A Programmer’s Guide
*18-9; title "Demonstrating Column Percents"; proc format; value $gender 'F' = 'Female' 'M' = 'Male'; run; proc sort data=learn.college out=college; by descending Scholarship; run; proc tabulate data=college format=7. order=data noseps; class Gender Scholarship; tables (Gender all), (Scholarship all)*colpctn; keylabel colpctn = 'Percent' all = 'Total'; format Gender $gender.; run;
Chapter 19 Solutions *19-1; options fmtsearch=(learn); ods listing close; ods html file = 'c:\books\learning\prob19_1.html'; title "Sending Output to an HTML File"; proc print data=learn.college(obs=8) noobs; run; proc means data=learn.college n mean maxdec=2; var GPA ClassRank; run; ods html close; ods listing; *19-3; ods listing close; ods html file='prob19_3.html' style=journal;
Solutions to Odd-Numbered Problems 589
title "Sending Output to an HTML File"; proc print data=learn.college(obs=8) noobs; run; proc means data=learn.college n mean maxdec=2; var GPA ClassRank; run; ods html close; ods listing; *19-5; ods trace on; proc univariate data=learn.survey; var Age Salary; run; ods trace off; ods select quantiles; proc univariate data=learn.survey; var Age Salary; run;
Chapter 20 Solutions *20-1; goptions reset=all vsize = 4in ftext='arial' htext=1.0 ftitle='arial/bo' htitle=1.5 colors=(black); title "Distribution of Country and Model"; pattern value=empty; proc gchart data=learn.bicycles; vbar Country Model; run; quit; *20-3; title "Distribution of Sales"; pattern value=solid;
590 Learning SAS by Example: A Programmer’s Guide
proc gchart data=learn.bicycles; vbar TotalSales / midpoints = 0 to 12000 by 2000; run; quit; *20-5; title "Distribution of Sales by Model"; pattern value=solid; proc gchart data=learn.bicycles; vbar Country / subgroup = Model; run; quit; *20-7; options fmtsearch=(learn); /* the learn library c:\books\learning is where the formats for COLLEGE are kept. */ pattern value = empty; title "Average GPA by Size of School"; proc gchart data=learn.college; vbar SchoolSize / sumvar = GPA type = mean; run; quit; *20-9; title "Stock Prices by Time"; symbol value=dot i=sm; proc gplot data=learn.stocks; plot Price * Date; run; quit;
Chapter 21 Solutions *21-1; data prob21_1; infile 'c:\books\learning\test_scores.txt' missover; /* or truncover */ input Score1-Score3; run;
Solutions to Odd-Numbered Problems 591
title "Listing of PROB21_1"; proc print data=prob21_1 noobs; run; *21-3; data prob21_3; infile 'c:\books\learning\scores_column.txt' pad; input Score1 1-2 Score2 3-4 Score3 5-6; run; title "Listing of PROB21_3"; proc print data=prob21_3 noobs; run; *21-5; title "Summary Report from BICYCLES Data Set"; data prob21_5; set learn.bicycles end=lastrec; TotalUnits + units; Sum_of_Sales + TotalSales; file print; if lastrec then put "---------------------------------------"/ "Total Units Sold is " TotalUnits comma10. / "Sales Total is " Sum_of_Sales dollar10.0; run; *21-7; data prob21_7; infile 'c:\books\learning\file_A.txt' firstobs=2 end=last_of_a; if last_of_a then infile 'c:\books\learning\file_B.txt' firstobs=2; input x y z; run; title "Listing of PROB21_7"; proc print data=prob21_7; run; *21-9; data prob21_9; filename xyzfiles ('c:\books\learning\xyz1.txt' 'c:\books\learning\xyz2.txt'); infile xyzfiles; input x y z; run;
592 Learning SAS by Example: A Programmer’s Guide
title "Listing of PROB21_9"; proc print data=prob21_9; run; *21-11; data prob21_11; infile 'c:\books\learning\three_per_line.txt'; input @1 (HR1-HR3)(3. +6) @4 (SBP1-SBP3)(3. +6) @7 (DBP1-DBP3)(3. +6); run; title "Listing of PROB21_11"; proc print data=prob21_11; run;
Chapter 22 Solutions *22-1; proc format; value high_sbp low - <140 = 'Normal' 140 - high = 'High SBP'; value high_dbp low - <90 = 'Normal' 90 - high ='High DBP'; run; title "Frequencies on SBP and DBP"; proc freq data=learn.bloodpressure; tables SBP DBP / nocum nopercent; format SBP high_sbp. DBP high_dbp.; run; *22-3; proc format; value high_sbp low - <140 = 'Normal' 140 - high = 'High SBP'; value high_dbp low - <90 = 'Normal' 90 - high ='High DBP'; run; data bloodpressure; set learn.bloodpressure; SBPGroup = put(SBP,high_sbp.); DBPGroup = put(DBP,high_dbp.); run;
Solutions to Odd-Numbered Problems 593
title "Listing of BLOODPRESSURE"; proc print data=bloodpressure noobs; run; *22-5; proc format; invalue $convert 0 - 65 = 'F' 66 - 75 = 'C' 76 - 85 = 'B' 86 - high = 'A' other = ' '; run; data lettergrades; infile 'c:\books\learning\numgrades.txt'; input ID $ LetterGrade $convert. @@; run; title "Listing of LETTERGRADES"; proc print data=lettergrades noobs; run; *22-7; data control; set learn.dxcodes(rename=(Dx = Start Description = Label)); retain fmtname '$dxcodes' Type 'C'; run; proc format cntlin=control; select $dxcodes; run; *22-9; proc format; value muggle '01jan1990'd - '31dec2004'd = 'Too Early' '01jan2005'd - '31dec2005'd = [mmddyy10.] '01jan2007'd - high = 'Too Late'; run; title "Listing of GYM"; proc print data=learn.gym noobs; format Date muggle.; run;
594 Learning SAS by Example: A Programmer’s Guide
Chapter 23 Solutions *23-1; data long; set learn.wide; array X_array[5] X1-X5; array Y_array[5] Y1-Y5; do Time = 1 to 5; X = X_array[Time]; Y = y_array{Time]; output; end; keep Subj Time X Y; run; title "Listing of LONG"; proc print data=long; run; *23-3; proc transpose data=learn.wide out=long(rename=(col1=X) drop=_name_); by Subj; var X1-X5; run; title "Listing of LONG"; proc print data=long; run;
Chapter 24 Solutions *24-1; proc sort data=learn.dailyprices out=dailyprices; by Symbol Date; run; data lastprice; set dailyprices; by Symbol; if last.Symbol; run;
Solutions to Odd-Numbered Problems 595
title "Listing of LASTPRICE"; proc print data=lastprice noobs; run; *24-3; proc sort data=learn.dailyprices out=dailyprices; by Symbol Date; run; data countit; set dailyprices; by Symbol; if first.Symbol then N_Days = 0; N_Days + 1; if last.Symbol; keep Symbol N_Days; run; title "Listing of COUNTIT"; proc print data=countit noobs; run; *24-5; proc sort data=learn.dailyprices out=dailyprices; by Symbol Date; run; data first_last; set dailyprices; by Symbol; retain FirstPrice; if first.Symbol and last.Symbol then delete; if first.Symbol then FirstPrice = Price; if last.Symbol then do; Diff = Price - FirstPrice; output; end; keep Symbol Price Diff; run; title "Listing of FIRST_LAST"; proc print data=first_last noobs; run; *24.7; proc sort data=learn.dailyprices out=dailyprices; by Symbol Date; run; data first_last; set dailyprices; by Symbol; if first.Symbol and last.Symbol then delete;
596 Learning SAS by Example: A Programmer’s Guide
Diff = dif(Price); if not first.Symbol then output; keep Symbol Price Diff; run; title "Listing of FIRST_LAST"; proc print data=first_last noobs; run;
Chapter 25 Solutions *25-1; title "Listing produced on &sysday, &sysdate9 at &systime"; proc print data=learn.stocks(obs=5) noobs; run; *25-3; %macro print_n(dsn, /* data set name */ nobs /* number of observations to list */); title "Listing of the first &nobs Observations from " "Data set &dsn"; proc print data=&dsn(obs=&nobs) noobs; run; %mend print_n; %print_n(learn.bicycles, 4) *25-5; proc means data=learn.fitness nway noprint; var TimeMile RestPulse MaxPulse; output out=summary mean= / autoname; run; data _null_; set summary; call symput('GrandTime',TimeMile_mean); call symput('GrandRest',RestPulse_mean); call symput('GrandMax',MaxPulse_mean); run; data compute_percents; set learn.fitness; P_TimeMile = round(100*TimeMile/&GrandTime);
Solutions to Odd-Numbered Problems 597
P_RestPulse = round(100*RestPulse/&GrandRest); R_MaxPulse = round(100*MaxPulse/&GrandMax); run; title "Fitness Stats as a Percent of Mean"; proc print data=compute_percents noobs; run;
Chapter 26 Solutions *26-1; title "Observations from INVENTORY where Price > 20"; proc sql; select * from learn.inventory where Price gt 20; quit; *26-3; proc sql; create table n_sales as select Name, TotalSales from learn.sales where Region eq 'North'; quit; title "Listing of N_SALES"; proc print data=n_sales noobs; run; *26-5; ***Part 1; proc sql; create table both as select l.Subj as LeftSubj, Height, Weight, r.Subj as RightSubj, Salary from learn.left as l inner join learn.right as r on left.Subj = right.Subj; quit;
598 Learning SAS by Example: A Programmer’s Guide
title "Listing of BOTH"; proc print data=both noobs; run; /* alternate code proc sql; create table both as select l.Subj as LeftSubj, Height, Weight, r.Subj as RightSubj, Salary from learn.left as l, learn.right as r where left.Subj = right.Subj; quit; title "Listing of BOTH"; proc print data=both noobs; run; */ ***Part 2; proc sql; create table both as select l.Subj as LeftSubj, Height, Weight, r.Subj as RightSubj, Salary from learn.left as l full join learn.right as r on left.Subj = right.Subj; quit; title "Listing of BOTH"; proc print data=both noobs; run; ***Part 3; proc sql; create table both as select l.Subj as LeftSubj, Height, Weight, r.Subj as RightSubj, Salary from learn.left as l left join
Solutions to Odd-Numbered Problems 599
learn.right as r on left.Subj = right.Subj; quit; title "Listing of LEFT"; proc print data=both noobs; run; *26-7; proc sql; create table third as select * from learn.first union all corresponding Select * from learn.second; quit; title "Listing of THIRD"; proc print data=third; run; *26-9; proc sql; create table percentages as select Subject, RBC, WBC, mean(RBC) as MeanRBC, mean(WBC) as MeanWBC, 100*RBC / calculated MeanRBC as Percent_RBC, 100*WBC / calculated MeanWBC as Percent_WBC from learn.blood(obs=10); quit; title "Listing of PERCENTAGES"; proc print data=percentages; run;
600 Learning SAS by Example: A Programmer’s Guide
Index A ABS function 197–198 absolute column pointer 456–457 ACROSS option, DEFINE statement (REPORT) creating ACROSS variable 310 displaying statistics 311–313 modifying column label 311 addition in assignment statements 19–20 addresses, standardizing 236–238 AFTER option, RBREAK statement (REPORT) 303 alignment parameter 156 _ALL_ keyword 59, 369, 374–375 ampersand (&) 46 ANALYSIS option, DEFINE statement (REPORT) 292–295, 312 analysis variables DEFINE statement (REPORT) 292–295, 312 TABULATE procedure and 372–373, 377–378 AND operator 109–111 ANOVA procedure 463 ANY functions 225–226 ANYALNUM function 225 ANYALPHA function 225 ANYDIGIT function 225 ANYPUNCT function 225 ANYSPACE function 225 APPEND procedure 478 arithmetic operators 19–20 array reference 245 ARRAY statement asterisk (*) in 247–248 changing array bounds 250–251 converting character values to lowercase 248–249 creating variables 249–250 missing character values in 247–248 missing numeric values in 245–246
table lookups 254–255 temporary arrays 251–252 arrays CALL routines and 246 changing bounds 250–251 converting character values to lowercase 248–249 creating variables 249–250 defined 244 loading initial values from raw data 253 missing character values in 247–248 missing numeric values in 244–246 multidimensional 254–257 table lookup and 254–257 temporary 251–257 ASCII coding method 35, 230 assignment statements 19–20, 23 defined 19 RETAIN statement and 473 asterisk (*) as wildcard 338, 446, 538 associating formats 374 in ARRAY statement 247–248 in assignment statements 19 in comment statements 19 TABLE statement (TABULATE) and 368 two-way tables and 356 at sign (@) absolute column pointer 456–457 column pointers and 40 double trailing (@@) 197, 454–455 format catalog and 465 informats and 488 INPUT statement and 197 single trailing 130, 451–454 automatic macro variables 523 AUTONAME option, OUTPUT statement (MEANS) 329–330, 337–338 AXIS statement, GCHART procedure 413, 421 ORDER= option 421
602 Index
B bar charts adding variables to 423–425 creating for continuous variables 416–418 producing 413–415 representing means 422–423 representing sums 420–422 with values representing categories 418–420 BEFORE option, RBREAK statement (REPORT) 303 BEGINNING alignment, INTNX function 156 BETWEEN AND operator 113 blanks concatenation operator and 366 converting multiple 214–215 dividing strings into words 230 IN operator and 267 missing character values and 192 raw data separated by 30–31 removing trailing/leading 217–218, 233–234 searching for 225 TABULATE operators and 366, 368 BODY= keyword 400 Boolean operators 107, 109–112 BREAK statement, REPORT procedure 303–306 SKIP option 305 SUMMARIZE option 305 SUPPRESS option 306 Burlew, Michelle 522 BY groups computing differences between first/last observations 514–516 counters and 508–511 BY statement adding subtotals/totals to listings 274–276 CLASS statement and 327 easier to read listings 277–278 MEANS procedure and 323–324, 327, 330–331 merging data sets 171–173, 181–182
merging data sets with different data types 179–181 merging data sets with different names 177–178 omitting in merges 172–173 outputting summary data sets 330–332 SET statement and 167–168, 507–508 BY SUBJECT statement 498
C CALL MISSING routine 193, 246, 497 CALL routines 193 arrays and 246 restructuring data sets with DATA step 497 CALL SYMPUT routine 531 CARDS statement 36 Carpenter, Art 522 Cartesian product 539–542, 551–552 cases changing 213–214 SAS and 7 searching for 225–226 CAT function 215–217 CATALOG procedure 465 Cates, Randall 438 CATS function 215–217, 487, 489 CATX function 215–216 CENTER option, DEFINE statement (REPORT) 295 character classes ANY functions and 225–226 defined 219 NOT functions and 226–227 character functions ANY functions 225–226 changing character case 213–214 comparing strings 232–234 concatenating strings 215–217 data cleaning with 227–228 determining value lengths 212–213 dividing strings into words 230–232 extracting parts of strings 228–230 fuzzy matching with 234–235
Index 603
NOT functions 226–227 removing characters from strings 214–215, 218–220 removing trailing/leading blanks 217–218 searching for character classes 225–226 searching for characters 220–223 searching for words in strings 223–225 substituting characters/words 235–238 _CHARACTER_ keyword 247–249 character values changing case of 213–214 character-to-numeric conversions 180, 201–202, 229, 256, 468–469 comparing 232–234 converting to lowercase 248–249 determining length of 212–213 fuzzy matching for 234–235 IN operator and 267 missing values in 192–193 numeric-to-character conversions 202 PUT function and 202 reading in one step 467–470 removing from strings 214–215, 218–220 removing trailing/leading blanks 217–218 replacing missing values for arrays 247–248 searching for 220–223 setting as missing 193 substituting 235–238 character variables categories and 365 COMPUTE blocks and 308–309 computing frequencies of 342 defined 8 detail reports for 292 DO loops and 130–131 dollar sign ($) and 13 extracting parts of strings 228–230 formats with 74 INPUT function and 201 logical comparison operators and 107
replacing missing values for arrays 247–248 character-to-numeric conversions 229, 468–469 CHART procedure 412–413 CHARTYPE option, MEANS procedure 334–337 CLASS statement BY statement and 327 complex tables 377–378 MEANS procedure and 324–325, 327, 333–337 missing values in TABULATE procedure 385–389 MLF option 483 outputting summary data sets 331–333 PRELOADFMT option 484 TABULATE procedure and 365 class variables analysis variables and 372–373 applying formats to 325–326 computing percentages on 384 counting number of visits and 511–512 formats and 462–463 missing values and 386–388 multiple 333–337 NWAY option, MEANS procedure and 511 PCTN statistic and 379–380 TABULATE procedure and 365 CLM statistic 321 CNTLIN= option, FORMAT procedure 471–476, 479, 487 CNTLOUT= option, FORMAT procedure 477 colon (:) as delimiter 35 as modifier 233 as wildcard 202, 337 informats and 44, 456 logical comparison operators and 107 color, setting 413 COLPCTN keyword 382–383 COLPCTSUM keyword 384
604 Index
column headings labeling 273–274 modifying labels for ACROSS variable 311 renaming with SQL procedure 540–541 column indices 254 column input 37–39 column pointers 40, 456–457 COLUMN statement, REPORT procedure adding 291 changing order of variables in 297–298 computing character variables 309 computing new variables 308 controlling order of variables 300–301 creating ACROSS variable 310 displaying statistics with ACROSS variable 312 grouping variables and 296–297 ordering reports with nonprinting variables 307 columns computing percentages 379–380, 382–385 crosstab tables and 356–357 displaying percentages in 381–382 fixed 37–43 TABLE statement (TABULATE) and 367 variables and 18, 31, 536 wrapping lines of text 294–296 comma (,) changing values appearances 265–266 column input and 37 comma informat 180 formatting bar charts 418 in CSV files 33 in multidimensional arrays 256 IN operator and 267 in TABLE statement (TABULATE) 367 comma informat 180 comma11. informat 180 comment statements 19–21 COMPARE function 232–234 COMPBL function 214–215 compile stage 22–23
COMPRESS function removing characters from strings 214, 218–220 removing dashes with 180 searching for characters 221–222 COMPUTE blocks computing character variables 308–309 creating 308 selecting variables for reports 291 COMPUTE statement, REPORT procedure 308–309 COMPUTED option, DEFINE statement (REPORT) 308 concatenating data sets 165, 168, 546–549 strings 215–217, 366 concatenation operator 215–217, 366 conditional processing See also IF statement See also WHERE statement Boolean operators 107, 109–112 combining detail/summary data 168–169 DO UNTIL statement 131–134, 448 DO WHILE statement 131–135 ELSE IF statement 102–105 IN operator 107 reading data conditionally 451–453 restructuring data sets with DATA step 496 SELECT statement 108–109 subsetting IF statement 105–107 substituting for missing date values 151–152 CONSTANT function 198–199 constants computing 198–199 date 147–148 hexadecimal 35 CONTAINS operator 113–114 CONTENTS= keyword 400 CONTENTS procedure _ALL_ keyword 59 conversion process and 98 documenting data sets with 80–81
Index 605
examining data sets with 56–58 listing data sets with 59 NODS option 59 VARNUM option 58, 149 CONTINUE statement 135–136 continuous variables creating bar charts for 416–418 with values representing categories 418–420 converting character values to lowercase 248–249 characters to numbers 180, 201–202, 229, 256, 468–469 data sets into CSV files 96–98 data sets into spreadsheets 93–95 Fahrenheit to Celsius 250 missing numeric values 244–246 multiple blanks 214–215 numbers to characters 202 spreadsheets into CSV files 87–88 spreadsheets with Import Wizard 88–92 with XLS engines 95–96 Corel WordPerfect 402 counters arrays and 246 BY groups and 508 FREQ procedure and 509–511 in DATA step 253 setting 508 sum statement and 120, 124 CREATE clause (SQL) 538 crosstab tables 356–358 CSV files converting data sets into 96–98 converting spreadsheets into 87–88 embedded delimiters in 46 informats and 44 reading data values 33 CTRL+C key combination 134 curly brackets { } 245, 254 current date 148–149 customized reports
applying ORDER usage to variables 300–301 changing order of variables in 297–298 changing row order in 299–300 comparing detail/summary reports 291–293 COMPUTE blocks in 308–309 computing new variables for 307–308 creating ACROSS variable 310 displaying statistics with ACROSS variable 311–313 FLOW option, REPORT procedure 294–296 grouping variables 296–297 modifying labels for ACROSS variable 311 multi-column 301–302 ordering with nonprinting variables 306–307 producing breaks in 303–306 producing summary reports 293–294 REPORT procedure and 288–290 selecting variables for 291
D dash (-) 180 data cleaning NOT functions for 226–227 VERIFY function 227–228 with character functions 227–228 DATA _NULL_ reporting 68, 444 DATA= option, SURVEYSELECT procedure 200 data portion (data sets) defined 56 viewing 63–64 viewing with SAS VIEWTABLE window 64–65 data sets 8 See also merging data sets See also permanent data sets See also summary data sets accessing with user-defined formats 82
606 Index
data sets (continued) adding observations to 164–167 combining detail/summary data 168–170 concatenating 165, 168, 546–549 controlling observations in 173–175 converting spreadsheets to 88–92 converting via ODS 96–98 creating formats 471–476 creating spreadsheets from 93–95 descriptor portion of 22, 56–58, 60–63, 73 documenting 80 interleaving 167–168 JOIN option, SYMBOL statement 429 naming conventions 7 naming variables in output 329–330 output 329–330, 408–409 permanent attributes for 80–81 restructuring using DATA step 494–497 restructuring using TRANSPOSE procedure 497–500 SAS processing 24 sending output to 407–409 subsetting 112, 162–164 tables and 536 updating master files 183–184 virtual 474 WHERE statement and 112 DATA step combining detail/summary data 169 counters in 253, 508 creating labels in 72–73 creating summary data sets 336–337 data sets as input to 65–66 defined 6 end of file and 176–177 FORMAT statement in 43, 79–80 INPUTN function in 485–490 LABEL statement in 73, 79–80 labeling column headings 273 %LET statement and 524 nested formats in 480–481 _NULL_ keyword and 67–68 restructuring data sets using 494–497
SAS processing 22–24 semi-colon (;) and 36 SET statement and 177 SQL procedure and 536, 549 subsetting data steps 163–164 transferring values between 530–532 data structures, reading 456–457 data summaries See summarizing data data types 8, 179–181 data view 474 DATALINES statement 36–37, 448 date constant 147–148 DATE function 149 date interval functions 152–157 date9. format 43, 145, 523 dates automatic macro variables and 523 computing current 148–149 computing years between 146–147 creating from day values 150–151 creating from month values 150–151 creating from year values 150–151 extracting day of month from 149–150 extracting day of week from 149–150 extracting year from 149–150 INPUT function and 201 interval functions for 152–157 reading values from raw data 143–145 storing 142 substituting missing values for 151–152 day of month extracting 149–150 substituting for missing values 151–152 day of week 149–150, 419 debugging 68 DEFINE statement, REPORT procedure ACROSS option 310, 311–313 ANALYSIS option 292–295, 312 CENTER option 295 COMPUTED option 308 creating ACROSS variable 310 DISPLAY option 292–293, 295
Index 607
displaying statistics with ACROSS variable 311–313 FLOW option 294–296 GROUP option 293–294, 296–297, 303–305 LEFT option 295 MEAN option 293–294, 312 modifying column label for ACROSS variable 311 NOPRINT option 307–308 ORDER= option 299–301, 303–305 ordering reports with nonprinting variables 307 RIGHT option 295 DELETE statement 120, 454 DELIMITER= option, INFILE statement 35 delimiters blanks as 30–32 commas as 33 defined 23–24 dividing strings into words 230 DLM= option for 34–35 embedded in list input 46 DESCENDING option ORDER option, DEFINE statement (REPORT) 300–301 SORT procedure 270–271 descriptive statistics outputting with MEANS procedure 328–329 TABULATE procedure and 370–372 descriptive statistics functions 194–196 descriptor portion (data sets) 22 examining 56–58 labels in 73 viewing with SAS Explorer 60–63 detail reports 291–293 DIF function 204, 207, 513 digits, searching for 225 DIM function 248 DISCRETE option, VBAR statement (GCHART) 419–420 Display Manager 9, 406
DISPLAY option, DEFINE statement (REPORT) 292–293, 295 displaying data 262–263 adding number of observations to listings 279 adding subtotals/totals to listings 274–277 adding titles/footnotes to listings 268–270 changing listing appearance 263–265 changing listing order 270–272 changing values appearances 265–266 controlling observation appearance in listings 266–267 double-spacing listings 280 easier to read listings 277–278 labeling column headings 273–274 listing specified number of observations 281–283 sorting by multiple variables 272–273 division in assignment statements 19–20 DLM= option, INFILE statement 34–35, 37 DO statement arrays in 246 converting character values to lowercase 249 DO groups and 119 iterative looping 125–129 iterative processing and 118–120 multidimensional arrays and 256 other forms 129–131 DO UNTIL statement 131–134, 448 DO WHILE statement 131–135 documenting data sets 80 DOL option, RBREAK statement (REPORT) 303 dollar sign ($) changing values appearances 265–266 column input and 37 formats and 74 informats and 180, 465 variable names and 13, 31 dollar11.2 format 43, 75 DONUT statement 414 DOUBLE option, PRINT procedure 280
608 Index
double trailing at sign (@@) 197, 454–455 double-spacing listings 280 DROP= data set option counting number of visits and 510, 512 DROP statement and 163 variable selection and 337 DROP= option, TRANSPOSE procedure 499 DROP statement DROP= data set option and 163 dropping variables from data sets 337 retained variables and 516 shortening 202 DSD= option, INFILE statement CSV files and 33, 88 DATALINES statement and 37 DLM= option and 35 DUL option, RBREAK statement (REPORT) 303
E e (mathematical constant) 198–199 EBCDIC coding method 35, 230 ELSE IF statement 102–105 embedded delimiters 46 END alignment, INTNX function 156 END= data set option 475, 478 end of file DATA step and 176–177 detecting 443–445 end of line 438–440 END= option INFILE statement 443–446 SET statement 445 END statement DO groups and 119 iterative DO loop and 126, 134 LEAVE statement and 135 ENDCOMP statement, REPORT procedure 308–309 engines conversion process and 54 reading spreadsheets with 95–96 Enterprise Guide 9
EQ operator 103 equal sign (=) formats and 74, 78 in labels 72 WHERE statement operator and 113 equations, graphing 128–129 Excel spreadsheets converting into CSV files 87–88 converting with Import Wizard 88–92 converting with ODS 96–98 creating from data sets 93–95 reading with engines 95–96 EXCEPT operator 546 EXCLUDE statement, FORMAT procedure 84 execution stage 22–24 EXP function 197–198 Explorer conversion process and 98 documenting data sets with 80 viewing data sets with 60–63 exponentiation in assignment statements 19–20 EXPORT statement 95 Export Wizard 93–95 external files alternative methods for 34 PUT statement and 202 reading 447–448 reading long 443
F Fahrenheit-to-Celsius conversion 250 FancyPrinter style 401 FILE PRINT statement 444 FILENAME statement reading external files 447–448 reading from multiple files 446–447 sending output to HTML files 399 specifying external files 34 filerefs 34, 448 FILEVAR option, INFILE statement 448 FIND function 221–224 FINDC function 223 FINDW function 223–225
Index 609
FIRSTOBS= data set option 92, 282 FIRSTOBS= option, INFILE statement 445–446 fixed columns column input 37–39 formatted input 39–43 FLOW option, DEFINE statement (REPORT) 294–296 FMTLIB option, FORMAT procedure format definitions and 82, 84 listing formats 473, 479 SELECT statement and 488 viewing catalog entries 465 FMTSEARCH= system option 80, 82 folders 59 fonts, setting 413 FOOTNOTE statement displaying data with 268, 270 RESET=all graphics option and 412 footnotes 268–270 format catalog 465 format definitions 82–84 format library 79 FORMAT= option, TABULATE procedure 366 FORMAT procedure CNTLIN= option 471–476, 479, 487 CNTLOUT= option 477 creating numeric informats 487 enhancing output with 73–74 EXCLUDE statement 84 FMTLIB option 82, 84, 465, 473, 479, 488 INVALUE statement 465–466, 469 LIBRARY= option 79 SELECT statement 84, 488 storing formats 79 user-defined formats 380, 464 VALUE statement 74, 78, 482–485 FORMAT statement applying formats to class variables 325–326 associating formats with 41–42, 73 changing values appearances 265–266
for bar charts 418 formatting date values 144 in DATA step 43, 79–80 labeling output 346–347 TABLES statement and 77 formats 41–43 applying to class variables 325–326 associating with asterisk (*) 374 creating 73, 471–476 DATA _NULL_ reporting and 68 enhancing output with 73–76 for date values 144–145 formats within 479–481 in DATA step 43 labeling output with 346–347 listing for variables 80 maintaining 477–479 multi-label 482–485 problems when grouping values 349–350 PUT function and 463–464 recoding variables with 462–463 regrouping values using 76–77 specifying ranges 78–79 storing 79 table lookup and 470–471 to group values 347–348 updating 477–479 user-defined 79–82, 380, 464 formatted input 39–43 forward slash (/) as relative line pointer 450 in assignment statements 19 statement options and 330 four-digit years 145 FRAME= keyword 400 FREQ procedure 13, 342–344 See also TABLES statement, FREQ procedure changing order of values in 353–356 counting number of visits with 509–511 displaying missing values 351–352 formats and 74, 77, 462 grouping values through formats 347–348
610 Index
FREQ procedure (continued) labeling output with formats 346–347 listing observations per quarter 154 multiple two-way tables 358 NOPRINT option 509 ORDER= option 353–356 output example 402–403 problems when grouping values 349–350 producing three-way tables 358–360 producing two-way tables 356–357 sample SAS program 16 selecting variables for 345–346 _FREQ_ variable 331, 337 frequencies See FREQ procedure FROM clause (SQL) 537, 539–542 full joins 543–545 fuzzy matching/merge 234–235, 541, 551–552
G GCHART procedure adding variables to charts 423–425 AXIS statement 413, 421 bar charts representing means 422–423 bar charts representing sums 420–422 charts with values representing categories 418–420 creating bar charts for continuous variables 416–418 creating pie charts 415–416 PIE statement 414–415 producing bar charts 413–415 VBAR statement 414, 417–424 GE operator 103 GLM procedure 463 global statements 6, 13 GOPTIONS statement 413–414 VSIZE= option 414 GPLOT procedure See also SYMBOL statement, GPLOT procedure example of 129, 154
PLOT statement 426 producing scatter plots 425–427 grand mean 332 graphics 412–413 adding variables to charts 423–425 bar charts representing means 422–423 bar charts representing sums 420–422 charts with values representing categories 418–420 connecting points 427–429 connecting points with smooth line 429–430 creating pie charts 415–416 creating pie charts for continuous variables 416–418 producing bar charts 413–415 producing scatter plots 425–427 graphing equations 128–129 Gregorian calendar 41 GROUP option, DEFINE statement (REPORT) 293–294, 296–297, 303–305 GROUP= option, VBAR statement (GCHART) 423 grouping values through formats 347–348 values with FREQ procedure 349–350 variables 296–297 GT operator 103
H Haworth, Lauren 382 HAXIS option, PLOT statement (GPLOT) 426 HBAR statement 414 HBAR3D statement 414 HEADING= option, PRINT procedure 282 HEADLINE option, REPORT procedure 297 hexadecimal constants 35 HTML files creating table of contents 400–401 selecting different styles 401–402 sending output to 398–399 hyphen (-) 180
Index 611
I ICD-9 codes 472–477 ID statement controlling listing appearance 264–265 easier to read listings 277–278 variables in 75 IF statement See also subsetting IF statement arrays and 246 computing differences between observations 513 conditional processing 102–105 DO groups and 119–120 example 67 in procedures 112 missing character values in arrays 248 MISSING function 104 restructuring data sets with DATA step 496 substituting for missing date values 151–152 IMPORT procedure 91 Import Wizard 88–92 IN= data set option checking missing values 175–176 controlling observations with 173–175 IN operator conditional processing 107 controlling observation appearance in listings 267 listed 103 INDEX function 222 index variables 128–130 INDEXW function 223 INFILE statement DELIMITER= option 35 DLM= option 34–35, 37 DSD= option 33, 35, 37, 88 END= option 443–446 filerefs in 34 FILEVAR option 448 FIRSTOBS= option 445–446 LRECL option 443 MISSOVER option 443
OBS= option 445–446 options in DATALINES statement 37 PAD option 143, 442–443 PUT statement and 68 reading external filenames 447–448 reading long external files 443 reading raw data with 12, 31 SAS processing 22 TRUNCOVER option 143, 443 infinite loops 134–135 informat lists 455–456 informat modifiers 44, 46 INFORMAT statement 45 informats 40–41 at sign (@) and 488 colon (:) and 44, 456 creating numeric 488–489 defined 40 INFORMAT statement 45 INPUT function and 201–202 INPUTN function and 485–490 reading data in one step 467–470 reading date values from raw data 143 table lookup and 470–471 user-defined 464–467 variable lists and 455–456 with list input 43–44 inner joins 543–545 input buffer 22 INPUT function 201–202 character-to-numeric conversion 180, 201–202, 229, 256, 468–469 nested formats and 481 table lookups and 471, 486, 488–489 user-defined informats and 464, 466 INPUT statement ampersand (&) modifier in 46 at sign (@) in 197 CSV files and 88 informat lists and 455–456 INFORMAT statement and 45 informats and 43–44 INPUT function and 202
612 Index
missing values at end of line 438–440 multiple lines of data for observations 448–450 multiple observations from line of input 454–455 reading data conditionally 451–453 reading raw data with 12, 31 reading short data lines 440–443 relative column pointers 456–457 SAS processing 22, 24 single trailing at sign (@) 130 trailing at sign (@) and 454 variable lists and 455–456 INPUTN function 485–490 INT function 190–191 INTCK function 152–155 interleaving data sets 167–168 INTERPOL= option, SYMBOL statement 427–429 INTERSECTION operator 546 INTNX function 152–153, 155–157 INVALUE statement, FORMAT procedure 465–466, 469 JUST option 466 UPCASE option 466, 469 IS MISSING operator 113 IS NULL operator 113 iterative DO loop 125–129 other forms 129–131 iterative processing See looping
J JOIN option, SYMBOL statement 428–429 JOURNAL style 402 JUST option, INVALUE statement (FORMAT) 466
K KEEP= data set option 163, 337, 510 KEEP statement 496 KEYLABEL statement, TABULATE procedure 375–376, 380, 383
keywords, renaming 375
L LABEL option, PRINT procedure 273, 279 LABEL statement 72 adding number of observations to listings 279 in DATA step 73, 79–80 labeling column headings 273–274 labels adding to variables 71–73 defining format 479 for column headings 273–274 formats for 346–347 listing for variables 80 modifying for ACROSS variable 311 multi-label formats 482–485 Lafler, Kirk 536 LAG function 204–207, 512–515 LARGEST function 195 LE operator 103 leading blanks, removing 217–218 LEAVE statement 135–136, 496 LEFT function 217–218 left joins 543–545 LEFT option, DEFINE statement (REPORT) 295 LENGTH function 212–213 LENGTH statement CONSTANT function and 199 dividing strings into words 231 extracting parts of strings 229 index variables and 130 maintaining formats 478 SET statement and 167 LENGTHC function 213 LENGTHN function 212–213 less than sign (<) 78 %LET statement 524–525 LIBNAME statement 54–55, 58 libraries 59 LIBRARY= option, FORMAT procedure 79
Index 613
librefs defined 54–55 macro variables specifying 529–530 storing formats 79 LIKE operator 113–114 LINE= option, SYMBOL statement 429 line pointers 450 LINESIZE= system option 282 list input blanks in 30–31 defined 23 INFORMAT statement with 45 informats with 43–44 missing values at end of line 438–440 specifying missing values 32 with embedded delimiters 46 LISTING destination 408 listings See also reports adding number of observations to 279 adding subtotals/totals to 274–277 changing appearance of 263–265 changing order of 270–272 CONTENTS procedure and 59 controlling observation appearance in 266–267 double-spacing 280 easier to read 277–278 formats and 74 OBS= option 281–283 ODS statement and 399 PRINT procedure and 63–64 LOG function 197–198 Log window 15–16 LOG10 function 197–198 logical comparison operators Boolean logic 107, 109–112 conditional processing and 107 listed 103 longitudinal data 506 looping arrays and 246 CONTINUE statement 135–136
converting character values to lowercase 249 DO groups and 118–120 DO UNTIL statement 131–134 DO WHILE statement 131–135 infinite 134–135 iterative DO loop 125–131 LEAVE statement 135–136 multidimensional arrays and 256 restructuring data sets with DATA step 496 sum statement and 120–125 LOWCASE function 214, 235 lowercase converting character values to 248–249 LOWCASE function 214, 235 LRECL option, INFILE statement 443 LT operator 103
M %MACRO statement 525 macro variables as prefixes 529–530 assigning values with %LET statement 524–525 automatic 523 built-in 523 defined 522 tokens and 527–529 transferring values between DATA steps 530–532 macros 525–527 many-to-many merge 182 master files, updating 183–184 mathematical functions 197–199 MAX function 195 MAX statistic 321 MAXDEC statistic 321 MAXIS= option, VBAR statement (GCHART) 421 MDY function 150–152 MEAN function 194, 549–550 MEAN option, DEFINE statement (REPORT) 293–294, 312
614 Index
MEAN statistic 321, 327–328, 378 means, bar charts representing 422–423 MEANS procedure 14, 320–322 applying formats to class variables 325–326 BY statement and 323–324, 327, 330–331 CHARTYPE option 334–337 CLASS statement with 324–325, 327, 333–337 combining detail/summary data 169 counting number of visits with 509, 511–512 creating summary data sets 327–328 formats and 74, 462–463 labels example 72–73 macro variables transferring values 530–532 multilabel formats 482–483 multiple class variables with 333–337 NOPRINT option 327–328, 511–512 NWAY option 332–333, 336, 511 OUTPUT statement 327, 329–330, 337–338 outputting descriptive statistics with 328–329 outputting summary data sets 330–333 sample SAS program 16 selecting statistics for variables 337–338 sending output to HTML files 398 SQL procedure and 549 statistic options listed 321 VAR statement and 322, 329–330 MEDIAN statistic 321 %MEND statement 525 MERGE statement 171, 182, 510 MERGENOBY system option 173 merging data sets 170–172 controlling observations 173–175 many-to-many 182 omitting BY statement 172–173 one-to-many 181 one-to-one 181 with different data types 179–181
with different names 177–178 merging tables 539–545 METHOD= option, SURVEYSELECT procedure 200 Microsoft Office Word 402 MIDDLE alignment, INTNX function 156 MIDPOINTS= option, VBAR statement (GCHART) 417–418 MIN function 195–196 MIN statistic 321 MISSING function IF statement and 104 numeric functions and 192–193 substituting for missing date values 152 testing for missing values 496 true value for 120 MISSING option TABLES statement (FREQ) 351–352 TABULATE procedure 387–388 MISSING routine 193, 246, 497 missing values adding observations to data sets 166 at end of line 438–440 checking with IN= data set option 175–176 conditional processing and 103 DATA step and 66 FREQ procedure and 351–352 grouping problem with 349–350 in numeric functions 192–193 on class variables 386 printing 485 replacing for character variables 247–248 replacing for numeric variables 244–246 setting 193 specifying with list input 32 substituting for dates 151–152 sum statement and 123 table lookups and 471 TABULATE procedure and 385–389 testing for 496 MISSOVER option, INFILE statement 443 MISSPRINT= option, TABLES statement (FREQ) 388
Index 615
MISSTEXT= option, TABLE statement (TABULATE) 389, 485 MLF option, CLASS statement 483 mmddyy10. format 145, 480 mmddyy10. informat 40–42, 479 modifiers COMPARE function and 233 COMPRESS function and 219–220 defined 219 informat 44, 46 MONTH function 149 months creating dates from 150–151 date interval functions 152–157 extracting from dates 149–150 MPRINT system option 525 multi-column reports 301–302 multidimensional arrays 254–257 MULTILABEL option, VALUE statement (FORMAT) 482–485 multi-level sorts 272–273 multiplication in assignment statements 19–20
N N function 194–195 N= option, PRINT procedure 279 N statistic 321, 375, 378 NA value 247–248 names 7–8 naming conventions data sets 7 librefs 55 variables 7 NE operator 103 negation in assignment statements 19–20 nested formats 479–481 nesting operator 368 NMISS function 195 NMISS statistic 321, 375 NOCENTER system option 16, 263 NOCOL option, TABLES statement (FREQ) 359
NOCUM option, TABLES statement (FREQ) 345 NODS option, CONTENTS procedure 59 NOHEADING option, PIE statement (GCHART) 415 NOOBS option, PRINT procedure 97, 265 NOPERCENT option, TABLES statement (FREQ) 346, 359 NOPRINT option DEFINE statement (REPORT) 307–308 FREQ procedure 509 MEANS procedure 327–328, 511–512 procedures and 408 NOROW option, TABLES statement (FREQ) 359 NOSEPS option, TABULATE procedure 376, 381 NOT functions 226–227 NOT operator 109–111 NOTALNUM function 227 NOTALPHA function 227 NOTDIGIT function 226–227 NOWD option, REPORT procedure 289–290 _NULL_ keyword 67–68 numeric functions computing constants with 198–199 computing sums with 196–197 descriptive statistics functions 194–196 generating random numbers 199–201 mathematical functions 197–198 missing values in 192–193 return values from observations 204–207 rounding numeric values 190–191 setting missing values 193 special functions 201–203 truncating numeric values 190–191 numeric values character-to-numeric conversions 180, 201–202, 229, 256, 468–469 IN operator and 267 missing values in 192–193 numeric-to-character conversions 202 reading in one step 467–470
616 Index
numeric values (continued) replacing missing values for arrays 244–246 rounding 190–191 truncating 190–191 numeric variables computing frequencies of 342 computing percentages on 384–385 computing statistics on 14, 321 defined 8 informats and 467–470 logical comparison operators and 107 replacing missing values for arrays 244–246 summary reports for 292 NWAY option, MEANS procedure 332–333, 336, 511
O OBS= data set option 92, 281–283 OBS= option, INFILE statement 445–446 observations adding to data sets 164–167 adding to listings 279 checking missing values for 175–176 combining detail/summary data 168–170 computing differences between 512–514 computing differences between first/last 514–516 computing sums within 196–197 controlling appearance in listings 266–267 controlling in merged data sets 173–175 counting number of visits 509–512 detail reports about 291 functions returning values from 204–207 identifying first/last in groups 506–509 listing per quarter 154 listing specified number of 281–283 multiple 454–455 reading multiple lines from 448–450 restructuring data sets using DATA step 494–497
restructuring data sets using TRANSPOSE procedure 497–500 retained variables and 515–517 table rows and 18, 31, 536 ODS (Output Delivery System) choosing destinations 402–403 converting data sets into spreadsheets 96–98 creating table of contents 400–401 procedures and 397–398 selecting different HTML styles 401–402 selecting/excluding output 403–407 sending output to data sets 407–409 sending output to HTML files 398–399 ODS CLOSE statement 97 ODS CSV statement 97 ODS EXCLUDE statement 403, 406–407 ODS HTML CLOSE statement 399 ODS HTML FILE statement 399 ODS HTML statement 400 ODS OUTPUT statement 408 ODS SELECT statement 403–407, 409 PERSIST option 407 ODS statement 399 ODS TRACE statement 404–406, 408 OL option, RBREAK statement (REPORT) 303 ON clause (SQL) 543–545 one-to-many merge 181 one-to-one merge 181 operators arithmetic 19–20 asterisk (*) as 368 Boolean 107, 109–112 comma as 367 concatenation 215–217, 366 for TABULATE procedure 366–368 in WHERE statement 113–114 logical comparison 103, 107 UNION 546–549 OR operator 107, 109–112 ORDER clause (SQL) 551 ORDER= option AXIS statement (GCHART) 421
Index 617
DEFINE statement (REPORT) 299–301, 303–305 FREQ procedure 353–356 OTHER keyword 349–350, 471, 475 OTHERWISE statement 108–109 OUT= option OUTPUT statement, MEANS procedure 327 procedures and 407 SORT procedure 271 SURVEYSELECT procedure 200 output See also ODS (Output Delivery System) choosing destinations 402–403 for summary data sets 330–333 formats in 73–76 labeling with formats 346–347 missing values in TABULATE procedure 385–389 selecting/excluding portions of 403–407 sending to data sets 407–409 sending to HTML files 398–399 output data sets creating simplified reports with 409 determining structure of 408 naming variables in 329–330 Output Delivery System See ODS output objects 398, 404–406 OUTPUT statement counting number of visits and 512 iterative DO loop 126, 128–129 SAS processing 24 subsetting data sets 164 OUTPUT statement, MEANS procedure 327, 329–330, 337–338 AUTONAME option 329–330, 337–338 OUT= option 327 Output window 15, 67–68
P PAD option, INFILE statement 143, 442–443 PAGEBY statement 276
PANELS= option, REPORT procedure 301–302 parentheses ( ) ARRAY statement and 245 Boolean operators and 110 in assignment statements 20 logical comparison operators and 107 variable lists in 455 PATH statement 400 PATTERN statement 412–414, 425 PCTN statistic 379–380, 383 PCTSUM statistic 384 PDF output destination 402–403 PDV (Program Data Vector) 22–24 adding observations to data sets 166–167 combining detail/summary data 169 merging data sets with different names 178 missing character values in arrays 247 RETAIN statement and 516 subsetting data sets 163 PERCENT format 169, 531 percent sign (%) 114, 522 percentages computing 379–380, 382–383 computing on numeric variables 384–385 in two-dimensional tables 381–382 period (.) list input and 32 macro processor and 529–530 missing values and 192, 388–389 permanent data sets and 54–55 permanent data sets as input to DATA step 65–66 examining with CONTENTS procedure 56–58 LIBNAME statement and 54–55 listing with CONTENTS procedure 59 _NULL_ keyword and 67–68 reason for creating 55 user-defined formats with 79–82 viewing with PRINT procedure 63–64 viewing with SAS Explorer 60–63
618 Index
permanent data sets (continued) viewing with SAS VIEWTABLE window 64–65 PERSIST option, ODS SELECT statement 407 pi (mathematical constant) 198–199 pie charts 415–416 PIE statement, GCHART procedure 414–415 NOHEADING option 415 PIE3D statement 414 PLOT procedure 412 PLOT statement, GPLOT procedure 426 HAXIS option 426 VAXIS option 426 plus sign (+) 123, 456–457 Prairie, Katherine 536 PREFIX= option, TRANSPOSE procedure 500 PRELOADFMT option, CLASS statement 484 PRINT procedure adding 31 adding number of observations to listings 279 adding subtotals/totals to listings 274–277 adding titles/footnotes to listings 268–270 changing listing appearance 263–265 changing listing order 270–272 changing values appearances 265–266 controlling observation appearance in listings 266–267 customized reports and 288 displaying data with 262–263 DOUBLE option 280 double-spacing listings 280 easier to read listings 277–278 FORMAT statement 41–42, 75, 265–266 HEADING= option 282 ID statement and 75 LABEL option 273, 279 labeling column headings 273–274 listing specified number of observations 281–283 N= option 279 NOOBS option 97, 265 output data sets and 408
REPORT procedure and 283, 288, 290 sending output to HTML files 398 SORT procedure and 299 sorting by multiple variables 272–273 viewing data sets with 63–64 WHERE statement and 336 PRINTMISS option, TABLE statement (TABULATE) 484–485 PRINTTO procedure 407 PROC steps creating labels in 72–73 defined 6 %LET statement and 524 SAS processing 24 procedures FORMAT statement and 43 IF statement in 112 NOPRINT option in 408 ODS and 397–398 OUT= option in 407 Program Data Vector See PDV programs, SAS See SAS programs PROPCASE function 214–215 punctuation dividing strings into words 230 searching for 225 PUT function 201–203 creating variables with 463–464 formats and 463–464 merging data sets 180 nested formats and 481 table lookups and 471 PUT statement controlling observations example 174 end of file and 444 in DATA step 67–68 PUT function and 202 PUTC function 489 PUTN function 489 p-values 407–409
Index 619
Q Q1 statistic 321 Q3 statistic 321 QRANGE statistic 321 quarters, date interval functions 152–157 queries Cartesian product 539 demonstrating 537–538 question mark (?) 247–248, 446 QUIT statement 537 quotation marks (") in TITLE statement 57–58 macro variables and 523 missing character values and 192 XLS engine and 96
R random numbers, generating 199–201, 524 RANUNI function 199–201 raw data loading initial values from arrays 253 reading 11–18 reading column input 37–39 reading date values from 143–145 reading formatted input 39–43 reading from multiple files 446 reading from multiple files with FILENAME statement 447 reading portion of 445–446 reading short data lines 441–442 separated by blanks 30–31 separated by commas 33 RBREAK statement, REPORT procedure 303–306 AFTER option 303 BEFORE option 303 DOL option 303 DUL option 303 OL option 303 SUMMARIZE option 303 UL option 303 reading character data in one step 467–470
complex data structures 456–457 data conditionally 451–453 date values from raw data 143–145 external files 447–448 from multiple files 446 from multiple files with FILENAME statement 447 long external files 443 multiple lines of data for observations 448–450 numeric data in one step 467–470 portion of raw data file 445–446 raw data 11–18 raw data separated by blanks 30–31 raw data separated by commas 33 raw data with column input 37–39 raw data with formatted input 39–43 short data lines 440–443 spreadsheets with engines 95–96 relative column pointers 456–457 relative line pointers 450 RENAME= data set option counting number of visits and 510, 512 renaming variables 177–178, 473 SET statement and 202 RENAME= option, TRANSPOSE procedure 499 RENAME statement 337 REPORT procedure 289–290 See also COLUMN statement, REPORT procedure See also DEFINE statement, REPORT procedure applying ORDER usage to variables 300–301 BREAK statement 303–306 changing row order 299–300 comparing detail/summary reports 291–293 COMPUTE blocks in 308–309 COMPUTE statement 308–309 computing new variables 307–308 creating ACROSS variable 310
620 Index
REPORT procedure (continued) displaying statistics with ACROSS variable 311–313 ENDCOMP statement 308–309 grouping variables 296–297 HEADLINE option 297 modifying column label for ACROSS variable 311 multi-column reports 301–302 NOWD option 289–290 ordering reports with nonprinting variables 306–307 PANELS= option 301–302 PRINT procedure and 283, 288, 290 producing report breaks 303–306 producing summary reports 293–294 RBREAK statement 303–306 selecting variables for report 291 SPLIT= option 294–296 reports See also customized reports See also displaying data See also listings BY statement vs. CLASS statement 324 DATA _NULL_ reporting 68 detail reports 291–293 multi-column 301–302 output data sets and 409 producing 11–18 summary reports 291–294, 303 RESET=all graphics option 412–413 restructuring data sets with DATA step 494–497 with TRANSPOSE procedure 497–500 RETAIN statement assignment statement and 473 computing differences between first/last observations 515–516 default missing values and 497 setting initial values with 121–122 retained variables 515–517 return values from observations 204–207 right joins (SQL) 543–545
RIGHT option, DEFINE statement (REPORT) 295 ROUND function 147, 190–191, 201 rounding numeric values 190–191 row indices 254 ROWPCTN keyword 382–383 ROWPCTSUM keyword 384 rows changing report order 299–300 computing percentages 379–380, 384–385 crosstab tables and 356–357 displaying percentages in 381–382 observations and 18, 31, 536 RTF output destination 402–403 RTS= option, TABLE statement (TABULATE) 381 RUN statement need for 13 SAS processing 24 semicolon (;) and 36
S _SAME_ keyword 469 SAMEDAY alignment, INTNX function 156 SAMPSIZE= option, SURVEYSELECT procedure 200 SAS getting data into 4 inner workings of 22–24 overview 3–4 SAS/ACCESS 88 SAS Display Manager 9, 406 SAS Enterprise Guide 9 SAS Explorer conversion process and 98 documenting data sets with 80 viewing data sets with 60–63 SAS/GRAPH See graphics SAS library 59 SAS macros 525–527 SAS names 7–8 SAS programs
Index 621
debugging 68 enhancing 18–20 interrupting 134 producing reports 11–18 reading raw data 11–18 sample 4–7 submitting 14 writing data lines in 36 SAS sessions 58 SAS/STAT 200 SCAN function 230–232, 306 scatter plots 425–427 searching for blanks 225 for cases 225–226 for character classes 225–226 for character values 220–222 for characters 220–223 for digits 225 for punctuation 225 for words in strings 223–225 seed numbers 199 SEED= option, SURVEYSELECT procedure 200 SELECT clause (SQL) 537, 539–542 SELECT statement conditional processing and 108–109 FORMAT procedure 84, 488 LEAVE statement and 135 maintaining formats 477, 479 semi-colon (;) comment statements and 21 DATA step and 36 RUN statement and 36 SAS programs and 6 sessions 58 SET statement 66 adding observations to data sets 164–167 arrays and 246 BY statement and 167–168, 507–508 combining detail/summary data 168–169 concatenating data sets 546 DATA step and 177
END= option 445 macro variables transferring values 530–531 missing character values in arrays 247–248 subsetting data sets 163 single trailing at sign (@) 130, 451–454 SKIP option, BREAK statement (REPORT) 305 slash See forward slash SMALLEST function 196 Social Security numbers 180 SORT procedure changing listing order 270–271 DESCENDING option 270–271 OUT= option 271 PRINT procedure and 299 sort flag and 168 sorting by multiple variables 272–273 sorting multiple variables 272–273 spaces See blanks special functions 201–203 SPEDIS function 234–235, 552 SPLIT= option, REPORT procedure 294–296 SPSS 244, 246 SQL procedure 536–538 concatenating data sets 546–549 FROM clause 537, 539–542 full joins 543–545 fuzzy matching 551–552 joining tables 539–542 left joins 543–545 ON clause 543–545 ORDER clause 551 right joins 543–545 SELECT clause 537, 539–542 summary functions 549–550 UNION operator 546–549 WHERE clause 537, 541–542, 552 SQRT function 127, 197–198 square brackets [ ] 245, 480 ssn11. format 180
622 Index
standardizing addresses 236–238 STAR statement 414 statements basic rules 6 imbedding comments in 20–21 statistics bar charts representing 420–422 computing 14 computing row/column percentages 379–380 descriptive 328–329, 370–372 descriptive statistics functions 194–196 displaying with ACROSS variables 311–313 grand mean 332 in summary reports 293–294 naming variables in output data sets 329–330 options with MEANS procedure 321 outputting summary data sets 330–331 outputting with MEANS procedure 328–329 RBREAK statement and 303 selecting for variables 337–338, 371 summary reports and 291 t-tests 407 underscore (_) and 329, 337–338 STD statistic 321 storing dates 142 formats 79 strings comparing 232–234 concatenating 215–217 dividing into words 230–232 extracting parts of 228–230 removing characters from 214–215, 218–220 searching for words in 223–225 STRIP function 217–218 SUBGROUP= option, VBAR statement (GCHART) 424 subscripts 245
subsetting data sets 112, 162–164 subsetting IF statement 105–107 controlling observations with 174 LENGTHN function and 213 subsetting data sets and 162 SUBSTR function 228–230 subtotals adding to listings 274–277 producing in reports 303–306 subtraction in assignment statements 19–20 SUM function 195–197, 549–550 sum statement 120–125 adding subtotals/totals to listings 274, 276 iterative DO loop and 125 SUM statistic 321, 384, 421 SUMMARIZE option BREAK statement (REPORT) 305 RBREAK statement (REPORT) 303 summarizing data applying formats to class variables 325–326 BY statement with MEANS procedure 323–324, 327 CLASS statement with MEANS procedure 324–325, 327 creating summary data sets 327–328 multiple class variables when 333–337 naming variables in output data sets 329–330 outputting descriptive statistics 328–329 outputting summary data sets 330–333 selecting statistics for variables 337–338 with MEANS procedure 320–322 summary data sets creating in DATA step 336–337 creating with MEANS procedure 327–328 outputting in BY statement 330–331 outputting in CLASS statement 331–333 selecting statistics for variables 337–338 SUMMARY procedure creating summary data sets 327–328 formats and 463 multilabel formats 482
Index 623
selecting statistics for variables 337–338 summary reports BREAK statement and 303 comparing with detail reports 291–293 producing 293–294 sums, bar charts representing 420–422 SUMVAR= option, VBAR statement (GCHART) 421–422 SUPPRESS option, BREAK statement (REPORT) 306 SURVEYSELECT procedure 200 DATA= option 200 METHOD= option 200 OUT= option 200 SAMPSIZE= option 200 SEED= option 200 swap and drop technique 202, 221 SYMBOL statement 412–413 connecting points 427–429 connecting points with smooth line 429–430 INTERPOL= option 427–429 JOIN option 428–429 LINE= option 429 producing scatter plots 425–426 VALUE= option 426 WIDTH= option 428 SYMPUT routine 531 &SYSDATE9 macro variable 523 &SYSTIME macro variable 523
T tab character 35 table lookup formats and 470–471 informats and 470–471 INPUTN function 485–490 multidimensional arrays for 254–257 table of contents 400–401 TABLE statement, TABULATE procedure 365 asterisk (*) in 368 comma in 367 concatenation operator and 366
descriptive statistics and 370–372 missing values and 385 MISSTEXT= option 389, 485 PRINTMISS option 484–485 RTS= option 381 tables See also columns See also rows See also TABULATE procedure combining class/analysis variables in 372–373 complex 377–378 controlling dimensions of 368 creating 312 crosstab 356–358 customizing 374–377 data sets and 536 joining 539–545 merging 539–545 observations and 18, 31 three-way 358–360 two-dimensional 381–382 two-way 356–358 variables and 18, 31 TABLES statement, FREQ procedure 13 counting number of visits 509 formats and 77 MISSING option 351–352 MISSPRINT= option 388 multiple two-way tables 358 NOCOL option 359 NOCUM option 345 NOPERCENT option 346, 359 NOROW option 359 producing two-way tables 356–357 selecting variables and 345–346 tabular reports See TABULATE procedure TABULATE procedure 364–365 See also TABLE statement, TABULATE procedure ALL keyword 369
624 Index
TABULATE procedure (continued) analysis variables and 372–373, 377–378 CLASS statement and 365 class variables and 365 combining class/analysis variables 372–373 complex tables 377–378 computing percentages on numeric variables 384–385 computing row/column percentages 379–380, 382–383 controlling decimal places with 322 creating tables 312 customizing tables 374–377 FORMAT= option 366 formats and 463 KEYLABEL statement 375–376, 380, 383 MISSING option 387–388 missing values and 385–389 multi-label formats 482–484 NOSEPS option 376, 381 operators for 366–368 percentages in two-dimensional tables 381–382 producing descriptive statistics 370–372 temporary arrays 251–252 loading initial values into 253 table lookups with 254–257 _TEMPORARY_ keyword 252, 256 text wrapping 294–296 three-way tables 358–360 TITLE statement 13 automatic macro variables in 523 connecting points and 428 displaying data with 268–270 quotation marks in 57–58 RESET=all graphics option and 412 sample SAS program 16 titles adding to listings 268–270 font settings in 413 TODAY function 148–149
tokens 527–529 totals adding to listings 274–277 producing in reports 303–306 trailing blanks, removing 217–218, 233–234 transaction files 183–184 TRANSLATE function 235–237, 256 TRANSPOSE procedure 497–500 DROP= option 499 PREFIX= option 500 RENAME= option 499 TRANWRD function 235–238 TRIM function NOT functions and 227 removing trailing blanks 217–218, 233–234 truncating numeric values 190–191 TRUNCOVER option, INFILE statement 143, 443 TTEST procedure 407–408 t-tests 407–408 t-values 407–409 two-digit years 145 two-dimensional tables 381–382 two-way tables 356–358 TYPE= option, VBAR statement (GCHART) 421–422 _TYPE_ variable 332–337
U UL option, RBREAK statement (REPORT) 303 underscore (_) as wildcard 114 conversion process and 92, 96 naming conventions and 7 statistics and 329, 337–338 UNION ALL CORRESPONDING operator 546 UNION ALL operator 546 UNION CORRESPONDING operator 546 UNION operator (SQL) 546–549 UNIVARIATE procedure 403–405 UPCASE function 214, 235
Index 625
UPCASE option, INVALUE statement (FORMAT) 466, 469 UPDATE statement 183–184 uppercase 214, 235 user-defined formats 79–82, 380, 464 user-defined informats 464–467
V VALUE= option, SYMBOL statement (GPLOT) 426 VALUE statement, FORMAT procedure 74, 78, 482–485 MULTILABEL option 482–485 VAR statement 14 changing listing appearance with 263–265 descriptive statistics and 370 double dash in 149 ID statement and 75 MEANS procedure and 322, 328–330 TABLE statement (TABULATE) and 365 VAR statistic 321 variable lists 149, 455–456 variable names array references and 245 defined 13 in INPUT statement 31 informats and 43–44 variables See also character variables See also class variables See also macro variables See also numeric variables adding labels to 71–73 adding to bar charts 423–425 analysis variables 292–295, 312, 372–373, 377–378 applying ORDER usage to 300–301 changing order in COLUMN statement 297–298 computing frequencies of 342–344 computing with REPORT procedure 307–308
continuous 416–420 controlling decimal places 322 controlling listing appearance 263–265 creating 249–250, 463–464 defining usage for 296 _FREQ_ 331, 337 grouping 296–297 in ID statements 75 listing formats 80 listing labels 80 missing values in TABULATE procedure 385–389 naming conventions 7 naming in output data sets 329–330 nonprinting 306–307 recoding with formats 462–463 retained 515–517 selecting for FREQ procedure 345–346 selecting for reports 291 selecting statistics for 337–338, 371 setting initial values for 121–122 sorting by multiple 272–273 sum statement and 123 swap and drop technique 202 table columns and 18, 31, 536 _TYPE_ 332–337 types of 8 VAR statement and 149 WHERE statement and 162 VARNUM option, CONTENTS procedure 58, 149 VAXIS option, PLOT statement (GPLOT) 426 VBAR statement, GCHART procedure 414 DISCRETE option 419–420 GROUP= option 423 MAXIS= option 421 MIDPOINTS= option 417–418 SUBGROUP= option 424 SUMVAR= option 421–422 TYPE= option 421–422 VBAR3D statement 414 VERIFY function 227–228 VIEWTABLE Window 64–65
626 Index
virtual data sets 474 visits, counting number of 509–512 VSIZE=4 option, GOPTIONS statement 414
W $w. informat 40 w.d informat 40 WEEKDAY function 149, 419 WHEN statement 108–109 WHERE clause (SQL) 537 fuzzy matching 552 joining tables 541–542 WHERE= data set option 499 WHERE statement controlling observation appearance in listings 266–267 subsetting data sets 112, 162 _TYPE_ variable in 336 useful operators 113–114 WIDTH= option, SYMBOL statement 428 wildcards asterisk (*) as 338, 446, 538 colon (:) as 202, 337 for WHERE statement operators 114 question mark as 446 Williams, Christianna 536 words dividing strings into 230–232 searching for in strings 223–225 substituting 235–238 wrapping lines of text 294–296
X XLS engine 95–96 X-Y plots 425–427
Y YEAR function 149 YEARCUTOFF system option 145, 150 years computing between dates 146–147 creating dates from 150–151 date interval functions 152–157
extracting from dates 149–150 four-digit 145 two-digit 145 YRDIF function 146–147
Z Zdeb, Mike 180
Symbols & (ampersand) 46 * (asterisk) See asterisk (*) @ (at sign) See at sign (@) @@ (double trailing @ sign) 197, 454–455 : (colon) See colon (:) , (comma) See comma (,) { } (curly brackets) 245, 254 $ (dollar sign) See dollar sign ($) = (equal sign) See equal sign (=) / (forward slash) See forward slash (/) - (hyphen) 180 < (less than sign) 78 ( ) (parentheses) See parentheses ( ) % (percent sign) 114, 522 . (period) See period (.) + (plus sign) 123, 456–457 ? (question mark) 247–248, 446 " (quotation marks) See quotation marks (") ; (semicolon) See semicolon (;) [ ] (square brackets) 245, 480
Books Available from SAS Press
Advanced Log-Linear Models Using SAS® by Daniel Zelterman
Cody’s Data Cleaning Techniques Using SAS® Software by Ron Cody
Analysis of Clinical Trials Using SAS®: A Practical Guide by Alex Dmitrienko, Geert Molenberghs, Walter Offen, and Christy Chuang-Stein
Common Statistical Methods for Clinical Research with SAS ® Examples, Second Edition by Glenn A. Walker
Annotate: Simply the Basics
The Complete Guide to SAS ® Indexes
by Art Carpenter
by Michael A. Raithel
Applied Multivariate Statistics with SAS® Software, Second Edition
Data Management and Reporting Made Easy with SAS ® Learning Edition 2.0
by Ravindra Khattree and Dayanand N. Naik
by Sunil K. Gupta
Data Preparation for Analytics Using SAS® Applied Statistics and the SAS ® Programming Language, Fifth Edition
by Gerhard Svolba
by Ronald P. Cody and Jeffrey K. Smith
Debugging SAS ® Programs: A Handbook of Tools and Techniques by Michele M. Burlew
An Array of Challenges — Test Your SAS ® Skills by Robert Virgile
Decision Trees for Business Intelligence and Data Mining: Using SAS® Enterprise MinerTM
Carpenter’s Complete Guide to the SAS® Macro Language, Second Edition
by Barry de Ville
by Art Carpenter
Efficiency: Improving the Performance of Your SAS ® Applications
The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith
Categorical Data Analysis Using the SAS ® System, Second Edition by Maura E. Stokes, Charles S. Davis, and Gary G. Koch
by Robert Virgile
The Essential Guide to SAS ® Dates and Times by Derek P. Morgan
Fixed Effects Regression Methods for Longitudinal Data Using SAS® by Paul D. Allison
support.sas.com/pubs
Genetic Analysis of Complex Traits Using SAS ® by Arnold M. Saxton
A Handbook of Statistical Analyses Using SAS®, Second Edition by B.S. Everitt and G. Der
Health Care Data and SAS® by Marge Scerbo, Craig Dickstein, and Alan Wilson
The Little SAS ® Book: A Primer, Third Edition by Lora D. Delwiche and Susan J. Slaughter (updated to include SAS 9.1 features) The Little SAS ® Book for Enterprise Guide® 3.0 by Susan J. Slaughter and Lora D. Delwiche The Little SAS ® Book for Enterprise Guide® 4.1 by Susan J. Slaughter and Lora D. Delwiche
by Thomas Miron
Logistic Regression Using the SAS® System: Theory and Application by Paul D. Allison
In the Know... SAS® Tips and Techniques From Around the Globe, Second Edition
Longitudinal Data and SAS®: A Programmer’s Guide by Ron Cody
The How-To Book for SAS/GRAPH ® Software
by Phil Mason
Instant ODS: Style Templates for the Output Delivery System
Maps Made Easy Using SAS® by Mike Zdeb
by Bernadette Johnson
Models for Discrete Date by Daniel Zelterman
Integrating Results through Meta-Analytic Review Using SAS® Software by Morgan C. Wang and Brad J. Bushman
Multiple Comparisons and Multiple Tests Using SAS® Text and Workbook Set
Introduction to Data Mining Using SAS® Enterprise MinerTM
(books in this set also sold separately) by Peter H. Westfall, Randall D. Tobias, Dror Rom, Russell D. Wolfinger, and Yosef Hochberg
by Patricia B. Cerrito
Multiple-Plot Displays: Simplified with Macros Learning SAS ® by Example: A Programmer’s Guide
by Perry Watts
by Ron Cody
Learning SAS ® in the Computer Lab, Second Edition by Rebecca J. Elliott The Little SAS ® Book: A Primer by Lora D. Delwiche and Susan J. Slaughter The Little SAS ® Book: A Primer, Second Edition by Lora D. Delwiche and Susan J. Slaughter (updated to include SAS 7 features)
support.sas.com/pubs
Multivariate Data Reduction and Discrimination with SAS ® Software by Ravindra Khattree and Dayanand N. Naik
Output Delivery System: The Basics by Lauren E. Haworth
Painless Windows: A Handbook for SAS ® Users, Third Edition by Jodie Gilmore (updated to include SAS 8 and SAS 9.1 features)
Pharmaceutical Statistics Using SAS®: A Practical Guide Edited by Alex Dmitrienko, Christy Chuang-Stein, and Ralph D’Agostino
SAS ® for Mixed Models, Second Edition by Ramon C. Littell, George A. Milliken, Walter W. Stroup, Russell D. Wolfinger, and Oliver Schabenberger
by Jonas V. Bilenas
SAS® for Monte Carlo Studies: A Guide for Quantitative Researchers
PROC SQL: Beyond the Basics Using SAS®
˝ by Xitao Fan, Ákos Felsovályi, Stephen A. Sivo, and Sean C. Keenan
The Power of PROC FORMAT
by Kirk Paul Lafler
SAS ® Functions by Example PROC TABULATE by Example
by Ron Cody
by Lauren E. Haworth
SAS ® Guide to Report Writing, Second Edition Professional SAS® Programmer’s Pocket Reference, Fifth Edition
by Michele M. Burlew
by Rick Aster
SAS ® Macro Programming Made Easy, Second Edition
Professional SAS ® Programming Shortcuts, Second Edition
by Michele M. Burlew
by Rick Aster
SAS ® Programming by Example
Quick Results with SAS/GRAPH ® Software
by Ron Cody and Ray Pass
by Arthur L. Carpenter and Charles E. Shipp
Quick Results with the Output Delivery System by Sunil Gupta
Reading External Data Files Using SAS®: Examples Handbook by Michele M. Burlew
Regression and ANOVA: An Integrated Approach Using SAS ® Software by Keith E. Muller and Bethel A. Fetterman
SAS ®
for Forecasting Time Series, Second Edition
by John C. Brocklebank and David A. Dickey
SAS ®
for Linear Models, Fourth Edition
by Ramon C. Littell, Walter W. Stroup, and Rudolf Freund
SAS ® Programming for Researchers and Social Scientists, Second Edition by Paul E. Spector
SAS ® Programming in the Pharmaceutical Industry by Jack Shostak
SAS® Survival Analysis Techniques for Medical Research, Second Edition by Alan B. Cantor
SAS ® System for Elementary Statistical Analysis, Second Edition by Sandra D. Schlotzhauer and Ramon C. Littell
SAS ® System for Regression, Third Edition by Rudolf J. Freund and Ramon C. Littell
SAS ® System for Statistical Graphics, First Edition by Michael Friendly
support.sas.com/pubs
The SAS ® Workbook and Solutions Set
Visualizing Categorical Data
(books in this set also sold separately) by Ron Cody
by Michael Friendly
Selecting Statistical Techniques for Social Science Data: A Guide for SAS® Users by Frank M. Andrews, Laura Klem, Patrick M. O’Malley, Willard L. Rodgers, Kathleen B. Welch, and Terrence N. Davidson
Statistical Quality Control Using the SAS ® System
Web Development with SAS® by Example, Second Edition by Frederick E. Pratter
Your Guide to Survey Research Using the SAS ® System by Archer Gravely
by Dennis W. King
A Step-by-Step Approach to Using the SAS ® System for Factor Analysis and Structural Equation Modeling by Larry Hatcher
A Step-by-Step Approach to Using SAS ® for Univariate and Multivariate Statistics, Second Edition
JMP® Books
JMP ® for Basic Univariate and Multivariate Statistics: A Step-by-Step Guide by Ann Lehman, Norm O’Rourke, Larry Hatcher, and Edward J. Stepanski
JMP ® Start Statistics, Third Edition
by Norm O’Rourke, Larry Hatcher, and Edward J. Stepanski
by John Sall, Ann Lehman, and Lee Creighton
Step-by-Step Basic Statistics Using SAS ®: Student Guide and Exercises (books in this set also sold separately)
Regression Using JMP ®
by Larry Hatcher
Survival Analysis Using SAS ®: A Practical Guide by Paul D. Allison
Tuning SAS ® Applications in the OS/390 and z/OS Environments, Second Edition by Michael A. Raithel
Univariate and Multivariate General Linear Models: Theory and Applications Using SAS ® Software by Neil H. Timm and Tammy A. Mieczkowski
Using SAS ® in Financial Research by Ekkehart Boehmer, John Paul Broussard, and Juha-Pekka Kallunki
Using the SAS ® Windowing Environment: A Quick Tutorial by Larry Hatcher
support.sas.com/pubs
by Rudolf J. Freund, Ramon C. Littell, and Lee Creighton