Cody’s Data Cleaning Techniques Using SAS Software ®
Ron Cody
The correct bibliographic citation for this manual is ...
355 downloads
1415 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Cody’s Data Cleaning Techniques Using SAS Software ®
Ron Cody
The correct bibliographic citation for this manual is as follows: Cody, Ron. 1999. Cody’s Data Cleaning Techniques Using SAS® Software. Cary, NC: SAS Institute Inc. Cody’s Data Cleaning Techniques Using SAS® Software Copyright © 1999, SAS Institute Inc., Cary, NC, USA ISBN 1-58025-600-7 All rights reserved. Produced in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, December 1999 2nd printing, February 2002 3rd printing, August 2003 Note that text corrections may have been made at each printing. SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228. SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
iii
Table of Contents /LVWRI3URJUDPV ,QWURGXFWLRQ $FNQRZOHGJPHQWV
1
2
L[ [YLL [L[
Checking Values of Character Variables
,QWURGXFWLRQ
8VLQJ352&)5(4WR/LVW9DOXHV
'HVFULSWLRQRIWKH)LOH3$7,(1767;7
8VLQJD'$7$6WHSWR&KHFNIRU,QYDOLG9DOXHV
8VLQJ352&35,17ZLWKD:+(5(6WDWHPHQWWR/LVW,QYDOLG9DOXHV
8VLQJ)RUPDWVWR&KHFNIRU,QYDOLG9DOXHV
8VLQJ,QIRUPDWVWR&KHFNIRU,QYDOLG9DOXHV
Checking Values of Numeric Variables
,QWURGXFWLRQ
8VLQJ352&0($16352&7$%8/$7(DQG352&81,9$5,$7( WR/RRNIRU2XWOLHUV
8VLQJ352&35,17ZLWKD:+(5(6WDWHPHQWWR/LVW,QYDOLG'DWD9DOXHV
8VLQJD'$7$6WHSWR&KHFNIRU,QYDOLG9DOXHV
&UHDWLQJD0DFURIRU5DQJH&KHFNLQJ
8VLQJ)RUPDWVWR&KHFNIRU,QYDOLG9DOXHV
8VLQJ,QIRUPDWVWR&KHFNIRU,QYDOLG9DOXHV
8VLQJ352&81,9$5,$7(WR/RRNIRU+LJKHVWDQG/RZHVW 9DOXHVE\3HUFHQWDJH
8VLQJ352&5$1.WR/RRNIRU+LJKHVWDQG/RZHVW9DOXHVE\3HUFHQWDJH
([WHQGLQJ352&5$1.WR/RRNIRU+LJKHVWDQG/RZHVWQ9DOXHV
iv
3
4
)LQGLQJ$QRWKHU:D\WR'HWHUPLQH+LJKHVWDQG/RZHVW9DOXHV
&KHFNLQJD5DQJH8VLQJDQ$OJRULWKP%DVHGRQ6WDQGDUG'HYLDWLRQ
0DFURV%DVHGRQWKH7ZR0HWKRGVRI2XWOLHU'HWHFWLRQ
'HPRQVWUDWLQJWKH'LIIHUHQFHEHWZHHQWKH7ZR0HWKRGV
&KHFNLQJD5DQJH%DVHGRQWKH,QWHUTXDUWLOH5DQJH
&KHFNLQJ5DQJHVIRU6HYHUDO9DULDEOHV
Checking for Missing Values
,QWURGXFWLRQ
,QVSHFWLQJWKH6$6/RJ
8VLQJ352&0($16DQG352&)5(4WR&RXQW0LVVLQJ9DOXHV
8VLQJ'$7$6WHS$SSURDFKHVWR,GHQWLI\DQG&RXQW0LVVLQJ9DOXHV
8VLQJ352&7$%8/$7(WR&RXQW0LVVLQJDQG1RQPLVVLQJ9DOXHVIRU 1XPHULF9DULDEOHV
8VLQJ352&7$%8/$7(WR&RXQW0LVVLQJDQG1RQPLVVLQJ9DOXHVIRU &KDUDFWHU9DULDEOHV
&UHDWLQJD*HQHUDO3XUSRVH0DFURWR&RXQW0LVVLQJDQG1RQPLVVLQJ 9DOXHVIRU%RWK1XPHULFDQG&KDUDFWHU9DULDEOHV
6HDUFKLQJIRUD6SHFLILF1XPHULF9DOXH
Working with Dates
,QWURGXFWLRQ
&KHFNLQJ5DQJHVIRU'DWHV8VLQJD'$7$6WHS
&KHFNLQJ5DQJHVIRU'DWHV8VLQJ352&35,17
&KHFNLQJIRU,QYDOLG'DWHV
:RUNLQJZLWK'DWHVLQ1RQVWDQGDUG)RUP
&UHDWLQJD6$6'DWH:KHQWKH'D\RIWKH0RQWK,V0LVVLQJ
6XVSHQGLQJ(UURU&KHFNLQJIRU.QRZQ,QYDOLG'DWHV
v
5
6
Looking for Duplicates and "n" Observations per Subject
,QWURGXFWLRQ
(OLPLQDWLQJ'XSOLFDWHVE\8VLQJ352&6257
'HWHFWLQJ'XSOLFDWHVE\8VLQJ'$7$6WHS$SSURDFKHV
8VLQJ352&)5(4WR'HWHFW'XSOLFDWH,' V
6HOHFWLQJ3DWLHQWVZLWK'XSOLFDWH2EVHUYDWLRQVE\8VLQJD0DFUR/LVW DQG64/
,GHQWLI\LQJ6XEMHFWVZLWKQ2EVHUYDWLRQV(DFK'$7$6WHS$SSURDFK
,GHQWLI\LQJ6XEMHFWVZLWKQ2EVHUYDWLRQV(DFK8VLQJ352&)5(4
Working with Multiple Files
,QWURGXFWLRQ
&KHFNLQJIRUDQ,'LQ(DFKRI7ZR)LOHV
&KHFNLQJIRUDQ,'LQ(DFKRIQ)LOHV
$6LPSOH0DFURWR&KHFN,' VLQ0XOWLSOH)LOHV
$0RUH&RPSOLFDWHG0XOWL)LOH0DFURIRU,'&KHFNLQJ
0RUH&RPSOLFDWHG0XOWL)LOH5XOHV
&KHFNLQJ7KDWWKH'DWHV$UHLQWKH3URSHU2UGHU
vi
7
Double Entry and Verification (PROC COMPARE)
,QWURGXFWLRQ
&RQGXFWLQJD6LPSOH&RPSDULVRQRI7ZR'DWD6HWVZLWKRXWDQ,' 9DULDEOH 8VLQJ352&&203$5(ZLWKDQ,'9DULDEOH
8VLQJ352&&203$5(ZLWK7ZR'DWD6HWV7KDW+DYHDQ8QHTXDO 1XPEHURI2EVHUYDWLRQV
&RPSDULQJ7ZR'DWD6HWV:KHQ6RPH9DULDEOHV$UH1RWLQ%RWK'DWD 6HWV
8
Some SQL Solutions to Data Cleaning
,QWURGXFWLRQ
$4XLFN5HYLHZRI352&64/
&KHFNLQJIRU,QYDOLG&KDUDFWHU9DOXHV
&KHFNLQJIRU2XWOLHUV
&KHFNLQJD5DQJH8VLQJDQ$OJRULWKP%DVHGRQWKH6WDQGDUG'HYLDWLRQ
&KHFNLQJIRU0LVVLQJ9DOXHV
5DQJH&KHFNLQJIRU'DWHV
&KHFNLQJIRU'XSOLFDWHV
,GHQWLI\LQJ6XEMHFWVZLWKQ2EVHUYDWLRQV(DFK
&KHFNLQJIRUDQ,'LQ(DFKRI7ZR)LOHV
0RUH&RPSOLFDWHG0XOWL)LOH5XOHV
vii
9
Using Validation Data Sets
,QWURGXFWLRQ
$6LPSOH([DPSOHRID9DOLGDWLRQ'DWD6HW
0DNLQJWKH3URJUDP0RUH)OH[LEOHDQG&RQYHUWLQJ,WWRD0DFUR
9DOLGDWLQJ&KDUDFWHU'DWD
&RQYHUWLQJ3URJUDPLQWRD*HQHUDO3XUSRVH0DFUR
([WHQGLQJWKH9DOLGDWLRQ0DFURWR,QFOXGH9DOLG&KDUDFWHU5DQJHV
&RPELQLQJ1XPHULFDQG&KDUDFWHU9DOLGLW\&KHFNVLQD6LQJOH0DFUR ZLWKD6LQJOH9DOLGDWLRQ'DWD6HW
,QWURGXFLQJ6$6,QWHJULW\&RQVWUDLQWV9HUVLRQVDQG/DWHU
Listing of Raw Data Files and
Appendix
SAS Programs
'HVFULSWLRQRIWKH5DZ'DWD)LOH3$7,(1767;7
/D\RXWIRUWKH'DWD)LOH3$7,(1767;7
/LVWLQJRI5DZ'DWD)LOH3$7,(1767;7
3URJUDPWR&UHDWHWKH6$6'DWD6HW3$7,(176
/LVWLQJRI5DZ'DWD)LOH3$7,(1767;7
3URJUDPWR&UHDWHWKH6$6'DWD6HW3$7,(176
3URJUDPWR&UHDWHWKH6$6'DWD6HW$($GYHUVH(YHQWV
3URJUDPWR&UHDWHWKH6$6'DWD6HW/$%B7(67
Index
viii
ix
List of Programs
1
Checking Values of Character Variables
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
2
:ULWLQJD3URJUDPWR&UHDWHWKH'DWD6HW3$7,(176 8VLQJ352&)5(4WR/LVW$OOWKH8QLTXH9DOXHV IRU&KDUDFWHU9DULDEOHV 8VLQJD'$7$B18//B6WHSWR'HWHFW,QYDOLG &KDUDFWHU'DWD 8VLQJ352&35,17WR/LVW,QYDOLG&KDUDFWHU9DOXHV 8VLQJ352&35,17WR/LVW,QYDOLG&KDUDFWHU'DWD IRU6HYHUDO9DULDEOHV 8VLQJD8VHU'HILQHG)RUPDWDQG352&)5(4WR /LVW,QYDOLG'DWD9DOXHV 8VLQJD8VHU'HILQHG)RUPDWDQGD'$7$6WHSWR /LVW,QYDOLG'DWD9DOXHV 8VLQJD8VHU'HILQHG,QIRUPDWWR6HW,QYDOLG'DWD 9DOXHVWR0LVVLQJ 8VLQJD8VHU'HILQHG,QIRUPDWZLWKWKH,1387)XQFWLRQ
Checking Values of Numeric Variables
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
8VLQJ352&0($16WR'HWHFW,QYDOLGDQG0LVVLQJ9DOXHV 8VLQJ352&7$%8/$7(WR'LVSOD\'HVFULSWLYH'DWD 8VLQJ352&81,9$5,$7(WR/RRNIRU2XWOLHUV $GGLQJDQ,'6WDWHPHQWWR352&81,9$5,$7( 8VLQJD:+(5(6WDWHPHQWZLWK352&35,17WR /LVW2XWRI5DQJH'DWD 8VLQJD'$7$B18//B6WHSWR/LVW2XWRI5DQJH 'DWD9DOXHV :ULWLQJD0DFURWR/LVW2XWRI5DQJH'DWD9DOXHV
x
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
'HWHFWLQJ2XWRI5DQJH9DOXHV8VLQJ8VHU'HILQHG )RUPDWV 0RGLI\LQJWKH3UHYLRXV3URJUDPWR'HWHFW,QYDOLG &KDUDFWHU 'DWD9DOXHV 8VLQJ8VHU'HILQHG,QIRUPDWVWR'HWHFW2XWRI5DQJH 'DWD9DOXHV 0RGLI\LQJWKH3UHYLRXV3URJUDPWR'HWHFW,QYDOLG &KDUDFWHU 'DWD9DOXHV 8VLQJ352&81,9$5,$7(WR3ULQWWKH7RSDQG %RWWRPQ3HUFHQWRI'DWD9DOXHV &UHDWLQJD0DFURWR/LVWWKH+LJKHVWDQG/RZHVWQ 3HUFHQWRIWKH'DWD8VLQJ352&81,9$5,$7( &UHDWLQJD0DFURWR/LVWWKH+LJKHVWDQG/RZHVWQ 3HUFHQWRIWKH'DWD8VLQJ352&5$1. &UHDWLQJD0DFURWR/LVWWKH7RSDQG%RWWRPQ 'DWD9DOXHV8VLQJ352&5$1. 'HWHUPLQLQJWKH1XPEHURI1RQPLVVLQJ2EVHUYDWLRQV LQD'DWD6HW /LVWLQJWKH+LJKHVWDQG/RZHVWQ9DOXHV8VLQJ 352&6257 &UHDWLQJD0DFURWR/LVWWKHQ+LJKHVWDQG/RZHVW'DWD 9DOXHV8VLQJ352&6257 'HWHFWLQJ2XWOLHUV%DVHGRQWKH6WDQGDUG'HYLDWLRQ 'HWHFWLQJ2XWOLHUV%DVHGRQD7ULPPHG0HDQ &UHDWLQJD0DFURWR'HWHFW2XWOLHUV%DVHGRQD6WDQGDUG 'HYLDWLRQ &UHDWLQJD0DFURWR'HWHFW2XWOLHUV%DVHGRQD 7ULPPHG0HDQ 'HWHFWLQJ2XWOLHUV%DVHGRQWKH,QWHUTXDUWLOH5DQJH :ULWLQJD3URJUDPWR6XPPDUL]H'DWD(UURUVRQ6HYHUDO 9DULDEOHV
xi
3
Checking for Missing Values
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
3URJUDP 3URJUDP
3URJUDP 3URJUDP 3URJUDP
4
&RXQWLQJ0LVVLQJDQG1RQPLVVLQJ9DOXHVIRU1XPHULF DQG&KDUDFWHU9DULDEOHV :ULWLQJD6LPSOH'$7$6WHSWR/LVW0LVVLQJ'DWD 9DOXHVDQGDQ,'9DULDEOH $WWHPSWLQJWR/RFDWHD0LVVLQJRU,QYDOLG3DWLHQW,' E\/LVWLQJWKH7ZR3UHYLRXV,' V 8VLQJ352&35,17WR/LVW'DWDIRU0LVVLQJRU,QYDOLG 3DWLHQW,' V /LVWLQJDQG&RXQWLQJ0LVVLQJ9DOXHVIRU 6HOHFWHG9DULDEOHV /LVWLQJWKH1XPEHURI1RQPLVVLQJDQG0LVVLQJ9DOXHV DQGWKH0LQLPXPDQG0D[LPXP9DOXHVIRU$OO 1XPHULF9DULDEOHV 8VLQJ352&7$%8/$7(WR&RXQW0LVVLQJDQG 1RQPLVVLQJ9DOXHVIRU&KDUDFWHU9DULDEOHV :ULWLQJD0DFURWR&RXQWWKH1XPEHURI0LVVLQJDQG 1RQPLVVLQJ2EVHUYDWLRQVIRU$OO1XPHULFDQG &KDUDFWHU9DULDEOHVLQD'DWD6HW ,GHQWLI\LQJ$OO1XPHULF9DULDEOHV(TXDOWRD)L[HG 9DOXH6XFKDV &UHDWLQJD0DFUR9HUVLRQRI3URJUDP ,GHQWLI\LQJ9DULDEOHVZLWK6SHFLILHG1XPHULF9DOXHVDQG &RXQWLQJWKH1XPEHURI7LPHVWKH9DOXH$SSHDUV
Working with Dates
3URJUDP 3URJUDP
&KHFNLQJ7KDWD'DWH,VZLWKLQD6SHFLILHG,QWHUYDO '$7$6WHS$SSURDFK &KHFNLQJ7KDWD'DWH,VZLWKLQD6SHFLILHG,QWHUYDO 8VLQJ352&35,17DQGD:+(5(6WDWHPHQW
xii
3URJUDP 3URJUDP
3URJUDP
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
5
5HDGLQJ'DWHVZLWKWKH00''<<,QIRUPDW /LVWLQJ0LVVLQJDQG,QYDOLG'DWHVE\5HDGLQJWKH 'DWH7ZLFH2QFHZLWKD'DWH,QIRUPDWDQGWKH 6HFRQGDV&KDUDFWHU'DWD /LVWLQJ0LVVLQJDQG,QYDOLG'DWHVE\5HDGLQJWKH'DWH DVD&KDUDFWHU9DULDEOHDQG&RQYHUWLQJWRD6$6 'DWHZLWKWKH,1387)XQFWLRQ 5HPRYLQJWKH0LVVLQJ9DOXHVIURPWKH,QYDOLG'DWH/LVWLQJ 'HPRQVWUDWLQJWKH0'<)XQFWLRQWR5HDG'DWHVLQ 1RQVWDQGDUG)RUP 5HPRYLQJ0LVVLQJ9DOXHVIURPWKH(UURU/LVWLQJ &UHDWLQJD6$6'DWH:KHQWKH'D\RIWKH0RQWK,V 0LVVLQJ 6XEVWLWXWLQJWKHWKRIWKH0RQWK:KHQWKH'DWHRI WKH0RQWK,V0LVVLQJ 6XVSHQGLQJ(UURU&KHFNLQJIRU.QRZQ,QYDOLG'DWHVE\ 8VLQJWKH"",QIRUPDW0RGLILHU 'HPRQVWUDWLQJWKH"",QIRUPDW0RGLILHUZLWKWKH,1387 )XQFWLRQ
Looking for Duplicates and "n" Observations per Subject
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
'HPRQVWUDWLQJWKH12'83.(<2SWLRQRI352&6257 'HPRQVWUDWLQJWKH12'832SWLRQRI352&6257 'HPRQVWUDWLQJD)HDWXUHRIWKH12'832SWLRQ ,GHQWLI\LQJ'XSOLFDWH,' V &UHDWLQJWKH6$6'DWD6HW3$7,(176D'DWD6HW &RQWDLQLQJ0XOWLSOH9LVLWVIRU(DFK3DWLHQW ,GHQWLI\LQJ3DWLHQW,' VZLWK'XSOLFDWH9LVLW'DWHV 8VLQJ352&)5(4DQGDQ2XWSXW'DWD6HWWR,GHQWLI\ 'XSOLFDWH,' V
xiii
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
6
Working with Multiple Files
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
7
3URGXFLQJD/LVWRI'XSOLFDWH3DWLHQW1XPEHUVE\ 8VLQJ352&)5(4 8VLQJ352&64/WR&UHDWHD/LVWRI'XSOLFDWHV 8VLQJ352&64/WR&UHDWHD/LVWRI'XSOLFDWHVLQ6RUWHG 2UGHU 8VLQJD'$7$6WHSWR/LVW$OO,' VIRU3DWLHQWV:KR 'R1RW+DYH([DFWO\7ZR2EVHUYDWLRQV 8VLQJ352&)5(4WR/LVW$OO,' VIRU3DWLHQWV:KR 'R1RW+DYH([DFWO\7ZR2EVHUYDWLRQV
&UHDWLQJ7ZR7HVW'DWD6HWVIRU&KDSWHU([DPSOHV ,GHQWLI\LQJ,' V1RWLQ(DFKRI7ZR'DWD6HWV &UHDWLQJD7KLUG'DWD6HWIRU7HVWLQJ3XUSRVHV &KHFNLQJIRUDQ,'LQ(DFKRI7KUHH'DWD6HWV/RQJ:D\ &UHDWLQJD0DFURWR&KHFNIRUDQ,'LQ(DFKRIQ)LOHV 6LPSOH:D\ :ULWLQJD0RUH*HQHUDO0DFURWR+DQGOH$Q\1XPEHU RI'DWD6HWV 9HULI\LQJ7KDW3DWLHQWVZLWKDQ$GYHUVH(YHQWRI; LQ'DWD6HW$(KDYHDQ(QWU\LQ'DWD6HW/$%B7(67 $GGLQJWKH&RQGLWLRQ7KDWWKH/DE7HVW0XVW)ROORZWKH $GYHUVH(YHQW
Double Entry and Verification (PROC COMPARE)
3URJUDP 3URJUDP 3URJUDP 3URJUDP
&UHDWLQJ'DWD6HWV21(DQG7:2IURP7ZR5DZ 'DWD)LOHV 5XQQLQJ352&&203$5( 8VLQJ352&&203$5(WR&RPSDUH7ZR'DWD5HFRUGV 8VLQJ352&&203$5(ZLWKDQ,'9DULDEOH
xiv
3URJUDP 3URJUDP 3URJUDP 3URJUDP
8
5XQQLQJ352&&203$5(RQ7ZR'DWD6HWVRI 'LIIHUHQW/HQJWK &UHDWLQJ7ZR7HVW'DWD6HWV'(02*DQG2/''(02* &RPSDULQJ7ZR'DWD6HWV7KDW&RQWDLQ'LIIHUHQW9DULDEOHV $GGLQJD9$56WDWHPHQWWR352&&203$5(
Some SQL Solutions to Data Cleaning
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
'HPRQVWUDWLQJD6LPSOH64/4XHU\ 8VLQJ64/WR/RRNIRU,QYDOLG&KDUDFWHU9DOXHV 8VLQJ64/WR/LVW,QYDOLG&KDUDFWHU'DWD0LVVLQJ 9DOXHV1RW)ODJJHGDV(UURUV 8VLQJ64/WR&KHFNIRU2XWRI5DQJH1XPHULF9DOXHV 8VLQJ64/WR&KHFNIRU2XWRI5DQJH9DOXHV%DVHG RQWKH6WDQGDUG'HYLDWLRQ &RQYHUWLQJ3URJUDPLQWRD0DFUR 8VLQJ64/WR/LVW0LVVLQJ9DOXHV 8VLQJ64/WR3HUIRUP5DQJH&KHFNVRQ'DWHV 8VLQJ64/WR/LVW'XSOLFDWH3DWLHQW1XPEHUV 8VLQJ64/WR/LVW3DWLHQWV:KR'R1RW+DYH7ZR9LVLWV &UHDWLQJ7ZR'DWD6HWVIRU7HVWLQJ3XUSRVHV 8VLQJ64/WR/RRNIRU,' V7KDW$UH1RWLQ(DFK RI7ZR)LOHV 8VLQJ64/WR'HPRQVWUDWH0RUH&RPSOLFDWHG 0XOWL)LOH5XOHV ([DPSOHRI/()75,*+7DQG)8//-RLQV
xv
9
Using Validation Data Sets
3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP 3URJUDP
&UHDWLQJD6LPSOH9DOLGDWLRQ'DWD6HW 5HVWUXFWXULQJWKH3$7,(176'DWD6HWDQG3URGXFLQJDQ ([FHSWLRQV5HSRUW &UHDWLQJD1HZ9DOLGDWLRQ'DWD6HW7KDW&RQWDLQV 0LVVLQJ9DOXH,QVWUXFWLRQV 9DOLGDWLQJD'DWD6HWZLWKD0DFUR7KDW&RQWDLQV 0LVVLQJ9DOXH,QVWUXFWLRQV &UHDWLQJD7HVW'DWD6HWIRU&KDUDFWHU9DOLGDWLRQ &UHDWLQJD9DOLGDWLRQ'DWD6HW&B9$/,' IRU&KDUDFWHU 9DULDEOHV :ULWLQJWKH3URJUDPWR9DOLGDWH&KDUDFWHU9DULDEOHV :ULWLQJD0DFURWR&KHFNIRU,QYDOLG&KDUDFWHU9DOXHV :ULWLQJD0DFURWR&KHFNIRU'LVFUHWH&KDUDFWHU9DOXHVDQG &KDUDFWHU5DQJHV &UHDWLQJD0DFURWR9DOLGDWHERWK1XPHULFDQG&KDUDFWHU 'DWD,QFOXGLQJ&KDUDFWHU5DQJHVZLWKD6LQJOH 9DOLGDWLRQ'DWD)LOH
xvi
xvii
Introduction What is data cleaning? In this book, we define data cleaning to include: • • • • • • • • •
Making sure that the raw data were accurately entered into a computer readable file. Checking that character variables contain only valid values. Checking that numeric values are within predetermined ranges. Checking if there are missing values for variables where complete data is necessary. Checking for and eliminating duplicate data entries. Checking for uniqueness of certain values, such as patient ID’s. Checking for invalid date values. Checking that an ID number is present in each of "n" files. Verifying that more complex multi-file rules have been followed. For example, if an adverse event of type X occurs in one data set, you expect an observation with the same ID number in another data set. In addition, the date of this observation must be after the adverse event and before the end of the trial.
This book provides many programming examples to accomplish the tasks listed above. In many cases, a given problem is solved in several ways. For example, numeric outliers are detected in a DATA step by using formats and informats, by using SAS procedures, and by SQL queries, which are presented together in Chapter 8. Throughout the book, there are useful macros that you may want to add to your collection of data cleaning tools. However, even if you are not experienced with SAS macros, most of the macros that are presented are first presented in non-macro form, so you should still be able to understand the programming concepts that are presented. But, there is another purpose for this book. It provides instruction on intermediate and advanced SAS programming techniques. One of the reasons for providing multiple solutions to data cleaning problems is to demonstrate specific features of SAS programming. The more complex programs and macros in this book are described in detail. It is impossible to provide an example of every data cleaning task. Indeed, some studies require custom programming. For those cases, the tools that are developed in this book can be the jumping-off point for more complex programs.
xviii
Many applications that require accurate data entry use customized, and sometimes very expensive, data entry and verification programs. A chapter on PROC COMPARE shows how SAS software can be used in a double-entry data verification process. Chapter 9 describes the use of validation data sets. In a step-by-step process, programs and macros are developed that can read all of the rules for character and numeric variables from a raw data file (called a validation data file) and produce a validation data set and an exception report. The use of integrity constraints, new with Version 7 SAS software, is also discussed. Although all of the programs in this book were tested by using either Version 7 or Version 8 SAS software, most of the programs should run under Release 6.12, perhaps with some minor changes (such as shortening variable names). However, the integrity constraints discussed in Chapter 9 require using Version 7 or later. I have enjoyed writing this book. Writing any book is a learning experience and this book is no exception. I hope that most of the egregious errors have been eliminated. If any remain, I take full responsibility for them. Every program in the text has been run against sample data. However, as experience will tell, no program is foolproof.
xix
Acknowledgments :HOOKHUH,DPZULWLQJWKHDFNQRZOHGJPHQWV7KDWPHDQVWKHERRNLVDOPRVWILQLVKHG DQG , FDQ UHOD[ UHDG D ERRN ULGH P\ ELNH RU SOD\ WKH SLDQR 6RPH RWKHU SHRSOH FDQ UHOD[WRRDQGVRPHVWLOOKDYHVRPHZRUNWRGR0\UHYLHZHUV0LNH=GHE.HYLQ+REEV &\QWKLD=HQGHU.HQW5HHYH*HRUJH%HUJ-RKQ/DLQJDQG*UDQW&RRSHUILQLVKHGWKHLU ZRUNVRPHWLPHDJR,DPJUDWHIXOIRUWKHLUFDUHIXOUHYLHZVDQGHVSHFLDOO\IRUILQGLQJ VRPHYHU\VXEWOHHUURUVDQGVRPHQRWVRVXEWOHHUURUV 7KHHGLWLQJDQGSURGXFWLRQVWDIIDUHVWLOOZRUNLQJ-XG\:KDWOH\DOWKRXJKVKHLVOLVWHGDV WKHDFTXLVLWLRQVHGLWRUKDVSOD\HGDPXFKODUJHUUROHRYHUVHHLQJWKHERRNWKURXJKRXWWKH SURGXFWLRQSURFHVV-XG\LW VEHHQDUHDOSOHDVXUHZRUNLQJZLWK\RX7KDQNVVRPXFK IRU \RXU XSEHDW DWWLWXGH HYHQ WKURXJK PDMRU IORRGV LQ 1RUWK &DUROLQD 2WKHU 6$6 ,QVWLWXWHIRONVZKRSOD\HGDQLPSRUWDQWUROHLQEULQJLQJWKLVERRNWRSULQWDUHFRS\HGLWRU -RVHSKLQH 3RSH SURGXFWLRQ VSHFLDOLVW 0DU\ 5LRV PDUNHWLQJ DQDO\VW 3DWULFLD 8UTXKDUW FDUWRRQLVW0LNH3H]]RQLDQGFRYHUGHVLJQHU&DWH3DUULVK+RZHYHUVLQFH,KDYHQ WVHHQ WKHFRYHU\HW,PD\ZDQWWRZLWKKROGP\DSSUHFLDWLRQWR&DWH-XVWNLGGLQJ 5RQ&RG\ :LQWHU
xx
1
Checking Values of Character Variables ,QWURGXFWLRQ
8VLQJ352&)5(4WR/LVW9DOXHV
'HVFULSWLRQRIWKH)LOH3$7,(1767;7
8VLQJD'$7$6WHSWR&KHFNIRU,QYDOLG9DOXHV
8VLQJ352&35,17ZLWKD:+(5(6WDWHPHQWWR/LVW,QYDOLG9DOXHV
8VLQJ)RUPDWVWR&KHFNIRU,QYDOLG9DOXHV
8VLQJ,QIRUPDWVWR&KHFNIRU,QYDOLG9DOXHV
Introduction 7KHUHDUHVRPHEDVLFRSHUDWLRQVWKDWQHHGWREHURXWLQHO\SHUIRUPHGZKHQGHDOLQJ ZLWKFKDUDFWHUGDWDYDOXHV
7KLVVHFWLRQGHPRQVWUDWHVKRZWRXVH352&)5(4WRFKHFNIRULQYDOLGYDOXHVRID FKDUDFWHUYDULDEOH,QRUGHUWRWHVWWKHSURJUDPV\RXGHYHORSXVHWKHUDZGDWDILOH 3$7,(1767;7 OLVWHG LQ WKH $SSHQGL[
2
®
Cody’s Data Cleaning Techniques Using SAS Software
Description of the Raw Data File PATIENTS.TXT
7KHUDZGDWDILOH3$7,(1767;7FRQWDLQVERWKFKDUDFWHUDQGQXPHULFYDULDEOHVIURPD W\SLFDOFOLQLFDOWULDO$QXPEHURIGDWDHUURUVZHUHLQFOXGHGLQWKHILOHVRWKDW\RXFDQ WHVWWKHGDWDFOHDQLQJSURJUDPVWKDWDUHGHYHORSHGLQWKLVWH[W7KHSURJUDPVLQWKLVERRN DVVXPH WKDW WKH ILOH 3$7,(1767;7 LV ORFDWHG LQ D GLUHFWRU\ IROGHU FDOOHG &?&/($1,1*7KLVLVWKHGLUHFWRU\WKDWLVXVHGWKURXJKRXWWKLVWH[WDVWKHORFDWLRQIRU GDWDILOHV6$6GDWDVHWV6$6SURJUDPVDQG6$6PDFURV
'HVFULSWLRQ
6WDUWLQJ &ROXPQ
/HQJWK 9DULDEOH7\SH
9DOLG9DOXHV
3$712
3DWLHQW 1XPEHU
&KDUDFWHU
1XPHUDOVRQO\
*(1'(5
*HQGHU
&KDUDFWHU
0 RU )
9,6,7
9LVLW'DWH
00''<<
$Q\YDOLGGDWH
+5
+HDUW5DWH
1XPHULF
%HWZHHQDQG
6%3
6\VWROLF%ORRG 3UHVVXUH
1XPHULF
%HWZHHQDQG
'%3
'LDVWROLF %ORRG 3UHVVXUH
1XPHULF
%HWZHHQDQG
';
'LDJQRVLV &RGH
&KDUDFWHU
WRGLJLWQXPHUDO
$(
$GYHUVH(YHQW
&KDUDFWHU
RU
7KHUHDUHVHYHUDOFKDUDFWHUYDULDEOHVWKDWVKRXOGKDYHDOLPLWHGQXPEHURIYDOLGYDOXHV )RU WKLV H[HUFLVH \RX H[SHFW YDOXHV RI *(1'(5 WR EH ) RU 0 YDOXHV RI '; WKH QXPHUDOVWKURXJKDQGYDOXHVRI$(DGYHUVHHYHQWV WREH RU $YHU\VLPSOH DSSURDFKWRLGHQWLI\LQJLQYDOLGFKDUDFWHUYDOXHVLQWKLVILOHLVWRXVH352&)5(4WROLVW DOO WKH XQLTXH YDOXHV RI WKHVH YDULDEOHV 2I FRXUVH RQFH LQYDOLG YDOXHV DUH LGHQWLILHG XVLQJWKLVWHFKQLTXHRWKHUPHDQVZLOOKDYHWREHHPSOR\HGWRORFDWHVSHFLILFUHFRUGVRU SDWLHQWQXPEHUV FRUUHVSRQGLQJWRWKHLQYDOLGYDOXHV
Chapter 1
Checking Values of Character Variables
3
8VHWKHSURJUDP3$7,(1766$6VKRZQQH[W WRFUHDWHWKH6$6GDWDVHW 3$7,(176 IURPWKHUDZGDWDILOH3$7,(1767;7ZKLFKFDQEHGRZQORDGHGIURPWKH6$6:HE VLWH RU IRXQG OLVWHG LQ WKH $SSHQGL[ 7KLV SURJUDP LV IROORZHG ZLWK WKH DSSURSULDWH 352&)5(4VWDWHPHQWVWROLVWWKHXQLTXHYDOXHVDQGWKHLUIUHTXHQFLHV IRUWKHYDULDEOHV *(1'(5';DQG$( 3URJUDP
:ULWLQJD3URJUDPWR&UHDWHWKH'DWD6HW3$7,(176
*----------------------------------------------------------* |PROGRAM NAME: PATIENTS.SAS IN C:\CLEANING | |PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS | *----------------------------------------------------------*; LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.PATIENTS; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; /* Pad short records with blanks */ INPUT @1 @5 @15 @18 @21 @24 @27
PATNO VISIT HR SBP DBP DX AE
LABEL PATNO GENDER VISIT HR SBP DBP DX AE FORMAT VISIT RUN;
$3. @4 GENDER MMDDYY10. 3. 3. 3. $3. $1.;
$1.
= "Patient Number" = "Gender" = "Visit Date" = "Heart Rate" = "Systolic Blood Pressure" = "Diastolic Blood Pressure" = "Diagnosis Code" = "Adverse Event?"; MMDDYY10.;
4
®
Cody’s Data Cleaning Techniques Using SAS Software
7KH '$7$ VWHS LV VWUDLJKWIRUZDUG 1RWLFH WKH 3$' RSWLRQ LQ WKH ,1),/( VWDWHPHQW 7KLVZLOOVHHPIRUHLJQWRPRVWPDLQIUDPHXVHUVDQGLVSUREDEO\QRORQJHUQHFHVVDU\RQ RWKHU SODWIRUPV 7KH 3$' RSWLRQ SDGV DOO UHFRUGV DGGV EODQNV WR WKH HQG RI VKRUW UHFRUGV WR WKH GHIDXOW ORJLFDO UHFRUG OHQJWK RU D OHQJWK VSHFLILHG E\ DQRWKHU ,1),/( RSWLRQ/5(&/,WSUHYHQWVWKHSRVVLELOLW\RIVNLSSLQJWRWKHQH[WUHFRUGOLQH RIGDWD ZKHQDVKRUWOLQHLVHQFRXQWHUHG 1H[W \RX ZDQW WR XVH 352& )5(4 WR OLVW DOO WKH XQLTXH YDOXHV IRU \RXU FKDUDFWHU YDULDEOHV7RVLPSOLI\WKHRXWSXWIURP352&)5(4XVHWKH12&80QRFXPXODWLYH VWDWLVWLFV DQG123(5&(17QRSHUFHQWDJHV 7$%/(6RSWLRQVEHFDXVH\RXRQO\ZDQW IUHTXHQF\FRXQWVIRUHDFKRIWKHXQLTXHFKDUDFWHUYDOXHV1RWHVRPHWLPHVWKHSHUFHQW DQGFXPXODWLYHVWDWLVWLFVFDQEHXVHIXO²WKHFKRLFHLV\RXUV 7KH352&VWDWHPHQWVDUH VKRZQLQ3URJUDP 3URJUDP
8VLQJ 352& )5(4 WR /LVW $OO WKH 8QLTXH 9DOXHV IRU &KDUDFWHU 9DULDEOHV
PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Frequency Counts for Selected Character Variables"; TABLES GENDER DX AE / NOCUM NOPERCENT; RUN;
Chapter 1
Checking Values of Character Variables
+HUHLVWKHRXWSXWIURPUXQQLQJ3URJUDP Frequency Counts for Selected Character Variables The FREQ Procedure Gender GENDER Frequency ------------------2 1 F 12 M 14 X 1 f 2 Frequency Missing = 1 Diagnosis Code DX Frequency --------------1 7 2 2 3 3 4 3 5 3 6 1 7 2 X 2 Frequency Missing = 8
Adverse Event? AE Frequency --------------0 19 1 10 A 1 Frequency Missing = 1
5
6
®
Cody’s Data Cleaning Techniques Using SAS Software
/HW V IRFXV LQ RQ WKH IUHTXHQF\ OLVWLQJ IRU WKH YDULDEOH *(1'(5 ,I YDOLG YDOXHV IRU *(1'(5DUH ) 0 DQGPLVVLQJWKLVRXWSXWZRXOGSRLQWRXWVHYHUDOGDWDHUURUV7KH YDOXHV DQG ; ERWKRFFXURQFH'HSHQGLQJRQWKHVLWXDWLRQWKHORZHUFDVHYDOXH I
PD\RUPD\QRWEHFRQVLGHUHGDQHUURU,IORZHUFDVHYDOXHVZHUHHQWHUHGLQWRWKHILOHE\ PLVWDNHEXWWKHYDOXHDVLGHIURPWKHFDVH ZDVFRUUHFW\RXFRXOGFKDQJHDOOORZHUFDVH YDOXHV WR XSSHUFDVH ZLWK WKH 83&$6( IXQFWLRQ 0RUH RQ WKDW ODWHU 7KH LQYDOLG '; FRGH RI ; DQG WKH DGYHUVH HYHQW RI $ DUH DOVR HDVLO\ LGHQWLILHG $W WKLV SRLQW LW LV QHFHVVDU\ WR UXQ DGGLWLRQDO SURJUDPV WR LGHQWLI\ WKH ORFDWLRQ RI WKHVH HUURUV 5XQQLQJ 352&)5(4LVVWLOODXVHIXOILUVWVWHSLQLGHQWLI\LQJHUURUVRIWKHVHW\SHVDQGLWLVDOVR XVHIXODVDODVWVWHSDIWHUWKHGDWDKDYHEHHQFOHDQHGWRHQVXUHWKDWDOOWKHHUURUVKDYH EHHQLGHQWLILHGDQGFRUUHFWHG Using a DATA Step to Check for Invalid Values
8VLQJD'$7$B18//B6WHSWR'HWHFW,QYDOLG&KDUDFWHU'DWD
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.; ***Check GENDER; IF GENDER NOT IN (’F’ ’M’ ’ ’) THEN PUT PATNO= GENDER=; ***Check DX; IF VERIFY(DX,’ 0123456789’) NE 0 THEN PUT PATNO= DX=; ***Check AE; IF AE NOT IN (’0’ ’1’ ’ ’) THEN PUT PATNO= AE=; RUN;
Chapter 1
Checking Values of Character Variables
7
%HIRUH GLVFXVVLQJ WKH RXWSXW OHW V VSHQG D PRPHQW ORRNLQJ RYHU WKH SURJUDP )LUVW QRWLFH WKH XVH RI WKH '$7$ B18//B VWDWHPHQW %HFDXVH WKH RQO\ SXUSRVH RI WKLV SURJUDPLVWRLGHQWLI\LQYDOLGGDWDYDOXHVWKHUHLVQRQHHGWRFUHDWHD6$6GDWDVHW7KH ),/(35,17VWDWHPHQWFDXVHVWKHUHVXOWVRIDQ\VXEVHTXHQW387VWDWHPHQWVWREHVHQWWR WKH 2XWSXW ZLQGRZ RU RXWSXW GHYLFH :LWKRXW WKLV VWDWHPHQW WKH UHVXOWV RI WKH 387 VWDWHPHQWVZRXOGEHVHQWWRWKH6$6/RJ*(1'(5DQG$(DUHFKHFNHGE\XVLQJWKH ,1RSHUDWRU7KHVWDWHPHQW IF X IN (’A’ ’B’ ’C’) THEN . . .;
LVHTXLYDOHQWWR IF X = ’A’ OR X = ’B’ OR X = ’C’ THEN . . .;
7KDW LV LI ; LV HTXDO WR DQ\ RI WKH YDOXHV LQ WKH OLVW IROORZLQJ WKH ,1 RSHUDWRU WKH H[SUHVVLRQLVHYDOXDWHGDVWUXH
$QRWKHUSRVVLELOLW\LV IF GENDER NE ’F’ AND GENDER NE ’M’ AND GENDER NE ’ ’ THEN PUT PATNO= GENDER=;
:KLOH DOO RI WKHVH VWDWHPHQWV FKHFNLQJ IRU *(1'(5 DQG $( SURGXFH WKH VDPH UHVXOW WKH,1RSHUDWRULVSUREDEO\WKHHDVLHVWWRZULWHHVSHFLDOO\LIWKHUHDUHDODUJHQXPEHURI SRVVLEOH YDOXHV WR FKHFN $OZD\V EH VXUH WR FRQVLGHU ZKHWKHU \RX ZDQW WR LGHQWLI\ PLVVLQJ YDOXHV DV LQYDOLG RU QRW ,Q WKH VWDWHPHQWV DERYH \RX DUH DOORZLQJ PLVVLQJ YDOXHV DV YDOLG FRGHV ,I \RX ZDQW WR IODJ PLVVLQJ YDOXHV DV HUURUV GR QRW LQFOXGH D PLVVLQJYDOXHLQWKHOLVWRIYDOLGFRGHV
8
®
Cody’s Data Cleaning Techniques Using SAS Software
,I\RXZDQWWRDOORZORZHUFDVH0 VDQG) VDVYDOLGYDOXHV\RXFDQDGGWKHVLQJOHOLQH GENDER = UPCASE(GENDER);
LPPHGLDWHO\EHIRUHWKHOLQHWKDWFKHFNVIRULQYDOLGJHQGHUFRGHV$V\RXFDQSUREDEO\ JXHVVWKH83&$6(IXQFWLRQFKDQJHVDOOORZHUFDVHOHWWHUVWRXSSHUFDVHOHWWHUV $VWDWHPHQWVLPLODUWRWKHJHQGHUFKHFNLQJVWDWHPHQWLVXVHGWRWHVWWKHDGYHUVHHYHQWV 7KHUHDUHVRPDQ\YDOLGYDOXHVIRU';DQ\QXPHUDOIURPWR WKDWWKHDSSURDFK \RX XVHG IRU *(1'(5 DQG $( ZRXOG EH LQHIILFLHQW DQG ZHDU \RX RXW W\SLQJ LI \RX XVHGLWWRFKHFNIRULQYDOLG';FRGHV7KH9(5,)<IXQFWLRQLVRQHRIWKHPDQ\SRVVLEOH ZD\V\RXFDQFKHFNWRVHHLIWKHUHLVDYDOXHRWKHUWKDQWKHQXPHUDOVWRRUEODQNDVD ';YDOXH7KH9(5,)<IXQFWLRQKDVWKHIROORZLQJIRUP VERIFY(character_variable,verify_string)
ZKHUHWKHYHULI\VWULQJLVHLWKHUDFKDUDFWHUYDULDEOHRUDVHULHVRIFKDUDFWHUYDOXHVSODFHG LQ VLQJOH RU GRXEOH TXRWHV 7KH 9(5,)< IXQFWLRQ UHWXUQV WKH ILUVW SRVLWLRQ LQ WKH FKDUDFWHUBYDULDEOH WKDW FRQWDLQV D FKDUDFWHU WKDW LV QRW LQ WKH YHULI\BVWULQJ ,I WKH FKDUDFWHUBYDULDEOHGRHVQRWFRQWDLQDQ\LQYDOLGYDOXHVWKH9(5,)<IXQFWLRQUHWXUQVD 7RPDNHWKLVFOHDUHUOHW VORRNDWWKHIROORZLQJVWDWHPHQWWKDWXVHVWKH9(5,)<IXQFWLRQ WRFKHFNIRULQYDOLG*(1'(5YDOXHV IF VERIFY (GENDER,’FM ’) NE 0 THEN PUT PATNO= GENDER=;
1RWLFH WKDW \RX LQFOXGHG D EODQN LQ WKH YHULI\BVWULQJ VR WKDW PLVVLQJ YDOXHV ZLOO EH FRQVLGHUHGYDOLG,I*(1'(5KDVDYDOXHRWKHUWKDQDQ ) 0 RUPLVVLQJWKH9(5,)< IXQFWLRQ UHWXUQV WKH SRVLWLRQ RI WKH LQYDOLG FKDUDFWHU LQ WKH VWULQJ %XW EHFDXVH WKH OHQJWKRI*(1'(5LVDQ\LQYDOLGYDOXHIRU*(1'(5UHWXUQVD
Chapter 1
Checking Values of Character Variables
9
$OWKRXJKWKHIXQFWLRQ VERIFY(DX,’ 0123456789’)
UHWXUQVDLIWKHUHDUHQRLQYDOLGFKDUDFWHUVLQWKH';FRGHLWVKRXOGEHSRLQWHGRXWWKDW ';FRGHVZLWKHPEHGGHGEODQNVZLOOQRWEHLGHQWLILHGDVLQYDOLGZLWKWKLVVWDWHPHQW,I \RXZDQWWRHQVXUHWKDWRQO\WKHFKDUDFWHUUHSUHVHQWDWLRQVRIWKHQXPEHUVWR DUH FRQVLGHUHGYDOLGWKHIROORZLQJVWDWHPHQWVFDQEHXVHG X_DX = INPUT(DX,3.); IF X_DX EQ . AND DX NE ’ ’ THEN PUT PATNO= DX=;
7KH ILUVW OLQH DERYH FUHDWHV D QXPHULF YDULDEOH ;B'; IURP WKH FKDUDFWHU '; YDOXH 7KH,1387IXQFWLRQFDQEHWKRXJKWRILQDVLPLODUPDQQHUWRDQ,1387VWDWHPHQW,W VD\V WR SUHWHQG \RX DUH UHDGLQJ D YDULDEOH '; IURP D GDWD ILOH DFFRUGLQJ WR WKH ,1)250$7 H[FHSW \RX DUH DFWXDOO\ UHDGLQJ WKH YDOXH IURP D FKDUDFWHU YDULDEOH 7KHUHVXOWRIWKLVSURFHVVLVWREHDVVLJQHGWRWKHYDULDEOH;B';,Q RWKHU ZRUGV WKH ,1387IXQFWLRQSHUIRUPVDFKDUDFWHUWRQXPHULFFRQYHUVLRQ,IWKHUHLVDQLQYDOLG'; FRGHFRQWDLQLQJDOHWWHURUHPEHGGHGEODQNIRUH[DPSOH WKH,1387IXQFWLRQVHQGVDQ HUURUPHVVDJHWRWKH6$6/RJDQGUHWXUQVDPLVVLQJYDOXH,QWKHVHFRQGOLQH\RXWHVWLI WKH QXPHULF HTXLYDOHQW RI WKH '; FRGH LV PLVVLQJ DQG WKH RULJLQDO '; LV QRW PLVVLQJ SXWWLQJ RXW DQ HUURU PHVVDJH ZKHQ WKLV FRQGLWLRQ LV WUXH 1RWH EHFDXVH WKH RULJLQDO FKDUDFWHU YDOXH ZDV WKUHH E\WHV \RX GRQ W KDYH WR WHVW LI ;B'; LV JUHDWHU WKDQ EHFDXVHWKLVLVWKHODUJHVWQXPEHU\RXFDQZULWHZLWKWKUHHGLJLWV $Q\LQYDOLG';FRGH ZLOOWKHQFDXVHWKHHUURUPHVVDJHWREHSULQWHG )RUUHDOO\FRPSXOVLYHSURJUDPPHUVOLNHWKHDXWKRU WKHUHLVRQHILQDOSUREOHPZLWKWKH DERYHDSSURDFK$YDOXHVXFKDVZRXOGQRWUHVXOWLQDQHUURUPHVVDJHEHFDXVHWKH QXPEHULVEHWZHHQDQG7KHUHDUHVHYHUDOZD\VDURXQGWKLVSUREOHP2QHZD\ LV WR XVH WKH 75$16/$7( IXQFWLRQ WR VXEVWLWXWH DQ LQYDOLG FKDUDFWHU IRU WKH GHFLPDO SRLQWEHIRUH\RXSHUIRUPWKHFKDUDFWHUWRQXPHULFFRQYHUVLRQ X_DX = INPUT(TRANSLATE(DX,’X’,’.’),3.);
7KH75$16/$7(IXQFWLRQDERYHZLOOFRQYHUWSHULRGVRUGHFLPDOSRLQWV WR; V,I'; RULJLQDOO\FRQWDLQHGDGHFLPDOSRLQWWKHYDOXHRI;B';ZRXOGEHDPLVVLQJYDOXH,Q JHQHUDOWKHV\QWD[RIWKH75$16/$7(IXQFWLRQLV TRANSLATE(char_variable,to_string,from_string)
10
®
Cody’s Data Cleaning Techniques Using SAS Software
ZKHUH HDFK FKDUDFWHU LQ WKH IURPBVWULQJ LV WUDQVODWHG WR WKH FRUUHVSRQGLQJ FKDUDFWHU LQ WKHWRBVWULQJ)RUH[DPSOHWRWUDQVODWHWKHQXPHUDOVWKURXJKWRWKHOHWWHUV$WKURXJK (IRUDYDULDEOHFDOOHG6&25(\RXZRXOGZULWH NEW_VAR = TRANSLATE(SCORE,’ABCDE’,’12345’);
$QRWKHULQWHUHVWLQJDSSURDFKLVWRWHVWWRVHHLIWKHYDOXHRI;B';LVQRWDQLQWHJHU7KH 02' IXQFWLRQ LV DQ HIIHFWLYH ZD\ WR GR WKLV ,I DQ\ QXPEHU PRGXOXV LV QRW WKH UHPDLQGHU DIWHU \RX GLYLGH WKH QXPEHU E\ WKH QXPEHU LV QRW DQ LQWHJHU 7KH 6$6 FRGHXVLQJWKLVPHWKRGLV X_DX = INPUT(DX,3.); IF (X_DX EQ . OR MOD(X_DX,1) NE 0) AND DX NE ’ ’ THEN PUT PATNO= DX=;
+HUHLVDQRWKHUSRLQW,I\RXZDQWWRDYRLGILOOLQJXS\RXU6$6/RJZLWKHUURUPHVVDJHV UHVXOWLQJIURPLQYDOLGDUJXPHQWVWRWKH,1387IXQFWLRQ\RXFDQXVHWKHGRXEOHTXHVWLRQ PDUN"" PRGLILHUEHIRUHWKHLQIRUPDWWRWHOOWKHSURJUDPWRLJQRUHWKHVHHUURUVDQGQRW WRUHSRUWWKHHUURUVWRWKH6$6/RJ7KH,1387IXQFWLRQZRXOGWKHQORRNOLNHWKLV X_DX = INPUT(DX,?? 3.);
7KH""LQIRUPDWPRGLILHUFDQDOVREHXVHGZLWKWKH,1387VWDWHPHQW+HUHLVWKHRXWSXW IURPUXQQLQJ3URJUDP Listing of Invalid Patient Numbers and Data Values PATNO=002 PATNO=003 PATNO=004 PATNO=010 PATNO=013 PATNO=002 PATNO=023
DX=X GENDER=X AE=A GENDER=f GENDER=2 DX=X GENDER=f
1RWH WKDW SDWLHQW DSSHDUV WZLFH LQ WKLV RXWSXW 7KLV RFFXUV EHFDXVH WKHUH LV D GXSOLFDWH REVHUYDWLRQ IRU SDWLHQW LQ DGGLWLRQ WR VHYHUDO RWKHU SXUSRVHO\ LQFOXGHG HUURUV VR WKDW WKH GDWD VHW FDQ EH XVHG IRU H[DPSOHV ODWHU LQ WKLV ERRN VXFK DV WKH GHWHFWLRQRIGXSOLFDWH,' VDQGGXSOLFDWHREVHUYDWLRQV
Chapter 1
Checking Values of Character Variables 11
6XSSRVH \RX ZDQW WR FKHFN IRU YDOLG SDWLHQW QXPEHUV 3$712 LQ D VLPLODU PDQQHU +RZHYHU\RXZDQWWRIODJPLVVLQJYDOXHVDVHUURUVHYHU\SDWLHQWPXVWKDYHDYDOLG,' 7KHIROORZLQJVWDWHPHQWV ID = INPUT(TRANSLATE(PATNO,’X’,’.’),?? 3.); IF ID LT 1 THEN PUT "Invalid ID for PATNO=" PATNO;
ZLOOZRUNLQWKHVDPHZD\DV\RXUFKHFNIRULQYDOLG';FRGHVH[FHSWWKDWPLVVLQJYDOXHV ZLOOQRZEHOLVWHGDVHUURUV Using PROC PRINT with a WHERE Statement to List Invalid Values
7KHUHDUHVHYHUDODOWHUQDWLYHZD\VWRLGHQWLI\WKH,' VFRQWDLQLQJLQYDOLGGDWD$VZLWK PRVW RI WKH WRSLFV LQ WKLV ERRN \RX ZLOO VHH VHYHUDO ZD\V RI DFFRPSOLVKLQJ WKH VDPH WDVN :K\" 2QH UHDVRQ LV WKDW VRPH WHFKQLTXHV DUH EHWWHU VXLWHG WR DQ DSSOLFDWLRQ $QRWKHUUHDVRQLVWRWHDFKVRPHDGGLWLRQDO6$6SURJUDPPLQJWHFKQLTXHV)LQDOO\XQGHU GLIIHUHQWFLUFXPVWDQFHVVRPHWHFKQLTXHVPD\EHPRUHHIILFLHQWWKDQRWKHUV 2QH YHU\ HDV\ DOWHUQDWLYH ZD\ WR OLVW WKH VXEMHFWV ZLWK LQYDOLG GDWD LV WR XVH 352& 35,17IROORZHGE\D:+(5(VWDWHPHQW-XVWDV\RXXVHGDQ,)VWDWHPHQWLQD'$7$ VWHSLQWKHSUHYLRXVVHFWLRQ\RXFDQXVHD:+(5(VWDWHPHQWLQDVLPLODUPDQQHUZLWK 352&35,17DQGDYRLGKDYLQJWRZULWHD'$7$VWHSDOWRJHWKHU)RUH[DPSOHWROLVW WKH,' VZLWKLQYDOLG*(1'(5YDOXHV\RXFRXOGZULWHDSURJUDPOLNHWKHRQHVKRZQLQ 3URJUDP 3URJUDP 8VLQJ352&35,17WR/LVW,QYDOLG&KDUDFWHU9DOXHV PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID GENDER VALUES"; WHERE GENDER NOT IN (’M’ ’F’ ’ ’); ID PATNO; VAR GENDER; RUN;
,W V HDV\ WR IRUJHW WKDW :+(5( VWDWHPHQWV FDQ EH XVHG ZLWKLQ 6$6 SURFHGXUHV 6$6 SURJUDPPHUV WKDW KDYH EHHQ DW LW IRU D ORQJ WLPH OLNH WKH DXWKRU RIWHQ ZULWH D VKRUW '$7$VWHSILUVWDQGXVH387VWDWHPHQWVRUFUHDWHDWHPSRUDU\6$6GDWDVHWDQGIROORZ LW ZLWK D 352& 35,17 7KH SURJUDP DERYH LV ERWK VKRUWHU DQG PRUH HIILFLHQW WKDQ D '$7$VWHSIROORZHGE\D352&35,17'$7$B18//BVWHSVKRZHYHUWHQGWREH IDLUO\HIILFLHQWDQGDUHDUHDVRQDEOHDOWHUQDWLYHDVZHOODVWKHPRUHIOH[LEOHDSSURDFK
12
®
Cody’s Data Cleaning Techniques Using SAS Software
7KHRXWSXWIURP3URJUDPIROORZV
LISTING OF INVALID GENDER VALUES PATNO
GENDER
003 010 013 023
X f 2 f
7KLV SURJUDP FDQ EH H[WHQGHG WR OLVW LQYDOLG YDOXHV IRU DOO WKH FKDUDFWHU YDULDEOHV
3URJUDP
8VLQJ 352& 35,17 WR /LVW ,QYDOLG &KDUDFWHU 'DWD IRU 6HYHUDO 9DULDEOHV
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID CHARACTER VALUES"; WHERE GENDER NOT IN (’M’ ’F’ ’ ’) OR VERIFY(DX,’ 0123456789’) NE 0 OR AE NOT IN (’0’ ’1’ ’ ’); ID PATNO; VAR GENDER DX AE; RUN;
7KHUHVXOWLQJRXWSXWLVVKRZQQH[W LISTING OF INVALID CHARACTER VALUES PATNO
GENDER
DX
AE
002 003 004 010 013 002 023
F X F f 2 F f
X 3 5 1 1 X
0 1 A 0 0 0
Chapter 1
Checking Values of Character Variables 13
1RWLFHWKDWWKLVRXWSXWLVQRWDVLQIRUPDWLYHDVWKHRQHSURGXFHGE\WKH'$7$B18//B VWHS LQ 3URJUDP ,W OLVWV DOO WKH SDWLHQW QXPEHUV JHQGHUV '; FRGHV DQG DGYHUVH HYHQWVHYHQZKHQRQO\RQHRIWKHYDULDEOHVKDVDQHUURUSDWLHQWIRUH[DPSOH 6R WKHUHLVDWUDGHRII²WKHVLPSOHUSURJUDPSURGXFHVVOLJKWO\OHVVGHVLUDEOHRXWSXW:H FRXOG JHW SKLORVRSKLFDO DQG H[WHQG WKLV FRQFHSW WR OLIH LQ JHQHUDO EXW WKDW V IRU VRPH RWKHUERRN
OR AND OR
Using Formats to Check for Invalid Values
$QRWKHUZD\WRFKHFNIRULQYDOLGYDOXHVRIDFKDUDFWHUYDULDEOHIURPUDZGDWDLVWRXVH XVHUGHILQHGIRUPDWV7KHUHDUHVHYHUDOSRVVLELOLWLHVKHUH2QH\RXFDQFUHDWHDIRUPDW WKDWOHDYHVDOOYDOLGFKDUDFWHUYDOXHVDVLVDQGIRUPDWVDOOLQYDOLGYDOXHVWRDVLQJOHHUURU FRGH /HW V VWDUW RXW ZLWK D SURJUDP WKDW VLPSO\ DVVLJQV IRUPDWV WR WKH FKDUDFWHU YDULDEOHVDQGXVHV352&)5(4WROLVWWKHQXPEHURIYDOLGDQGLQYDOLGFRGHV)ROORZLQJ WKDW \RX ZLOO H[WHQG WKH SURJUDP E\ XVLQJ D '$7$ VWHS WR LGHQWLI\ ZKLFK ,' V KDYH LQYDOLGYDOXHV3URJUDP XVHV IRUPDWV WR FRQYHUW DOO LQYDOLG GDWD YDOXHV WR D VLQJOH YDOXH
14
®
Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP
8VLQJD8VHU'HILQHG)RUPDWDQG352&)5(4WR/LVW,QYDOLG'DWD 9DOXHV
PROC FORMAT; VALUE $GENDER ’F’,’M’ = ’ ’ = OTHER = VALUE $DX ’001’ - ’999’ ’ ’ OTHER
’Valid’ ’Missing’ ’Miscoded’; = ’Valid’ /* See important note below */ = ’Missing’ = ’Miscoded’;
VALUE $AE ’0’,’1’ = ’Valid’ ’ ’ = ’Missing’ OTHER = ’Miscoded’; RUN; PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Using Formats to Identify Invalid Values"; FORMAT GENDER $GENDER. DX $DX. AE $AE.; TABLES GENDER DX AE / NOCUM NOPERCENT MISSING; RUN;
)RUWKHYDULDEOHV*(1'(5DQG$(ZKLFKKDYHVSHFLILFYDOLGYDOXHV \RX OLVW HDFK RI WKH YDOLG YDOXHV LQ WKH UDQJH WR WKH OHIW RI WKH HTXDO VLJQ LQ WKH 9$/8( VWDWHPHQW )RUPDWHDFKRIWKHVHYDOXHVZLWKWKHYDOXH 9DOLG )RUWKH';IRUPDW\RXVSHFLI\D UDQJHRIYDOXHVRQWKHOHIWVLGHRIWKHHTXDOVLJQ ,PSRUWDQW1RWH,WVKRXOGEHSRLQWHGRXWKHUHWKDWWKHUDQJH ZLOOEHKDYH GLIIHUHQWO\ RQ :LQGRZV DQG 81,; SODWIRUPV FRPSDUHG WR 096 DQG &06 SODWIRUPV
0LVFRGHG YDOXHV7KH7$%/(6RSWLRQ0,66,1*FDXVHVWKHPLVVLQJYDOXHVWREHOLVWHG LQWKHERG\RIWKH352&)5(4RXWSXW+HUHLVWKHRXWSXWIURP352&)5(4
Chapter 1
Checking Values of Character Variables 15
Using Formats to Identify Invalid Values The FREQ Procedure Gender GENDER Frequency --------------------Missing 1 Miscoded 4 Valid 26 Diagnosis Code DX Frequency --------------------Missing 8 Valid 21 Miscoded 2 Adverse Event? AE Frequency --------------------Missing 1 Valid 29 Miscoded 1
7KLV RXWSXW LVQ W SDUWLFXODUO\ XVHIXO ,W GRHVQ W WHOO \RX ZKLFK REVHUYDWLRQV SDWLHQW QXPEHUV FRQWDLQ PLVVLQJ RU LQYDOLG YDOXHV /HW V PRGLI\ WKH SURJUDP E\ DGGLQJ D '$7$VWHSVRWKDW,' VZLWKLQYDOLGFKDUDFWHUYDOXHVDUHOLVWHG 3URJUDP
8VLQJD8VHU'HILQHG)RUPDWDQGD'$7$6WHSWR/LVW,QYDOLG'DWD 9DOXHV
PROC FORMAT; VALUE $GENDER ’F’,’M’ = ’Valid’ ’ ’ = ’Missing’ OTHER = ’Miscoded’; VALUE $DX ’001’ - ’999’ = ’Valid’ ’ ’ = ’Missing’ OTHER = ’Miscoded’; VALUE $AE ’0’,’1’ = ’Valid’ ’ ’ = ’Missing’ OTHER = ’Miscoded’; RUN;
16
®
Cody’s Data Cleaning Techniques Using SAS Software
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" FILE PRINT; ***Send output to the TITLE "Listing of Invalid Patient ***Note: We will only input those INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.;
PAD; Output window; Numbers and Data Values"; variables of interest;
IF PUT(GENDER,$GENDER.) = ’Miscoded’ THEN PUT PATNO= GENDER=; IF PUT(DX,$DX.) = ’Miscoded’ THEN PUT PATNO= DX=; IF PUT(AE,$AE.) = ’Miscoded’ THEN PUT PATNO= AE=; RUN;
7KHKHDUWRIWKLVSURJUDPLVWKH387IXQFWLRQ7RUHYLHZWKH387IXQFWLRQLVVLPLODU WRWKH,1387IXQFWLRQ,WWDNHVWKHIROORZLQJIRUP character_variable = PUT(variable,format)
ZKHUH FKDUDFWHUBYDULDEOH LV D FKDUDFWHU YDULDEOH WKDW FRQWDLQV WKH YDOXH RI WKH YDULDEOH OLVWHGDVWKHILUVWDUJXPHQWWRWKHIXQFWLRQIRUPDWWHGE\WKHIRUPDWOLVWHGDVWKHVHFRQG DUJXPHQWWR WKH IXQFWLRQ 7KH UHVXOW RI D 387 IXQFWLRQ LV DOZD\V D FKDUDFWHU YDULDEOH DQG WKH IXQFWLRQ LV IUHTXHQWO\ XVHG WR SHUIRUP QXPHULFWRFKDUDFWHU FRQYHUVLRQV ,Q 3URJUDPWKHILUVWDUJXPHQWRIWKH387IXQFWLRQLVDFKDUDFWHUYDULDEOHDQGWKHUHVXOW RIWKH387IXQFWLRQIRUDQ\LQYDOLGGDWDYDOXHVZRXOGEHWKHYDOXH 0LVFRGHG +HUHLVWKHRXWSXWIURP3URJUDP Listing of Invalid Patient Numbers and Data Values PATNO=002 PATNO=003 PATNO=004 PATNO=010 PATNO=013 PATNO=002 PATNO=023
DX=X GENDER=X AE=A GENDER=f GENDER=2 DX=X GENDER=f
Chapter 1
Checking Values of Character Variables 17
Using Informats to Check for Invalid Values
352&)250$7LVDOVRXVHGWRFUHDWHLQIRUPDWV 5HPHPEHU WKDW IRUPDWV DUH XVHG WR FRQWURO KRZ YDULDEOHV ORRN LQ RXWSXW RU KRZ WKH\ DUH FODVVLILHG E\ VXFK SURFHGXUHV DV 352& )5(4 ,QIRUPDWV PRGLI\ WKH YDOXH RI YDULDEOHV DV WKH\ DUH UHDG IURP WKH UDZ GDWDRUWKH\FDQEHXVHGZLWKDQ,1387IXQFWLRQWRFUHDWHQHZYDULDEOHVLQWKH'$7$ VWHS8VHUGHILQHGLQIRUPDWVDUHFUHDWHGLQPXFKWKHVDPHZD\DVXVHUGHILQHGIRUPDWV ,QVWHDGRID9$/8(VWDWHPHQWWKDWFUHDWHVIRUPDWVDQ,19$/8(VWDWHPHQWLVXVHGWR FUHDWHLQIRUPDWV7KHRQO\GLIIHUHQFHEHWZHHQWKHWZRLVWKDWLQIRUPDWQDPHVFDQRQO\ EH VHYHQ FKDUDFWHUV LQ OHQJWK 1RWH )RU WKRVH FXULRXV UHDGHUV WKH UHDVRQ LV WKDW LQIRUPDWVDQGIRUPDWVDUHERWKVWRUHGLQWKHVDPHFDWDORJDQGDQ#LVSODFHGEHIRUH LQIRUPDWVWRGLVWLQJXLVKWKHPIURPIRUPDWV 7KH IROORZLQJ LV D SURJUDP WKDW FKDQJHV LQYDOLGYDOXHVIRU*(1'(5DQG$(WRPLVVLQJYDOXHVE\XVLQJDXVHUGHILQHGLQIRUPDW 3URJUDP 8VLQJ D 8VHU'HILQHG ,QIRUPDW WR 6HW ,QYDOLG 'DWD 9DOXHV WR 0LVVLQJ *----------------------------------------------------------------* | PROGRAM NAME: INFORM1.SAS IN C:\CLEANING | | PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS2 | | AND SET ANY INVALID VALUES FOR GENDER AND AE TO | | MISSING, USING A USER-DEFINED INFORMAT | *---------------------------------------------------------------*; LIBNAME CLEAN "C:\CLEANING"; PROC FORMAT; INVALUE $GEN
’F’,’M’ = _SAME_ OTHER = ’ ’; ’0’,’1’ = _SAME_ OTHER = ’ ’;
INVALUE $AE RUN;
DATA CLEAN.PATIENTS2; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @4 GENDER $GEN1. @27 AE $AE1.; LABEL PATNO GENDER DX AE RUN;
= = = =
"Patient Number" "Gender" "Diagnosis Code" "Adverse Event?";
18
®
Cody’s Data Cleaning Techniques Using SAS Software
PROC PRINT DATA=CLEAN.PATIENTS2; TITLE "Listing of Data Set PATIENTS2"; VAR PATNO GENDER AE; RUN;
1RWLFH WKH ,19$/8( VWDWHPHQWV LQ WKH 352& )250$7 DERYH 7KH NH\ ZRUG B6$0(BLVD6$6UHVHUYHGYDOXHWKDWGRHVZKDWLWVQDPHLPSOLHV²LWOHDYHVDQ\RIWKH YDOXHV OLVWHG LQ WKH UDQJH VSHFLILFDWLRQ XQFKDQJHG 7KH NH\ ZRUG 27+(5 LQ WKH VXEVHTXHQW OLQH UHIHUV WR DQ\ YDOXHV QRW PDWFKLQJ RQH RI WKH SUHYLRXV UDQJHV 1RWLFH DOVR WKDW WKH LQIRUPDWV LQ WKH ,1387 VWDWHPHQW XVH WKH XVHUGHILQHG LQIRUPDW QDPH IROORZHG E\ WKH QXPEHU RI FROXPQV WR EH UHDG WKH VDPH PHWKRG WKDW LV XVHG ZLWK SUHGHILQHG6$6LQIRUPDWV 2XWSXWIURPWKH352&35,17LVVKRZQQH[W Listing of Data Set PATIENTS2
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
PATNO
GENDER
AE
001 002 003 004 XX5 006 007
M F
0 0 1
008 009 010 011 012 013 014 002 003 015 017 019 123 321 020 022 023 024 025 027 028 029 006
F M M M F M M M M F M F F M M F F M F M F F M F
0 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0
Chapter 1
Checking Values of Character Variables 19
1RWLFHWKDWLQYDOLGYDOXHVIRU*(1'(5DQG$(DUHQRZPLVVLQJYDOXHVLQFOXGLQJWKH WZRORZHUFDVH I VSDWLHQWQXPEHUVDQG /HW V DGG RQH PRUH IHDWXUH WR WKLV SURJUDP %\ XVLQJ WKH NH\ZRUG 83&$6( LQ WKH LQIRUPDWVSHFLILFDWLRQ\RXFDQDXWRPDWLFDOO\FRQYHUWWKHYDOXHVEHLQJUHDGWRXSSHUFDVH EHIRUH WKH UDQJHV DUH FKHFNHG +HUH DUH WKH 352& )250$7 VWDWHPHQWV UHZULWWHQ WR XVHWKLVRSWLRQ PROC FORMAT; INVALUE $GEN (UPCASE)
’F’ = ’F’ ’M’ = ’M’ OTHER = ’ ’; INVALUE $AE ’0’,’1’ = _SAME_ OTHER = ’ ’; RUN;
7KH83&$6(RSWLRQLVSODFHGLQSDUHQWKHVLVIROORZLQJWKHLQIRUPDWQDPH1RWLFHVRPH RWKHUFKDQJHVDVZHOO
20
®
Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP 8VLQJD8VHU'HILQHG,QIRUPDWZLWKWKH,1387)XQFWLRQ PROC FORMAT; INVALUE $GENDER ’F’,’M’ = _SAME_ OTHER = ’ERROR’; INVALUE $AE ’0’,’1’ = _SAME_ OTHER = ’ERROR’; RUN; DATA _NULL_; FILE PRINT; SET CLEAN.PATIENTS; IF INPUT (GENDER,$GENDER.) = ’ERROR’ THEN PUT @1 "Error for Gender for Patient:" PATNO" Value is " GENDER; IF INPUT (AE,$AE.) = ’ERROR’ THEN PUT @1 "Error for AE for Patient:" PATNO" Value is " AE; RUN;
7KH DGYDQWDJH RI WKLV SURJUDP RYHU 3URJUDP LV WKDW WKH RULJLQDO YDOXHV RI WKH YDULDEOHVDUHQRWORVW
2
Checking Values of Numeric Variables Introduction
21
Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers
22
Using PROC PRINT with a WHERE Statement to List Invalid Data Values 32 Using a DATA Step to Check for Invalid Values
33
Creating a Macro for Range Checking
34
Using Formats to Check for Invalid Values
37
Using Informats to Check for Invalid Values
40
Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage
43
Using PROC RANK to Look for Highest and Lowest Values by Percentage 48 Extending PROC RANK to Look for Highest and Lowest "n" Values
51
Finding Another Way to Determine Highest and Lowest Values
55
Checking a Range Using an Algorithm Based on Standard Deviation
58
Macros Based on the Two Methods of Outlier Detection
62
Demonstrating the Difference between the Two Methods
64
Checking a Range Based on the Interquartile Range
65
Checking Ranges for Several Variables
68
Introduction
The techniques for checking for invalid numeric data are quite different from the techniques that you saw in the last chapter for checking character data. Although there are usually many different values a numeric variable can take on, there are several techniques that you can use to help identify data errors. One simple technique is to examine some of the largest and smallest data values for each numeric variable. If you see values such as 12 or 1200 for a systolic blood pressure (usually between 80 and 200 in healthy adults), you can be quite certain that an error was made, either in entering the data values or on the original data collection form.
22
®
Cody’s Data Cleaning Techniques Using SAS Software
There are also some internal consistency methods that can be used to identify possible data errors. If you see that most of the data values fall within a certain range of values, then any values that fall far enough outside that range may be data errors. This chapter develops programs based on these ideas. Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers
One of the simplest ways to check for invalid numeric values is to run either PROC MEANS or PROC UNIVARIATE. By default, PROC MEANS lists the minimum and maximum values, along with the n, mean, and standard deviation. PROC UNIVARIATE is somewhat more useful in detecting invalid values, because it provides you with a listing of the five highest and five lowest values, along with graphical output (stem-andleaf plots and box plots). Let’s first look at how you can use PROC MEANS for very simple checking of numeric variables. The program below checks the three numeric variables, heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP), in the PATIENTS data set. Program 2-1
Using PROC MEANS to Detect Invalid and Missing Values
LIBNAME CLEAN "C:\CLEANING"; PROC MEANS DATA=CLEAN.PATIENTS N NMISS MIN MAX MAXDEC=3; TITLE "Checking Numeric Variables in the PATIENTS Data Set"; VAR HR SBP DBP; RUN;
Let’s choose the options N, NMISS, MIN, MAX, and MAXDEC=3 for this procedure. The N and NMISS options report the number of nonmissing and missing observations for each variable, respectively. The MIN and MAX options list the smallest and largest nonmissing values for each variable. The MAXDEC=3 option is used so that the minimum and maximum values will be printed to three decimal places. Because HR, SBP, and DBP are supposed to be integers, you might have thought to set the MAXDEC option to 0. However, you might want to catch any data errors where a decimal point was entered by mistake.
Chapter 2
Checking Values of Numeric Variables 23
Here is the output from Program 2-1. Checking Numeric Variables in the PATIENTS Data Set The MEANS Procedure N Variable Label N Miss Minimum Maximum -----------------------------------------------------------------------------HR Heart Rate 28 3 10.000 900.000 SBP Systolic Blood Pressure 27 4 20.000 400.000 DBP Diastolic Blood Pressure 28 3 8.000 200.000 ------------------------------------------------------------------------------
This output is not particularly useful. It does show the number of nonmissing and missing observations along with the highest and lowest values. Inspection of the minimum and maximum values for all three variables shows that there are probably some data errors in the PATIENTS data set. If you want a slightly prettier output, you can use PROC TABULATE to accomplish the same task. For an excellent reference on PROC TABULATE, let me suggest a book written by Lauren E. Haworth, called PROC TABULATE by Example, published by SAS Institute, Cary, NC, as part of their Books by Users series. Here is the equivalent PROC TABULATE program, followed by the output. (Assume that the libref CLEAN has been previously defined in this program and in any future programs where it is not included in the program.) Program 2-2
Using PROC TABULATE to Display Descriptive Data
PROC TABULATE DATA=CLEAN.PATIENTS FORMAT=7.3; TITLE "Statistics for Numeric Variables";
➊
VAR HR SBP DBP; ➋ TABLES HR SBP DBP, N*F=7.0 NMISS*F=7.0 MEAN MIN MAX / RTSPACE=18; KEYLABEL N NMISS MEAN MIN MAX RUN;
= = = = =
’Number’ ➍ ’Missing’ ’Mean’ ’Lowest’ ’Highest’;
➌
24
®
Cody’s Data Cleaning Techniques Using SAS Software
The FORMAT option ➊ tells the procedure to use the numeric format 7.3 (a field width of 7 with 3 places to the right of the decimal point) for all the output, unless otherwise specified. The analysis variables HR, SBP, and DBP are listed in a VAR statement ➋. Let’s place these variables on the row dimension and the statistics along the column dimension. The TABLE option RTSPACE=18 ➌ allows for 18 spaces for all row labels, including the spaces for the lines forming the table. In addition, the format 7.0 is to be used for N and NMISS in the table. Finally, the KEYLABEL statement ➍ replaces the keywords for the selected statistics with more meaningful labels. Below is the output from PROC TABULATE.
Statistics for Numeric Variables Number
Missing
Mean
Lowest
Highest
Heart Rate
28
3
107.393
10.000
900.000
Systolic Blood Pressure
27
4
144.519
20.000
400.000
Diastolic Blood Pressure
28
5
88.071
8.000
200.000
A more useful procedure might be PROC UNIVARIATE. Running this procedure for your numeric variables yields much more information. Program 2-3
Using PROC UNIVARIATE to Look for Outliers
PROC UNIVARIATE DATA=CLEAN.PATIENTS PLOT; TITLE "Using PROC UNIVARIATE to Look for Outliers"; VAR HR SBP DBP; RUN;
The procedure option PLOT provides you with several graphical displays of the data; a stem-and-leaf plot, a box plot, and a normal probability plot. Output from this procedure is shown next. (Note: To save some space, the PROC UNIVARIATE output for the variable SBP has been omitted)
Chapter 2
Checking Values of Numeric Variables 25
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation
28 107.392857 161.086436 4.73965876 1023549 149.997347
Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
28 3007 25948.8399 23.7861582 700618.679 30.442475
Basic Statistical Measures Location Mean Median Mode
Variability
107.3929 74.0000 68.0000
Std Deviation Variance Range Interquartile Range
161.08644 25949 890.00000 27.00000
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student’s t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
3.527731 14 203
0.0015 <.0001 <.0001
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min
Estimate 900 900 210 208 87 74 60 22 22 10 10 Continued
26
®
Cody’s Data Cleaning Techniques Using SAS Software
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Extreme Observations ----Lowest----
----Highest---
Value
Obs
Value
Obs
10 22 22 48 58
23 25 15 24 20
90 101 208 210 900
8 4 19 9 22
Missing Values Missing Value
Count
.
3
Stem 9 8 7 6 5 4 3 2 1 0
-----Percent Of----Missing All Obs Obs 9.68
Leaf 0
11 0 122566667777777888889999 ----+----+----+----+---Multiply Stem.Leaf by 10**+2
100.00 # 1
Boxplot *
2 1 24
* + +--0--+
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Normal Probability Plot 950+ * | | 650+ | + | ++++++ 350+ +++++++ | ++++++ * * | ++++++ * 50+ * * ** *+**+***** * ** * +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
Continued
Chapter 2
Checking Values of Numeric Variables 27
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation
28 88.0714286 37.2915724 1.06190956 254732 42.342418
Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
28 2466 1390.66138 3.67139184 37547.8571 7.04744476
Basic Statistical Measures Location Mean Median Mode
Variability
88.07143 81.00000 78.00000
Std Deviation Variance Range Interquartile Range
37.29157 1391 192.00000 26.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 3. Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student’s t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
12.49693 14 203
<.0001 <.0001 <.0001
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min
Estimate 200 200 180 120 100 81 74 64 20 8 8 Continued
28
®
Cody’s Data Cleaning Techniques Using SAS Software
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Extreme Observations ----Lowest----
----Highest---
Value
Obs
Value
Obs
8 20 64 68 68
23 12 14 27 6
106 120 120 180 200
28 4 11 10 22
Missing Values
Missing Value
Count
.
3
Stem 20 18 16 14 12 10 8 6 4 2 0
-----Percent Of----Missing All Obs Obs 9.68
100.00
Leaf 0 0
# 1 1
Boxplot * *
00 0026 000244800 488044888
2 4 9 9
| +-----+ *--+--* +-----+
0 1 8 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1
0 0
Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: DBP (Diastolic Blood Pressure) Normal Probability Plot 210+ * | * + | +++++ | ++++++ | ++*+* 110+ ++*** * | ****+** | * * **+*+* | +++++ | ++*+++ 10+ +++*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
Chapter 2
Checking Values of Numeric Variables 29
You certainly get lots of information from PROC UNIVARIATE, perhaps too much information. Starting off, you see some descriptive univariate statistics (hence the procedure name) for each of the variables listed in the VAR statement. Most of these statistics are not very useful in the data checking operation. The number of nonmissing observations (N), the number of observations not equal to zero (Num ^= 0), and the number of observations greater than zero (Num > 0) are probably the only items that are of interest to you at this time. One of the most important sections of the PROC UNIVARIATE output, for data checking purposes, is the section labeled “Extremes.” Here you see the five highest and five lowest values for each of your variables. For example, for the variable HR (heart rate), there are three possible data errors under the column label “Lowest” (10, 22, and 22) and three possible data errors under the column label “Highest” (208, 210, and 900). Obviously, having knowledge of reasonable values for each of your variables is essential if this information is to be of any use. Next to the listing of the highest and lowest values is the observation number containing this value. What would be more useful would be the patient or subject number you assigned to each patient. This is easily accomplished by adding an ID statement to PROC UNIVARIATE. You list the name of your identifying variable following the keyword ID. The values of this ID variable are then used in addition to the OBS column. (Note: In versions of SAS software prior to Version 7, the ID column replaced the OBS column; in versions after Version 7, both the ID column and the OBS column are displayed.) If you are running Version 7 or later of SAS software, you can include an ODS (output delivery system) statement to limit the PROC UNIVARIATE output to just the table of extreme values. Here are the PROC UNIVARIATE statements with an ID statement added, as well as the ODS statement to limit the output to the five highest and five lowest data values (the "Extremes").
30
®
Cody’s Data Cleaning Techniques Using SAS Software
Program 2-4
Adding an ID Statement to PROC UNIVARIATE
/******************************************************************\ | The ODS statement is valid for V7 and later. | | Note that the name EXTREMEOBS may change in future SAS releases. | | Use ODS TRACE ON; before the PROC and ODS TRACE OFF; after | | the PROC to obtain a list of output object names (found in | | the SAS Log). | \******************************************************************/ ODS SELECT EXTREMEOBS; PROC UNIVARIATE DATA=CLEAN.PATIENTS; TITLE "Using PROC UNIVARIATE to Look for Outliers"; ID PATNO; VAR HR SBP DBP; RUN;
The section of output showing the "Extremes" for the variable heart rate (HR) follows: Using PROC UNIVARIATE to Look for Outliers The UNIVARIATE Procedure Variable: HR (Heart Rate) Extreme Observations --------Lowest--------
--------Highest-------
Value
Obs
Value
23 25 15 24 20
90 101 208 210 900
10 22 22 48 58
PATNO 020 023 014 022 019
Missing Value Count % Count/Nobs
PATNO 004 017 008 321
Obs 8 4 19 9 22
. 3 10.00
Before the addition of the ID statement, we only had columns labeled VALUE and OBS. With the ID statement there is a new column labeled PATNO that contains the values of your ID variable (PATNO), making it easier to locate the original patient data and check for errors. (Note: The column, which contains the values of the ID variable, that here is labeled PATNO will be labeled ID in releases of SAS software prior to Version 7).
Chapter 2
Checking Values of Numeric Variables 31
The middle section of the output on page 28 contains a stem-and-leaf plot and a box plot. These two data visualizations come from an area of statistics known as exploratory data analysis (EDA). (For an excellent reference see Exploratory Data Analysis, by John Tukey, Reading, Massachusetts: Addison-Wesley.) Let’s focus on the plots for the variable DBP (diastolic blood pressure). The stem-and-leaf plot can be thought of as a sideways histogram. For this variable, the diastolic blood pressures are grouped in 20point intervals. For example, the stem labeled "8" represents the values from 80 to 99. Instead of simply placing X’s or some other symbol to represent the bar in this sideways histogram, the next digit in the value is used instead. Thus, you see that there were three values of 80, one 82, two 84’s, one 88, and two 90’s. You can ignore these values and just think of the stem-and-leaf plot as a histogram, or you might be interested in the additional information that the leaf values give you. A quick examination of this plot shows that there were some abnormally low and high diastolic blood pressure values. This useful information complements the "Extremes" information. The "Extremes" only lists the five highest and five lowest values; the stem-and-leaf plot shows all the values in your data set. To the right of the stem-and-leaf plot is a box plot. This plot shows the mean (+ sign), the median (the dashed line between the two asterisks), and the 25th and 75th percentiles (the bottom and top of the box, respectively). The distance between the top and bottom of the box is called the interquartile range and can also be found earlier in the outout labeled as "Q1-Q3." Extending from both the top and bottom of the box are whiskers and outliers. The whiskers represent data values within one-and-a-half interquartile ranges above or below the box. (Note: The EDA people call the top and bottom of the box, the hinges.) Any data values more than one-and-a-half but less than three interquartile ranges above or below the box (hinges) are represented by 0’s. Two data values for DBP (8 and 20) fit this description in the box plot on page 28. Finally, any data values more than three interquartile ranges above or below the top and bottom hinges are represented by asterisks. For your DBP variable, two data points, 180 and 200, fit this description. The final graph, called a normal probability plot, is of interest to statisticians and helps determine deviations from a theoretical distribution called a normal or Gaussian distribution. The information displayed in the normal probability plot may not be useful for your data cleaning task because you are looking for data errors and are not particularly interested if the data are normally distributed or not.
32
®
Cody’s Data Cleaning Techniques Using SAS Software
Using PROC PRINT with a WHERE Statement to List Invalid Data Values
While PROC MEANS and PROC UNIVARIATE can be useful as a first step in data cleaning for numeric variables, they can produce large volumes of output and may not give you all the information you want, and certainly not in a concise form. One way to check each numeric variable for invalid values is to use PROC PRINT, followed by the appropriate WHERE statement. Suppose you want to check all the data for any patient having a heart rate outside the range of 40 to 100, a systolic blood pressure outside the range of 80 to 200, and a diastolic blood pressure outside the range of 60 to 120. For this example, missing values are not treated as invalid. The PROC PRINT step in Program 2-5 reports all patients with out-of-range values for heart rate, systolic blood pressure, or diastolic blood pressure. Program 2-5
Using a WHERE Statement with PROC PRINT to List Out-ofRange Data
PROC PRINT DATA=CLEAN.PATIENTS; WHERE (HR NOT BETWEEN 40 AND 100 AND HR IS NOT MISSING) (SBP NOT BETWEEN 80 AND 200 AND SBP IS NOT MISSING) (DBP NOT BETWEEN 60 AND 120 AND DBP IS NOT MISSING); TITLE "Out-of-Range Values for Numeric Variables"; ID PATNO; VAR HR SBP DBP; RUN;
OR OR
You don’t need the parentheses in the WHERE statements because the AND operator is evaluated before the OR operator. However, because this author can never seem to remember the order of operation of Boolean operators, the parentheses were included for clarity. Extra parentheses do no harm.
Chapter 2
Checking Values of Numeric Variables 33
The resulting output is shown next. Out-of-Range Values for Numeric Variables PATNO
HR
SBP
DBP
004 008 009 010 011 014 017 321 020 023
101 210 86 . 68 22 208 900 10 22
200 . 240 40 300 130 . 400 20 34
120 . 180 120 20 90 84 200 8 78
A disadvantage of this listing is that an observation is printed if one or more of the variables is outside the specified range. To obtain a more precise listing that shows only the data values outside the normal range, you can use a DATA step as described in the next section. Using a DATA Step to Check for Invalid Values
A simple DATA _NULL_ step can also be used to produce a report on out-of-range values. The approach here is the same as the one described in the section of Chapter 1 that begins on page 6. Program 2-6
Using a DATA _NULL_ Step to List Out-of-Range Data Values
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Patient Numbers and Invalid Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @15 HR 3. @18 SBP 3. @21 DBP 3.; ***Check HR; IF (HR LT 40 AND HR NE .) OR HR GT 100 THEN PUT PATNO= HR=; ***Check SBP; IF (SBP LT 80 AND SBP NE .) OR SBP GT 200 THEN PUT PATNO= SBP=; ***Check DBP; IF (DBP LT 60 AND DBP NE .) OR DBP GT 120 THEN PUT PATNO= DBP=; RUN;
34
®
Cody’s Data Cleaning Techniques Using SAS Software
Here is the output from Program 2-6. Listing of Patient Numbers and Invalid Data Values PATNO=004 PATNO=008 PATNO=009 PATNO=009 PATNO=010 PATNO=011 PATNO=011 PATNO=014 PATNO=017 PATNO=321 PATNO=321 PATNO=321 PATNO=020 PATNO=020 PATNO=020 PATNO=023
HR=101 HR=210 SBP=240 DBP=180 SBP=40 SBP=300 DBP=20 HR=22 HR=208 HR=900 SBP=400 DBP=200 HR=10 SBP=20 DBP=8 HR=22
PATNO=023
SBP=34
Notice that a statement such as "IF HR LT 40" includes missing values because missing values are interpreted by SAS programs as the smallest possible value. Therefore, the following statement IF HR LT 40 OR HR GT 100 THEN PUT PATNO= HR=;
will produce a listing that includes missing heart rates as well as out-of-range values (which may be what you want). Creating a Macro for Range Checking
Because range checking is such a common data cleaning task, it makes some sense to automate the procedure somewhat. Looking at Program 2-6, you see that the range checking lines all look similar except for the variable names and the low and high cutoff values. Even if you are not a macro expert, the following program should not be too difficult to understand. (For an excellent review of macro programming, I recommend two books in SAS Institute’s BBU series: SAS® Macro Programming Made Easy, by Michele M. Burlew, and Carpenter's Complete Guide to the SAS® Macro Language, by Art Carpenter, both published by SAS Institute, Cary, NC.)
Chapter 2
Program 2-7
Checking Values of Numeric Variables 35
Writing a Macro to List Out-of-Range Data Values
*---------------------------------------------------------------* | Program Name: RANGE.SAS in C:\CLEANING | | Purpose: Macro that takes lower and upper limits for a | | numeric variable and an ID variable to print out | | an exception report to the Output window. | | Arguments: DSN - Data set name | | VAR - Numeric variable to test | | LOW - Lowest valid value | | HIGH - Highest valid value | | IDVAR - ID variable to print in the exception | | report | | Example: %RANGE(CLEAN.PATIENTS,HR,40,100,PATNO) | *---------------------------------------------------------------*; %MACRO RANGE(DSN,VAR,LOW,HIGH,IDVAR); TITLE "Listing of Invalid Patient Numbers and Data Values"; DATA _NULL_; SET &DSN(KEEP=&IDVAR &VAR); ➊ FILE PRINT; IF (&VAR LT &LOW AND &VAR NE .) OR &VAR GT &HIGH THEN PUT "&IDVAR:" &IDVAR @18 "Variable:&VAR" @38 "Value:" &VAR @50 "out-of-range"; RUN; %MEND RANGE;
First, a brief explanation. A macro program is a piece of SAS code where parts of the code are substituted with variable information by the macro processor before the code is processed in the usual way by the SAS compiler. The macro in Program 2-7 is named RANGE, and it begins with a %MACRO statement and ends with a %MEND (macro end) statement. The first line of the macro contains the macro name, followed by a list of arguments. When the macro is called, the macro processor replaces each of these arguments with the values you specify. Then, in the macro program, every macro variable (that is, every variable name preceded by an ampersand (&)) is replaced by the assigned value. For example, if you want to use this macro to look for out-of-range values for heart rate in the PATIENTS data set, you would call the macro like this %RANGE(CLEAN.PATIENTS,HR,40,100,PATNO)
36
®
Cody’s Data Cleaning Techniques Using SAS Software
The macro processor will substitute these calling arguments for the &variables in the macro program. For example, line ➊ will become: SET CLEAN.PATIENTS(KEEP=PATNO HR);
&DSN was replaced by CLEAN.PATIENTS, &IDVAR was replaced by PATNO, and &VAR was replaced by HR. To be sure this concept is clear, (and to help you understand how the macro processor works), you can call the macro with the MPRINT option turned on. This option lists the macro generated code in the SAS Log. Here is the section of the SAS Log containing the macro generated statements when the macro is called with the above arguments: MPRINT(RANGE): Values"; MPRINT(RANGE): MPRINT(RANGE): MPRINT(RANGE): MPRINT(RANGE): range"; MPRINT(RANGE):
TITLE
"Listing
of
Invalid
Patient
Numbers
and
Data
DATA _NULL_; SET CLEAN.PATIENTS(KEEP=PATNO HR); FILE PRINT; IF (HR LT 40 AND HR NE .) OR HR GT 100 THEN PUT "PATNO:" PATNO @18 "Variable:HR" @38 "Value:" HR @50 "out-ofRUN;
By the way, the missing semicolon at the end of the line where the macro is called is not a mistake — you don't need it. The reason is that the macro code contains a semicolon after the last RUN statement so that an extra semicolon is unnecessary. If you like, you may put one in any way. As pointed out by Mike Zdeb, one of my reviewers, if you include the unnecessary semicolon, you can change the line to a comment by placing an asterisk at the beginning. The results from running the macro with the above calling arguments are listed next: Listing of Invalid Patient Numbers and Data Values PATNO:004 PATNO:008 PATNO:014 PATNO:017 PATNO:321 PATNO:020 PATNO:023
Variable:HR Variable:HR Variable:HR Variable:HR Variable:HR Variable:HR Variable:HR
Value:101 Value:210 Value:22 Value:208 Value:900 Value:10 Value:22
out-of-range out-of-range out-of-range out-of-range out-of-range out-of-range out-of-range
Chapter 2
Checking Values of Numeric Variables 37
While this saves on programming time, it is not as efficient as a program that checks all the numeric variables in one DATA step. However, sometimes it is reasonable to sacrifice computer time for human time. Using Formats to Check for Invalid Values
Just as you did with character values in Chapter 1, you can use user-defined formats to check for out-of-range data values. Program 2-8 uses formats to find invalid data values, based on the same ranges used in Program 2-5 in this chapter. Program 2-8
Detecting Out-of-Range Values Using User-Defined Formats
PROC FORMAT; VALUE HR_CK 40-100, . = ’OK’; VALUE SBP_CK 80-200, . = ’OK’; VALUE DBP_CK 60-120, . = ’OK’; RUN; DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @15 HR 3. @18 SBP 3. @21 DBP 3.; IF PUT(HR,HR_CK.) NE ’OK’ THEN PUT PATNO= HR=; IF PUT(SBP,SBP_CK.) NE ’OK’ THEN PUT PATNO= SBP=; IF PUT(DBP,DBP_CK.) NE ’OK’ THEN PUT PATNO= DBP=; RUN;
38
®
Cody’s Data Cleaning Techniques Using SAS Software
This is a fairly simple and efficient program. The user-defined formats HR_CK., SBP_CK., and DBP_CK. all assign the formatted value ’OK’ for any data value in the acceptable range. In the DATA step, the result of the PUT function is the value of the first argument (the variable to be tested) formatted by the format specified as the second calling argument of the function. For example, any value of heart rate between 40 and 100 (or missing) falls into the format range ’OK’. A value of 22 for heart rate does not fall within the range of 40 to 100 or missing and the formatted value ’OK’ is not assigned. In that case, the PUT function for heart rate does not return the value ’OK’ and the IF statement condition is true. The appropriate PUT statement is then executed and the invalid value is printed to the print file. Output from this program is shown next: Listing of Invalid Patient Numbers and Data Values PATNO=004 PATNO=008 PATNO=009 PATNO=009 PATNO=010 PATNO=011 PATNO=011 PATNO=014 PATNO=017 PATNO=321 PATNO=321 PATNO=321 PATNO=020 PATNO=020 PATNO=020 PATNO=023 PATNO=023
HR=101 HR=210 SBP=240 DBP=180 SBP=40 SBP=300 DBP=20 HR=22 HR=208 HR=900 SBP=400 DBP=200 HR=10 SBP=20 DBP=8 HR=22 SBP=34
Chapter 2
Checking Values of Numeric Variables 39
Notice that patient number 27, who had a value of ’NA’ for heart rate, did not appear in this listing. Why not? Well, the INPUT statement generates a missing value in its attempt to read a character value with a numeric informat. Because missing values are not treated as errors in this example, no error listing is produced for patient number 27. If you would like to include invalid character values (such as NA) as errors, you can use the internal _ERROR_ variable to check if such a value was processed by the INPUT statement. Unfortunately, the program cannot tell which variable for patient number 27 contained the invalid value. It is certainly possible to distinguish between invalid character values in numeric fields from true missing values. One possible approach is to use an enhanced numeric informat. Another is to read all of the numeric variables as character data, test the values, and then convert to numeric for range checking. In the section that follows, the program demonstrates how a user-defined enhanced numeric informat can be used. A simple "work-around" for program 2-8 is to test for any character values that were converted to missing values by using the internal variable _ERROR_, which gets set to ’1’ any time the input processor detects such an error. A modified version of Program 2-8, shown below, will print a notification that one or more variables for a patient had an invalid character value. Program 2-9
Modifying the Previous Program to Detect Invalid (Character) Data Values
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @15 HR 3. @18 SBP 3. @21 DBP 3.; IF PUT(HR,HR_CK.) NE ’OK’ OR _ERROR_ GT 0 THEN PUT PATNO= HR=; IF PUT(SBP,SBP_CK.) NE ’OK’ OR _ERROR_ GT 0 THEN PUT PATNO= SBP=; IF PUT(DBP,DBP_CK.) NE ’OK’ OR _ERROR_ GT 0 THEN PUT PATNO= DBP=; IF _ERROR_ GT 0 THEN PUT PATNO= "had one or more invalid character values"; ***Set the Error flag back to 0; _ERROR_ = 0; RUN;
40
®
Cody’s Data Cleaning Techniques Using SAS Software
Using Informats to Check for Invalid Values
You can accomplish the same result as the previous program by using user-defined informats. Remember that informats are used to replace values as the raw data is being read in or as the second argument in an INPUT function. Following is a program very similar to Program 2-9, however this one uses informats. Program 2-10 Using User-Defined Informats to Detect Out-of-Range Data Values PROC FORMAT; INVALUE HR_CK 40-100, . = 9999; INVALUE SBP_CK 80-200, . = 9999; INVALUE DBP_CK 60-120, . = 9999; RUN; DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @15 HR HR_CK3. @18 SBP SBP_CK3. @21 DBP DBP_CK3.; IF HR NE 9999 THEN PUT PATNO= HR=; IF SBP NE 9999 THEN PUT PATNO= SBP=; IF DBP NE 9999 THEN PUT PATNO= DBP=; RUN;
PROC FORMAT is used to create three informats (note the use of INVALUE statements instead of the usual VALUE statements). For the informat HR_CK, any numeric value in the range 40 to 100 or missing is assigned a value of 9999. Note that you cannot assign a character value here because the result of a numeric informat must be numeric. In this example, using the value of 9999 is a good choice because 9999 can never be a valid value for any of the variables (they are stored in three columns in the input file).
Chapter 2
Checking Values of Numeric Variables 41
Running Program 2-10 results in the following output: Listing of Invalid Patient Numbers and Data Values PATNO=004 PATNO=008 PATNO=009 PATNO=009 PATNO=010 PATNO=011 PATNO=011 PATNO=014 PATNO=017 PATNO=321 PATNO=321 PATNO=321 PATNO=020 PATNO=020 PATNO=020 PATNO=023 PATNO=023 PATNO=027
HR=101 HR=210 SBP=240 DBP=180 SBP=40 SBP=300 DBP=20 HR=22 HR=208 HR=900 SBP=400 DBP=200 HR=10 SBP=20 DBP=8 HR=22 SBP=34 HR=.
If you look carefully at the output from this program and the earlier program that used user-defined formats, you will notice that patient number 027 is listed here with a missing heart rate but not shown in the earlier listing. What’s going on? Inspection of the raw data shows a value of 'NA' for heart rate for patient number 27. When you used a format, the original data value of 'NA' was converted to a numeric missing value by the SAS processor (with a resulting message being written to the Log). The result of the PUT function was therefore 'OK' and the value was not flagged as invalid. When an informat was used, the value 'NA' was not in the valid range so that the value of 9999 was not assigned to heart rate and the value was flagged as invalid. If you would like to go the "extra mile," you can use an enhanced numeric informat to assign a value to any alphabetic value. With this technique, you can distinguish invalid character values from true missing values. Program 2-11 demonstrates this.
42
®
Cody’s Data Cleaning Techniques Using SAS Software
Program 2-11 Modifying the Previous Program to Detect Invalid (Character) Data Values PROC FORMAT; INVALUE HR_CK (UPCASE) 40 - 100, . ’A’ - ’Z’ INVALUE SBP_CK (UPCASE) 80 - 200, . ’A’ - ’Z’ INVALUE DBP_CK (UPCASE) 60 - 120, . ’A’ - ’Z’ RUN;
= 9999 = 8888; = 9999 = 8888; = 9999 = 8888;
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @15 HR HR_CK3. @18 SBP SBP_CK3. @21 DBP DBP_CK3.; IF HR = 8888 THEN PUT PATNO= "Invalid character value for HR"; ELSE IF HR NE 9999 THEN PUT PATNO= HR=; IF SBP = 8888 THEN PUT PATNO= "Invalid character value for SBP"; ELSE IF SBP NE 9999 THEN PUT PATNO= SBP=; IF DBP = 8888 THEN PUT PATNO= "Invalid character value for DBP"; ELSE IF DBP NE 9999 THEN PUT PATNO= DBP=; RUN;
The UPCASE option converts any character values to uppercase before it is determined if the value fits into one of the specified ranges. Notice that the ranges for the three informats contain both numeric ranges and character ranges. This feature, called an enhanced numeric informat, is very powerful and allows programs to read a combination of numeric and character data with a single informat (see SAS Technical Report P-222, Changes and Enhancements to Base SAS Software, Release 6.07). Notice in the next output, that patient number 27 is reported to have invalid character data for heart rate.
Chapter 2
Checking Values of Numeric Variables 43
Listing of Invalid Patient Numbers and Data Values PATNO=004 PATNO=008 PATNO=009 PATNO=009 PATNO=010 PATNO=011 PATNO=011 PATNO=014 PATNO=017 PATNO=321 PATNO=321 PATNO=321 PATNO=020 PATNO=020 PATNO=020 PATNO=023 PATNO=023 PATNO=027
HR=101 HR=210 SBP=240 DBP=180 SBP=40 SBP=300 DBP=20 HR=22 HR=208 HR=900 SBP=400 DBP=200 HR=10 SBP=20 DBP=8 HR=22 SBP=34 Invalid character value for HR
Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage
Let’s return to the problem of locating the "n" highest and "n" lowest values for each of several numeric variables in the data set. Remember that earlier in this chapter, you used PROC UNIVARIATE to list the five highest and five lowest values for your three numeric variables. First of all, this procedure prints lots of other statistics that you don't need (or want), unless you use the output delivery system to limit the output. If you are running a version of SAS software prior to Version 7 or you want to control the number of high and low values to list, you can write a custom program to give you exactly what you want. The approach is to have PROC UNIVARIATE output a data set containing the cutoff values on the lower and upper range of interest. The first program described lists the bottom and top "n" percent of the values. Next, the program is turned into a macro so that it is easier to use. Program 2-12 uses PROC UNIVARIATE to print out the bottom and top "n" percent of the data values.
44
®
Cody’s Data Cleaning Techniques Using SAS Software
Program 2-12 Using PROC UNIVARIATE to Print the Top and Bottom "n" Percent of Data Values ***Solution using PROC UNIVARIATE and Percentiles; LIBNAME CLEAN "C:\CLEANING"; ***The two macro variables that follow define the lower and upper percentile cut points; ***Change the value in the line below to the percentile cut-off you want; %LET LOW_PER=20;
➊
***Compute the upper cut-off value; %LET UP_PER= %EVAL(100 - &LOW_PER);
➋
***Choose a variable to operate on; %LET VAR = HR;
➌
PROC UNIVARIATE DATA=CLEAN.PATIENTS NOPRINT; VAR &VAR; ID PATNO;
➍
OUTPUT OUT=TMP PCTLPTS=&LOW_PER &UP_PER PCTLPRE = L_; RUN; DATA HILO; SET CLEAN.PATIENTS(KEEP=PATNO &VAR); ➏ ***Bring in upper and lower cutoffs for variable; IF _N_ = 1 THEN SET TMP; ➐ IF &VAR LE L_&LOW_PER THEN DO; RANGE = ’LOW ’; OUTPUT; END; ELSE IF &VAR GE L_&UP_PER THEN DO; RANGE = ’HIGH’; OUTPUT; END; RUN; PROC SORT DATA=HILO(WHERE=(&VAR NE .)); BY DESCENDING RANGE &VAR; RUN;
➑
➎
Chapter 2
Checking Values of Numeric Variables 45
PROC PRINT DATA=HILO; TITLE "High and Low Values for Variables"; ID PATNO; VAR RANGE &VAR; RUN;
Let’s go through this program step by step. To make the program somewhat general, it uses several macro variables. Line ➊ assigns the lower percentile to a macro variable (LOW_PER) using a %PUT statement. Line ➋ computes the upper percentile cutoff (UP_PER) by subtracting the lower percentile cutoff from 100. (Note: The %EVAL function is needed here to perform the integer arithmetic. If the value of LOW_PER was 20, the value of &UP_PER, without the %EVAL function, would be the text string "100 − 20" instead of 80.) If you look at line ➎, you see the two macro variables LOW_PER and UP_PER preceded by an ampersand (&). As discussed earlier, before the SAS processor runs any SAS program, it runs the macro processor, which processes all the macro statements and substitutes the assigned values of the macro variables. In this program, after the macro processor does its job, line ➎ reads: OUTPUT OUT=TMP PCTLPTS=20 80 PCTLPRE = L_;
That is, the two macro variables, &LOW_PER and &UP_PER are replaced by the values assigned by the %LET statements, 20 and 80 respectively. In line ➌, a macro variable (VAR) is assigned the value of one of the numeric variables to be checked (HR). To run this program on another numeric variable, SBP for example, you only have to change the variable name in line ➌. PROC UNIVARIATE can be used to create an output data set containing information that is normally printed out by the procedure. Because you only want the output data set and not the listing from the procedure, use the NOPRINT option as shown in line ➍. As you did before, you are supplying PROC UNIVARIATE with an ID statement so that the ID variable (PATNO in this case) will be included in the output data set. Line ➎ defines the name of the output data set and specifies the information you want it to include. The keyword OUT= names your data set (TMP) and PCTLPTS= instructs the program to create two variables; one to hold the value of the VAR variable at the 20th percentile and the other for the 80th percentile. In order for this procedure to create the variable names for these two variables, the keyword PCTLPRE= (percentile prefix) is used. Because you set the prefix to L_, the procedure creates two variables, L_20 and L_80.
46
®
Cody’s Data Cleaning Techniques Using SAS Software
The cut points you choose are combined with your choice of prefix to create these two variable names. The data set TMP contains only one observation and three variables, PATNO (because of the ID statement), L_20, and L_80. The value of L_20 is 58 and the value of _80 is 88, the 20th and 80th percentile cutoffs, respectively. The remainder of the program is easier to follow. You want to add the two values of L_20 and L_80 to every observation in the original PATIENTS data set. Let’s do this with a "trick." The SET statement in line ➏ brings in an observation from the PATIENTS data set, keeping only the variables PATNO and HR (because the macro variable &VAR was set to HR). Line ➐ is executed only on the first iteration of this DATA step (when _N_ is equal to 1). Because all variables brought in with a SET statement are automatically retained, the values for L_20 and L_80 are added to every observation in the data set HILO. Finally, for each observation coming in from the PATIENTS data set, the value of HR is compared to the lower and upper cutoff points defined by L_20 and L_80. If the value of HR is at or below the value of L_20, RANGE is set to the value ’LOW’ and the observation is added to the data set HILO. Likewise, if the value of HR is at or above the value of L_80, RANGE is set to ’HIGH’ and the observation is added to the data set HILO. Before you print out the contents of the data set HILO, you sort it first ➑ so that the low values and high values are grouped, and within these groups, the values sorted from lowest to highest. The keyword DESCENDING is used in the first level sort so that the LOW values are listed before the HIGH values (’H’ comes before ’L’ in a normal ascending alphabetical sort). Within each of these two groups, the data values are listed from low to high. It would probably be nicer for the HIGH values to be listed from highest to lowest, but it would not be worth the effort. The final listing from this program is shown next. High and Low Values for Variables PATNO
RANGE
020 014 023 022 003 019 001 007
LOW LOW LOW LOW LOW LOW HIGH HIGH HIGH HIGH HIGH HIGH HIGH
004 017 008 321
HR 10 22 22 48 58 58 88 88 90 101 208 210 900
Chapter 2
Checking Values of Numeric Variables 47
To turn the above program into a macro is actually quite straightforward. The macro version is shown in Program 2-13. Program 2-13 Creating a Macro to List the Highest and Lowest "n" Percent of the Data Using PROC UNIVARIATE *---------------------------------------------------------------* | Program Name: HILOWPER.SAS in C:\CLEANING | | Purpose: To list the n percent highest and lowest values for | | a selected variable. | | Arguments: DSN - Data set name | | VAR - Numeric variable to test | | PERCENT - Upper and Lower percentile cutoff | | IDVAR - ID variable to print in the report | | Example: %HILOWPER(CLEAN.PATIENTS,SBP,20,PATNO) | *---------------------------------------------------------------*; %MACRO HILOWPER(DSN,VAR,PERCENT,IDVAR); ***Compute upper percentile cutoff; %LET UP_PER = %EVAL(100 - &PERCENT); PROC UNIVARIATE DATA=&DSN NOPRINT; VAR &VAR; ID &IDVAR; OUTPUT OUT=TMP PCTLPTS=&PERCENT &UP_PER PCTLPRE = L_; RUN; DATA HILO; SET &DSN(KEEP=&IDVAR &VAR); IF _N_ = 1 THEN SET TMP; IF &VAR LE L_&PERCENT THEN DO; RANGE = ’LOW ’; OUTPUT; END; ELSE IF &VAR GE L_&UP_PER THEN DO; RANGE = ’HIGH’; OUTPUT; END; RUN; PROC SORT DATA=HILO(WHERE=(&VAR NE .)); BY DESCENDING RANGE &VAR; RUN;
48
®
Cody’s Data Cleaning Techniques Using SAS Software PROC PRINT DATA=HILO; TITLE "Low and High Values for Variables"; ID &IDVAR; VAR RANGE &VAR; RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE TMP; DELETE HILO; RUN; QUIT;
%MEND HILOWPER ;
The only change, besides the four macro variables, is the addition of PROC DATASETS to delete the two temporary data sets TMP and HILO. To demonstrate this macro, the three lines below call the macro to list the highest and lowest 20 % of the values for heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP) in the data set PATIENTS. %HILOWPER(CLEAN.PATIENTS,HR,20,PATNO) %HILOWPER(CLEAN.PATIENTS,SBP,20,PATNO) %HILOWPER(CLEAN.PATIENTS,DBP,20,PATNO)
Using PROC RANK to Look for Highest and Lowest Values by Percentage
There is a simpler and more efficient way to list the highest and lowest "n" percent of the data values, that is, by using PROC RANK. The reason that the previous, more complicated program was shown, is that it produces a slightly more accurate listing than the program shown in this section. PROC RANK is designed to produce a new variable (or replace the values of an existing variable) with values equal to the ranks of another variable. For example, if the variable X has values of 7, 3, 2, and 8, the equivalent ranks would be 3, 2, 1, and 4, respectively. However, PROC RANK has a very useful option (GROUPS=) that allows you to group your data values. For example, if you set GROUPS=4, the new variable that usually holds the rank values, will now have values of 0, 1, 2, and 3, with those observations in groups 0 being in the bottom quartile and observations in group 3 being in the top quartile. So, if you want to print out the top 20% of your data values, you set the GROUPS option to 5, each group representing 20% of your data values. The bottom 20% corresponds to the ranked variable having a value of 0, and the top 20% corresponds to the ranked variable having a value of 4. (Yes, it is
Chapter 2
Checking Values of Numeric Variables 49
odd that without the GROUPS= option ranks go from 1 to n and with the GROUPS= option, the groups go from 0 to n − 1.) Now let’s see how to apply this idea to a program that will list the top and bottom "n" percent of your data values. Because you have already seen a program and macro that lists highest and lowest "n" percent of your data values, only the macro version is shown here. Program 2-14 Creating a Macro to List the Highest and Lowest "n" Percent of the Data Using PROC RANK *----------------------------------------------------------------* | Macro Name: HI_LOW_P | | Purpose: To list the upper and lower n% of values | | Arguments: DSN - Data set name (one- or two-level | | VAR - Variable to test | | PERCENT - Upper and lower n% | | IDVAR - ID variable | | Example: %HI_LOW_P(CLEAN.PATIENTS,SBP,20,PATNO) | *----------------------------------------------------------------*; %MACRO HI_LOW_P(DSN,VAR,PERCENT,IDVAR); ***Compute number of groups for PROC RANK; %LET GRP = %SYSEVALF(100 / &PERCENT,FLOOR); ➊ ***Value of the highest GROUP from PROC RANK, equal to the number of groups - 1; %LET TOP = %EVAL(&GRP - 1);
➋
PROC FORMAT; ➌ VALUE RNK 0=’Low’ &TOP=’High’; RUN; PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP; VAR &VAR; RANKS RANGE; RUN;
➍
***Sort and keep top and bottom n%; PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP))); BY &VAR; RUN;
➎
50
®
Cody’s Data Cleaning Techniques Using SAS Software ***Produce the report; PROC PRINT DATA=NEW; ➏ TITLE "Upper and Lower &PERCENT.% Values for %UPCASE(&VAR)"; ID &IDVAR; VAR RANGE &VAR; FORMAT RANGE RNK.; RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE NEW; RUN; QUIT;
➐
%MEND HI_LOW_P;
First, you need to compute the approximate number of groups that will correspond to the percentage you want. Line ➊ uses the %SYSEVALF function to do this computation. This function, unlike its companion %EVAL, allows noninteger arithmetic and also provides various conversions (CEIL, FLOOR, INTEGER, or BOOLEAN) for the results. The floor conversion was chosen because you would rather have the program list too many values (i.e., a smaller value for the GROUPS= option) than too few. For example, if you want the top and bottom 8% of your data values, the value of GRP would be FLOOR(100/8) = 12 and the value for TOP would be 11. It is this rounding that may produce a slightly less accurate report than the program that uses PROC UNIVARIATE. The RNK format assigns the formats ’Low’ and ’High’ to the ranked variable. The key to the whole program is PROC RANK ➍, which uses the GROUPS= option to divide the data values into groups. The sort ➎ accomplishes two things: 1) It subsets the data set with the WHERE data set option, keeping only the top and bottom groups, and 2) it puts the data values in order from the smallest to the largest. All that is left to do is to print the report ➏ and delete the temporary data set ➐. Issue the following statement to see a list of the top and bottom 10% of your values for SBP (systolic blood pressure): %HI_LOW_P(CLEAN.PATIENTS,SBP,10,PATNO)
Chapter 2
Checking Values of Numeric Variables 51
This produces the following output: Upper and Lower 10% Values for SBP PATNO
RANGE
SBP
020 023 011 321
Low Low High High
20 34 300 400
Extending PROC RANK to Look for Highest and Lowest "n" Values
Instead of listing the highest and lowest "n" percent of the data values, you might want to select the cutoffs based on the actual number of values, not the percent. This is slightly harder because you have to determine the number of observations in the data set and to compute the percentage cutoffs, given the number of values you want. To save some time (and space) only the macro version of this program is presented. It is followed by the explanation. Program 2-15 Creating a Macro to List the Top and Bottom "n" Data Values Using PROC RANK *----------------------------------------------------------------* | Macro Name: HI_LOW_N | | Purpose: To list N highest and lowest values (approximately) | | Arguments: DSN - Data set name (one- or two-level | | VAR - Variable to test | | N - Number of highest and lowest values | | IDVAR - ID variable | | Example: %HI_LOW_N (CLEAN.PATIENTS,SBP,10,PATNO) | *----------------------------------------------------------------*;
52
®
Cody’s Data Cleaning Techniques Using SAS Software
%MACRO HI_LOW_N(DSN,VAR,N,IDVAR); ***Find the number of observations in data set; %LET DSID = %SYSFUNC(OPEN(&DSN)); ➊ %LET N_OBS = %SYSFUNC(ATTRN(&DSID,NOBS)); %LET RETURN = %SYSFUNC(CLOSE(&DSID)); ***Compute number of groups, from N and N_OBS; %LET GRP = %SYSEVALF(&N_OBS / &N,FLOOR); ➋ ***Continue as in the macro based on percents; %LET TOP = %EVAL(&GRP - 1); PROC FORMAT; VALUE RNK 0=’Low’ &TOP=’High’; RUN;
PROC RANK DATA=&DSN OUT=NEW GROUPS=&GRP; VAR &VAR; RANKS RANGE; RUN; ***Sort and keep top and bottom n%; PROC SORT DATA=NEW (WHERE=(RANGE IN (0,&TOP))); BY &VAR; RUN; ***Produce the report; PROC PRINT DATA=NEW; TITLE "Approximate Highest and Lowest &N Values for %UPCASE(&VAR)"; ID &IDVAR; VAR RANGE &VAR; FORMAT RANGE RNK.; RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE NEW; RUN; QUIT; %MEND HI_LOW_N;
Chapter 2
Checking Values of Numeric Variables 53
This macro is very similar to the macro in the previous section, except that the number of groups computed is based on the number of observations in the data set. If there are too many missing values for the variable of interest, the number of nonmissing values for that variable can be used instead of the number of observations in the entire data set. This number can be determined by using PROC MEANS to output the number of nonmissing values to a data set. This macro uses %SYSFUNC ➊ to open the data set and determine the number of observations, and close the data set (don’t forget to close data sets!). (%SYSFUNC executes SAS language functions and returns the results to the macro facility.) The calculation of the number of groups is accomplished in line ➋. The FLOOR function is used so that we err on the side of too few groups (more data values listed) rather than too many. The remainder of the program is identical to Program 2-14. To demonstrate this macro, let’s list the 10 highest and 10 lowest systolic blood pressure readings in the PATIENTS data set by using the following statement: %HI_LOW_N (CLEAN.PATIENTS,SBP,10,PATNO)
The resulting output is shown next. Approximate Highest and Lowest 10 Values for SBP PATNO
RANGE
SBP
020 023 010 006 025 013 003 022 019 028 027 003
Low Low Low Low Low Low Low Low Low High High High High High High High High
20 34 40 102 102 108 112 114 118 150 166 190 190 200 240 300 400
004 009 011 321
54
®
Cody’s Data Cleaning Techniques Using SAS Software
Notice that there are nine low values and eight high values. This discrepancy from the number we selected will be less in larger data sets, but it still may not be exact because of rounding errors and the way that groups are selected when the number of data values is not an exact multiple of the number of groups. Also, PROC RANK assigns ranks only to nonmissing values. Program 2-17 and Program 2-18 both provide listings of exactly "n" highest and lowest values, but, at the expense of processing time. If you have large numbers of missing values for variables in your data set, there is an alternative and more CPU intensive method to determine the number of nonmissing values instead of the number of observations in the data set, as shown in Program 2-16. Just substitute the following lines for the three %SYSFUNC lines in the previous macro. Program 2-16 Determining the Number of Nonmissing Observations in a Data Set ***Find the number of nonmissing observations in data set; PROC MEANS DATA=&DSN NOPRINT; VAR &VAR; OUTPUT OUT=TMP N=NONMISS; RUN; DATA _NULL_; SET TMP; ***Assign the value of NONMISS to the macro variable N_OBS; CALL SYMPUT("N_OBS",NONMISS); RUN;
If you use this code, add the data set TMP to the list of data sets to delete at the end of the macro. If you use this method of determining the number of nonmissing observations in your data set, the same macro call that was used in Program 2-15 produces output with 13 low values and 14 high values. (The computation of the percentiles gives a larger value if the denominator is the number of observations with nonmissing values rather than the total number of observations.)
Chapter 2
Checking Values of Numeric Variables 55
Finding Another Way to Determine Highest and Lowest Values
There is usually more than one way to solve any SAS problem. Here is another approach to listing the 10 highest and 10 lowest values for a variable in a SAS data set. The advantage of this program is that it always gives you exactly 10 high and 10 low values. The program is presented first, followed by a macro version of it. Program 2-17 Listing the Highest and Lowest "n" Values Using PROC SORT LIBNAME CLEAN "C:\CLEANING"; %LET VAR = HR; ***Assign values to two macro variables; %LET IDVAR = PATNO; PROC SORT DATA=CLEAN.PATIENTS(KEEP=&IDVAR &VAR WHERE=(&VAR NE .)) OUT=TMP; BY &VAR; RUN;
➊
DATA _NULL_; TITLE "Ten Highest and Ten Lowest Values for &VAR"; SET TMP NOBS=NUM_OBS; HIGH = NUM_OBS - 9; FILE PRINT;
➌
➋
IF _N_ LE 10 THEN DO; ➍ IF _N_ = 1 THEN PUT / "Ten Lowest Values" ; PUT "&IDVAR = " &IDVAR @15 &VAR; END; IF _N_ GE HIGH THEN DO; ➎ IF _N_ = HIGH THEN PUT / "Ten Highest Values" ; PUT "&IDVAR = " &IDVAR @15 &VAR; END; RUN;
56
®
Cody’s Data Cleaning Techniques Using SAS Software
This is a simpler program than the one that uses PROC RANK. One drawback, however, is that the data set needs to be sorted each time the program is run for a different variable. This may be OK for a relatively small data set but inappropriate (and inefficient) for a large one. In this program, only the variable name and the ID variable are assigned to macro variables. To make the program as efficient as possible, a KEEP= data set option is used with PROC SORT ➊. In addition, only the nonmissing observations are placed in the sorted temporary data set TMP (because of the WHERE= data set option). The data set TMP will contain only the ID variable and the variable to be checked, in order, from lowest to highest. Therefore, the first 10 observations in this data set are the 10 lowest, nonmissing values for the variable to be checked. Use the NOBS= option in the SET statement (line ➋) to obtain the number of observations in the data set TMP. Because this data set only contains nonmissing values, the 10 highest values for your variable start with observation NUM_OBS - 9. This program uses a DATA _NULL_ and PUT statements to provide the listing of high and low values. As an alternative, you could create a temporary data set and use PROC PRINT to provide the listing. One final note: this program does not check if there are fewer than 20 nonmissing observations for the variable to be checked. That would probably be overkill. If you had that few observations, you wouldn’t really need a program at all, just a PROC PRINT! Running Program 2-17 on the PATIENTS data set for the heart rate variable (HR) produces the following: Ten Highest and Ten Lowest Values for HR Ten Lowest Values PATNO = 020 10 PATNO = 014 22 PATNO = 023 22 PATNO = 022 48 PATNO = 003 58 PATNO = 019 58 PATNO = 012 60 PATNO = 123 60 PATNO = 028 66 PATNO = 003 68 Ten Highest Values PATNO = 002 84 PATNO = 002 84 PATNO = 009 86 PATNO = 001 88 PATNO = 007 88 PATNO = 90 PATNO = 004 101 PATNO = 017 208 PATNO = 008 210 PATNO = 321 900
Chapter 2
Checking Values of Numeric Variables 57
A macro version of this program is straightforward (see Program 2-18). To make it more general, the data set name is added as a macro variable as well. Program 2-18 Creating a Macro to List the "n" Highest and Lowest Data Values Using PROC SORT *-------------------------------------------------------------------* | Program Name: TEN.SAS in C:\CLEANING | | Purpose: To list the 10 highest and lowest data values for | | a variable in a SAS data set using DATA step processing | | Arguments: DSN - Data set name | | VAR - Numeric variable to be checked | | IDVAR - ID variable name | | | | Example: %TEN(CLEAN.PATIENTS,HR,PATNO) | *-------------------------------------------------------------------*; %MACRO TEN(DSN,VAR,IDVAR); PROC SORT DATA=&DSN(KEEP=&IDVAR &VAR WHERE=(&VAR NE .)) OUT=TMP; BY &VAR; RUN; DATA _NULL_; TITLE "Ten Highest and Ten Lowest Values for %UPCASE(&VAR)"; SET TMP NOBS=NUM_OBS; HIGH = NUM_OBS - 9; FILE PRINT; IF _N_ LE 10 THEN DO; IF _N_ = 1 THEN PUT / "Ten Lowest Values" ; PUT "&IDVAR = " &IDVAR @15 "&VAR = " &VAR; END; IF _N_ GE HIGH THEN DO; IF _N_ = HIGH THEN PUT / "Ten Highest Values" ; PUT "&IDVAR = " &IDVAR @15 "&VAR = " &VAR; END; RUN; %MEND TEN;
58
®
Cody’s Data Cleaning Techniques Using SAS Software
Checking a Range Using an Algorithm Based on Standard Deviation
One way of deciding what constitutes reasonable cutoffs for low and high data values is to use an algorithm based on the data values themselves. For example, you could decide to flag all values more than two standard deviations from the mean. However, if you had some severe data errors, the standard deviation could be so badly inflated that obviously incorrect data values might lie within two standard deviations. A possible workaround for this would be to compute the standard deviation after removing some of the highest and lowest values. For example, you could compute a standard deviation of the middle 50% of your data and use this to decide on outliers. Another popular alternative is to use an algorithm based on the interquartile range (the difference between the 25th percentile and the 75th percentile). Some programs and macros based on these ideas are presented in the next two sections. Let’s first see how you could identify data values more than two standard deviations from the mean. You can use PROC MEANS to compute the standard deviations and a short DATA step to select the outliers, as shown in Program 2-19. Program 2-19 Detecting Outliers Based on the Standard Deviation LIBNAME CLEAN "C:\CLEANING"; ***Output means and standard deviations to a data set; PROC MEANS DATA=CLEAN.PATIENTS NOPRINT; VAR HR SBP DBP; OUTPUT OUT=MEANS(DROP=_TYPE_ _FREQ_) MEAN=M_HR M_SBP M_DBP STD=S_HR S_SBP S_DBP; RUN;
Chapter 2
Checking Values of Numeric Variables 59
DATA _NULL_; FILE PRINT; TITLE "Statistics for Numeric Variables"; *** The number of standard deviations to list; %LET N_SD = 2; SET CLEAN.PATIENTS; ***Bring in the means and standard deviations; IF _N_ = 1 THEN SET MEANS; ARRAY RAW[3] HR SBP DBP; ARRAY MEAN[3] M_HR M_SBP M_DBP; ARRAY STD[3] S_HR S_SBP S_DBP; DO I = 1 TO DIM[RAW]; IF RAW[I] LT MEAN[I] - &N_SD*STD[I] AND RAW[I] NE . OR RAW[I] GT MEAN[I] + &N_SD*STD[I] THEN PUT PATNO= RAW[I]=; END; RUN;
The PROC MEANS step computes the mean and standard deviation for each of the numeric variables in your data set. To make the program more flexible, the number of standard deviations above or below the mean that you would like to report is assigned to a macro variable (N_SD). To compare each of the raw data values against the limits defined by the mean and standard deviation, you need to combine the values in the single observation data set created by PROC MEANS to the original data set. You use the same trick you used earlier, that is, you execute a SET statement only once, when _N_ is equal to one. Because all the variables brought into the program data vector (PDV) with a SET statement are retained, these summary values will be available in each observation in the PATIENTS data set. Finally, to save some typing, three arrays were created to hold the original raw variables, the means, and the standard deviations, respectively. The IF statement at the bottom of this DATA step prints out the ID variable and the raw data value for any value above or below the designated cutoff.
60
®
Cody’s Data Cleaning Techniques Using SAS Software
The results of running this program on the PATIENTS data set with N_SD set to two follows: Statistics for Numeric Variables PATNO=009 PATNO=011 PATNO=321 PATNO=321 PATNO=321 PATNO=020
DBP=180 SBP=300 HR=900 SBP=400 DBP=200 DBP=8
How would you go about computing cutoffs based on the middle 50% of your data? Calculating a mean and standard deviation on the middle 50% of the data (called trimmed statistics by robust statisticians — and I know some statisticians that are very robust!) is easy if you first use PROC RANK (with a GROUPS= option) to identify quartiles, and then use this information in a subsequent PROC MEANS step to compute the mean and standard deviation of the middle 50% of your data. Your decision on how many of these trimmed standard deviation units should be used to define outliers is somewhat of a trial-and-error process. Obviously, (well, maybe not that obvious) the standard deviation computed on the middle 50% of your data will be smaller than the standard deviation computed from all of your data if you have outliers. The difference between the two will be even larger if there are some dramatic outliers in your data. (This will be demonstrated later in this section.) As an approximation, if your data are normally distributed, the trimmed standard deviation is approximately 2.6 times smaller than the untrimmed value. So, if your original cutoff was plus or minus two standard deviations, you might choose 5 or 5.2 trimmed standard deviations as your cutoff scores. What follows is a program that computes trimmed statistics and uses them to identify outliers. Program 2-20 Detecting Outliers Based on a Trimmed Mean PROC RANK DATA=CLEAN.PATIENTS OUT=TMP GROUPS=4; VAR HR; RANKS R_HR; RUN; PROC MEANS DATA=TMP NOPRINT; WHERE R_HR IN (1,2); ***The middle 50%; VAR HR; OUTPUT OUT=MEANS(DROP=_TYPE_ _FREQ_) MEAN=M_HR STD=S_HR; RUN;
Chapter 2
Checking Values of Numeric Variables 61
DATA _NULL_; TITLE "Outliers Based on Trimmed Standard Deviation"; FILE PRINT; %LET N_SD = 5.25; ***The value of 5.25 computed from the trimmed mean is approximately equivalent to the 2 standard deviations you used before, computed from all the data. Set this value approximately 2.65 times larger than the number of standard deviations you would compute from untrimmed data; SET CLEAN.PATIENTS; IF _N_ = 1 THEN SET MEANS; IF HR LT M_HR - &N_SD*S_HR AND HR NE . OR HR GT M_HR + &N_SD*S_HR THEN PUT PATNO= HR=; RUN;
There is one slight complication here, compared to the earlier nontrimmed version of the program. The middle 50% of the observations can be different for each of the numeric variables you want to test. So, if you want to run the program for several variables, it would be convenient to assign to a macro variable the name of the numeric variable that will be tested. This is done next, but first, a brief explanation of the program. PROC RANK is used with the GROUPS= option to create a new variable (R_HR), which will have values of 0, 1, 2, or 3, depending on which quartile the value lies. Because you want both the original value for HR and the rank value, use a RANKS statement, which allows you to give a new name to the variable that will hold the rank of the variable listed in the VAR statement. All that is left to do is to run PROC MEANS as you did before, except that a WHERE statement selects the middle 50% of the data values. What follows is the same as Program 2-19, except that arrays are not needed because you can only process one variable at a time. Finally, here is the output from Program 2-20. Outliers Based on Trimmed Standard Deviation PATNO=008 PATNO=014 PATNO=017 PATNO=321 PATNO=020 PATNO=023
HR=210 HR=22 HR=208 HR=900 HR=10 HR=22
Notice that the method based on a nontrimmed standard deviation reported only one HR as an outlier (PATNO=321, HR=900) while the method based on a trimmed mean identified six values. The reason? The heart rate value of 900 inflated the nontrimmed standard deviation so much that none of the other values fell within two standard deviations.
62
®
Cody’s Data Cleaning Techniques Using SAS Software
Macros Based on the Two Methods of Outlier Detection
It is straightforward to turn each of the above programs into more general purpose macros. First, here is a macro that detects outliers based on a standard deviation computed from all the data. Program 2-21 Creating a Macro to Detect Outliers Based on a Standard Deviation *------------------------------------------------------------------* | Program Name: SD_ALL.SAS in C:\CLEANING | | Purpose: To identify outliers based on n standard deviations | | from the mean. | | Arguments: DSN - Data set name | | VAR - Numeric variable to be checked | | IDVAR - ID variable name | | N_SD - The number of standard deviation units for | | declaring an outlier | | | | Example: %SD_ALL(CLEAN.PATIENTS,HR,PATNO,2) | *------------------------------------------------------------------*; %MACRO SD_ALL(DSN,VAR,IDVAR,N_SD); TITLE1 "Outliers for Variable &VAR Data Set &DSN"; TITLE2 "Based on &N_SD Standard Deviations"; PROC MEANS DATA=&DSN NOPRINT; VAR &VAR ; OUTPUT OUT=MEANS(DROP=_TYPE_ _FREQ_) MEAN=M STD=S; RUN; DATA _NULL_; FILE PRINT; SET &DSN; IF _N_ = 1 THEN SET MEANS; IF &VAR LT M - &N_SD*S AND &VAR NE . OR &VAR GT M + &N_SD*S THEN PUT &IDVAR= &VAR=; RUN;
Chapter 2
Checking Values of Numeric Variables 63
PROC DATASETS LIBRARY=WORK NOLIST; DELETE MEANS; RUN; QUIT; %MEND SD_ALL;
Next, a macro is shown that detects outliers based on computing the mean and standard deviation from the middle 50% of the data values. It can be easily modified to use more or less data by adjusting the GROUPS= option in PROC RANK and modifying the WHERE statement in the PROC MEANS step as appropriate. Program 2-22 Creating a Macro to Detect Outliers Based on a Trimmed Mean *-------------------------------------------------------------------* | Program Name: SD_TRIM.SAS in C:\CLEANING | | Purpose: To identify outliers based on n standard deviations | | from the mean, computed from the middle 50% of the data. | | Arguments: DSN - Data set name | | VAR - Numeric variable to be checked | | IDVAR - ID variable name | | N_SD - The number of standard deviation units you | | would specify if the data values were not | | trimmed. | | | | EXAMPLE: %SD_TRIM(CLEAN.PATIENTS,HR,PATNO,2) | *-------------------------------------------------------------------*; %MACRO SD_TRIM(DSN,VAR,IDVAR,N_SD); TITLE1 "Outliers for Variable &VAR Data Set &DSN"; TITLE2 "Based on &N_SD Standard Deviations Estimated from Trimmed (50%)Data"; PROC RANK DATA=&DSN OUT=TMP GROUPS=4; VAR &VAR; RANKS R; RUN; PROC MEANS DATA=TMP NOPRINT; WHERE R IN (1,2); ***The middle 50%; VAR &VAR; OUTPUT OUT=MEANS(DROP=_TYPE_ _FREQ_) MEAN=M STD=S; RUN;
64
®
Cody’s Data Cleaning Techniques Using SAS Software DATA _NULL_; FILE PRINT; SET &DSN; IF _N_ = 1 THEN SET MEANS; IF &VAR LT M - &N_SD*S*2.65 AND &VAR NE . OR &VAR GT M + &N_SD*S*2.65 THEN PUT &IDVAR= &VAR=;
➊
RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE MEANS; DELETE TMP; RUN; QUIT; %MEND SD_TRIM;
Notice in line ➊ of the above macro, the value 2.65 is the estimated amount you need to inflate the trimmed standard deviation to estimate the untrimmed standard deviation for normally distributed data. Demonstrating the Difference between the Two Methods
To show the difference between these two methods of outlier detection, look at the following small data set: DATA TRIM; INPUT X @@; PATNO + 1; DATALINES; 1.02 1.06 1.23 2.00 1.09 1.15 1.23 1.33 1.99 1.11 1.45 156 4.88 2.11 1.54 1.64 1.73 1.19 1.21 1.29 ;
First, a brief comment on the program. Ordinarily, SAS goes to a new line for each INPUT statement in a DATA step and for each iteration of the DATA step. The double at sign (@@) prevents this from happening and allows you to place data for multiple observations on a single line. This is a convenient way to save some space. Next, remember that the SAS statement PATNO + 1;
does several things. First, it initializes PATNO at 0. Second, PATNO is automatically retained, and third, each time the statement executes, PATNO is incremented by 1.
Chapter 2
Checking Values of Numeric Variables 65
There are 20 values of X in the data set TRIM. One of the values, 156, is a data error and should have been 1.56. This type of error is not all that uncommon. The other suspicious value is 4.88. This may or may not be a data error. You would probably want your data checking program to flag the 4.88 value so that it could be checked. Using the following two lines to run both macros, %SD_ALL(TRIM,PATNO,X,2) %SD_TRIM(TRIM,PATNO,X,2)
the resulting output is Outliers for Variable X Data Set TRIM Based on Two Standard Deviations PATNO=12 X=156 Outliers for Variable X Data Set TRIM Based on Two Standard Deviations Estimated from Trimmed (50%) Data PATNO=12 X=156 PATNO=13 X=4.88
The program based on the standard deviation of all the values, with a cutoff of two standard deviations only lists the value of 156. Why didn’t the program identify 4.88 as an outlier? Because of the 156, the mean and standard deviation, using all 20 of the data values, were 9.31 and 34.54, respectively. The single value being approximately 100 times larger than the other values grossly inflated the mean and standard deviation. The value of 4.88 is less than one standard deviation from the sample mean. The mean and standard deviation of the trimmed data set are 1.38 and .195, respectively. Using these values, the value of 4.88 is easily identified as a possible outlier. Checking a Range Based on the Interquartile Range
Yet another way to look for outliers is a method devised by advocates of exploratory data analysis (EDA). This is a robust method, much like the previous method described, based on a trimmed mean. It uses the interquartile range (the distance from the 25th percentile to the 75th percentile) and defines an outlier as a multiple of the interquartile range above or below the upper or lower hinge, respectively. For those not familiar with EDA terminology, the lower hinge is the value corresponding to the 25th percentile (the
66
®
Cody’s Data Cleaning Techniques Using SAS Software
value below which 25% of the data values lie). The upper hinge is the value corresponding to the 75% percentile. For example, you may want to examine any data values more than two interquartile ranges above the upper hinge or below the lower hinge. This is an attractive method because it is independent of the distribution of the data values. An easy way to determine the interquartile range and the upper and lower hinges is to use PROC UNIVARIATE to output these quantities. Presented next is a macro, which is similar to the one in the previous section, but this one uses the number of interquartile ranges instead of an estimate of the standard deviation. Program 2-23 Detecting Outliers Based on the Interquartile Range *-------------------------------------------------------------------* | Program Name: INTER_Q.SAS in C:\CLEANING | | Purpose: To identify outliers based on n interquartile ranges | | Arguments: DSN - Data set name | | VAR - Numeric variable to be checked | | IDVAR - ID variable name | | N_IQR - The number of interquartile ranges above or | | below the upper and lower hinge (75th and | | 25th percentile points) to declare a value | | an outlier. | | | | Example: %INTER_Q(CLEAN.PATIENTS,HR,PATNO,2) | *-------------------------------------------------------------------*; %MACRO INTER_Q(DSN,VAR,IDVAR,N_IQR); PROC UNIVARIATE DATA=&DSN NOPRINT; VAR &VAR; OUTPUT OUT=TMP Q3=UPPER Q1=LOWER QRANGE=IQR; RUN;
➊
DATA _NULL_; TITLE "Outliers Based on &N_IQR Interquartile Ranges"; FILE PRINT; SET &DSN; IF _N_ = 1 THEN SET TMP; IF &VAR LT LOWER - &N_IQR*IQR AND &VAR NE . OR &VAR GT UPPER + &N_IQR*IQR THEN PUT &IDVAR= &VAR=; RUN;
➋
Chapter 2
Checking Values of Numeric Variables 67
PROC DATASETS LIBRARY=WORK NOLIST; DELETE TMP; RUN; QUIT; %MEND INTER_Q;
Use PROC UNIVARIATE to output the values of the 25th and 75th percentile to a data set ➊. In the DATA _NULL_ step that follows, any values more than "n" interquartile ranges (the macro variable N_IQR) below the lower hinge or above the upper hinge are flagged as errors and reported ➋. To demonstrate this macro, the calling sequence below checks for outliers more than two interquartile ranges above or below the upper or lower hinge, respectively. The calling statement is %INTER_Q(CLEAN.PATIENTS,HR,PATNO,2)
with the resulting output shown next. Outliers Based on Two Interquartile Ranges PATNO=008 PATNO=017 PATNO=321
HR=210 HR=208 HR=900
The same macro, used on the data set TRIM (on page 64) with the number of interquartile ranges set at two, results in the output shown next. Outliers Based on Two Interquartile Ranges PATNO=12 X=156 PATNO=13 X=4.88
Notice that both the values 4.88 and 156 are identified in this method.
68
®
Cody’s Data Cleaning Techniques Using SAS Software
Checking Ranges for Several Variables
In this final section of this chapter, the range checking macro developed on page 35 is expanded to do two things: One, make the macro more flexible so that it can either treat missing values as valid or invalid; and two, allow the macro to be called multiple times with different numeric variables and produce one consolidated report when finished. The macro is listed first, followed by a step-by-step explanation. Program 2-24 Writing a Program to Summarize Data Errors on Several Variables *---------------------------------------------------------------* | PROGRAM NAME: ERRORSN.SAS IN C:\CLEANING | | PURPOSE: Accumulates errors for numeric variables in a SAS | | data set for later reporting. | | This macro can be called several times with a | | different variable each time. The resulting errors | | are accumulated in a temporary SAS data set called | | ERRORS. | | ARGUMENTS: DSN - SAS data set name (assigned with a %LET) | | IDVAR - ID variable (assigned with a %LET) | | | | VAR - The variable name to test | | LOW - Lowest valid value | | HIGH - Highest valid value | | M - Missing value flag. If=1 count missing | | values as invalid, =0, missing values OK | | | | EXAMPLE: %LET DSN = CLEAN.PATIENTS; | | %LET IDVAR = PATNO; | | %ERRORSN(HR,40,100,1) | | %ERRORSN(SBP,80,200,0) | | %ERRORSN(DBP,60,120,0) | | Test the numeric variables HR, SBP, and DBP in the | | data set CLEAN.PATIENTS for data outside the ranges | | 40 to 100, 80 to 200, and 60 to 120, respectively. | | The ID variable is PATNO and missing values are to | | be flagged as invalid for HR but not for SBP or DBP. | *---------------------------------------------------------------*;
Chapter 2
Checking Values of Numeric Variables 69
LIBNAME CLEAN "C:\CLEANING"; %LET DSN=CLEAN.PATIENTS; %LET IDVAR=PATNO;
***Define Data set name and; ***ID variable;
%MACRO ERRORSN(VAR,LOW,HIGH,M);
➊
➋
DATA TMP; SET &DSN(KEEP=&IDVAR &VAR);
➌
LENGTH REASON $ 10 VARIABLE $ 8; VARIABLE = "&VAR"; VALUE = &VAR;
➍
➎
IF &VAR LT &LOW AND &VAR NE . THEN DO; REASON=’LOW’; OUTPUT; END;
➐
ELSE IF &VAR EQ . AND &M THEN DO; REASON=’MISSING’; OUTPUT; END; ELSE IF &VAR GT &HIGH THEN DO; REASON=’HIGH’; OUTPUT; END; DROP &VAR; RUN;
➏
➑
PROC APPEND BASE=ERRORS DATA=TMP; ➒ RUN; TITLE "Listing Of Errors In Data Set &DATA "; %MEND ERRORSN; ***Error Reporting Macro - to be run after ERRORSN has been called as many times as desired for each numeric variable to be tested; %MACRO E_REPORT ➓ PROC SORT DATA=ERRORS; BY & IDVAR; RUN;
11
70
®
Cody’s Data Cleaning Techniques Using SAS Software PROC PRINT DATA=ERRORS; TITLE "Error Report for Data Set &DSN"; ID &IDVAR; VAR VARIABLE VALUE REASON; RUN; PROC DATASETS LIBRARY=WORK NOLIST; 12 DELETE ERRORS; DELETE TMP; RUN; QUIT;
%MEND E_REPORT;
To avoid having to enter the data set name and the ID variable each time this macro is called, the two macro variables DSN and IDVAR are assigned with %LET statements ➊. Calling arguments to the macro ➋ are the name of the numeric variable to be tested, the lower and upper valid values for this variable, and a variable to determine if missing values are to be listed in the error report or not. To keep the macro somewhat efficient, only the variable in question and the ID variable are added to the TMP data set because of the KEEP= data set option in line ➌. The variables REASON and VARIABLE ➍ hold values for why the observation was selected and the name of the variable being tested. Because the name of the numeric variable to be tested changes each time the macro is called, a variable called VALUE ➎ is assigned the value of the numeric variable. The range checking is accomplished in lines ➏ and ➑. Line ➐ reports missing values as invalid if the macro variable M is set to 1, otherwise missing values are not treated as errors. Finally, each error found is added to the temporary data set ERRORS by using PROC APPEND ➒. This is the most efficient method of adding observations to an existing SAS data set. Each time the ERRORSN macro is called, all the invalid observations will be added to the ERRORS data set. The second macro, E_REPORT ➓, is a macro that should be called once after the ERRORSN macro has been called for each of the desired numeric variable range checks. The E_REPORT macro is simple. It sorts the ERRORS data set by the ID variable, so that all errors for a particular ID will be grouped together 11 . Finally, as you have done in the past, use PROC DATASETS 12 to clean up the WORK data sets that were created.
Chapter 2
Checking Values of Numeric Variables 71
To demonstrate how these two macros work, the ERRORSN macro is called three times, for the variables heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP), respectively. For the HR variable, you want missing values to appear in the error report; for the other two variables, you do not want missing values listed as errors. Here is the calling sequence: ***Calling the ERRORSN macro; LIBNAME CLEAN.PATIENTS; %LET DSN = CLEAN.PATIENTS; ***Set two macro variables; %LET ID = PATNO; %ERRORSN(HR,40,100,1) %ERRORSN(SBP,80,200,0) %ERRORSN(DBP,60,120,0) ***Generate the report; %E_REPORT
And finally, the report that is produced: Error Report for Data Set CLEAN.PATIENTS PATNO
VARIABLE
VALUE
REASON
004 008 009 009 010 010 011 011 014 017 020 020 020 023 023 027 029 321 321 321
HR HR SBP DBP HR SBP SBP DBP HR HR HR SBP DBP HR SBP HR HR HR SBP DBP
101 210 240 180 . 40 300 20 22 208 10 20 8 22 34 . . 900 400 200
HIGH HIGH HIGH HIGH MISSING LOW HIGH LOW LOW HIGH LOW LOW LOW LOW LOW MISSING MISSING HIGH HIGH HIGH
The clear advantage of this technique is that it provides a report that lists all the out-ofrange or missing value errors for each patient, all in one place.
72
®
Cody’s Data Cleaning Techniques Using SAS Software
3
Checking for Missing Values ,QWURGXFWLRQ
,QVSHFWLQJWKH6$6/RJ
8VLQJ352&0($16DQG352&)5(4WR&RXQW0LVVLQJ9DOXHV
8VLQJ'$7$6WHS$SSURDFKHVWR,GHQWLI\DQG&RXQW0LVVLQJ9DOXHV
8VLQJ352&7$%8/$7(WR&RXQW0LVVLQJDQG1RQPLVVLQJ9DOXHVIRU 1XPHULF9DULDEOHV
8VLQJ352&7$%8/$7(WR&RXQW0LVVLQJDQG1RQPLVVLQJ9DOXHVIRU &KDUDFWHU9DULDEOHV
&UHDWLQJD*HQHUDO3XUSRVH0DFURWR&RXQW0LVVLQJDQG1RQPLVVLQJ9DOXHV IRU%RWK1XPHULFDQG&KDUDFWHU9DULDEOHV 6HDUFKLQJIRUD6SHFLILF1XPHULF9DOXH
Introduction
0DQ\GDWDVHWVFRQWDLQ PLVVLQJ YDOXHV 7KHUH DUH VHYHUDO ZD\V LQ ZKLFK PLVVLQJ YDOXHVFDQHQWHUD6$6GDWDVHW)LUVW RI DOO WKH UDZ GDWD YDOXH PD\ EH PLVVLQJ HLWKHU LQWHQWLRQDOO\ RU DFFLGHQWDOO\ 1H[W DQ LQYDOLG YDOXH FDQ FDXVH D PLVVLQJ YDOXH WR EH FUHDWHG )RU H[DPSOH UHDGLQJ D FKDUDFWHU YDOXH ZLWK D QXPHULF LQIRUPDWZLOOJHQHUDWHDPLVVLQJYDOXH,QYDOLGGDWHVDUHDQRWKHUFRPPRQFDXVHRI 6$6 JHQHUDWHG PLVVLQJ YDOXHV )LQDOO\ PDQ\ RSHUDWLRQV VXFK DV DVVLJQPHQW VWDWHPHQWVFDQFUHDWHPLVVLQJYDOXHV7KLVFKDSWHULQYHVWLJDWHVZD\VWRGHWHFWDQG FRXQWPLVVLQJYDOXHVIRUQXPHULFDQGFKDUDFWHUYDULDEOHV Inspecting the SAS Log
,WLVYLWDOO\LPSRUWDQWWRFDUHIXOO\LQVSHFWWKH6$6/RJHVSHFLDOO\ZKHQFUHDWLQJD 6$6GDWDVHWIRUWKHILUVWWLPH$ORJILOOHGZLWKPHVVDJHVDERXWLQYDOLGGDWDYDOXHV LVDFOXHWKDWVRPHWKLQJPD\EHZURQJHLWKHUZLWKWKHGDWDRUWKHSURJUDP,I\RX NQRZWKDWDQXPHULFILHOGFRQWDLQVLQYDOLGFKDUDFWHUYDOXHV\RXPD\FKRRVHWRUHDG WKRVHGDWDYDOXHVZLWKDFKDUDFWHULQIRUPDWDQGWRSHUIRUPDFKDUDFWHUWRQXPHULF FRQYHUVLRQ XVLQJ WKH ,1387 IXQFWLRQ \RXUVHOI 7KLV ZLOO NHHS WKH 6$6
74
®
Cody’s Data Cleaning Techniques Using SAS Software
/RJFOHDQHUDQGPDNHLWHDVLHUWRVSRWXQH[SHFWHGHUURUV/HW VORRNDWSRUWLRQVRIWKH6$6 /RJWKDWZHUHJHQHUDWHGZKHQWKH3$7,(176GDWDVHWZDVFUHDWHG 1 LIBNAME CLEAN "C:\CLEANING"; NOTE: Libref CLEAN was successfully assigned as follows: Engine: V7 Physical Name: C:\CLEANING 2 3 DATA CLEAN.PATIENTS; 4 INFILE "C:\CLEANING\PATIENTS.TXT" PAD; 5 INPUT @1 PATNO $3. 6 @4 GENDER $1. 7 @5 VISIT MMDDYY10. 8 @15 HR 3. 9 @18 SBP 3. 10 @21 DBP 3. 11 @24 DX $3. 12 @27 AE $1.; 13 14 LABEL PATNO = "Patient Number" 15 GENDER = "Gender" 16 VISIT = "Visit Date" 17 HR = "Heart Rate" 18 SBP = "Systolic Blood Pressure" 19 DBP = "Diastolic Blood Pressure" 20 DX = "Diagnosis Code" 21 AE = "Adverse Event?"; 22 23 FORMAT VISIT MMDDYY10.; 24 25 RUN; NOTE: The infile "C:\CLEANING\PATIENTS.TXT" is: File Name=C:\CLEANING\PATIENTS.TXT, RECFM=V,LRECL=256 NOTE: RULE: 7 RULE:
Invalid data for VISIT in line 7 5-14. ---+---1---+---2---+---3---+---4---+---5---+---6---+---7---+---8 007M08/32/1998 88148102 0 ---+---1---+---2---+---3---+---4---+---5---+---6---+---7---+---8 92 183 PATNO=007 GENDER=M VISIT=. HR=88 SBP=148 DBP=102 DX= AE=0 _ERROR_=1 _N_=7
Chapter 3
Checking for Missing Values
75
NOTE: Invalid data for VISIT in line 12 5-14. 12 011M13/13/1998 68300 20 41 92 183 PATNO=011 GENDER=M VISIT=. HR=68 SBP=300 DBP=20 DX=4 AE=1 _ERROR_=1 _N_=12 NOTE: Invalid data for VISIT in line 21 5-14. 21 123M15/12/1999 60 10 92 183 PATNO=123 GENDER=M VISIT=. HR=60 SBP=. DBP=. DX=1 AE=0 _ERROR_=1 _N_=21 NOTE: Invalid data for VISIT in line 23 5-14. 23 020F99/99/9999 10 20 8 0 92 183 PATNO=020 GENDER=F VISIT=. HR=10 SBP=20 DBP=8 DX= AE=0 _ERROR_=1 _N_=23 NOTE: Invalid data for VISIT in line 28 5-14. NOTE: Invalid data for HR in line 28 15-17. 28 027FNOTAVAIL NA 166106 70 92 183 PATNO=027 GENDER=F VISIT=. HR=. SBP=166 DBP=106 DX=7 AE=0 _ERROR_=1 _N_=28 NOTE: 31 records were read from the infile "C:\CLEANING\PATIENTS.TXT". The minimum record length was 26. The maximum record length was 27. NOTE: The data set CLEAN.PATIENTS has 31 observations and 8 variables. NOTE: DATA statement used: real time 0.50 seconds
7KHILUVWLQYDOLGGDWDPHVVDJHLVJHQHUDWHGE\DQLQYDOLGGDWH 7KLVZLOOEH GLVFXVVHG LQ PRUH GHWDLO LQ &KDSWHU :RUNLQJ ZLWK 'DWHV )RU QRZ UHDOL]H WKDW D QXPHULF PLVVLQJ YDOXH UHPHPEHU WKDW GDWHV DUH VWRUHG DV QXPHULF YDOXHV ZLOO EH JHQHUDWHGDVDUHVXOWRIWKLVLQYDOLGGDWH6HYHUDOPRUHLQYDOLGGDWHPHVVDJHVIROORZ$ PLVVLQJ YDOXH IRU KHDUW UDWH +5 ZDV JHQHUDWHG IRU SDWLHQW QXPEHU EHFDXVH RI WKH FKDUDFWHUYDOXH 1$ QRWDYDLODEOHRUQRWDSSOLFDEOH WKDWZDVHQWHUHG%HIRUHJRLQJDQ\ IXUWKHUWKHLQYDOLGGDWHVQHHGWREHFKHFNHGDQGDGHFLVLRQQHHGVWREHPDGHFRQFHUQLQJ WKH 1$ YDOXHIRUKHDUWUDWH
76
®
Cody’s Data Cleaning Techniques Using SAS Software
Using PROC MEANS and PROC FREQ to Count Missing Values
7KHUHDUHVHYHUDOSURFHGXUHVWKDWZLOOFRXQWPLVVLQJYDOXHVIRU\RX,WPD\EHQRUPDOWR KDYHPLVVLQJYDOXHVIRUFHUWDLQYDULDEOHVLQ\RXUGDWDVHW7KHUHPD\DOVREHYDULDEOHVIRU ZKLFK QR PLVVLQJ YDOXHV DUH SHUPLWWHG VXFK DV D SDWLHQW ,' $Q HDV\ ZD\ WR FRXQW PLVVLQJ YDOXHV LV E\ XVLQJ 352& 0($16 IRU FKDUDFWHU YDULDEOHV 352& )5(4 ZLOO SURYLGHWKLVLQIRUPDWLRQ3URJUDPLVDVLPSOHSURJUDPWKDWFDQEHXVHG WR FKHFN WKH QXPEHURIQXPHULFDQGFKDUDFWHUPLVVLQJYDOXHVLQWKH3$7,(176GDWDVHW 3URJUDP
&RXQWLQJ0LVVLQJDQG1RQPLVVLQJ9DOXHVIRU1XPHULFDQG&KDUDFWHU 9DULDEOHV
LIBNAME CLEAN "C:\CLEANING"; TITLE "Missing Value Check for the PATIENTS Data Set"; PROC MEANS DATA=CLEAN.PATIENTS N NMISS; RUN; PROC FORMAT; VALUE $MISSCNT ’ ’ = ’MISSING’ OTHER = ’NONMISSING’; RUN; PROC FREQ DATA=CLEAN.PATIENTS; TABLES _CHARACTER_ / NOCUM MISSING; FORMAT _CHARACTER_ $MISSCNT.; RUN;
7KH FKHFN IRU QXPHULF PLVVLQJ YDOXHV LV VWUDLJKWIRUZDUG %\ XVLQJ WKH 1 DQG 10,66 RSWLRQVZLWK352&0($16\RXJHWDFRXQWRIWKHQRQPLVVLQJDQGPLVVLQJYDOXHVIRUDOO \RXU QXPHULF YDULDEOHV WKH GHIDXOW LI QR 9$5 VWDWHPHQW LV LQFOXGHG
Chapter 3
Checking for Missing Values
77
1RWLFHDOVRWKDWLWLVQHFHVVDU\WRXVHWKH6$6NH\ZRUGB&+$5$&7(5BLQWKH7$%/(6 VWDWHPHQWRUWRSURYLGHDOLVWRIFKDUDFWHUYDULDEOHV 352&)5(4FDQSURGXFHIUHTXHQF\ WDEOHVIRUQXPHULFDVZHOODVFKDUDFWHUYDULDEOHV)LQDOO\WKH7$%/(6RSWLRQ0,66,1* LQFOXGHVWKHPLVVLQJYDOXHVLQWKHERG\RIWKHIUHTXHQF\OLVWLQJ([DPLQDWLRQRIWKHOLVWLQJ IURPWKHVHWZRSURFHGXUHVLVDJRRGILUVWVWHSLQ\RXULQYHVWLJDWLRQRIPLVVLQJYDOXHV7KH RXWSXWIURP3URJUDPLVVKRZQQH[W Missing Value Check for the PATIENTS Data Set The MEANS Procedure N Variable Label N Miss -------------------------------------------------VISIT Visit Date 24 7 HR Heart Rate 28 3 SBP Systolic Blood Pressure 27 4 DBP Diastolic Blood Pressure 28 3 -------------------------------------------------Missing Value Check for the PATIENTS data set The FREQ Procedure Patient Number PATNO Frequency Percent ----------------------------------MISSING 1 3.23 NONMISSING 30 96.77 Gender GENDER Frequency Percent ----------------------------------MISSING 1 3.23 NONMISSING 30 96.77 Diagnosis Code DX Frequency Percent ----------------------------------MISSING 8 25.81 NONMISSING 23 74.19 Adverse Event? AE Frequency Percent ----------------------------------MISSING 1 3.23 NONMISSING 30 96.77
78
®
Cody’s Data Cleaning Techniques Using SAS Software
Using DATA Step Approaches to Identify and Count Missing Values
&RXQWLQJPLVVLQJYDOXHVLVQRWXVXDOO\HQRXJK,I\RXKDYHYDULDEOHVIRU ZKLFK PLVVLQJ YDOXHVDUHQRWDOORZHG\RXQHHGWRORFDWHWKHREVHUYDWLRQVVRWKDWWKHRULJLQDOGDWDYDOXHV FDQEHFKHFNHGDQGWKHHUURUVFRUUHFWHG$VLPSOH'$7$ VWHS ZLWK D 387 VWDWHPHQW LV RQHDSSURDFK3URJUDPFKHFNVIRUDQ\PLVVLQJYLVLWGDWHVKHDUWUDWHV+5 RUDGYHUVH HYHQWV$( 3URJUDP
:ULWLQJD6LPSOH'$7$6WHSWR/LVW0LVVLQJ'DWD9DOXHVDQGDQ,' 9DULDEOH
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Missing Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @5 VISIT MMDDYY10. @15 HR 3. @27 AE $1.; IF VISIT = . THEN PUT "Missing or invalid visit date for ID " PATNO; IF HR = . THEN PUT "Missing or invalid HR for ID " PATNO; IF AE = ’ ’ THEN PUT "Missing or invalid AE for ID " PATNO; RUN;
2XWSXWIURPUXQQLQJ3URJUDPLVVKRZQQH[W Listing of Missing Values Missing Missing Missing Missing Missing Missing Missing Missing Missing Missing Missing
or or or or or or or or or or or
invalid invalid invalid invalid invalid invalid invalid invalid invalid invalid invalid
visit date for HR for ID 010 visit date for AE for ID 013 visit date for visit date for visit date for visit date for visit date for HR for ID 027 HR for ID 029
ID 007 ID 011 ID ID ID ID ID
015 123 321 020 027
Chapter 3
Checking for Missing Values
79
:KDWGR\RXGRDERXWPLVVLQJSDWLHQWQXPEHUV"2EYLRXVO\\RXFDQ WOLVW ZKLFK SDWLHQW QXPEHULVPLVVLQJEHFDXVH\RXGRQ WKDYHWKDWLQIRUPDWLRQ2QHSRVVLELOLW\LVWRUHSRUWWKH SDWLHQW QXPEHU RU QXPEHUV SUHFHGLQJ WKH PLVVLQJ QXPEHU LQ WKH RULJLQDO RUGHU RI GDWD HQWU\ ,I\RXVRUWWKHGDWDVHWILUVWDOOWKHPLVVLQJYDOXHVZLOOIORDWWRWKHWRSDQG\RX ZLOOQRWKDYHDFOXHDVWRZKLFKSDWLHQWVWKH\EHORQJWR+HUHLVDSURJUDPWKDWSULQWVRXW WKHWZRSUHYLRXVSDWLHQW,' VZKHQDPLVVLQJ,'LVIRXQG 3URJUDP
$WWHPSWLQJ WR /RFDWH D 0LVVLQJ RU ,QYDOLG 3DWLHQW ,' E\ /LVWLQJ WKH 7ZR3UHYLRXV,' V
DATA _NULL_; SET CLEAN.PATIENTS; ***Be sure to run this on the unsorted data set; FILE PRINT; TITLE "Listing of Missing Patient Numbers"; PREV_ID = LAG(PATNO); PREV2_ID = LAG2(PATNO); IF PATNO = ’ ’ THEN PUT "Missing Patient ID. Two previous ID’s are:" PREV2_ID "and " PREV_ID / @5 "Missing Record is number " _N_; ELSE IF INPUT(PATNO,?? 3.) = . THEN PUT "Invalid Patient ID:" PATNO +(-1)". Two previous ID’s are:" PREV2_ID "and " PREV_ID / @5 "Missing Record is number " _N_; RUN;
$OWKRXJK WKHUH DUH VHYHUDO VROXWLRQV WR OLVWLQJ WKH SDWLHQW QXPEHUV IURP WKH SUHFHGLQJ REVHUYDWLRQVWKH/$*IXQFWLRQVHUYHVWKHSXUSRVHKHUH5HPHPEHUWRH[HFXWHWKH/$* DQG /$* IXQFWLRQV IRU HYHU\ REVHUYDWLRQ 7KHQ ZKHQ D PLVVLQJ SDWLHQW QXPEHU LV HQFRXQWHUHGWKHWZRODJJHGYDULDEOHVZLOOEHWKH,' VIURPWKHSUHYLRXVWZRREVHUYDWLRQV 7KH DVVXPSWLRQ LQ WKLV SURJUDP LV WKDW WKHUH DUH QRW PRUH WKDQ WKUHH PLVVLQJ SDWLHQW QXPEHUVLQDURZ ,I WKDW LV D SRVVLELOLW\ \RX FRXOG OLVW PRUH WKDQ WZR SUHYLRXV SDWLHQW ,' V RU LQFOXGH SDWLHQW ,' V IROORZLQJ WKH PLVVLQJ RQH DV ZHOO 1RWLFH WKDW ZH DGGHG WKH REVHUYDWLRQQXPEHUWRWKHRXWSXWE\SULQWLQJWKHYDOXHRIWKHLQWHUQDO6$6YDULDEOHB1B 7KLVSURYLGHVRQHDGGLWLRQDOFOXHLQILQGLQJWKHPLVVLQJSDWLHQWQXPEHU
®
80
Cody’s Data Cleaning Techniques Using SAS Software
+HUHLVWKHRXWSXWIURP3URJUDP
Listing of Missing Patient Numbers Invalid Patient ID:XX5. Two previous ID’s are:003 and 004 Missing Record is number 5 Missing Patient ID. Two previous ID’s are:006 and 007 Missing Record is number 8
$QRWKHUDSSURDFKLVWROLVWWKHYDOXHVRIDOOWKHYDULDEOHVIRUDQ\PLVVLQJRULQYDOLGSDWLHQW ,'7KLVPD\JLYHDFOXHWRWKHLGHQWLW\RIWKHPLVVLQJ,'8VLQJ352&35,17ZLWKD :+(5(VWDWHPHQWPDNHVWKLVDQHDV\WDVNDVGHPRQVWUDWHGE\WKH6$6FRGHLQ3URJUDP 3URJUDP
8VLQJ352&35,17WR/LVW'DWDIRU0LVVLQJRU,QYDOLG3DWLHQW,' V
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "Data Listing for Patients with Missing or Invalid ID’s"; WHERE PATNO = ’ ’ OR INPUT(PATNO,3.) = .; RUN;
+HUHLVWKHFRUUHVSRQGLQJRXWSXW
Data Listing for Patients with Missing or Invalid ID’s Obs 5 8
PATNO
GENDER
XX5
M M
VISIT
HR
SBP
DBP
DX
AE
05/07/1998 11/11/1998
68 90
120 190
80 100
1
0 0
%HIRUH OHDYLQJ WKLV VHFWLRQ RQ '$7$ VWHS GHWHFWLRQ RI PLVVLQJ YDOXHV OHW V PRGLI\ 3URJUDP ZKLFK OLVWHG PLVVLQJ GDWHV KHDUW UDWHV DQG DGYHUVH HYHQWV WR FRXQW WKH QXPEHURIHDFKPLVVLQJYDULDEOHDVZHOO
Chapter 3
3URJUDP
Checking for Missing Values
81
/LVWLQJDQG&RXQWLQJ0LVVLQJ9DOXHVIRU6HOHFWHG9DULDEOHV
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD END=LAST; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Missing Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @5 VISIT MMDDYY10. @15 HR 3. @27 AE $1.; IF VISIT = . THEN DO; PUT "Missing or invalid visit date for ID " PATNO; N_VISIT + 1; END; IF HR = . THEN DO; PUT "Missing or invalid HR for ID " PATNO; N_HR + 1; END; IF AE = ’ ’ THEN DO; PUT "Missing or invalid AE for ID " PATNO; N_AE + 1; END; IF LAST THEN PUT // "Summary of 25*’-’ / "Number of missing "Number of missing "Number of missing RUN;
missing values" / dates = " N_VISIT / HR’s = " N_HR / adverse events = " N_AE;
(DFKWLPHDPLVVLQJYDOXHLVORFDWHGWKHUHVSHFWLYHPLVVLQJFRXQWHULVLQFUHPHQWHGE\ %HFDXVH\RXRQO\ZDQWWRVHHWKHWRWDOVRQFHDIWHUDOOWKHGDWDOLQHVKDYHEHHQUHDGXVHWKH (1' RSWLRQLQWKH,1),/(VWDWHPHQWWRFUHDWHWKHORJLFDOYDULDEOH/$67/$67ZLOOEH WUXH ZKHQ WKH ODVW UHFRUG LV EHLQJ UHDG IURP WKH UDZ GDWD ILOH 3$7,(1767;7 6R LQ DGGLWLRQWRWKHHDUOLHUOLVWLQJ\RXKDYHWKHDGGLWLRQDOOLQHVRIRXWSXWVKRZQQH[W
82
®
Cody’s Data Cleaning Techniques Using SAS Software
Summary of missing values ------------------------Number of missing dates = 7 Number of missing HR’s = 3 Number of missing adverse events = 1
Using PROC TABULATE to Count Missing and Nonmissing Values for Numeric Variables
,QVWHDGRIXVLQJ352&0($16WRFRXQWQXPHULFPLVVLQJYDOXHV\RXPLJKWZDQWDPRUH DWWUDFWLYHWDEOHIURP352&7$%8/$7(7KLVKRZHYHUDGGVVLJQLILFDQWFRPSOLFDWLRQV %HIRUH ZULWLQJ WKH JHQHUDO SXUSRVH SURJUDP OHW V ILUVW VHH KRZ 352& 7$%8/$7( FDQ SURYLGH LQIRUPDWLRQ RQ \RXU QXPHULF YDULDEOHV 3URJUDP OLVWV WKH QXPEHU RI QRQPLVVLQJDQGPLVVLQJYDOXHVDQGWKHPLQLPXPDQGPD[LPXPYDOXHVIRUWKUHHQXPHULF YDULDEOHVLQWKH3$7,(176GDWDVHW 3URJUDP
/LVWLQJ WKH 1XPEHU RI 1RQPLVVLQJ DQG 0LVVLQJ 9DOXHV DQG WKH 0LQLPXPDQG0D[LPXP9DOXHVIRU$OO1XPHULF9DULDEOHV
PROC TABULATE DATA=CLEAN.PATIENTS FORMAT=8.; TITLE "Missing Values, Low and High Values for Numeric Variables"; VAR HR SBP DBP; TABLE HR SBP DBP, N NMISS MIN MAX / RTSPACE=26; KEYLABEL N = ’Number’ NMISS = ’Number Missing’ MIN = ’Lowest Value’ MAX = ’Highest Value’; RUN;
7KHRSWLRQ)250$7 SURYLGHVDGHIDXOWIRUPDWIRUDOOWKHSULQWHGQXPEHUV(LJKWZDV FKRVHQEHFDXVHLWDOORZVWKHFROXPQKHDGLQJVWRILWQLFHO\7KHRSWLRQ5763$&(URZ WLWOHVSDFH OHDYHVHQRXJKURRPIRUWKHYDULDEOHODEHOV)LQDOO\WKH.($%(/VWDWHPHQW OHWV \RX FKRRVH PRUH FRQYHQLHQW QDPHV IRU WKH JHQHUDWHG VWDWLVWLFV 7KH RXWSXW IURP 3URJUDPIROORZV5HPHPEHUWKDWWKHODEHOVIRUWKHWKUHHQXPHULFYDULDEOHVDUHVWRUHG ZLWKWKHSHUPDQHQWGDWDVHW
Chapter 3
Checking for Missing Values
83
Missing Values, Low and High Values for Numeric Variables Number
Number Missing
Lowest Value
Highest Value
Heart Rate
28
3
10
900
Systolic Blood Pressure
27
4
20
400
Diastolic Blood Pressure
28
5
8
200
Using PROC TABULATE to Count Missing and Nonmissing Values for Character Variables
$ VLPLODU SURJUDP FDQ SURYLGH WKH QXPEHU RI PLVVLQJ DQG QRQPLVVLQJ YDOXHV IRU WKH FKDUDFWHUYDULDEOHVLQWKHGDWDVHW7KHUHDUHVRPHGLIIHUHQFHVLQKRZ352&7$%8/$7( FDQEHXVHGWRFRXQWPLVVLQJDQGQRQPLVVLQJYDOXHVIRUFKDUDFWHUYDULDEOHV
8VLQJ352&7$%8/$7(WR&RXQW0LVVLQJDQG1RQPLVVLQJ9DOXHV IRU&KDUDFWHU9DULDEOHV
PROC FORMAT; VALUE $MISSCH ’ ’ = ’Missing’ OTHER = ’Nonmissing’; RUN; PROC TABULATE DATA=CLEAN.PATIENTS MISSING FORMAT=8.; CLASS PATNO DX AE; TABLE PATNO DX AE, N / RTSPACE=26; FORMAT PATNO DX AE $MISSCH.; KEYLABEL N = ’Number’; RUN;
84
®
Cody’s Data Cleaning Techniques Using SAS Software
7KH RQO\ VWDWLVWLF \RX FDQ UHTXHVW ZKHQ WKHUH DUH QR DQDO\VLV YDULDEOHV OLVWHG LV 1 WKH QXPEHURIREVHUYDWLRQVLQHDFKRIWKHIRUPDWWHGFDWHJRULHV%HFDXVHWKHFKDUDFWHUIRUPDW 0,66&+KDVRQO\WZROHYHOVUHSUHVHQWLQJPLVVLQJ DQG QRQPLVVLQJ YDOXHV UHVSHFWLYHO\ \RXWULFNWKHSURFHGXUHLQWRJLYLQJ\RXZKDW\RXZDQW7KHRXWSXWIURP 3URJUDP IROORZV Missing Values, Low and High Values for Character Variables Number Patient Number Missing Nonmissing
1 30
Diagnosis Code Missing Nonmissing
8 23
Adverse Event? Missing Nonmissing
1 30
Creating a General Purpose Macro to Count Missing and Nonmissing Values for Both Numeric and Character Variables
:KHQ\RXRPLWWKH9$5VWDWHPHQWRUXVHB180(5,&BLQSODFHRIWKHYDULDEOHOLVW ZLWK 352& 0($16 DOO QXPHULF YDULDEOHV DUH OLVWHG :LWK 352& )5(4 \RX FDQ XVH B&+$5$&7(5BLQSODFHRIDYDULDEOHOLVWLQWKH7$%/(6VWDWHPHQW+RZHYHUZKHQ\RX XVH 352& 7$%8/$7( \RX QHHG WR SURYLGH D OLVW RI YDULDEOH QDPHV LQ WKH &/$66 RU 9$5VWDWHPHQW
Chapter 3
3URJUDP
Checking for Missing Values
85
:ULWLQJ D 0DFUR WR &RXQW WKH 1XPEHU RI 0LVVLQJ DQG 1RQPLVVLQJ 2EVHUYDWLRQVIRU$OO1XPHULFDQG&KDUDFWHU9DULDEOHVLQD'DWD6HW
*----------------------------------------------------------------* | Program Name: AUTOMISS.SAS in C:\CLEANING | | Purpose: Macro to list the number of missing and nonmissing | | variables in a SAS data set | | Arguments: DSNAME = SAS data set name (one- or two-level) | | Example: %AUTOMISS(CLEAN.PATIENTS) | *----------------------------------------------------------------*; %MACRO AUTOMISS(DSNAME); %***One-level data set name; %IF %INDEX(&DSNAME,.) = 0 %THEN %DO; %LET LIB = WORK; %LET DSN = %UPCASE(&DSNAME); %END;
¯
%***Two-level data set name; %ELSE %DO; ° %LET LIB = %UPCASE(%SCAN(&DSNAME,1,".")); %LET DSN = %UPCASE(%SCAN(&DSNAME,2,".")); %END; %*Note: it is important for the libname and data set name to be in uppercase; %* Initialize macro variables to null; %LET NVARLIST=; %LET CVARLIST=; TITLE1 "Number of Missing and Nonmissing Values from &DSNAME"; %* Get list of numeric variables; PROC SQL NOPRINT; SELECT NAME INTO :NVARLIST SEPARATED BY " " FROM DICTIONARY.COLUMNS
²
±
WHERE LIBNAME = "&LIB" AND MEMNAME = "&DSN" AND TYPE = "num";
³
%* Get list of character variables; SELECT NAME INTO :CVARLIST SEPARATED BY " " FROM DICTIONARY.COLUMNS WHERE LIBNAME = "&LIB" AND MEMNAME = "&DSN" AND TYPE = "char"; QUIT;
´
86
®
Cody’s Data Cleaning Techniques Using SAS Software PROC FORMAT; µ VALUE $MISSCH " " = "Missing" OTHER = "Nonmissing"; RUN; PROC TABULATE DATA=&LIB..&DSN MISSING FORMAT=8.;
¶
%* If there are any numeric variables, do the following; %IF &NVARLIST NE %THEN %DO; VAR &NVARLIST; TITLE2 "for Numeric Variables"; TABLE &NVARLIST, N NMISS MIN MAX / RTSPACE=26; %END; %* If there are any character variables, do the following; %IF &CVARLIST NE %THEN %DO; CLASS &CVARLIST; TITLE2 "for Character Variables"; TABLE &CVARLIST, N / RTSPACE=26; FORMAT &CVARLIST $MISSCH.; %END; KEYLABEL N NMISS MIN MAX RUN;
= = = =
"Number" "Number Missing" "Lowest Value" "Highest Value";
%MEND AUTOMISS;
7KHPDFURVWDUWVZLWKDWHVWWRVHHLIWKHGDWDVHWQDPHWKHFDOOLQJDUJXPHQW LVDRQHRU WZROHYHOQDPH7KHPDFURIXQFWLRQ,1'(;¯UHWXUQVDLIWKHUHLVQRSHULRGLQWKH GDWDVHWQDPH,QWKLVFDVHWKHPDFURYDULDEOH/,%LVVHWHTXDOWR:25.DQGWKHPDFUR YDULDEOH'61LVVHWHTXDOWRWKHGDWDVHWQDPH,IWKHGDWDVHWQDPHFRQWDLQVDSHULRGWKH ,1'(;IXQFWLRQUHWXUQVDQXPEHUJUHDWHUWKDQDQGWKHWZRPDFURYDULDEOHV/,%DQG '61DUHVHWHTXDOWRWKHOLEQDPHDQGGDWDVHWQDPHUHVSHFWLYHO\°,WLVLPSRUWDQWWKDW WKHOLEQDPHDQGGDWDVHWQDPHVEHLQXSSHUFDVHVRWKH83&$6(IXQFWLRQLVXVHG7KH 6(/(&7 VWDWHPHQW ± LV XVHG WR SODFH WKH YDULDEOH QDPH 1$0( LQ D PDFUR YDULDEOH FDOOHG 19$5/,67 7KLV LV DFFRPSOLVKHG E\ WKH WHUPV ,172 19$5/,67 7KH OLVW RI QDPHV SURGXFHG LV VHSDUDWHG E\ VSDFHV DV LQGLFDWHG LQ WKH 6(/(&7 VWDWHPHQW 7KH NH\ZRUG',&7,21$5<&2/8016²UHWXUQVWKHOLVWRIYDULDEOHV)LQDOO\WRREWDLQWKH QXPHULF DQG FKDUDFWHU OLVWV VHSDUDWHO\ 7<3( QXP RU 7<3( FKDU LV DGGHG WR WKH :+(5(VWDWHPHQWV³´ :KHQ WKH WZR 6(/(&7 VWDWHPHQWV DUH SURFHVVHG WKH PDFUR
Chapter 3
Checking for Missing Values
87
YDULDEOH 19$5/,67 ZLOO FRQWDLQ D OLVW RI DOO WKH QXPHULF YDULDEOHV VHSDUDWHG E\ VSDFHV DQGWKHPDFURYDULDEOH&9$5/,67ZLOOFRQWDLQDOLVWRIDOOWKHFKDUDFWHUYDULDEOHVLQWKH GDWDVHWVSHFLILHGE\WKHFDOOLQJDUJXPHQWV 7KH 0,66&+ IRUPDW µ LV WKH VDPH DV ZH XVHG LQ 3URJUDP )LQDOO\ WKH 352& 7$%8/$7(VWDWHPHQWV¶DUHLGHQWLFDOWRWKHVWDWHPHQWVLQWKHSUHYLRXVSURJUDPVZKHUH WKH YDULDEOH OLVWV ZHUH KDUG FRGHG H[FHSW WKH PDFUR YDULDEOHV DUH VXEVWLWXWHG IRU WKH H[SOLFLWOLVWRIYDULDEOHV &DOOLQJ WKLV PDFUR ZLWK WKH GDWD VHW QDPH &/($13$7,(176 DV WKH DUJXPHQW $8720,66&/($13$7,(176 SURGXFHGWKHIROORZLQJRXWSXW Number of Missing and Nonmissing Values from CLEAN.PATIENTS for Numeric Variables Number
Number Missing
Lowest Value
Highest Value
Visit Date
24
7
13966
14560
Heart Rate
28
3
10
900
Systolic Blood Pressure
27
4
20
400
Diastolic Blood Pressure
28
3
8
200
Number of Missing and Nonmissing Values from CLEAN.PATIENTS for Character Variables Number Patient Number Missing Nonmissing
1 30
Gender Missing Nonmissing
1 30
Diagnosis Code Missing Nonmissing
8 23
Adverse Event? Missing Nonmissing
1 30
88
®
Cody’s Data Cleaning Techniques Using SAS Software
7KH 352& 0($16 DQG 352& )5(4 PHWKRGV IRU OLVWLQJ PLVVLQJ YDOXHV DUH FHUWDLQO\ HDVLHU WKDQ WKH FRPSOLFDWHG SURJUDP \RX ZHUH VKRZQ EHIRUH
6SHFLILF YDOXHV VXFK DV RU DUH VRPHWLPHV XVHG WR GHQRWH PLVVLQJ YDOXHV )RU H[DPSOH QXPHULF YDOXHV WKDW DUH OHIW EODQN LQ 'EDVH ILOHV DUH VWRUHG DV ]HURV ,I LV D YDOLGYDOXHIRUVRPHRI\RXUYDULDEOHVWKLVFDQOHDGWRSUREOHPV6RYDOXHVVXFKDVRU DUHVRPHWLPHVXVHGLQVWHDGRI]HURVRUEODQNV3URJUDPVHDUFKHVD6$6GDWDVHW IRU DOO QXPHULF YDULDEOHV VHW WR D VSHFLILF YDOXH DQG SURGXFHV D UHSRUW ZKLFK VKRZV WKH YDULDEOHQDPHDQGWKHREVHUYDWLRQZKHUHWKHVSHFLILFYDOXHZDVIRXQG 7KHWULFNLQWKLVSURJUDPLVWKHUHODWLYHO\XQNQRZQURXWLQH91$0(VHHRQOLQHKHOSIRU 6$6 5HOHDVH RU ODWHU IRU PRUH GHWDLOV $ FDOO WR WKLV URXWLQH UHWXUQV WKH YDULDEOH QDPHRIDQDUUD\HOHPHQW7KHILUVWSURJUDP3URJUDP VHDUFKHVD6$6GDWDVHWIRUD VSHFLILF YDOXH 7KH SURJUDP LV WKHQ JHQHUDOL]HG E\ PDNLQJ WKH GDWD VHW QDPH DQG WKH VSHFLILFYDOXHFDOOLQJDUJXPHQWVLQDPDFUR+HUHLVWKHILUVWSURJUDP 3URJUDP
,GHQWLI\LQJ $OO 1XPHULF 9DULDEOHV (TXDO WR D )L[HG 9DOXH 6XFK DV
*----------------------------------------------------------------* | Program Name: FIND_X.SAS in C:\CLEANING | | Purpose: Identifies any specified value for all numeric vars | *----------------------------------------------------------------*; ***Create test data set; DATA TEST; INPUT X Y A $ X1-X3 Z $; DATALINES; 1 2 X 3 4 5 Y 2 999 Y 999 1 999 J 999 999 R 999 999 999 X 1 2 3 4 5 6 7 ;
Chapter 3
Checking for Missing Values
89
***Program to detect the specified values; DATA _NULL_; SET TEST; FILE PRINT; ARRAY NUMS[*] _NUMERIC_; LENGTH VARNAME $ 8;
¯
DO __I = 1 TO DIM(NUMS); ° IF NUMS[__I] = 999 THEN DO; CALL VNAME(NUMS[__I],VARNAME); ± PUT "Value of 999 found for variable " VARNAME "in observation " _N_; END; END; DROP __I; RUN;
.H\WRWKLVSURJUDPLVWKHXVHRIB180(5,&BLQWKH$55$<VWDWHPHQW¯%HFDXVHWKLV $55$< VWDWHPHQW IROORZV WKH 6(7 VWDWHPHQW WKH DUUD\ 1806 ZLOO FRQWDLQ DOO WKH QXPHULFYDULDEOHVLQWKHGDWDVHW7(677KHQH[WVWHSLVWRH[DPLQHHDFKRIWKHHOHPHQWV LQWKH1806DUUD\GHWHUPLQHLIDYDOXHRILVIRXQGDQGWKHQGHWHUPLQHWKHYDULDEOH QDPHDVVRFLDWHGZLWKWKDWDUUD\HOHPHQW7KH'2ORRS°XVHVWKHLQGH[YDULDEOHBB,LQ WKHKRSHVWKDWWKHUHZLOOQRWEHDQ\YDULDEOHVLQWKHGDWDVHWWREHWHVWHGZLWKWKDWQDPH 1RZIRUWKHWULFN$V\RXVHDUFKIRUYDOXHVRIIRUHDFK RI WKH QXPHULF YDULDEOHV \RXFDQXVHWKH&$//91$0(URXWLQH±VHHWKHRQOLQHKHOSIRU5HOHDVHRUODWHU WR UHWXUQ WKH YDULDEOH QDPH WKDW FRUUHVSRQGV WR WKH DUUD\ HOHPHQW ,Q WKLV SURJUDP WKH YDULDEOH QDPH LV VWRUHG LQ WKH YDULDEOH 9$51$0( WKH ILUVW DUJXPHQW E\ WKH 91$0( URXWLQH $OO WKDW LV OHIW WR GR LV ZULWH RXW WKH YDULDEOH QDPHV DQG REVHUYDWLRQ QXPEHUV 1H[WLVWKHPDFURYHUVLRQRIWKHVDPHSURJUDPIROORZHGE\WKHRXWSXW 3URJUDP &UHDWLQJD0DFUR9HUVLRQRI3URJUDP *----------------------------------------------------------------* | Macro Name: FIND_X.SAS in C:\CLEANING | | Purpose: Identifies any specified value for all numeric vars | | Calling Arguments: DSN SAS Data Set Name | | NUM Numeric value to search for | | Example: To find variable values of 999 in data set TEST, use | | %FIND_X(TEST,999) | *----------------------------------------------------------------*;
90
®
Cody’s Data Cleaning Techniques Using SAS Software
%MACRO FIND_X(DSN,NUM); TITLE “Variables with 999 as Missing Values”; DATA _NULL_; SET &DSN; FILE PRINT; LENGTH VARNAME $ 8; ***Or LENGTH 32 for V7 and Later; ARRAY NUMS[*] _NUMERIC_; DO __I = 1 TO DIM(NUMS); IF NUMS[__I] = &NUM THEN DO; CALL VNAME(NUMS[__I],VARNAME); PUT "Value of &NUM found for variable " VARNAME "in observation " _N_; END; END; DROP __I; RUN; %MEND FIND_X;
7KHUHVXOWLQJRXWSXWLVVKRZQQH[W Variables with 999 as Missing Values Value Value Value Value Value Value Value Value
of of of of of of of of
999 999 999 999 999 999 999 999
found found found found found found found found
for for for for for for for for
variable variable variable variable variable variable variable variable
Y in observation 2 X1 in observation 2 X3 in observation 2 X in observation 3 Y in observation 3 X1 in observation 3 X2 in observation 3 X3 in observation 3
,I \RX ZRXOG SUHIHU MXVW WR VHH D VXPPDU\ RI YDULDEOHV WKDW KDYH WKH YDOXH RI D QXPEHU VXFKDVIRURQHRUPRUHREVHUYDWLRQV\RXFDQPRGLI\3URJUDPWRFUHDWHDGDWDVHW DQGXVH352&)5(4WRFRXQWWKHQXPEHURIWLPHVDVSHFLILHGYDOXHLVGHWHFWHGDVVKRZQ LQ3URJUDP
Chapter 3
Checking for Missing Values
91
3URJUDP ,GHQWLI\LQJ 9DULDEOHV ZLWK 6SHFLILHG 1XPHULF 9DOXHV DQG &RXQWLQJ WKH1XPEHURI7LPHVWKH9DOXH$SSHDUV DATA NUM_999; SET TEST; FILE PRINT; ARRAY NUMS[*] _NUMERIC_; LENGTH VARNAME $ 8; DO __I = 1 TO DIM(NUMS); IF NUMS[__I] = 999 THEN DO; CALL VNAME(NUMS[__I],VARNAME); OUTPUT; END; END; KEEP VARNAME; RUN; PROC FREQ DATA=NUM_999; TABLES VARNAME / NOCUM NOPERCENT; RUN;
(DFK WLPH D QXPHULF YDULDEOH LQ WKH DUUD\ 1806 LV HTXDO WR D YDOXH RI D FDOO WR 91$0(SODFHVWKHYDULDEOHQDPHLQWRWKHYDULDEOH9$51$0($QREVHUYDWLRQLVWKHQ ZULWWHQ WR WKH GDWD VHW 180B %HFDXVH \RX ZDQW WR FRXQW WKH QXPEHU RI WLPHV WKH VSHFLILF QXPHULF YDOXH VXFK DV RFFXUUHG XVH 352& )5(4 WR SULQW RXW WKH IUHTXHQFLHV 5XQQLQJ3URJUDPZLWKWKHYDOXHRIRQWKHGDWDVHW7(67JHQHUDWHGWKHIROORZLQJ RXWSXW Variables with 999 as Missing Values The FREQ Procedure VARNAME Frequency -------------------X 1 X1 2 X2 1 X3 2 Y 2
92
®
Cody’s Data Cleaning Techniques Using SAS Software
4
Working with Dates ,QWURGXFWLRQ
&KHFNLQJ5DQJHVIRU'DWHV8VLQJD'$7$6WHS
&KHFNLQJ5DQJHVIRU'DWHV8VLQJ352&35,17
&KHFNLQJIRU,QYDOLG'DWHV
:RUNLQJZLWK'DWHVLQ1RQVWDQGDUG)RUP
&UHDWLQJD6$6'DWH:KHQWKH'D\RIWKH0RQWK,V0LVVLQJ
6XVSHQGLQJ(UURU&KHFNLQJIRU.QRZQ,QYDOLG'DWHV
Introduction
6$6GDWHVVHHPP\VWHULRXVWRPDQ\SHRSOHEXWE\XQGHUVWDQGLQJKRZ6$6GDWHV DUHVWRUHG\RXZLOOVHHWKDWWKH\DUHUHDOO\TXLWHVLPSOH6$6GDWHVDUHVWRUHGLQ QXPHULF YDULDEOHV DQG UHSUHVHQW WKH QXPEHU RI GD\V IURP D IL[HG SRLQW LQ WLPH -DQXDU\7KHFRQIXVLRQGHYHORSVLQWKHPDQ\ZD\VWKDW6$6VRIWZDUHFDQ UHDGDQGZULWHGDWHV7\SLFDOO\GDWHVDUHUHDGDV00''<<<<RUVRPHVLPLODU IRUP 7KHUH DUH LQIRUPDWV WR UHDG DOPRVW DQ\ FRQFHLYDEOH GDWH QRWDWLRQ 5HJDUGOHVV RI KRZ D GDWH LV UHDG WKH LQIRUPDW SHUIRUPV WKH FRQYHUVLRQ WR D 6$6 GDWHDQGLWLVVWRUHGMXVWOLNHDQ\RWKHUQXPHULFYDOXH,I\RXSULQWRXWDGDWHYDOXH ZLWKRXW D 6$6 GDWH IRUPDW LW ZLOO DSSHDU DV D QXPEHU WKH QXPEHU RI GD\V IURP -DQXDU\ UDWKHU WKDQ D GDWH LQ RQH RI WKH VWDQGDUG IRUPV :KHQ GDWH LQIRUPDWLRQ LV QRW LQ D VWDQGDUG IRUP \RX FDQ UHDG WKH PRQWK GD\ DQG \HDU LQIRUPDWLRQ DV VHSDUDWH YDULDEOHV DQG XVH WKH 0'< PRQWKGD\\HDU IXQFWLRQ WR FUHDWHD6$6GDWH/HW VORRNDWVRPHZD\VWRSHUIRUPGDWDFOHDQLQJDQGYDOLGDWLRQ ZLWKGDWHV
94
®
Cody’s Data Cleaning Techniques Using SAS Software
Checking Ranges for Dates (Using a DATA Step)
6XSSRVH\RXZDQWWRGHWHUPLQHLIWKHYLVLWGDWHVLQWKH3$7,(176GDWDVHWDUHEHWZHHQ -XQHDQG2FWREHU
&KHFNLQJ 7KDW D 'DWH ,V ZLWKLQ D 6SHFLILHG ,QWHUYDO '$7$ 6WHS $SSURDFK
LIBNAME CLEAN "C:\CLEANING"; DATA _NULL_; TITLE "Dates before June 1, 1998 or after October 15, 1999"; FILE PRINT; SET CLEAN.PATIENTS(KEEP=VISIT PATNO); IF VISIT LT ’01JUN1998’D AND VISIT NE . OR VISIT GT ’15OCT1999’D THEN PUT PATNO= VISIT= MMDDYY10.; RUN;
7KHNH\WRWKLVSURJUDPLVWKHXVHRIWKHGDWHFRQVWDQWVDOVRFDOOHGGDWHOLWHUDOV LQWKH,) VWDWHPHQW ,I \RX ZDQW 6$6 WR WXUQ D GDWH LQWR D 6$6 GDWH WKH QXPEHU RI GD\V IURP WKHGDWHVPXVWEHZULWWHQLQWKLVIDVKLRQ'DWHFRQVWDQWVDUHZULWWHQDVDWZR GLJLWGD\DWKUHHFKDUDFWHUPRQWKQDPHDQGDWZRRUIRXUGLJLW\HDUSODFHGLQVLQJOHRU GRXEOHTXRWHVDQGIROORZHGE\DORZHUFDVHRUXSSHUFDVH '
VISIT=05/07/1998 VISIT=10/19/1999 VISIT=11/12/1999 VISIT=03/28/1998 VISIT=05/15/1998
Chapter 4
Working with Dates
95
Checking Ranges for Dates (Using PROC PRINT)
&KHFNLQJ 7KDW D 'DWH ,V ZLWKLQ D 6SHFLILHG ,QWHUYDO 8VLQJ 352& 35,17DQGD:+(5(6WDWHPHQW
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "Dates before June 1, 1998 or after October 15, 1999"; WHERE VISIT NOT BETWEEN ’01JUN1998’D AND ’15OCT1999’D AND VISIT NE .; ID PATNO; VAR VISIT; FORMAT VISIT DATE9.; RUN;
2XWSXWIURPWKLVSURFHGXUHFRQWDLQVWKHLGHQWLFDOLQIRUPDWLRQDVWKHSUHYLRXV'$7$VWHS DSSURDFK)RUYDULHW\OHW VFKRRVHWKH'$7(GDWHIRUPDW Dates before June 1, 1998 or after October 15, 1999 PATNO XX5 010 003 028 029
VISIT 07MAY1998 19OCT1999 12NOV1999 28MAR1998 15MAY1998
Checking for Invalid Dates
6RPH RI WKH GDWHV LQ WKH 3$7,(176 GDWD VHW DUH PLVVLQJ DQG VRPH DUH LQYDOLG GDWHV ZKLFK ZHUH FRQYHUWHG WR PLVVLQJ YDOXHV GXULQJ WKH LQSXW SURFHVV ,I \RX ZDQW WR GLVWLQJXLVKEHWZHHQWKHWZR\RXPXVWZRUNIURPWKHUDZGDWDQRWWKH6$6GDWDVHW,I \RX DWWHPSW WR UHDG DQ LQYDOLG GDWH ZLWK D 6$6 GDWH LQIRUPDW DQ HUURU PHVVDJH ZLOO DSSHDU LQ WKH 6$6 /RJ 7KLV LV RQH FOXH WKDW \RX KDYH HUURUV LQ \RXU GDWH YDOXHV 3URJUDPUHDGVWKHUDZGDWDILOH3$7,(1767;77KHUHVXOWLQJ6$6/RJIROORZV
96
®
Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP 5HDGLQJ'DWHVZLWKWKH00''<<,QIRUPDW DATA DATES; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @5 VISIT MMDDYY10.; FORMAT VISIT MMDDYY10.; RUN;
7KH6$6/RJWKDWUHVXOWVIURPUXQQLQJWKLVSURJUDPLVVKRZQQH[W
1 LIBNAME CLEAN "C:\CLEANING"; NOTE: Libref CLEAN was successfully assigned as follows: Engine: V6 Physical Name: C:\CLEANING 2 DATA DATES; 3 INFILE "C:\CLEANING\PATIENTS.TXT" PAD; 4 INPUT @5 VISIT MMDDYY10.; 5 FORMAT VISIT MMDDYY10.; 6 RUN; NOTE: The infile "C:\CLEANING\PATIENTS.TXT" is: File Name=C:\CLEANING\PATIENTS.TXT, RECFM=V,LRECL=256 NOTE: Invalid data for VISIT in line 7 5-14. RULE: ---+---1---+---2---+---3---+---4---+---5---+---6---+---7---+---8 7 007M08/32/1998 88148102 0 87 173 VISIT=. _ERROR_=1 _N_=7 NOTE: Invalid data for VISIT in line 12 5-14. 12 011M13/13/1998 68300 20 41 87 173 VISIT=. _ERROR_=1 _N_=12 NOTE: Invalid data for VISIT in line 21 5-14. 21 123M15/12/1999 60 10 87 173 VISIT=. _ERROR_=1 _N_=21 NOTE: Invalid data for VISIT in line 23 5-14. 23 020F99/99/9999 10 20 8 0 87 173 VISIT=. ERR0R_=1 _N_=23 continued
Chapter 4
Working with Dates
97
NOTE: Invalid data for VISIT in line 28 5-14. 28 027FNOTAVAIL NA 166106 70 87 173 VISIT=. _ERROR_=1 _N_=28 NOTE: 31 records were read from the infile "C:\CLEANING\PATIENTS.TXT". The minimum record length was 26. The maximum record length was 27. NOTE: The data set WORK.DATES has 31 observations and 1 variables. NOTE: DATA statement used: real time 2.14 seconds
7KHUH DUH VHYHUDO UHDVRQV ZK\ WKHVH GDWHV FDXVHG HUURU UHSRUWV LQ WKH /RJ ,Q VRPH FDVHVWKHPRQWKZDVJUHDWHUWKDQSRVVLEO\FDXVHGE\UHDGLQJDGDWHWKDWZDVDFWXDOO\ LQ GD\PRQWK\HDU IRUP UDWKHU WKDQ PRQWKGD\\HDU IRUP 2WKHU GDWHV VXFK DV ZHUH DQ DWWHPSW WR LQGLFDWH WKDW QR GDWH LQIRUPDWLRQ ZDV DYDLODEOH 2EYLRXVO\ WKH GDWD YDOXH RI 127$9$,/ QRW DYDLODEOH FDXVHG DQ HUURU 5HPHPEHU WKDWRQFHWKHHUURUVH[FHHGDGHIDXOWQXPEHUWKH\ZLOOQRORQJHUEHUHSRUWHGLQWKH6$6 /RJ7KLVQXPEHUFDQEHDGMXVWHGE\VHWWLQJWKHV\VWHPRSWLRQ(55256 ,I\RXKDYH QRPLVVLQJGDWHYDOXHVLQ\RXUGDWDDQ\PLVVLQJGDWHYDOXHPXVWKDYHEHHQJHQHUDWHGE\ DQLQYDOLGGDWH
/LVWLQJ0LVVLQJDQG,QYDOLG'DWHVE\5HDGLQJWKH'DWH7ZLFH2QFH ZLWKD'DWH,QIRUPDWDQGWKH6HFRQGDV&KDUDFWHU'DWD
DATA _NULL_; FILE PRINT; TITLE "Listing of Missing and Invalid Dates"; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @5 VISIT MMDDYY10. @5 V_DATE $CHAR10.; FORMAT VISIT MMDDYY10.; IF VISIT = . THEN PUT PATNO= V_DATE=; RUN;
+HUH\RXUHDGWKHGDWHWZLFHILUVWZLWKWKH6$6GDWHLQIRUPDW00''<<DQGWKHQ ZLWK WKH FKDUDFWHU LQIRUPDW &+$5 7KLV ZD\ HYHQ WKRXJK WKH 6$6 6\VWHP VXEVWLWXWHV D PLVVLQJ YDOXH IRU 9,6,7 WKH YDULDEOH 9B'$7( ZLOO FRQWDLQ WKH DFWXDO FKDUDFWHUVWKDWZHUHHQWHUHGLQWKHGDWHILHOG
98
®
Cody’s Data Cleaning Techniques Using SAS Software
$Q DOWHUQDWLYH LV WR UHDG WKH RULJLQDO GDWH RQO\ RQFH DV D FKDUDFWHU VWULQJ DQG XVH WKH ,1387IXQFWLRQWRFUHDWHWKH6$6GDWH5HPHPEHUWKDWWKH,1387IXQFWLRQUHDGVWKH YDOXH RI WKH ILUVW DUJXPHQW XVLQJ WKH LQIRUPDW OLVWHG DV WKH VHFRQG DUJXPHQW 7KH DOWHUQDWLYHSURJUDPLVVKRZQLQ3URJUDP 3URJUDP
/LVWLQJ 0LVVLQJ DQG ,QYDOLG 'DWHV E\ 5HDGLQJ WKH 'DWH DV D &KDUDFWHU9DULDEOHDQG&RQYHUWLQJWRD6$6'DWHZLWKWKH,1387 )XQFWLRQ
DATA _NULL_; FILE PRINT; TITLE "Listing of Missing and Invalid Dates"; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @5 V_DATE $CHAR10.; VISIT = INPUT(V_DATE,MMDDYY10.); FORMAT VISIT MMDDYY10.; IF VISIT = . THEN PUT PATNO= V_DATE=; RUN;
5XQQLQJHLWKHU3URJUDPRU3URJUDPUHVXOWVLQWKHIROORZLQJRXWSXW Listing of Missing and Invalid Dates PATNO=007 PATNO=011 PATNO=015 PATNO=123 PATNO=321 PATNO=020 PATNO=027
V_DATE=08/32/1998 V_DATE=13/13/1998 V_DATE= V_DATE=15/12/1999 V_DATE= V_DATE=99/99/9999 V_DATE=NOTAVAIL
,I \RX ZDQW WR LJQRUH UHDO PLVVLQJ YDOXHV \RX RQO\ QHHG WR PDNH D VOLJKW FKDQJH DV VKRZQLQ3URJUDP 3URJUDP
5HPRYLQJWKH0LVVLQJ9DOXHVIURPWKH,QYDOLG'DWH/LVWLQJ
DATA _NULL_; FILE PRINT; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @5 V_DATE $CHAR10.;
Chapter 4
Working with Dates
99
VISIT = INPUT(V_DATE,MMDDYY10.); FORMAT VISIT MMDDYY10.; IF VISIT = . AND V_DATE NE ’ ’ THEN PUT PATNO= V_DATE=; RUN;
¯
%HFDXVHRIOLQH¯RQO\QRQPLVVLQJLQYDOLGGDWHVZLOOEHSULQWHG Working with Dates in Nonstandard Form
$OWKRXJK6$6VRIWZDUHFDQUHDGGDWHVLQDOPRVWHYHU\FRQFHLYDEOHIRUPWKHUHPD\EH WLPHVZKHQ\RXKDYHGDWHLQIRUPDWLRQIRUZKLFKWKHUHLVQR6$6LQIRUPDW6XSSRVH\RX KDYHDPRQWKYDOXHDQXPEHUIURPWR LQFROXPQVDGD\RIWKHPRQWKYDOXHD QXPEHUIURPWR LQFROXPQVDQGDIRXUGLJLW\HDUYDOXHLQFROXPQV +RZFDQ\RXFUHDWHD6$6GDWHIURPWKHVHWKUHHYDULDEOHV"7KH0'<PRQWKGD\\HDU IXQFWLRQFRPHVWRWKHUHVFXH-XVWHQWHUWKHWKUHHYDULDEOHQDPHVIRUWKHPRQWKGD\DQG \HDU DV DUJXPHQWV WR WKLV IXQFWLRQ DQG LW ZLOO UHWXUQ D 6$6 GDWH 3URJUDP GHPRQVWUDWHVKRZWKLVZRUNV 3URJUDP
'HPRQVWUDWLQJ WKH 0'< )XQFWLRQ WR 5HDG 'DWHV LQ 1RQVWDQGDUG )RUP
***Sample program to read nonstandard dates; DATA NONSTAND; INPUT PATNO $ 1-3 MONTH 6-7 DAY 13-14 YEAR 20-23; DATE = MDY(MONTH,DAY,YEAR); FORMAT DATE MMDDYY10.; DATALINES; 001 05 23 1998 006 11 01 1998 123 14 03 1998 137 10 1946 ; PROC PRINT DATA=NONSTAND; TITLE "Listing of Data Set NONSTAND"; ID PATNO; RUN;
1RWLFH WKDW DQ LQYDOLG 0217+ YDOXH REVHUYDWLRQ WKUHH DQG D PLVVLQJ '$< YDOXH REVHUYDWLRQIRXU ZHUHLQFOXGHGLQWHQWLRQDOO\7KHOLVWLQJRIWKHGDWDVHW12167$1' IROORZV
®
100 Cody’s Data Cleaning Techniques Using SAS Software
Listing of Data Set NONSTAND PATNO
MONTH
DAY
YEAR
DATE
001 006 123 137
5 11 14 10
23 1 3 .
1998 1998 1998 1946
05/23/1998 11/01/1998 . .
,Q WKH WZR FDVHV ZKHUH D GDWH FRXOG QRW EH FRPSXWHG D PLVVLQJ YDOXH ZDV JHQHUDWHG ,QVSHFWLRQRIWKH6$6/RJDOVRVKRZVWKDWWKH0'<IXQFWLRQKDGDQLQYDOLGYDOXHDQGD PLVVLQJYDOXH
5HPRYLQJ0LVVLQJ9DOXHVIURPWKH(UURU/LVWLQJ
DATA _NULL_; FILE PRINT; TITLE "Invalid Date Values"; INPUT PATNO $ 1-3 MONTH 6-7 DAY 13-14 YEAR 20-23; DATE = MDY(MONTH,DAY,YEAR); C_DATE = PUT(MONTH,Z2.) || ’/’ || PUT(DAY,Z2.) || ’/’ || PUT(YEAR,4.); ***Note: the Z2. Format includes leading zeros; FORMAT DATE MMDDYY10.; IF C_DATE NE ’ ’ AND DATE = . THEN PUT PATNO= C_DATE=; DATALINES; 001 05 23 1998 006 11 01 1998 123 14 03 1998 137 10 1946 ;
,Q3URJUDP&B'$7(LVD FKDUDFWHU UHSUHVHQWDWLRQ RI WKH GDWH 7KH FRQFDWHQDWLRQ RSHUDWRU__LVXVHGWRSLHFHWRJHWKHUWKHPRQWKGD\DQG\HDUYDOXHVDQGWKHWZRVODVKHV ,IWKH0'<IXQFWLRQSURGXFHVDPLVVLQJYDOXHDQGWKHYDOXHRI&B'$7(LVQRQPLVVLQJ WKHUH PXVW KDYH EHHQ DQ LQYDOLG GDWH 7KH RXWSXW WKDW IROORZV GHPRQVWUDWHV WKDW WKLV SURJUDPZRUNVDVDGYHUWLVHG
Chapter 4
Working with Dates 101
Invalid Date Values PATNO=123 C_DATE=14/ 3/1998 PATNO=137 C_DATE=10/ ./1946
Creating a SAS Date When the Day of the Month Is Missing
6RPHRI\RXUGDWHYDOXHVPD\EHPLVVLQJWKHGD\RIWKHPRQWKEXW\RXZRXOGVWLOOOLNH WRFUHDWHD6$6GDWHE\XVLQJHLWKHUWKHVWRUWKHWKRIWKHPRQWKDVWKHGD\7KHUHDUH WZRSRVVLELOLWLHVKHUH2QHPHWKRGLVWRXVHWKH021<<LQIRUPDWWKDWUHDGVGDWHVLQ WKHIRUPRIDWKUHHFKDUDFWHUPRQWKQDPHDQGDWZRRUIRXUGLJLW\HDU,I\RXUGDWHVDUH LQWKLVIRUP6$6ZLOOFUHDWHD6$6GDWHXVLQJWKHILUVWRIWKHPRQWKDVWKHGD\YDOXH 7KHRWKHUPHWKRGRIFUHDWLQJD6$6GDWHIURPRQO\PRQWKDQG\HDUYDOXHVLVWRXVHWKH 0'< IXQFWLRQ VXEVWLWXWLQJ D YDOXH VXFK DV IRU WKH GD\ DUJXPHQW $Q H[DPSOH LV VKRZQLQ3URJUDP 3URJUDP
&UHDWLQJD6$6'DWH:KHQWKH'D\RIWKH0RQWK,V0LVVLQJ
DATA NO_DAY; INPUT @1 DATE1 MONYY7. @8 MONTH 2. @10 YEAR 4.; DATE2 = MDY(MONTH,15,YEAR); FORMAT DATE1 DATE2 MMDDYY10.; DATALINES; JAN98 011998 OCT1998101998 ; PROC PRINT DATA=NO_DAY; TITLE "Listing of Data Set NO_DAY"; RUN;
'$7(LVD6$6GDWHFUHDWHGE\WKH021<<6$6LQIRUPDW'$7(LVFUHDWHGE\WKH 0'<IXQFWLRQXVLQJWKHWKRIWKHPRQWKDVWKHPLVVLQJGD\YDOXH2XWSXWIURP352& 35,17LVVKRZQQH[W Listing of Data Set NO_DAY Obs 1 2
DATE1 01/01/1998 10/01/1998
MONTH 1 10
YEAR
DATE2
1998 1998
01/15/1998 10/15/1998
®
102 Cody’s Data Cleaning Techniques Using SAS Software
/HW V H[WHQG WKLV LGHD D ELW IXUWKHU 6XSSRVH PRVW RI \RXU GDWHV KDYH PRQWK GD\ DQG \HDUYDOXHVEXWIRUDQ\GDWHZKHUHWKHRQO\SLHFHPLVVLQJLVWKHGD\RIWKHPRQWK\RX ZDQWWRVXEVWLWXWHWKHWKRIWKHPRQWK3URJUDPZLOODFFRPSOLVKWKLVJRDO 3URJUDP 6XEVWLWXWLQJ WKH WK RI WKH 0RQWK :KHQ WKH 'DWH RI WKH 0RQWK ,V 0LVVLQJ DATA MISS_DAY; INPUT @1 PATNO $3. @4 MONTH 2. @6 DAY 2. @8 YEAR 4.; IF DAY NE . THEN DATE = MDY(MONTH,DAY,YEAR); ELSE DATE = MDY(MONTH,15,YEAR); FORMAT DATE MMDDYY10.; DATALINES; 00110211998 00205 1998 00344 1998 ; PROC PRINT DATA=MISS_DAY; TITLE "Listing of Data Set MISS_DAY"; RUN;
,IWKHGD\YDOXHLVQRWPLVVLQJWKH0'<IXQFWLRQXVHVDOOWKUHHYDOXHVRIPRQWKGD\ DQG\HDU WR FRPSXWH D 6$6 GDWH ,I WKH GD\ YDOXH LV PLVVLQJ WKH WK RI WKH PRQWK LV XVHG$VEHIRUHLIWKHUHLVDQLQYDOLGGDWHVXFKDVIRUSDWLHQW DPLVVLQJGDWHYDOXH LVJHQHUDWHG+HUHDUHWKHWKUHHREVHUYDWLRQVFUHDWHGE\WKLVSURJUDP
Listing of Data Set MISS_DAY OBS
PATNO
MONTH
DAY
YEAR
DATE
1 2 3
001 002 003
10 5 44
21 . .
1998 1998 1998
10/21/1998 05/15/1998 .
Chapter 4
Working with Dates 103
Suspending Error Checking for Known Invalid Dates
$V\RXVDZHDUOLHULQYDOLGGDWHYDOXHVFDQILOO\RXU6$6/RJZLWKORWVRIHUURUV7KHUH DUHWLPHVZKHQ\RXNQRZWKDWLQYDOLGGDWHYDOXHVZHUHXVHGWRUHSUHVHQWPLVVLQJGDWHVRU RWKHUVSHFLILFYDOXHV,I\RXZRXOGOLNHWRSUHYHQWWKHDXWRPDWLFOLVWLQJRIGDWHHUURUVLQ WKH 6$6 /RJ \RX FDQ XVH WKH GRXEOH TXHVWLRQ PDUN "" PRGLILHU LQ \RXU ,1387 VWDWHPHQW RU ZLWK WKH ,1387 IXQFWLRQ 7KLV PRGLILHU SUHYHQWV WKH 127(6 DQG GDWD OLVWLQJVWREHSULQWHGLQWKH6$6/RJDQGDOVRNHHSVWKH6$6LQWHUQDOYDULDEOHB(5525B DW 3URJUDP XVHV WKH "" PRGLILHU LQ WKH ,1387 VWDWHPHQW WR SUHYHQW HUURU PHVVDJHV IURPSULQWLQJLQWKH6$6/RJ 3URJUDP 6XVSHQGLQJ(UURU&KHFNLQJIRU.QRZQ,QYDOLG'DWHVE\8VLQJWKH "",QIRUPDW0RGLILHU DATA DATES; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @5 VISIT ?? MMDDYY10.; FORMAT VISIT MMDDYY10.; RUN;
:KHQ WKLV SURJUDP LV UXQ WKHUH ZLOO EH QR HUURU PHVVDJHV LQ WKH 6$6 /RJ FDXVHG E\ LQYDOLGGDWHV2QO\WXUQRII6$6HUURUFKHFNLQJZKHQ\RXSODQWRGHWHFWHUURUVLQRWKHU ZD\VRU\RXDOUHDG\NQRZDOODERXW\RXULQYDOLGGDWHV 3URJUDP VKRZV DQ H[DPSOH RI XVLQJ WKH "" LQIRUPDW PRGLILHU ZLWK WKH ,1387 IXQFWLRQ7KHIROORZLQJSURJUDPLVLGHQWLFDOWR3URJUDPZLWKWKHDGGLWLRQRIWKH"" PRGLILHUWRNHHSWKH6$6/RJIUHHRIHUURUPHVVDJHV
®
104 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP 'HPRQVWUDWLQJWKH"",QIRUPDW0RGLILHUZLWKWKH,1387)XQFWLRQ DATA _NULL_; FILE PRINT; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @5 V_DATE $CHAR10.; VISIT = INPUT(V_DATE,?? MMDDYY10.); FORMAT VISIT MMDDYY10.; IF VISIT = . THEN PUT PATNO= V_DATE=; RUN;
5HPHPEHUWKDW\RXFDQXVHWKH""PRGLILHUEHIRUHWKHLQIRUPDWDUJXPHQWRIWKH,1387 IXQFWLRQDVZHOODVWKHPRUHWUDGLWLRQDOXVHZLWKWKH,1387VWDWHPHQW,I\RXKDYHDORW RI NQRZQ GDWH HUURUV WKH RYHUULGLQJ RI WKH HUURU PHVVDJHV ZLOO DOVR LPSURYH SURJUDP HIILFLHQF\5HPHPEHUWKDWWKLVDOVRVHWVWKHYDOXHRIB(5525BWR
5
Looking for Duplicates and “n” Observations per Subject ,QWURGXFWLRQ
(OLPLQDWLQJ'XSOLFDWHVE\8VLQJ352&6257
'HWHFWLQJ'XSOLFDWHVE\8VLQJ'$7$6WHS$SSURDFKHV
8VLQJ352&)5(4WR'HWHFW'XSOLFDWH,' V
6HOHFWLQJ3DWLHQWVZLWK'XSOLFDWH2EVHUYDWLRQVE\8VLQJD0DFUR/LVW DQG64/
,GHQWLI\LQJ6XEMHFWVZLWKQ2EVHUYDWLRQV(DFK'$7$6WHS$SSURDFK
,GHQWLI\LQJ6XEMHFWVZLWKQ2EVHUYDWLRQV(DFK8VLQJ352&)5(4
Introduction
%HVLGHVFKHFNLQJIRULQYDOLGGDWDYDOXHVLQDGDWDVHWLWPD\EHQHFHVVDU\WRFKHFN IRUHLWKHUGXSOLFDWH,' VRUGXSOLFDWHREVHUYDWLRQV'XSOLFDWHREVHUYDWLRQVDUHHDV\ WR IL[ MXVW HOLPLQDWH WKH GXSOLFDWHV DOWKRXJK \RX PD\ ZDQW WR ILQG RXW KRZ WKH GXSOLFDWHV JRW WKHUH 'XSOLFDWH ,' V ZLWK GLIIHUHQW GDWD YDOXHV SUHVHQWV DQRWKHU SUREOHP2QHSRVVLEOHFDXVHRIWKLVLVWKDWWKHVDPH,'ZDVXVHGIRUPRUHWKDQRQH SHUVRQ $QRWKHU SRVVLELOLW\ LV WKDW GLIIHUHQW GDWD YDOXHV ZHUH HQWHUHG PRUH WKDQ RQFHIRUWKHVDPHSHUVRQ7KHUHDUHVHYHUDOZD\VWRGHWHFWDQGHOLPLQDWHGXSOLFDWHV LQD6$6GDWDVHW7KLVFKDSWHUH[SORUHVVRPHRIWKHP Eliminating Duplicates by Using PROC SORT
6XSSRVH\RXKDYHDGDWDVHWZKHUHHDFKSDWLHQWLVVXSSRVHGWREHUHSUHVHQWHGE\D VLQJOH REVHUYDWLRQ 7R GHPRQVWUDWH ZKDW KDSSHQV ZKHQ \RX KDYH PXOWLSOH REVHUYDWLRQV ZLWK WKH VDPH ,' VRPH GXSOLFDWHV LQ WKH 3$7,(176 GDWD VHW ZHUH LQFOXGHGRQSXUSRVH2EVHUYDWLRQVZLWKGXSOLFDWH,'QXPEHUVDUHVKRZQQH[W
®
106 Cody’s Data Cleaning Techniques Using SAS Software
OBS 2 3 4 5 7 8
PATNO
GENDER
002 002 003 003 006 006
F F X M F
VISIT
HR
SBP
DBP
DX
AE
11/13/1998 11/13/1998 10/21/1998 11/12/1999 06/15/1999 07/07/1999
84 84 68 58 72 82
120 120 190 112 102 148
78 78 100 74 68 84
X X 3
0 0 1 0 1 0
6 1
1RWLFHWKDWSDWLHQWQXPEHULVDWUXHGXSOLFDWHREVHUYDWLRQ)RUSDWLHQWQXPEHUV DQGWKHGXSOLFDWH,' VFRQWDLQGLIIHUHQWYDOXHV 7ZR YHU\ XVHIXO RSWLRQV RI 352& 6257 DUH 12'83.(< DQG 12'83 7KH 12'83.(< RSWLRQ DXWRPDWLFDOO\ HOLPLQDWHV PXOWLSOH REVHUYDWLRQV ZKHUH WKH %< YDULDEOHVKDYHWKHVDPHYDOXH)RUH[DPSOHWRDXWRPDWLFDOO\HOLPLQDWHPXOWLSOHSDWLHQW ,' V 3$712 LQ WKH 3$7,(176 GDWD VHW ZKLFK \RX SUREDEO\ ZRXOG QRW ZDQW WR GR WKLVLVIRULOOXVWUDWLRQRQO\ \RXFRXOGXVH352&6257ZLWKWKH12'83.(<RSWLRQDV VKRZQLQ3URJUDP 3URJUDP
'HPRQVWUDWLQJWKH12'83.(<2SWLRQRI352&6257
PROC SORT DATA=CLEAN.PATIENTS OUT=SINGLE NODUPKEY; BY PATNO; RUN; PROC PRINT DATA=SINGLE; TITLE "Data Set SINGLE - Duplicated ID’s Removed from PATIENTS"; ID PATNO; RUN;
1RWLFHWKDWWZRRSWLRQV287 DQG12'83.(<DUHXVHGKHUH7KH287 RSWLRQLV XVHG WR FUHDWH WKH QHZ GDWD VHW 6,1*/( OHDYLQJ WKH RULJLQDO GDWD VHW 3$7,(176 XQFKDQJHG6KRZQQH[WLVDOLVWLQJRIWKH6,1*/(GDWDVHW
Chapter 5
Looking for Duplicates and “n” Observations per Subject
107
Data Set SINGLE - Duplicated ID’s Removed from PATIENTS PATNO 001 002 003 004 006 007 008 009 010 011 012 013 014 015 017 019 020 022 023 024 025 027 028 029 123 321 XX5
GENDER M M F X F M F M f M M 2 M F F M F M f F M F F M M F M
VISIT
HR
SBP
DBP
11/11/1998 11/11/1998 11/13/1998 10/21/1998 01/01/1999 06/15/1999 . 08/08/1998 09/25/1999 10/19/1999 . 10/12/1998 08/23/1999 02/02/1999 . 04/05/1999 06/07/1999 . 10/10/1999 12/31/1998 11/09/1998 01/01/1999 . 03/28/1998 05/15/1998 . . 05/07/1998
90 88 84 68 101 72 88 210 86 . 68 60 74 22 82 208 58 10 48 22 76 74 . 66 . 60 900 68
190 140 120 190 200 102 148 . 240 40 300 122 108 130 148 . 118 20 114 34 120 102 166 150 . . 400 120
100 80 78 100 120 68 102 . 180 120 20 74 64 90 88 84 70 8 82 78 80 68 106 90 . . 200 80
DX 1 X 3 5 6 7 4 1 4
AE 0 0 0 1 A 1 0 0 1 0 1 0
1 3 2 2 1 5 7 3 4 1 5 1
1 1 0 0 0 1 0 0 1 0 0 1 0 1 0
7KH 12'83.(< RSWLRQ HOLPLQDWHG WKH VHFRQG REVHUYDWLRQ IRU HDFK RI WKH WKUHH GXSOLFDWH,' V7KHRQO\LQGLFDWLRQWKDWGXSOLFDWHVZHUHUHPRYHGLVLQWKH127(LQWKH 6$6/RJZKLFKLVVKRZQQH[W 507 508 509
PROC SORT DATA=CLEAN.PATIENTS OUT=SINGLE NODUPKEY; BY PATNO; RUN;
NOTE: 3 observations with duplicate key values were deleted. NOTE: The data set WORK.SINGLE has 28 observations and 8 variables.
7KLVPHWKRGRIORRNLQJIRUGXSOLFDWH,' VLVUHDOO\RQO\XVHIXOLIWKH6$6/RJVKRZVWKDW QRGXSOLFDWHVZHUHUHPRYHG,IWKH6$6/RJVKRZVGXSOLFDWHNH\YDOXHVZHUHGHOHWHG \RXQHHGWRVHHZKLFK,' VKDGGXSOLFDWHGDWDDQGWKHQDWXUHRIWKHGDWD
®
108 Cody’s Data Cleaning Techniques Using SAS Software
,I \RX XVH WKH 12'83.(< RSWLRQ ZLWK PRUH WKDQ RQH %< YDULDEOH RQO\ WKRVH REVHUYDWLRQV ZLWK LGHQWLFDO YDOXHV RQ HDFK RI WKH %< YDULDEOHV ZLOO EH GHOHWHG )RU H[DPSOH LI \RX VRUW E\ SDWLHQW QXPEHU 3$712 DQG YLVLW GDWH 9,6,7 RQO\ WKH GXSOLFDWHIRUSDWLHQWQXPEHUZLOOEHGHOHWHGZKHQ\RXXVHWKH12'83.(<RSWLRQ EHFDXVH WKH WZR REVHUYDWLRQV IRU SDWLHQW QXPEHU DUH WKH RQO\ RQHV ZLWK WKH VDPH SDWLHQWQXPEHUDQGYLVLWGDWH 7KHRSWLRQ12'83DOVRGHOHWHVGXSOLFDWHVEXWRQO\IRUWZRREVHUYDWLRQVZKHUHDOOWKH YDULDEOHVKDYHLGHQWLFDOYDOXHV3URJUDPGHPRQVWUDWHVWKLVRSWLRQ 3URJUDP
'HPRQVWUDWLQJWKH12'832SWLRQRI352&6257
PROC SORT DATA=CLEAN.PATIENTS OUT=SINGLE NODUP; BY _ALL_; RUN;
/LVWLQJ WKH GDWD VHW 6,1*/( ZKLFK LV FUHDWHG E\ WKLV SURFHGXUH VKRZV WKDW RQO\ WKH VHFRQGREVHUYDWLRQIRUSDWLHQWQXPEHUZDVGHOHWHG7KHXVHRIB$//BDVWKH%< YDULDEOHLQ3URJUDPLVQHFHVVDU\EHFDXVHRIWKHVWUDQJHZD\WKDWWKH12'83RSWLRQ ORRNVIRUGXSOLFDWHV352&6257FRPSDUHVDQREVHUYDWLRQWRWKHPRVWUHFHQWO\ZULWWHQ REVHUYDWLRQZKHQGHFLGLQJLIDQREVHUYDWLRQLVDGXSOLFDWHVR\RXKDYHWRVRUWE\DOOWKH YDULDEOHV WR PDNH LW ZRUN 7KLV SLHFH RI LQIRUPDWLRQ LV GXH WR RQH RI P\ YHU\ DVWXWH UHYLHZHUV 0LNH =GHE ZKR GHPRQVWUDWHG WKDW VLPSO\ VRUWLQJ E\ D VLQJOH YDULDEOH DQG XVLQJWKH12'83RSWLRQFRXOGUHVXOWLQDGDWDVHWWKDWVWLOOKDGGXSOLFDWHREVHUYDWLRQV %HFDXVHWKLVLVVXFKDQLPSRUWDQWSRLQW3URJUDPLOOXVWUDWHVKRZWKH12'83RSWLRQ FDQOHDYHGXSOLFDWHREVHUYDWLRQVLQDGDWDVHWZKHQ\RXGRQRWXVHWKHNH\ZRUGB$//B LQWKH%<VWDWHPHQW
Chapter 5
3URJUDP
Looking for Duplicates and “n” Observations per Subject
109
'HPRQVWUDWLQJD)HDWXUHRIWKH12'832SWLRQ
DATA MULTIPLE; INPUT PATNO $ X Y; DATALINES; 001 1 2 006 1 2 009 1 2 001 3 4 001 1 2 009 1 2 001 1 2 ; PROC SORT DATA=MULTIPLE OUT=SINGLE NODUP; BY PATNO; RUN; PROC PRINT DATA=SINGLE; TITLE "Listing of Data Set SINGLE"; RUN;
:KHQ352&6257VRUWVWKHREVHUYDWLRQVE\WKHYDULDEOH3$712WKHIRXUREVHUYDWLRQV ZLWK3$712HTXDOWR ZLQGXSDVWKHILUVWIRXUREVHUYDWLRQVLQWKHVRUWHGGDWDVHW 7KHILUVWIRXUREVHUYDWLRQVLQWKHVRUWHGGDWDVHWDUH 001 001 001 001
1 3 1 1
2 4 2 2
1RZ ZKHQ WKH 12'83 RSWLRQ GRHV LWV WKLQJ VXFFHVVLYH GXSOLFDWH REVHUYDWLRQV DUH GHOHWHG OHDYLQJ WKH ILUVW WKUHH REVHUYDWLRQV IRU 3$712 LQ WKH GDWD VHW 6,1*/(
PATNO 001 001 001 006 009
X 1 3 1 1 1
Y 2 4 2 2 2
7KLVLVDQLPSRUWDQWSRLQWWKDW\RXPXVWNHHSLQPLQGZKHQHYHU\RXZDQWWRHOLPLQDWH GXSOLFDWHREVHUYDWLRQVE\XVLQJWKH12'83RSWLRQRI352&6257
®
110 Cody’s Data Cleaning Techniques Using SAS Software
Detecting Duplicates by Using DATA Step Approaches
/HW V H[SORUH WKH ZD\V WKDW ZLOO DOORZ \RX WR GHWHFW GXSOLFDWH ,' V DQG GXSOLFDWH REVHUYDWLRQV LQ D GDWD VHW 2QH YHU\ JRRG ZD\ WR DSSURDFK WKLV SUREOHP LV WR XVH WKH WHPSRUDU\6$6YDULDEOHV),567DQG/$677RVHHKRZWKLVZRUNVORRNDW3URJUDP ZKLFKSULQWVRXWDOOREVHUYDWLRQVWKDWKDYHGXSOLFDWHSDWLHQWQXPEHUV 3URJUDP
,GHQWLI\LQJ'XSOLFDWH,' V
PROC SORT DATA=CLEAN.PATIENTS OUT=TMP; BY PATNO; RUN;
¯
DATA DUP; SET TMP; BY PATNO;
°
IF FIRST.PATNO AND LAST.PATNO THEN DELETE; RUN;
±
PROC PRINT DATA=DUP; TITLE "Listing of Duplicates from Data Set CLEAN.PATIENTS"; ID PATNO; RUN;
,W VILUVWQHFHVVDU\WRVRUWWKHGDWDVHWE\WKH,'YDULDEOH¯,QWKHDERYHSURJUDPWKH RULJLQDO GDWD VHW ZDV OHIW LQWDFW DQG D QHZ GDWD VHW 703 ZDV FUHDWHG IRU WKH VRUWHG REVHUYDWLRQV$IWHU\RXKDYHDVRUWHGGDWDVHWDVKRUW'$7$VWHSZLOOUHPRYHSDWLHQWV WKDWKDYHDVLQJOHREVHUYDWLRQOHDYLQJDGDWDVHWRIGXSOLFDWHV7KHNH\KHUHLVWKH%< VWDWHPHQW°IROORZLQJWKH6(7VWDWHPHQW:KHQD6(7VWDWHPHQWLVIROORZHGE\D%< VWDWHPHQW WKH WHPSRUDU\ 6$6 YDULDEOHV ),567E\BYDULDEOHBQDPH DQG /$67E\ BYDULDEOHBQDPHDUHFUHDWHG,QWKLVH[DPSOHWKHUHLVRQO\RQH%<YDULDEOH3$712 VRWKHWZRWHPSRUDU\6$6YDULDEOHV),5673$712DQG/$673$712DUHFUHDWHG,I DQREVHUYDWLRQLVWKHILUVWRQHLQD%<JURXSLQWKLVFDVHWKHILUVWRFFXUUHQFHRIDSDWLHQW QXPEHU ),5673$712ZLOOEHWUXHHTXDOWRRQH ,IDQREVHUYDWLRQLVWKHODVWRQHLQD %< JURXS /$673$712 ZLOO EH WUXH 2EYLRXVO\ LI ),5673$712 DQG /$673$712DUHERWKWUXHWKHUHLVRQO\RQHREVHUYDWLRQIRUWKDWSDWLHQWQXPEHU±
Chapter 5
Looking for Duplicates and “n” Observations per Subject
111
7KHUHIRUH WKH GDWD VHW '83 FRQWDLQV RQO\ REVHUYDWLRQV ZKHUH WKHUH LV PRUH WKDQ RQH REVHUYDWLRQIRUHDFKSDWLHQWQXPEHU2XWSXWIURP3URJUDPLVVKRZQQH[W Listing of Duplicates from Data Set CLEAN.PATIENTS PATNO
GENDER
002 002 003 003 006 006
F F X M F
VISIT
HR
SBP
DBP
DX
AE
11/13/1998 11/13/1998 10/21/1998 11/12/1999 06/15/1999 07/07/1999
84 84 68 58 72 82
120 120 190 112 102 148
78 78 100 74 68 84
X X 3
0 0 1 0 1 0
6 1
1H[W\RXZDQWWRFRQVLGHUWKHFDVHZKHUHWKHUHDUHWZRRUPRUHREVHUYDWLRQVIRUHDFK SDWLHQWDQGHDFKREVHUYDWLRQLVVXSSRVHGWRKDYHDGLIIHUHQWYLVLWGDWH9,6,7 7KHGDWD VHW 3$7,(176 ZDV FUHDWHG WR GHPRQVWUDWH WKLV VLWXDWLRQ )RU WKLV GDWD VHW WZR REVHUYDWLRQVZLWKWKHVDPHSDWLHQW,'DQGYLVLWGDWHZRXOGFRQVWLWXWHDQHUURU5HIHUWR WKH $SSHQGL[ IRU D OLVWLQJ RI WKH UDZ GDWD ILOH 3$7,(1767;7 IURP ZKLFK WKH 3$7,(176GDWDVHWZDVFUHDWHG ,I \RX ZRXOG OLNH WR FUHDWH WKH 3$7,(176 GDWD VHW IRU WHVW SXUSRVHV UXQ WKH VKRUW '$7$VWHSVKRZQLQ3URJUDP 3URJUDP
&UHDWLQJ WKH 6$6 'DWD 6HW 3$7,(176 D 'DWD 6HW &RQWDLQLQJ 0XOWLSOH9LVLWVIRU(DFK3DWLHQW
LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.PATIENTS2; INFILE "C:\CLEANING\PATIENTS2.TXT" PAD; INPUT @1 PATNO $3. @4 VISIT MMDDYY10. @14 HR 3. @17 SBP 3. @20 DBP 3.; FORMAT VISIT MMDDYY10.; RUN;
®
112 Cody’s Data Cleaning Techniques Using SAS Software
$OLVWLQJRIWKHUHVXOWLQJGDWDVHWIURP3URJUDPIROORZV Listing of Data set PATIENTS2 OBS 1 2 3 4 5 6 7 8 9 10 11 12 13
PATNO 001 001 002 002 002 003 004 004 005 005 006 007 007
VISIT
HR
SBP
DBP
06/12/1998 06/15/1998 01/01/1999 01/10/1999 02/09/1999 10/21/1998 03/12/1998 03/13/1998 04/14/1998 04/14/1998 11/11/1998 09/01/1998 10/01/1998
80 78 48 70 74 68 70 70 72 74 100 68 68
130 128 102 112 118 120 102 106 118 120 180 138 140
80 78 66 82 78 70 66 68 74 80 110 100 98
1RWLFHWKDWWKHUHDUHIURPRQHWRWKUHHREVHUYDWLRQVIRUHDFKSDWLHQW$OVRQRWLFHWKDW SDWLHQWKDVWZRREVHUYDWLRQVZLWKWKHVDPH9,6,7GDWHDQGGLIIHUHQWGDWDYDOXHV7R GHWHFWWKLVVLWXDWLRQXVHWKHYDULDEOHV),567DQG/$67H[FHSWZLWKWZR%<YDULDEOHV LQVWHDGRIRQHDVVKRZQLQ3URJUDP 3URJUDP
,GHQWLI\LQJ3DWLHQW,' VZLWK'XSOLFDWH9LVLW'DWHV
PROC SORT DATA=CLEAN.PATIENTS2 OUT=TMP; BY PATNO VISIT; RUN;
¯
DATA DUP; SET TMP; BY PATNO VISIT;
°
IF FIRST.VISIT AND LAST.VISIT THEN DELETE; RUN;
±
PROC PRINT DATA=DUP; TITLE "Listing of Duplicates from Data Set CLEAN.PATIENTS2"; ID PATNO; RUN;
Chapter 5
Looking for Duplicates and “n” Observations per Subject
113
VISIT
HR
SBP
DBP
04/14/1998 04/14/1998
72 74
118 120
74 80
Using PROC FREQ to Detect Duplicate ID's
$QRWKHUZD\WRILQGGXSOLFDWHVXVHV352&)5(4WRFRXQWWKHQXPEHURIREVHUYDWLRQV IRUHDFKYDOXHRIWKHSDWLHQW,'YDULDEOH3$712 8VHWKHSDWLHQW,'YDULDEOHDQGWKH 287 RSWLRQLQWKH7$%/(6VWDWHPHQWWRFUHDWHD6$6GDWDVHWWKDWFRQWDLQVWKHYDOXH RI 3$712 DQG WKH IUHTXHQF\ FRXQW 352& )5(4 XVHV WKH YDULDEOH QDPH &2817 WR KROGWKHIUHTXHQF\LQIRUPDWLRQ $IWHU\RXKDYHWKLVLQIRUPDWLRQ\RXFDQXVHLWWRVHOHFW WKHRULJLQDOGXSOLFDWHREVHUYDWLRQVIURP\RXUGDWDVHW7RGHPRQVWUDWHKRZWKLVZRUNV 3URJUDPLGHQWLILHVGXSOLFDWHSDWLHQWQXPEHUVIURPWKH3$7,(176GDWDVHW 3URJUDP
8VLQJ 352& )5(4 DQG DQ 2XWSXW 'DWD 6HW WR ,GHQWLI\ 'XSOLFDWH ,' V
PROC FREQ DATA=CLEAN.PATIENTS NOPRINT; ¯ TABLES PATNO / OUT=DUP_NO(KEEP=PATNO COUNT WHERE=(COUNT GT 1)); RUN; PROC SORT DATA=CLEAN.PATIENTS OUT=TMP; BY PATNO; RUN; PROC SORT DATA=DUP_NO; BY PATNO; RUN;
°
®
114 Cody’s Data Cleaning Techniques Using SAS Software DATA DUP; MERGE TMP DUP_NO(IN=YES_DUP DROP=COUNT); BY PATNO; IF YES_DUP; RUN;
±
²
PROC PRINT DATA=DUP; TITLE "Listing of Data Set DUP"; RUN;
352&)5(4¯XVHVWKH1235,17RSWLRQEHFDXVH\RXRQO\ZDQWWKHRXWSXW GDWD VHW QRW WKH DFWXDO 352& )5(4 OLVWLQJ 7KH 287 RSWLRQ LQ WKH 7$%/(6 VWDWHPHQW ° FUHDWHV D 6$6 GDWD VHW FDOOHG '83B12 ZKLFK FRQWDLQV WKH YDULDEOHV 3$712 DQG &28177KH:+(5(GDWDVHWRSWLRQUHVWULFWVWKLVGDWDVHWWRWKRVHREVHUYDWLRQVZKHUH &2817LVJUHDWHUWKDQRQHWKHGXSOLFDWHV 1H[WVRUWERWKWKHRULJLQDOGDWDVHW3$7,(176WRWKHWHPSRUDU\GDWDVHW703 DQGWKH '83B12GDWDVHWE\3$7127KHILQDO'$7$VWHSPHUJHVWKHWZRGDWDVHWV7KHNH\ WRWKHHQWLUHSURJUDPLVWKH,1 RSWLRQLQWKLV0(5*(VWDWHPHQW±7KH'83B12GDWD VHWRQO\FRQWDLQVSDWLHQWQXPEHUVZKHUHWKHYDOXHRI&2817LVJUHDWHUWKDQRQH7KH ORJLFDO YDULDEOH <(6B'83 FUHDWHG E\ WKLV ,1 GDWD VHW RSWLRQ LV WUXH ZKHQHYHU WKH '83B12 GDWD VHW LV PDNLQJ D FRQWULEXWLRQ WR WKH REVHUYDWLRQ EHLQJ IRUPHG 7KXV EHFDXVHRIOLQH²RQO\WKHGXSOLFDWHVZLOOEHSODFHGLQWKH'83GDWDVHWDVVKRZQLQWKH QH[WOLVWLQJ Listing of Data Set DUP OBS
PATNO
GENDER
1 2 3 4 5 6
002 002 003 003 006 006
F F X M F
VISIT
HR
SBP
DBP
DX
AE
11/13/1998 11/13/1998 10/21/1998 11/12/1999 06/15/1999 07/07/1999
84 84 68 58 72 82
120 120 190 112 102 148
78 78 100 74 68 84
X X 3
0 0 1 0 1 0
6 1
,IDOO\RXQHHGWRGRLVLGHQWLI\WKHSDWLHQW,' VZLWKPRUHWKDQRQHREVHUYDWLRQ\RXFDQ DYRLG WKH 0(5*( VWHS EHFDXVH WKH RXWSXW GDWD VHW IURP 352& )5(4 FRQWDLQV WKH YDULDEOH3$712DVZHOODVWKHIUHTXHQF\&2817 6RWKHPXFKVLPSOHUSURJUDPLV VKRZQLQ3URJUDP
Chapter 5
3URJUDP
Looking for Duplicates and “n” Observations per Subject
115
3URGXFLQJ D /LVW RI 'XSOLFDWH 3DWLHQW 1XPEHUV E\ 8VLQJ 352& )5(4
PROC FREQ DATA=CLEAN.PATIENTS NOPRINT; TABLES PATNO / OUT=DUP_NO(KEEP=PATNO COUNT WHERE=(COUNT GT 1)); RUN; DATA _NULL_; TITLE "Patients with Duplicate Observations"; FILE PRINT; SET DUP_NO; PUT "Patient number " PATNO "has " COUNT "observation(s)."; RUN;
7KLV SURJUDP LV FRQVLGHUDEO\ PRUH HIILFLHQW WKDQ WKH SURJUDP UHTXLULQJ VRUWV DQG PHUJLQJ
Selecting Patients with Duplicate Observations by Using a Macro List and SQL
$QRWKHU TXLFN HDV\ DQG HIILFLHQW ZD\ WR VHOHFW REVHUYDWLRQV ZLWK GXSOLFDWH ,' V LV WR FUHDWH D PDFUR YDULDEOH WKDW FRQWDLQV DOO WKH SDWLHQW ,' V LQ WKH GXSOLFDWH GDWD VHW '83B12 8VLQJDVKRUW64/VWHS\RXFDQFUHDWHDOLVWRISDWLHQWQXPEHUVVHSDUDWHG E\ VSDFHV RU FRPPDV ERWK ZLOO ZRUN DQG SODFHG LQ TXRWHV WKDW FDQ VXEVHTXHQWO\ EH XVHGDVWKHDUJXPHQWLQDQ,1VWDWHPHQW3URJUDPGHPRQVWUDWHVWKLV
®
116 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP
8VLQJ352&64/WR&UHDWHD/LVWRI'XSOLFDWHV
PROC SQL NOPRINT; SELECT QUOTE(PATNO)
¯
INTO :DUP_LIST SEPARATED BY " " FROM DUP_NO;
°
QUIT; PROC PRINT DATA=CLEAN.PATIENTS; WHERE PATNO IN (&DUP_LIST); ± TITLE "Duplicates Selected Using SQL and a Macro Variable"; RUN;
7KH6(/(&7VWDWHPHQWXVHVWKH4827(IXQFWLRQZKLFKSODFHVWKHSDWLHQWQXPEHUVLQ TXRWHV¯/LQH°DVVLJQVWKHOLVWRISDWLHQWQXPEHUVWRDPDFURYDULDEOH'83B/,67 DQGVHSDUDWHVHDFKRIWKHTXRWHGYDOXHVZLWKDVSDFH)LQDOO\IROORZLQJD352&35,17 VWDWHPHQW \RX VXSSO\ D :+(5( VWDWHPHQW ± WKDW VHOHFWV RQO\ WKRVH SDWLHQW QXPEHUV WKDWDUHFRQWDLQHGLQWKHOLVWRIGXSOLFDWHSDWLHQWQXPEHUV1RWLFHWKDWWKLVPHWKRGGRHV QRWUHTXLUHDVRUWRUD'$7$VWHS,QVSHFWLRQRIWKHRXWSXWIURP3URJUDPVKRZVWKDW WKHREVHUYDWLRQVDUHLQWKHRULJLQDORUGHURIWKHREVHUYDWLRQVLQWKH3$7,(176GDWDVHW Duplicates Selected Using SQL and a Macro Variable Obs 2 3 6 16 17 31
PATNO
GENDER
002 003 006 002 003 006
F X F M F
VISIT
HR
SBP
DBP
DX
AE
11/13/1998 10/21/1998 06/15/1999 11/13/1998 11/12/1999 07/07/1999
84 68 72 84 58 82
120 190 102 120 112 148
78 100 68 78 74 84
X 3 6 X
0 1 1 0 0 0
1
,I\RXZDQWWKLVOLVWLQVRUWHGRUGHU\RXFDQPRGLI\WKHSURJUDPDVVKRZQLQ3URJUDP
Chapter 5
Looking for Duplicates and “n” Observations per Subject
117
3URJUDP 8VLQJ352&64/WR&UHDWHD/LVWRI'XSOLFDWHVLQ6RUWHG2UGHU PROC SQL NOPRINT; SELECT QUOTE(PATNO) INTO :DUP_LIST SEPARATED BY " " FROM DUP_NO; QUIT; PROC SORT DATA=CLEAN.PATIENTS OUT=TMP; WHERE PATNO IN (&DUP_LIST); BY PATNO; RUN; PROC PRINT DATA=TMP; TITLE "Duplicates Selected Using SQL and a Macro Variable"; RUN;
7RVHHDQ64/VROXWLRQVHHSDJHLQ&KDSWHU Identifying Subjects with "n" Observations Each (DATA Step Approach)
%HVLGHVLGHQWLI\LQJGXSOLFDWHV\RXPD\QHHGWRYHULI\WKDWWKHUHDUHQREVHUYDWLRQVSHU VXEMHFWLQDUDZGDWDILOHRULQD6$6GDWDVHW)RUH[DPSOHLIHDFKSDWLHQWLQDFOLQLFDO WULDOZDVVHHQWZLFH\RXPLJKWZDQWWRYHULI\WKDWWKHUHDUHWZRREVHUYDWLRQV IRU HDFK SDWLHQW,'LQWKHILOHRUGDWDVHW
®
118 Cody’s Data Cleaning Techniques Using SAS Software
,QVSHFWLRQ RI WKH 3$7,(176 OLVWLQJ RQ SDJH VKRZV WKDW SDWLHQW KDV WKUHH REVHUYDWLRQVSDWLHQWVDQGKDYHWZRREVHUYDWLRQVDQGSDWLHQWV DQGKDYHRQO\RQHREVHUYDWLRQ 3URJUDPOLVWVDOOWKHSDWLHQW,' VZKRGRQRWKDYHH[DFWO\WZRREVHUYDWLRQVHDFK 3URJUDP 8VLQJD'$7$6WHSWR/LVW$OO,' VIRU3DWLHQWV:KR'R1RW+DYH ([DFWO\7ZR2EVHUYDWLRQV PROC SORT DATA=CLEAN.PATIENTS2(KEEP=PATNO) OUT=TMP; BY PATNO; RUN; DATA _NULL_; TITLE "Patient ID’s for Patients with Other than Two Observations"; FILE PRINT; SET TMP; BY PATNO;
¯
IF FIRST.PATNO THEN N = 1;
°
ELSE N + 1; ± IF LAST.PATNO AND N NE 2 THEN PUT "Patient number " PATNO "has " N "observation(s).";
²
RUN;
7KHILUVWVWHSLVWRVRUWWKHGDWDVHWE\SDWLHQW,'3$712 EHFDXVH\RXZLOOXVHD%< 3$712VWDWHPHQWLQWKH'$7$B18//BVWHS¯,QWKLVH[DPSOH\RXFUHDWHDQHZGDWD VHW WR KROG WKH VRUWHG REVHUYDWLRQV
Chapter 5
Looking for Duplicates and “n” Observations per Subject
119
Patient ID’s for Patients with Other than Two Observations Patient number 002 has 3 observation(s). Patient number 003 has 1 observation(s). Patient number 006 has 1 observation(s).
:KDW LI D SDWLHQW ZLWK WZR REVHUYDWLRQV UHDOO\ RQO\ KDG RQH YLVLW EXW WKH VLQJOH REVHUYDWLRQZDVGXSOLFDWHGE\PLVWDNH"
7KH RXWSXW GDWD VHW IURP 352& )5(4 '83B12 FRQWDLQV WKH YDULDEOHV 3$712 DQG WKHIUHTXHQF\&2817 DQGWKHUHLVRQHREVHUYDWLRQIRUHDFKSDWLHQWZKRGLGQRWKDYH H[DFWO\ WZR YLVLWV $OO WKDW LV OHIW WR GR LV WR VHDUFK HDFK REVHUYDWLRQ LQ WKH '$7$ B18//BVWHSDQGSULQWDQHUURUPHVVDJH2XWSXWIURP3URJUDPLVVKRZQQH[W
®
120 Cody’s Data Cleaning Techniques Using SAS Software
Patient ID’s for Patients with Other than Two Observations Patient number 002 has 3 observation(s). Patient number 003 has 1 observation(s). Patient number 006 has 1 observation(s).
,WLVXVXDOO\HDVLHUWROHWD352&GRWKHZRUNDVLQWKLVH[DPSOHUDWKHUWKDQGRLQJDOO WKH ZRUN \RXUVHOI ZLWK D '$7$ VWHS 6HH 3URJUDP RQ SDJH IRU DQ 64/ VROXWLRQWRWKLVSURJUDP
6
Working with Multiple Files ,QWURGXFWLRQ
&KHFNLQJIRUDQ,'LQ(DFKRI7ZR)LOHV
&KHFNLQJIRUDQ,'LQ(DFKRIQ)LOHV
$6LPSOH0DFURWR&KHFN,' VLQ0XOWLSOH)LOHV
$0RUH&RPSOLFDWHG0XOWL)LOH0DFURIRU,'&KHFNLQJ
0RUH&RPSOLFDWHG0XOWL)LOH5XOHV
&KHFNLQJ7KDWWKH'DWHV$UHLQWKH3URSHU2UGHU
Introduction
7KLVFKDSWHUDGGUHVVHVGDWDYDOLGDWLRQWHFKQLTXHVZKHUHPXOWLSOHILOHVRUGDWDVHWV DUHLQYROYHG
2QHUHTXLUHPHQWRIDODUJHSURMHFWPD\EHWKDWDSDUWLFXODU,'YDOXHH[LVWVLQHDFK RIVHYHUDO6$6GDWDVHWV/HW VVWDUWRXWE\GHPRQVWUDWLQJKRZ\RXFDQHDVLO\FKHFN WKDW DQ ,' LV LQ HDFK RI WZR ILOHV 7KLV ZLOO EH JHQHUDOL]HG ODWHU WR LQFOXGH DQ DUELWUDU\QXPEHURIILOHV 7KH WHFKQLTXH GHPRQVWUDWHG LQ WKLV VHFWLRQ LV WR PHUJH WKH WZR ILOHV LQ TXHVWLRQ XVLQJWKH,'YDULDEOHDVD%<YDULDEOH7KHNH\WRWKHSURJUDPLVWKH,1 GDWDVHW RSWLRQWKDWVHWVDORJLFDOYDULDEOHWRWUXHRUIDOVHGHSHQGLQJRQZKHWKHURUQRWWKH GDWDVHWSURYLGHVYDOXHVWRWKHFXUUHQWREVHUYDWLRQEHLQJFUHDWHG$QH[DPSOHZLOO
®
122 Cody’s Data Cleaning Techniques Using SAS Software
PDNHWKLVFOHDUHU,Q3URJUDPDUHWKH6$6VWDWHPHQWVWRFUHDWHWZR6$6GDWDVHWVIRU WHVWLQJSXUSRVHV 3URJUDP
&UHDWLQJ7ZR7HVW'DWD6HWVIRU&KDSWHU([DPSOHV
DATA ONE; INPUT PATNO X Y; DATALINES; 1 69 79 2 56 . 3 66 99 5 98 87 12 13 14 ; DATA TWO; INPUT PATNO Z; DATALINES; 1 56 3 67 4 88 5 98 13 99 ;
1RWLFHWKDW,' VDQGDUHLQGDWDVHW21(EXWQRWLQGDWDVHW7:2,' VDQG DUHLQGDWDVHW7:2EXWQRWLQGDWDVHW21(3URJUDPJLYHVGHWDLOHGLQIRUPDWLRQ RQWKHXQPDWFKHG,' V 3URJUDP
,GHQWLI\LQJ,' V1RWLQ(DFKRI7ZR'DWD6HWV
PROC SORT DATA=ONE; BY PATNO; RUN; PROC SORT DATA=TWO; BY PATNO; RUN;
¯
°
DATA _NULL_; FILE PRINT; TITLE "Listing of Missing ID’s"; MERGE ONE(IN=INONE) TWO(IN=INTWO) BY PATNO;
²
END=LAST;
±
Chapter 6
Working with Multiple Files
123
IF NOT INONE THEN DO; ³ PUT "ID " PATNO "is not in data set ONE"; N + 1; END; ELSE IF NOT INTWO THEN DO; ´ PUT "ID " PATNO "is not in data set TWO"; N + 1; END; IF LAST AND N EQ 0 THEN PUT "All ID’s Match in Both Files";
µ
RUN;
%HIRUH\RXFDQPHUJHWKHWZRGDWDVHWVWKH\PXVWILUVWEHVRUWHGE\WKH%<YDULDEOH¯ °7KH0(5*(VWDWHPHQW±LVWKHNH\WRWKLVSURJUDP(DFKRIWKHGDWDVHWQDPHVLV IROORZHG E\ WKH GDWD VHW RSWLRQ ,1 ORJLFDOBYDULDEOH ,Q DGGLWLRQ WKH (1' YDULDEOHBQDPH RSWLRQ FUHDWHV D ORJLFDO YDULDEOH WKDW LV VHW WR WUXH ZKHQ WKH ODVW REVHUYDWLRQIURPDOOGDWDVHWVKDVEHHQSURFHVVHG8VLQJD0(5*(VWDWHPHQWZRXOGEH XVHOHVV LQ WKLV DSSOLFDWLRQ ZLWKRXW WKH %< VWDWHPHQW ² 2QH ILQDO QRWH DOWKRXJK WKLV SURJUDPUXQVFRUUHFWO\HYHQLIWKHUHDUHPXOWLSOHREVHUYDWLRQVZLWKWKHVDPH%<YDULDEOH LQERWKGDWDVHWV\RXZRXOGEHZLVHWRFKHFNIRUXQH[SHFWHGGXSOLFDWHVDVGHVFULEHGLQ WKH SUHYLRXV FKDSWHU RU WR XVH WKH 12'83.(< RSWLRQ RI 352& 6257 RQ RQH RI WKH GDWDVHWV /HW V SOD\ FRPSXWHU WR VHH KRZ WKLV SURJUDP ZRUNV %RWK GDWD VHWV FRQWDLQ DQ REVHUYDWLRQIRU3$712 7KHUHIRUHWKHWZRORJLFDOYDULDEOHV,121(DQG,17:2 DUHERWKWUXHDQGQHLWKHURIWKH,)VWDWHPHQWV³´LVWUXHDQGDPHVVDJHLVQRWSULQWHGWR WKHRXWSXWILOH7KHQH[WYDOXHRI3$712LVIURPGDWDVHW21(%HFDXVHWKLVYDOXHLV QRWLQGDWDVHW7:2WKHYDOXHVRI,121(DQG,17:2DUHWUXHDQGIDOVHUHVSHFWLYHO\ 7KHUHIRUHVWDWHPHQW´LVWUXHDQGWKHDSSURSULDWHPHVVDJHLVSULQWHGWRWKHRXWSXWILOH :KHQ \RX UHDFK 3$712 ZKLFK H[LVWV LQ GDWD VHW 7:2 EXW QRW LQ GDWD VHW 21( VWDWHPHQW³LVWUXHDQGLWVDVVRFLDWHGPHVVDJHLVSULQWHGRXW1RWHWKDWDQ\WLPHDYDOXH RI3$712LVPLVVLQJIURPRQHRIWKHILOHVWKHYDULDEOH1LVLQFUHPHQWHG,I\RXUHDFK WKHODVWREVHUYDWLRQLQWKHILOHVEHLQJPHUJHGWKHORJLFDOYDULDEOH/$67LVWUXH,IWKHUH DUHQR,'HUURUV1ZLOOVWLOOEHDQGVWDWHPHQWµZLOOEHWUXH:KHQ\RXUXQ3URJUDP WKHIROORZLQJRXWSXWLVREWDLQHG Listing of Missing ID’s ID ID ID ID
2 is not in data set TWO 4 is not in data set ONE 12 is not in data set TWO 13 is not in data set ONE
®
124 Cody’s Data Cleaning Techniques Using SAS Software
$OO WKH ,' HUURUV DUH FRUUHFWO\ GLVSOD\HG ,I \RX ZDQW WR H[WHQG WKLV SURJUDP WR PRUH WKDQWZRGDWDVHWVWKHSURJUDPFRXOGEHFRPHORQJDQGWHGLRXV$PDFURDSSURDFKFDQ EHXVHGWRDFFRPSOLVKWKH,'FKHFNLQJWDVNZLWKDQDUELWUDU\QXPEHURIGDWDVHWV7KH QH[WVHFWLRQGHPRQVWUDWHVVXFKDSURJUDP Checking for an ID in Each of "n" Files
'DWDVHW7+5((LVDGGHGWRWKHPL[WRGHPRQVWUDWHKRZWRDSSURDFKWKLVSUREOHPZKHQ WKHUH DUH PRUH WKDQ WZR GDWD VHWV )LUVW UXQ 3URJUDP WR FUHDWH WKH QHZ GDWD VHW 7+5(( 3URJUDP
&UHDWLQJD7KLUG'DWD6HWIRU7HVWLQJ3XUSRVHV
DATA THREE; INPUT PATNO GENDER $ @@; DATALINES; 1 M 2 F 3 M 5 F 6 M 12 M 13 M ;
%HIRUH GHYHORSLQJ D PDFUR OHW V ORRN DW 3URJUDP ZKLFK LV D UDWKHU VLPSOH EXW VOLJKWO\WHGLRXVSURJUDPWRDFFRPSOLVKWKH,'FKHFNV 3URJUDP &KHFNLQJIRUDQ,'LQ(DFKRI7KUHH'DWD6HWV/RQJ:D\ PROC SORT DATA=ONE(KEEP=PATNO) OUT=TMP1; BY PATNO; RUN; PROC SORT DATA=TWO(KEEP=PATNO) OUT=TMP2; BY PATNO; RUN; PROC SORT DATA=THREE(KEEP=PATNO) OUT=TMP3; BY PATNO; RUN;
Chapter 6
Working with Multiple Files
125
DATA _NULL_; FILE PRINT; TITLE "Listing of Missing ID’s and Data Set Names"; MERGE TMP1(IN=IN1) TMP2(IN=IN2) TMP3(IN=IN3) END=LAST; BY PATNO; IF NOT IN1 THEN DO; PUT "ID " PATNO "missing from data set ONE"; N + 1; END; IF NOT IN2 THEN DO; PUT "ID " PATNO "missing from data set TWO"; N + 1; END; IF NOT IN3 THEN DO; PUT "ID " PATNO "missing from data set THREE"; N + 1; END; IF LAST AND N EQ 0 THEN PUT "All ID’s Match in All Files"; RUN;
3URJUDPFDQEHH[WHQGHGWRDFFRPPRGDWHDQ\ QXPEHU RI GDWD VHWV 7KH RXWSXW LV VKRZQQH[W Listing of Missing ID’s and Data Set Names ID ID ID ID ID ID ID
2 missing from data set TWO 4 missing from data set ONE 4 missing from data set THREE 6 missing from data set ONE 6 missing from data set TWO 12 missing from data set TWO 13 missing from data set ONE
$SDWWHUQEHJLQVWRHPHUJH1RWLFHWKHVRUWVDQG,)VWDWHPHQWVIROORZDSDWWHUQWKDWFDQ EH DXWRPDWHG E\ ZULWLQJ D PDFUR %HFDXVH WKH SURJUDPPLQJ FDQ EHFRPH FRPSOLFDWHG WZRPDFURVZLOOEHZULWWHQWRDFFRPSOLVKWKLVWDVN7KHILUVWLVHDVLHUWRXQGHUVWDQGEXW QRWDVHOHJDQW²WKHVHFRQGLVPRUHFRPSOLFDWHGEXWVOLJKWO\PRUHIOH[LEOHDQGHDVLHUWR UXQ
®
126 Cody’s Data Cleaning Techniques Using SAS Software
A Simple Macro to Check ID's in Multiple Files
7KHILUVWPDFURKDVDVLWVFDOOLQJDUJXPHQWVWKH,'YDULDEOHQDPHDQGWKHGDWDVHWQDPHV ,QWKLVSURJUDPXSWRGDWDVHWVDUHDOORZHG7KLVFDQHDVLO\EHLQFUHDVHGLIQHFHVVDU\ 7KHPRUHFRPSOLFDWHGPDFURGHYHORSHGLQ3URJUDPGRHVQRWKDYHWKLVOLPLWDWLRQ 3URJUDPLVWKHVLPSOHPDFURSURJUDP$QH[SODQDWLRQIROORZV 3URJUDP
&UHDWLQJD0DFURWR&KHFNIRUDQ,'LQ(DFKRIQ)LOHV6LPSOH :D\
*-----------------------------------------------------------------* | Program Name: ID_SIMP.SAS in C:\CLEANING | | Purpose: Simple version of the macro to test if an ID exists in | | each of up to 10 data sets | | Arguments: ID - Name of the ID variable | | DSNn - Name of the nth data set | | Example: %ID_SIMP(PATNO,ONE,TWO,THREE) | *-----------------------------------------------------------------*; %MACRO ID_SIMP(ID,DSN1,DSN2,DSN3,DSN4,DSN5, DSN6,DSN7,DSN8,DSN9,DSN10); TITLE "Report of ID’s Not in All Data Sets"; ***Sorting section; %DO I = 1 %TO 10;
¯
%IF &&DSN&I NE %THEN %DO; /* If non-null argument */
±
%LET N_DATA = &I;
PROC SORT DATA = &&DSN&I(KEEP=&ID) OUT=TMP&I; BY &ID; RUN; %END; %ELSE %LET I = 10; %END;
°
²
/* Stop the loop when DSNn is null */
***Create MERGE statements; DATA _NULL_; FILE PRINT; MERGE %DO I = 1 %TO &N_DATA; &&DSN&I(IN=IN&I) %END; END=LAST; µ ***End MERGE statement;
´
³
Chapter 6
Working with Multiple Files
127
BY &ID; ***Error reporting section; %DO I = 1 IF NOT PUT N + END; %END;
%TO &N_DATA; ¶ IN&I THEN DO; "ID " &ID "Missing from data set &&DSN&I"; 1;
IF LAST AND N EQ 0 THEN DO; · PUT "All ID’s Match in All Files"; STOP; END; RUN; %MEND ID_SIMP;
7R UXQ WKLV PDFUR IRU WKUHH GDWD VHWV 21( 7:2 DQG 7+5(( ZLWK D FRPPRQ ,' YDULDEOH3$712 \RXZRXOGZULWH %ID_SIMP(PATNO,ONE,TWO,THREE)
7KHSURJUDPVWDUWVZLWKD0$&52VWDWHPHQWZLWKFDOOLQJDUJXPHQWVIRUWKHQDPHRIWKH FRPPRQ,'YDULDEOHDQGXSWRGDWDVHWQDPHV1H[W\RXZDQWWRVRUWDOOWKHGDWDVHWV E\WKH,'YDULDEOH7KH'2ORRS¯DFFRPSOLVKHVWKLVWDVN:KHQ,LVHTXDOWRWKH PDFUR YDULDEOH '61 , ² ILUVW UHVROYHV WR '61 ZKLFK LV WKH PDFUR YDULDEOH UHSUHVHQWLQJWKHILUVWGDWDVHWWREHVRUWHG7KHVDPHWDVNLVSHUIRUPHGIRUHDFKRIWKH QGDWDVHWV(DFKWLPHWKHORRSLWHUDWHVWKHYDOXHRI1B'$7$LVVHWHTXDOWRWKHORRS LQGH[±:KHQDOOWKHGDWDVHWVKDYHEHHQSURFHVVHGWKHYDOXHRI '61 ,ZLOOEH QXOODQGWKH,)VWDWHPHQW°ZLOOEHIDOVH7KLVFDXVHVWKHORRSFRXQWHUWREHVHWWR ZKLFKVWRSVWKHORRS³ 7KHQH[WVHFWLRQRIWKHSURJUDPJHQHUDWHVVWDWHPHQWVVLPLODUWRWKRVHIROORZLQJWKHVRUW LQ 3URJUDP $IWHU WKH ZRUG 0(5*( D '2 ORRS ´ ZULWHV WKH GDWD VHW QDPHV IROORZHGE\DPDWFKLQJ,1 RSWLRQ)RUH[DPSOHLIWKHILUVWGDWDVHWLVFDOOHG21(WKH ILUVWLWHUDWLRQRIWKH'2ORRSZULWHVWKHWH[W21(,1 ,1 $IWHUWKHODVWGDWDVHW QDPHDQG,1 RSWLRQWKHWH[W(1' /$67ILQLVKHVWKH0(5*(VWDWHPHQWµ
®
128 Cody’s Data Cleaning Techniques Using SAS Software
,IDQ\RIWKH,1 ORJLFDOYDULDEOHVDUHIDOVHWKDWGDWDVHWGLGQRWKDYHDFRQWULEXWLRQWR WKH 0(5*( DQ HUURU UHSRUW LV JHQHUDWHG ¶ )LQDOO\ LI DOO WKH GDWD VHWV KDYH EHHQ PHUJHG/$67LVWUXH DQG1VWLOOHTXDOVQRPLVVLQJ,' V WKH,)VWDWHPHQWLQOLQH· LVWUXHDQGWKHSURJUDPUHSRUWVWKDWDOO,' VPDWFK 7R KHOS FODULI\ WKLV SURJUDP ORRN DW WKH FRGH JHQHUDWHG E\ WKH PDFUR XVLQJ WKH 035,17 V\VWHP RSWLRQ ZKHQ LW LV XVHG WR FKHFN WKH WKUHH GDWD VHWV 21( 7:2 DQG 7+5(( SAS Log Showing the Macro Generate Program (Using the MPRINT Option) MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP):
TITLE "Report of ID’s Not in All Data Sets"; ***Sorting section; PROC SORT DATA = one(KEEP=patno) OUT=TMP1; BY patno; RUN;
MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP):
PROC SORT DATA = two(KEEP=patno) OUT=TMP2; BY patno; RUN;
MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP):
PROC SORT DATA = three(KEEP=patno) OUT=TMP3; BY patno; RUN;
MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP):
***Create MERGE statements; DATA _NULL_; FILE PRINT; MERGE one(IN=IN1) two(IN=IN2) ***End MERGE statement; BY patno; ***Error reporting section; IF NOT IN1 THEN DO; PUT "ID " patno "Missing from N + 1; END; IF NOT IN2 THEN DO; PUT "ID " patno "Missing from N + 1; END; IF NOT IN3 THEN DO; PUT "ID " patno "Missing from
three(IN=IN3) END=LAST;
data set one";
data set two";
data set three";
Chapter 6 MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP): MPRINT(ID_SIMP):
Working with Multiple Files
129
N + 1; END; IF LAST AND N EQ 0 THEN DO; PUT "All ID’s Match in All Files"; STOP; END; RUN;
1RWLFH WKDW WKH PDFUR JHQHUDWHG FRGH LV EDVLFDOO\ WKH VDPH DV WKH QRQPDFUR SURJUDP WKDWSUHFHGHGLW)LQDOO\KHUHLVWKHRXWSXWIURP3URJUDP Report of ID’s Not in All Data Sets ID ID ID ID ID ID
2 Missing from data set two 4 Missing from data set one 4 Missing from data set three 6 Missing from data set one‘ 12 Missing from data set two 13 Missing from data set one
A More Complicated Multi-File Macro for ID Checking
7KH DERYH PDFUR FDQ EH PRGLILHG WR KDQGOH DQ\ QXPEHU RI GDWD VHWV 7KLV LV DFFRPSOLVKHGE\FDOOLQJWKHPDFURZLWKRQHDUJXPHQWUHSUHVHQWLQJWKH,'YDULDEOHDQG WKHRWKHUDOLVWRIGDWDVHWVWREHFKHFNHGVHSDUDWHGE\VSDFHV
:ULWLQJD0RUH*HQHUDO0DFURWR+DQGOH$Q\1XPEHURI'DWD6HWV
*----------------------------------------------------------------* | Program Name: CHECK_ID.SAS in C:\CLEANING | | Purpose: Macro which checks if an ID exists in each of n files | | Arguments: The name of the ID variable, followed by as many | | data set names as you want, separated by spaces | | Example: %CHECK_ID(PATNO,ONE TWO THREE) | *----------------------------------------------------------------*;
®
130 Cody’s Data Cleaning Techniques Using SAS Software %MACRO CHECK_ID(ID,DS_LIST); ¯ %LET STOP = 999; /* Initialize stop at a large value */ %DO I = 1 %TO &STOP; %LET DSN = %SCAN(&DS_LIST,&I); /* Break up list into data set names */ %IF &DSN NE %THEN /* If non-null argument */ ° %DO; %LET N = &I; /* the number of data sets */ PROC SORT DATA=&DSN(KEEP=&ID) OUT=TMP&I; BY &ID; RUN; %END; %ELSE %LET I = &STOP; /* Set index to max so loop stops */ %END; DATA _NULL_; FILE PRINT; MERGE %DO I = 1 %TO &N; TMP&I(IN=IN&I) %END; END=LAST; BY &ID; %DO I = 1 %TO &N; %LET DSN = %SCAN(&DS_LIST,&I); IF NOT IN&I THEN DO; PUT "ID " &ID "missing from data set &DSN"; NN + 1; END; %END; IF LAST AND NN EQ 0 THEN DO; PUT "All ID’s Match in All Files"; STOP; END; RUN; %MEND CHECK_ID;
Chapter 6
Working with Multiple Files
131
7KLVPDFURLVVLPLODUWRWKHSUHYLRXVRQHH[FHSWIRUWKHZD\WKHPXOWLSOHGDWDVHWQDPHV DUHKDQGOHG7KHVHFRQGDUJXPHQWLQWKHPDFURFDOOLVDOLVWRIGDWDVHWQDPHVVHSDUDWHG E\VSDFHV¯7KH6&$1IXQFWLRQZKLFKLQWHUSUHWVVSDFHVDVGHOLPLWHUVWKHGHIDXOW DOORZV \RX WR H[WUDFW WKH QDPH RI WKH PXOWLSOH GDWD VHW QDPHV :KHQ WKH 6&$1 IXQFWLRQ UHWXUQV D QXOO VWULQJ ° WKH ORRS HQGV E\ VHWWLQJ WKH YDOXH RI WKH '2 ORRS FRXQWHUWRWKHHQGLQJYDOXH DQGWKHSURJUDPFRQWLQXHV7KHUHPDLQGHURIWKHSURJUDPLV VLPLODU WR WKH SUHYLRXV SURJUDP ZKHUH D 0(5*( VWDWHPHQW LV FRQVWUXFWHG XVLQJ WKH PXOWLSOHGDWDVHWQDPHVIROORZHGE\WKH,1 GDWDVHWRSWLRQ 7RLQYRNHWKLVPDFURWRFKHFNWKHWKUHHGDWDVHWV21(7:2DQG7+5(( ZLWKWKH,' YDULDEOH3$712\RXZRXOGZULWH %CHECK_ID(PATNO,ONE TWO THREE)
5HPHPEHUWKDWWKHPXOWLSOHGDWDVHWQDPHVDUHVHSDUDWHGE\VSDFHVUDWKHUWKDQFRPPDV DVLQWKHSUHYLRXVPDFUR More Complicated Multi-File Rules
(YHU\GDWDFROOHFWLRQSURMHFWZLOOKDYHLWVRZQVHWRIUXOHV7KHSURJUDPVLQWKLVVHFWLRQ DUHLQWHQGHGWRGHPRQVWUDWHWHFKQLTXHVUDWKHUWKDQEHH[DFWPRGHOVIRUPXOWLSOHILOHGDWD YDOLGDWLRQUXOHV ,QWKLVILUVWH[DPSOH\RXZDQWWREHVXUHDQREVHUYDWLRQKDVEHHQDGGHGWRWKHODERUDWRU\ WHVWGDWDVHW/$%B7(67 LIWKHUHZDVDQDGYHUVHHYHQWRI ; HQWHUHGIRUDQ\SDWLHQW,' LQWKHDGYHUVHHYHQWGDWDVHW$( 7KH$(DQGWKH/$%B7(67GDWDVHWVFDQEHFUHDWHG E\UXQQLQJWKHSURJUDPVJLYHQLQWKH$SSHQGL[
®
132 Cody’s Data Cleaning Techniques Using SAS Software
+HUHDUHWKHOLVWLQJVRIWKH$(DQG/$%B7(67GDWDVHWV Listing of Data Set AE PATNO 001 001 003 004 004 011 013 009 022 025
DATE_AE 11/21/1998 12/13/1998 11/18/1998 09/18/1998 09/19/1998 10/10/1998 09/25/1998 12/25/1998 10/01/1998 02/09/1999
A_EVENT W Y X O P X W X W X
1RWHWKDWHDFKSDWLHQW,'3$712 PD\KDYHPRUHWKDQRQHDGYHUVHHYHQW
Listing of data set LAB_TEST PATNO 001 003 007 004 025 022
LAB_DATE 11/15/1998 11/19/1998 10/21/1998 12/22/1998 01/01/1999 10/10/1998
WBC
RBC
9000 9500 8200 11000 8234 8000
5.45 5.44 5.23 5.55 5.02 5.00
1RWHWKDWHDFKSDWLHQWKDVRQO\RQHREVHUYDWLRQLQWKLVGDWDVHW $FFRUGLQJ WR RXU UXOH DQ\ SDWLHQW ZLWK DQ DGYHUVH HYHQW RI ; VKRXOG KDYH RQH REVHUYDWLRQ LQ WKH /$%B7(67 GDWD VHW 3DWLHQWV DQG DOO KDG DQ DGYHUVH HYHQW RI ; KRZHYHU RQO\ SDWLHQWV DQG KDG DQ REVHUYDWLRQ LQ WKH /$%B7(67GDWDVHWDOWKRXJKWKHGDWHRIWKHODEWHVWIRUSDWLHQWLVHDUOLHUWKDQWKH GDWHRIWKH$(OHW VLJQRUHWKLVIRUQRZ 2QHDSSURDFKWRORFDWLQJWKHVHWZRRPLVVLRQV LVWRPHUJHWKH$(DQG/$%B7(67GDWDVHWVE\SDWLHQW,'VHOHFWLQJRQO\WKRVHSDWLHQWV ZLWKDQ$(RI ; 7KHQXVLQJWKH,1 GDWDVHWRSWLRQRQWKHPHUJH\RXFDQORFDWHDQ\ PLVVLQJREVHUYDWLRQV7KLVLVVKRZQLQ3URJUDP$QH[SODQDWLRQIROORZV
Chapter 6
3URJUDP
Working with Multiple Files
133
9HULI\LQJ7KDW3DWLHQWVZLWKDQ$GYHUVH(YHQW RI ; LQ 'DWD 6HW $(KDYHDQ(QWU\LQ'DWD6HW/$%B7(67
PROC SORT DATA=CLEAN.AE OUT=AE_X; WHERE A_EVENT = ’X’; BY PATNO; RUN;
¯
PROC SORT DATA=CLEAN.LAB_TEST(KEEP=PATNO LAB_DATE) OUT=LAB; BY PATNO; RUN; DATA MISSING; MERGE AE_X LAB(IN=IN_LAB); BY PATNO; IF NOT IN_LAB; RUN;
°
PROC PRINT DATA=MISSING LABEL; TITLE "Patients with AE of X Who Are Missing Lab Test Entry"; ID PATNO; VAR DATE_AE A_EVENT; RUN;
(DFKRIWKHWZRGDWDVHWVLVILUVWVRUWHGE\SDWLHQW,'3$712 ,QDGGLWLRQE\XVLQJD :+(5(VWDWHPHQW¯IROORZLQJWKH352&6257VWDWHPHQWRQO\WKRVHREVHUYDWLRQVLQ WKH DGYHUVH HYHQWV GDWD VHW ZLWK HYHQW ; DUH VHOHFWHG 7KH NH\ WR WKH SURJUDP LV VWDWHPHQW ° ZKHUH \RX XVH WKH ,1 RSWLRQ WR FUHDWH WKH WHPSRUDU\ ORJLFDO YDULDEOH ,1B/$%%HFDXVHWKHUHVKRXOGEHDQREVHUYDWLRQLQ/$%B7(67IRUHYHU\SDWLHQWZLWK DQDGYHUVHHYHQWRI ; DQ\WLPHWKHORJLFDOYDULDEOH,1B/$%LVIDOVH\RXKDYHORFDWHGD SDWLHQWZLWKDPLVVLQJODERUDWRU\WHVW7KHRXWSXWIURP3URJUDPLVVKRZQQH[W Patients with AE of X Who Are Missing Lab Test Entry Patient ID
Date of AE
Adverse Event
009 011
12/25/1998 10/10/1998
X X
®
134 Cody’s Data Cleaning Techniques Using SAS Software
Checking That the Dates Are in the Proper Order
,WZDVPHQWLRQHGHDUOLHUWKDWWKHGDWHRIWKHODERUDWRU\WHVWIRUSDWLHQWZDVSULRUWR WKH GDWH RI WKH DGYHUVH HYHQW
$GGLQJ WKH &RQGLWLRQ 7KDW WKH /DE 7HVW 0XVW )ROORZ WKH $GYHUVH (YHQW
TITLE "Patients with AE of X Who Are Missing Lab Test Entry"; TITLE2 "or the Date of the Lab Test Is Earlier Than the AE Date"; TITLE3 "-------------------------------------------------------"; DATA _NULL_; FILE PRINT; MERGE AE_X(IN=IN_AE) LAB(IN=IN_LAB); BY PATNO; IF NOT IN_LAB THEN PUT "No Lab Test for Patient " PATNO "with Adverse Event X"; ELSE IF IN_AE AND LAB_DATE EQ . THEN PUT ¯ "Date of Lab Test Is Missing for Patient " PATNO / "Date of AE Is " DATE_AE /; ELSE IF IN_AE AND LAB_DATE LT DATE_AE THEN PUT ° "Date of Lab Test Is Earlier Than Date of AE for Patient " PATNO / " Date of AE Is " DATE_AE " Date of Lab Test Is " LAB_DATE /; RUN;
2QH,)VWDWHPHQW¯FKHFNVLIWKHODERUDWRU\GDWHLVPLVVLQJDQGWKHRWKHU,)VWDWHPHQW° WHVWV LI WKH ODERUDWRU\ GDWH LV SULRU WR OHVV WKDQ WKH GDWH RI WKH DGYHUVH HYHQW DQG DQ DSSURSULDWHPHVVDJHLVSULQWHG2XWSXWIURPWKLVSURJUDPLVVKRZQQH[W
Chapter 6
Working with Multiple Files
135
Patients with AE of X Who Are Missing Lab Test Entry or the Date of the Lab Test Is Earlier Than the AE Date ------------------------------------------------------No Lab Test for Patient 009 with Adverse Event X No Lab Test for Patient 011 with Adverse Event X Date of Lab Test Is Earlier Than Date of AE for Patient 025 Date of AE Is 02/09/1999 Date of Lab Test Is 01/01/1999
3URJUDPVWRYHULI\PXOWLILOHUXOHVFDQEHFRPHYHU\FRPSOLFDWHG+RZHYHUPDQ\RIWKH WHFKQLTXHVGLVFXVVHGLQWKLVFKDSWHUVKRXOGSURYHXVHIXO
®
136 Cody’s Data Cleaning Techniques Using SAS Software
7
Double Entry and Verification (PROC COMPARE) ,QWURGXFWLRQ
&RQGXFWLQJD6LPSOH&RPSDULVRQRI7ZR'DWD6HWVZLWKRXWDQ,' 9DULDEOH
8VLQJ352&&203$5(ZLWKDQ,'9DULDEOH
8VLQJ352&&203$5(ZLWK7ZR'DWD6HWV7KDW+DYHDQ8QHTXDO 1XPEHURI2EVHUYDWLRQV
&RPSDULQJ7ZR'DWD6HWV:KHQ6RPH9DULDEOHV$UH1RWLQ%RWK'DWD 6HWV
Introduction
0DQ\FULWLFDOGDWDDSSOLFDWLRQVUHTXLUHWKDW\RXKDYHWKHGDWDHQWHUHGWZLFHDQG WKHQFRPSDUHWKHUHVXOWLQJILOHVIRUGLVFUHSDQFLHV7KLVLVXVXDOO\UHIHUUHGWRDV GRXEOHHQWU\DQGYHULILFDWLRQ,QWKHROGGD\VZKHQ,ZDVILUVWOHDUQLQJWRXVH FRPSXWHUVPRVWGDWDHQWU\ZDVGRQHXVLQJDNH\SXQFKDOWKRXJKP\ER\VZLOO WHOO\RXWKDWLQP\GD\LWZDVGRQHZLWKDKDPPHUDQGFKLVHORQVWRQHWDEOHWV 7KH PRVW FRPPRQ PHWKRG RI GRXEOH HQWU\ DQG YHULILFDWLRQ ZDV GRQH RQ D VSHFLDO NH\SXQFK PDFKLQH FDOOHG D YHULILHU 7KH RULJLQDO FDUGV ZHUH SODFHG LQ WKH LQSXW KRSSHU DQG D NH\SXQFK RSHUDWRU SUHIHUDEO\ QRW WKH RQH ZKR HQWHUHG WKH GDWD RULJLQDOO\ UHNH\HG WKH LQIRUPDWLRQ IURP WKH GDWD HQWU\ IRUP ,I WKH LQIRUPDWLRQEHLQJW\SHGPDWFKHGWKH LQIRUPDWLRQ DOUHDG\ SXQFKHG RQ WKH FDUG WKHFDUGZDVDFFHSWHGDQGDSXQFKZDVSODFHGXVXDOO\LQFROXPQRIWKHFDUG ,IWKHLQIRUPDWLRQGLGQRWPDWFKDFKHFNFRXOGEHPDGHWRVHHZKHWKHUWKHHUURU ZDVRQWKHRULJLQDOFDUGRULQWKHUHNH\LQJRIWKHLQIRUPDWLRQ 7RGD\WKHUHDUHVHYHUDOSURJUDPVWKDWDFFRPSOLVKWKHVDPHJRDOE\KDYLQJ DOO WKHGDWDHQWHUHGWZLFHDQGWKHQFRPSDULQJWKHUHVXOWLQJGDWDILOHV6RPHRIWKHVH SURJUDPVDUHTXLWHVRSKLVWLFDWHGDQGDOVRTXLWHH[SHQVLYH6$6VRIWZDUHKDVD YHU\IOH[LEOHSURFHGXUHFDOOHG352&&203$5(ZKLFKFDQEHXVHGWRFRPSDUH WKHFRQWHQWVRIWZR6$6GDWDVHWV
®
138 Cody’s Data Cleaning Techniques Using SAS Software
Conducting a Simple Comparison of Two Data Sets without an ID Variable
7KH VLPSOHVW DSSOLFDWLRQ RI 352& &203$5( LV SUHVHQWHG ILUVW GHWHUPLQLQJ LI WKH FRQWHQWV RI WZR 6$6 GDWD VHWV DUH LGHQWLFDO 6XSSRVH \RX KDYH WZR SHRSOH HQWHU GDWD IURPVRPHFRGLQJIRUPVDQGWKHWZRGDWDVHWVDUHFDOOHG),/(BDQG),/(B$OLVWLQJ RIWKHWZRILOHVLVVKRZQQH[W FILE_1 001M10211946130 80 002F12201950110 70 003M09141956140 90 004F10101960180100 007m10321940184110 FILE_2 001M1021194613080 002F12201950110 70 003M09141956144 90 004F10101960180100 007M10231940184110
+HUHLVWKHILOHIRUPDW 9DULDEOH 3$712 *(1'(5 '2% 6%3 '%3
'HVFULSWLRQ 3DWLHQW1XPEHU *HQGHU 'DWHRI%LUWK 6\VWROLF%ORRG3UHVVXUH 'LDVWROLF%ORRG3UHVVXUH
6WDUWLQJ&ROXPQ
/HQJWK
7\SH 1XPHULF &KDUDFWHU PPGG\\\\ 1XPHULF 1XPHULF
Chapter 7
Double Entry and Verification (PROC COMPARE)
139
7KHGDWDZLWKRXWPLVWDNHVVKRXOGKDYHEHHQ Correct Data Representation 001M10211946130 80 002F12201950110 70 003M09141956140 90 004F10101960180100 007M10231940184110
$YLVXDOLQVSHFWLRQRIWKHWZRRULJLQDOILOHVVKRZVWKHIROORZLQJGLVFUHSDQFLHV )RUSDWLHQWWKHUHLVDVSDFHPLVVLQJEHIRUHWKHDWWKHHQGRIWKHOLQHLQ),/(B )RUSDWLHQW6%3LVLQVWHDGRILQ),/(B )RUSDWLHQWWKHJHQGHULVHQWHUHGLQORZHUFDVHDQGWKHGLJLWVDUHLQWHUFKDQJHGLQWKH GD\ILHOGRIWKHGDWHLQ),/(B /HW V VHH KRZ WR XVH 352& &203$5( WR GHWHFW WKHVH GLIIHUHQFHV
&UHDWLQJ'DWD6HWV21(DQG7:2IURP7ZR5DZ'DWD)LOHV
DATA ONE; INFILE "C:\CLEANING\FILE_1" PAD; INPUT @1 PATNO 3. @4 GENDER $1. @5 DOB MMDDYY8. @13 SBP 3. @16 DBP 3.; FORMAT DOB MMDDYY10.; RUN; DATA TWO; INFILE "C:\CLEANING\FILE_2" PAD; INPUT @1 PATNO 3. @4 GENDER $1. @5 DOB MMDDYY8. @13 SBP 3. @16 DBP 3.; FORMAT DOB MMDDYY10.; RUN;
®
140 Cody’s Data Cleaning Techniques Using SAS Software
7KHQUXQ352&&203$5(DVVKRZQLQ3URJUDP 3URJUDP
5XQQLQJ352&&203$5(
PROC COMPARE BASE=ONE COMPARE=TWO; TITLE "Using PROC COMPARE to Compare Two Data Sets"; RUN;
7KH SURFHGXUH RSWLRQV %$6( DQG &203$5( LGHQWLI\ WKH WZR GDWD VHWV WR EH FRPSDUHG ,Q WKLV H[DPSOH GDWD VHW 21( ZDV DUELWUDULO\ FKRVHQ DV WKH EDVH GDWD VHW 7KHRSWLRQ'$7$ PD\EHXVHGLQSODFHRI%$6( EHFDXVHWKH\DUHHTXLYDOHQW +HUHLVWKHRXWSXWIURP352&&203$5( Using PROC COMPARE to Compare Two Data Sets COMPARE Procedure Comparison of WORK.ONE with WORK.TWO (Method=EXACT) Data Set Summary Data set
Created
Modified
Nvar
NObs
WORK.ONE WORK.TWO
17AUG98:10:30:25 17AUG98:10:30:25
17AUG98:10:30:25 17AUG98:10:30:25
5 5
5 5
Variables Summary Number of Variables in Common: 5. Observation Summary Observation First First Last Last
Obs Unequal Unequal Obs
Base
Compare
1 3 5 5
1 3 5 5 Continued
Chapter 7
Double Entry and Verification (PROC COMPARE)
Number of Observations in Common: 5. Total Number of Observations Read from WORK.ONE: 5. Total Number of Observations Read from WORK.TWO: 5. Number of Observations with Some Compared Variables Unequal: 2. Number of Observations with All Compared Variables Equal: 3. Values Comparison Summary Number of Variables Compared with All Observations Equal: 2. Number of Variables Compared with Some Observations Unequal: 3. Number of Variables with Missing Value Differences: 1. Total Number of Values which Compare Unequal: 3. Maximum Difference: 4. Using PROC COMPARE to Compare Two Data Sets COMPARE Procedure Comparison of WORK.ONE with WORK.TWO (Method=EXACT) Variables with Unequal Values Variable
Type
Len
Ndif
MaxDif
MissDif
GENDER DOB SBP
CHAR NUM NUM
1 8 8
1 1 1
0 4.000
0 1 0
Value Comparison Results for Variables __________________________________________________________ || Base Value Compare Value Obs || GENDER GENDER ________ || _ _ || 5 || m M __________________________________________________________ __________________________________________________________ || Base Compare Obs || DOB DOB Diff. % Diff ________ || _________ _________ _________ _________ || 5 || . 10/23/40 . . __________________________________________________________ __________________________________________________________ || Base Compare Obs || SBP SBP Diff. % Diff ________ || _________ _________ _________ _________ || 3 || 140.0000 144.0000 4.0000 2.8571 __________________________________________________________
141
®
142 Cody’s Data Cleaning Techniques Using SAS Software
1RWLFHWKDWWKHOHIWDGMXVWHGYDOXHRIIRUSDWLHQWLQ),/(BZDVQRWIODJJHGDVDQ HUURU :K\" %HFDXVH 6$6 FRUUHFWO\ UHDGV OHIWDGMXVWHG QXPHULF YDOXHV DQG WKH FRPSDULVRQ LV EHWZHHQ WKH WZR 6$6 GDWD VHWV QRW WKH UDZ ILOHV WKHPVHOYHV $OVR WKH LQFRUUHFWGDWHRISDWLHQWQXPEHULQ),/(B ZDVVKRZQDVDPLVVLQJ YDOXH LQ WKH RXWSXW ,I \RX LQVSHFW WKH 6$6 /RJ \RX ZLOO VHH WKH LQFRUUHFW GDWH ZDV IODJJHG DV DQ HUURU :KHQ LQYDOLG GDWHV DUH HQFRXQWHUHG 6$6 VXEVWLWXWHV D PLVVLQJ YDOXHIRUWKHGDWH,I\RXGRQRWZDQWWKLVWRKDSSHQ\RXFDQXVHDFKDUDFWHULQIRUPDW LQVWHDGRIDGDWHLQIRUPDWIRUGDWDFKHFNLQJSXUSRVHV
Chapter 7
Double Entry and Verification (PROC COMPARE)
143
,I\RXZDQWWRVLPSO\HPXODWHDGDWDHQWU\YHULI\SURFHVV\RXFDQ SURFHHG LQ DQRWKHU ZD\
8VLQJ352&&203$5(WR&RPSDUH7ZR'DWD5HFRUGV
DATA ONE; INFILE "C:\CLEANING\FILE_1" PAD; INPUT STRING $CHAR18.; RUN; DATA TWO; INFILE "C:\CLEANING\FILE_2" PAD; INPUT STRING $CHAR18.; RUN; PROC COMPARE BASE=ONE COMPARE=TWO BRIEF; TITLE "Treating Each Line as a String"; RUN;
7KLVJUHDWO\VLPSOLILHVWKH'$7$VWHSVDQGSHUKDSVJLYHV\RXDUHVXOWFORVHUWRZKDW \RXUHDOO\ZDQWDQH[DFWFRPSDULVRQRIWKHUDZGDWDILOHV+HUHLVWKHRXWSXW Treating Each Line as a String COMPARE Procedure Comparison of WORK.ONE with WORK.TWO (Method=EXACT) NOTE: Values of the following 1 variables compare unequal: STRING Value Comparison Results for Variables __________________________________________________________ || Base Value Compare Value Obs || STRING STRING ________ || __________________ __________________ || 1 || 001M10211946130 80 001M1021194613080 3 || 003M09141956140 90 003M09141956144 90 5 || 007m10321940184110 007M10231940184110 __________________________________________________________
2IFRXUVH\RXQRZKDYHWRORRNRYHUWKHWZROLQHVDQGGHWHUPLQHZKHUHWKHGLIIHUHQFHV DUH :KLFK RI WKHVH PHWKRGV \RX XVH ZLOO GHSHQG RQ \RXU JRDOV LQ WKH YHULILFDWLRQ SURFHVV
®
144 Cody’s Data Cleaning Techniques Using SAS Software
Using PROC COMPARE with an ID Variable
,I\RXDUHJRLQJWRXVHWKHILUVWPHWKRGGHVFULEHGLQ3URJUDPVDQGLGHQWLI\LQJ HDFK RI WKH YDULDEOHV LQ WKH WZR GDWD VHWV LW LV PXFK EHWWHU WR VSHFLI\ DQ ,' YDULDEOH 3$712LQWKLVH[DPSOH WROLQNWKHWZRILOHV
8VLQJ352&&203$5(ZLWKDQ,'9DULDEOH
PROC COMPARE BASE=ONE COMPARE=TWO; TITLE "Using PROC COMPARE to Compare Two Data Sets"; ID PATNO; RUN;
Using PROC COMPARE to Compare Two Data Sets COMPARE Procedure Comparison of WORK.ONE with WORK.TWO (Method=EXACT) Data Set Summary Dataset WORK.ONE WORK.TWO
Created
Modified
NVar
NObs
13AUG98:10:49:13 13AUG98:10:49:13
13AUG98:10:49:13 13AUG98:10:49:13
5 5
5 5
Variables Summary Number of Variables in Common: 5. Number of ID Variables: 1. Observation Summary Observation First First Last Last
Obs Unequal Unequal Obs
Base
Compare
1 3 5 5
1 3 5 5
ID PATNO=1 PATNO=3 PATNO=7 PATNO=7 Continued
Chapter 7
Double Entry and Verification (PROC COMPARE)
145
Number of Observations in Common: 5. Total Number of Observations Read from WORK.ONE: 5. Total Number of Observations Read from WORK.TWO: 5. Number of Observations with Some Compared Variables Unequal: 2. Number of Observations with All Compared Variables Equal: 3. Values Comparison Summary Number of Variables Compared with All Observations Equal: 1. Number of Variables Compared with Some Observations Unequal: 3. Number of Variables with Missing Value Differences: 1. Total Number of Values which Compare Unequal: 3. Maximum Difference: 4. Using PROC COMPARE to Compare Two Data Sets COMPARE Procedure Comparison of WORK.ONE with WORK.TWO (Method=EXACT) Variables with Unequal Values Variable
Type
Len
Ndif
MaxDif
MissDif
GENDER DOB SBP
CHAR NUM NUM
1 8 8
1 1 1
0 4.000
0 1 0
Value Comparison Results for Variables _________________________________________________________ || Base Value Compare Value PATNO || GENDER GENDER _______ || _ _ || 7 || m M _________________________________________________________ _________________________________________________________ || Base Compare PATNO || DOB DOB Diff. % Diff _______ || _________ _________ _________ _________ || 7 || . 10/23/40 . . _________________________________________________________ _________________________________________________________ || Base Compare PATNO || SBP SBP Diff. % Diff _______ || _________ _________ _________ _________ || 3 || 140.0000 144.0000 4.0000 2.8571
7KLV RXWSXW KDV WKH DGYDQWDJH RI LGHQWLI\LQJ WKH LQFRPSDWLEOH GDWD OLQHV E\ SDWLHQW QXPEHUPDNLQJLWHDVLHUWRJREDFNWRWKHGDWDVKHHWVDQGGHWHUPLQLQJWKHFRUUHFWYDOXHV
®
146 Cody’s Data Cleaning Techniques Using SAS Software
Using PROC COMPARE with Two Data Sets That Have an Unequal Number of Observations
7KH ,' VWDWHPHQW LV HVSHFLDOO\ XVHIXO ZKHQ WKH WZR GDWD VHWV GR QRW FRQWDLQ WKH VDPH QXPEHU RI REVHUYDWLRQV RU ZKHQ WKHUH LV D GLVFUHSDQF\ EHWZHHQ WKH YDOXHV RI WKH ,' YDULDEOHV7RVHHKRZ352&&203$5(WUHDWVWKLVSUREOHPORRNDWWKHWZRQHZILOHV ),/(B%DQG),/(B% VKRZQQH[W$QHZSDWLHQWQXPEHU KDV EHHQ DGGHG WR ),/(B WR PDNH ),/(B% DQG SDWLHQW QXPEHU KDV EHHQ RPLWWHG IURP ),/(B WR PDNH),/(B% FILE_1B 001M10211946130 80 002F12201950110 70 003M09141956140 90 004F10101960180100 005M01041930166 88 007m10321940184110 FILE_2B 001M1021194613080 002F12201950110 70 003M09141956144 90 007M10231940184110
7KHWZR6$6GDWDVHWV21(B%DQG7:2B% DUHFUHDWHGDVVKRZQLQ3URJUDP
5XQQLQJ352&&203$5(RQ7ZR'DWD6HWVRI'LIIHUHQW/HQJWK
PROC COMPARE BASE=ONE_B COMPARE=TWO_B; TITLE "Comparing Two Data Sets with Different ID Values"; ID PATNO; RUN;
Chapter 7
Double Entry and Verification (PROC COMPARE)
147
+HUHLVWKHRXWSXWIURP3URJUDP Comparing Two Data Sets with Different ID Values COMPARE Procedure Comparison of WORK.ONE_B with WORK.TWO_B (Method=EXACT) Data Set Summary Data set WORK.ONE_B WORK.TWO_B
Created
Modified
13AUG98:11:12:13 13AUG98:11:12:14
13AUG98:11:12:13 13AUG98:11:12:14
NVar
NObs
5 5
6 4
Variables Summary Number of Variables in Common: 5. Number of ID Variables: 1. Observation Summary Observation First First Last Last
Obs Unequal Unequal Obs
Base
Compare
1 3 6 6
1 3 4 4
ID PATNO=1 PATNO=3 PATNO=7 PATNO=7
Number of Observations in Common: 4. Number of Observations in WORK.ONE_B but not in WORK.TWO_B: 2. Total Number of Observations Read from WORK.ONE_B: 6. Total Number of Observations Read from WORK.TWO_B: 4. Number of Observations with Some Compared Variables Unequal: 2. Number of Observations with All Compared Variables Equal: 2. Values Comparison Summary Number of Variables Compared with All Observations Equal: 1. Number of Variables Compared with Some Observations Unequal: 3. Number of Variables with Missing Value Differences: 1. Total Number of Values That Compare Unequal: 3. Maximum Difference: 4. Continued
®
148 Cody’s Data Cleaning Techniques Using SAS Software Comparing Two Data Sets with Different ID Values COMPARE Procedure Comparison of WORK.ONE_B with WORK.TWO_B (Method=EXACT) Variables with Unequal Values Variable
Type
Len
Ndif
MaxDif
MissDif
GENDER DOB SBP
CHAR NUM NUM
1 8 8
1 1 1
0 4.000
0 1 0
Value Comparison Results for Variables _________________________________________________________ || Base Value Compare Value PATNO || GENDER GENDER _______ || _ _ || 7 || m M _________________________________________________________ _________________________________________________________ || Base Compare PATNO || DOB DOB Diff. % Diff _______ || _________ _________ _________ _________ || 7 || . 10/23/40 . . _________________________________________________________ _________________________________________________________ || Base Compare PATNO || SBP SBP Diff. % Diff _______ || _________ _________ _________ __________ || 3 || 140.0000 144.0000 4.0000 2.8571 _________________________________________________________
1RWLFHWKDWWKHLQIRUPDWLRQFRQFHUQLQJWKHPLVVLQJSDWLHQWVLVQRWVKRZQ 7R VHH WKLV DGGWKHWZRSURFHGXUHRSWLRQV/,67%$6(DQG/,67&203WRVHHDOLVWRIREVHUYDWLRQV IRXQGLQRQHGDWDVHWEXWQRWLQWKHRWKHU1H[W\RX OOVHHWKHDGGLWLRQDOLQIRUPDWLRQWKDW \RXZLOOJHWZKHQWKHVHWZRRSWLRQVDUHXVHG
Chapter 7
Double Entry and Verification (PROC COMPARE)
149
Comparing Two Data Sets with Different ID Values COMPARE Procedure Comparison of WORK.ONE_B with WORK.TWO_B (Method=EXACT) Comparison Results for Observations Observation 4 in WORK.ONE_B not found in WORK.TWO_B: PATNO=4. Observation 5 in WORK.ONE_B not found in WORK.TWO_B: PATNO=5.
Comparing Two Data Sets When Some Variables Are Not in Both Data Sets
352&&203$5(FDQDOVREHXVHGWRFRPSDUHVHOHFWHGYDULDEOHVEHWZHHQWZRGDWDVHWV 6XSSRVH\RXKDYHRQHGDWDVHWWKDWFRQWDLQVGHPRJUDSKLFLQIRUPDWLRQRQHDFKSDWLHQWLQ DFOLQLFDOWULDO'(02* ,QDGGLWLRQ\RXKDYHDQRWKHUILOHIURPDSUHYLRXVVWXG\WKDW FRQWDLQV VRPH RI WKH VDPH SDWLHQWV DQG VRPH RI WKH VDPH GHPRJUDSKLF LQIRUPDWLRQ 2/''(02* 3URJUDPFUHDWHVWKHVHVDPSOHGDWDVHWV 3URJUDP
&UHDWLQJ7ZR7HVW'DWD6HWV'(02*DQG2/''(02*
***Program to create data sets DEMOG and OLDDEMOG; DATA DEMOG; INPUT @1 PATNO 3. @4 GENDER $1. @5 DOB MMDDYY10. @15 HEIGHT 2.; FORMAT DOB MMDDYY10.; DATALINES; 001M10/21/194668 003F11/11/105062 004M04/05/193072 006F05/13/196863 ;
®
150 Cody’s Data Cleaning Techniques Using SAS Software DATA OLDDEMOG; INPUT @1 PATNO 3. @4 DOB MMDDYY8. @12 GENDER $1. @13 WEIGHT 3.; FORMAT DOB MMDDYY10.; DATALINES; 00110211946M155 00201011950F102 00404051930F101 00511111945M200 00605131966F133 ;
&RPSDULQJ7ZR'DWD6HWV7KDW&RQWDLQ'LIIHUHQW9DULDEOHV
PROC COMPARE BASE=OLDDEMOG COMPARE=DEMOG BRIEF; TITLE "Comparing Demographic Information between Two Data Sets"; ID PATNO; RUN;
+HUHLVWKHRXWSXWIURPWKLVSURFHGXUH Comparing Demographic Information between Two Data Sets COMPARE Procedure Comparison of WORK.OLDDEMOG with WORK.DEMOG (Method=EXACT) NOTE: Data set WORK.OLDDEMOG contains 2 observations not in WORK.DEMOG. NOTE: Data set WORK.DEMOG contains 1 observations not in WORK.OLDDEMOG. NOTE: Values of the following 2 variables compare unequal: DOB GENDER Continued
Chapter 7
Double Entry and Verification (PROC COMPARE)
151
Value Comparison Results for Variables _________________________________________________________ || Base Compare PATNO || DOB DOB Diff. % Diff _______ || _________ _________ _________ _________ || 6 || 05/13/66 05/13/68 731.0000 31.4544 _________________________________________________________ _________________________________________________________ || Base Value Compare Value PATNO || GENDER GENDER _______ || _ _ || 4 || F M _________________________________________________________
6XSSRVH\RXRQO\ZDQWWRYHULI\WKDWWKHJHQGHUVDUHFRUUHFWEHWZHHQWKHWZRILOHV
$GGLQJD9$56WDWHPHQWWR352&&203$5(
PROC COMPARE BASE=OLDDEMOG COMPARE=DEMOG BRIEF; TITLE "Comparing Demographic Information between Two Data Sets"; ID PATNO; VAR GENDER; RUN;
+HUHLVWKHRXWSXWZKLFKQRZRQO\VKRZVDJHQGHUFRPSDULVRQEHWZHHQWKHWZRILOHV Comparing Demographic Information between Two Data Sets COMPARE Procedure Comparison of WORK.OLDDEMOG with WORK.DEMOG (Method=EXACT) NOTE: Data set WORK.OLDDEMOG contains 2 observations not in WORK.DEMOG. NOTE: Data set WORK.DEMOG contains 1 observations not in WORK.OLDDEMOG. NOTE: Values of the following 1 variables compare unequal: GENDER Continued
®
152 Cody’s Data Cleaning Techniques Using SAS Software Value Comparison Results for Variables _________________________________________________________ || Base Value Compare Value PATNO || GENDER GENDER _______ || _ _ || 4 || F M _________________________________________________________
8
Some SQL Solutions to Data Cleaning ,QWURGXFWLRQ
$4XLFN5HYLHZRI352&64/
&KHFNLQJIRU,QYDOLG&KDUDFWHU9DOXHV
&KHFNLQJIRU2XWOLHUV
&KHFNLQJD5DQJH8VLQJDQ$OJRULWKP%DVHGRQWKH6WDQGDUG'HYLDWLRQ
&KHFNLQJIRU0LVVLQJ9DOXHV
5DQJH&KHFNLQJIRU'DWHV
&KHFNLQJIRU'XSOLFDWHV
,GHQWLI\LQJ6XEMHFWVZLWKQ2EVHUYDWLRQV(DFK
&KHFNLQJIRUDQ,'LQ(DFKRI7ZR)LOHV
0RUH&RPSOLFDWHG0XOWL)LOH5XOHV
Introduction
,WZDVDKDUGGHFLVLRQZKHWKHUWRJURXSDOOWKH352&64/DSSURDFKHVWRJHWKHULQRQH FKDSWHURUWRLQFOXGHDQ64/VROXWLRQLQHDFKRIWKHRWKHUFKDSWHUV,RSWHGIRUWKH IRUPHU352&64/6WUXFWXUHG4XHU\/DQJXDJH LVDQDOWHUQDWLYHWRWKHWUDGLWLRQDO '$7$VWHSDQG352&DSSURDFKHVXVHGLQWKLVERRNXSWRWKLVSRLQW352&64/LV VRPHWLPHV HDVLHU WR SURJUDP DQG PRUH HIILFLHQW VRPHWLPHV OHVV VR VRPHWLPHV H[WUHPHO\OHVVHIILFLHQWDVZLWKIXOOMRLQV ,QWKLVFKDSWHUPDQ\RIWKHGDWDFOHDQLQJ RSHUDWLRQV \RX SHUIRUPHG HDUOLHU ZLWK '$7$ VWHS DQG 352& VROXWLRQV ZLOO EH UHYLVLWHG%HFDUHIXOVRPH6$6SURJUDPPHUVJHWFDUULHGDZD\ZLWK352&64/DQG WU\WRVROYHHYHU\SUREOHPZLWKLW,WLVH[WUHPHO\SRZHUIXODQGXVHIXOEXWLWLVQRW DOZD\VWKHEHVWVROXWLRQWRHYHU\SUREOHP)RUDPRUHFRPSOHWHGLVFXVVLRQRI352& 64/ , UHFRPPHQG WKH 6$6 *XLGH WR WKH 64/ 3URFHGXUH 8VDJH DQG 5HIHUHQFH 9HUVLRQSXEOLVKHGE\6$6,QVWLWXWH
®
154 Cody’s Data Cleaning Techniques Using SAS Software
A Quick Review of PROC SQL
352&64/FDQEHXVHGWROLVWGDWDWRWKHRXWSXWGHYLFH2XWSXWZLQGRZ WRFUHDWH6$6 GDWDVHWVDOVRFDOOHGWDEOHVLQ64/WHUPLQRORJ\ WRFUHDWH6$6YLHZVRUWRFUHDWHPDFUR YDULDEOHV)RUPRVWRI\RXUGDWDFOHDQLQJRSHUDWLRQV\RXZLOOQRWEHFUHDWLQJ6$6GDWD VHWVRUYLHZV%\RPLWWLQJWKH&5($7(VWDWHPHQWWKHUHVXOWVRIDQ64/TXHU\ZLOOEH VHQWWRWKH2XWSXWZLQGRZXQOHVVWKH1235,17RSWLRQ LV LQFOXGHG )RU H[DPSOH LI \RXKDYHDGDWDVHWFDOOHG21(DQG\RXZDQWWROLVWDOOREVHUYDWLRQVZKHUH;LVJUHDWHU WKDQ\RXZRXOGZULWHDSURJUDPOLNHWKHRQHLQ3URJUDP 3URJUDP
'HPRQVWUDWLQJD6LPSOH64/4XHU\
PROC SQL; SELECT X FROM ONE WHERE X GT 100; QUIT;
7KH 6(/(&7 VWDWHPHQW LGHQWLILHV ZKLFK YDULDEOHV \RX ZDQW WR VHOHFW 7KLV FDQ EH D VLQJOHYDULDEOHDVLQWKLVH[DPSOH DOLVWRIYDULDEOHVVHSDUDWHGE\FRPPDVQRWVSDFHV RUDQDVWHULVN ZKLFKPHDQVDOOWKHYDULDEOHVLQWKHGDWDVHW7KH)520VWDWHPHQW LGHQWLILHV ZKDW GDWD VHW WR UHDG )LQDOO\ WKH :+(5( VWDWHPHQW VHOHFWV RQO\ WKRVH REVHUYDWLRQVIRUZKLFK WKH :+(5( VWDWHPHQW LV WUXH 7KLV LV D JRRG WLPH WR PHQWLRQ WKDW WKH 64/ VWDWHPHQWV KDYH WR EH LQ D FHUWDLQ RUGHU 6(/(&7 )520 :+(5( *5283%<KDYLQJ 25'(5%<7KDQNVWR&\QWKLD=HQGHURQHRIP\UHYLHZHUV \RX FDQ UHPHPEHU WKLV RUGHU E\ WKH VD\LQJ 6DQ )UDQFLVFR :KHUH WKH *UDWHIXO 'HDG 2ULJLQDWH 352&64/KDVRQHVLJQLILFDQWDGYDQWDJHRYHU'$7$VWHSDSSURDFKHVZKHQFRPELQLQJ GDWD VHWV :LWK 64/ \RX FDQ SHUIRUP D PDQ\WRPDQ\ PHUJH WR SURGXFH D &DUWHVLDQ 3URGXFWZKLFKFRQWDLQVDQREVHUYDWLRQURZ IRUHYHU\FRPELQDWLRQRIREVHUYDWLRQVLQ WKHWZRGDWDVHWV$PDQ\WRPDQ\PHUJHLQD'$7$VWHSGRHVQRWSURGXFHD&DUWHVLDQ 3URGXFWDQGWKHUHVXOWLVXVXDOO\QRWXVHIXO0DQ\RIWKHVHRSHUDWLRQVDUHGHPRQVWUDWHG LQWKHVHFWLRQVWKDWIROORZ
Chapter 8
Some SQL Solutions to Data Cleaning 155
Checking for Invalid Character Values
/HW VVWDUWZLWKFKHFNLQJIRULQYDOLGFKDUDFWHUYDOXHV)RUWKHVHH[DPSOHVOHW VXVHWKH 6$6GDWDVHW3$7,(176VHHWKH$SSHQGL[IRUWKHSURJUDPDQGGDWDILOH DQGORRNIRU LQYDOLGYDOXHVIRU*(1'(5';DQG$(,Q3URJUDPPLVVLQJYDOXHVDUHUHSRUWHG DVLQYDOLGIRU*(1'(5DQG$(/DWHUWKHSURJUDPLVPRGLILHGVRWKDWPLVVLQJYDOXHV DUHQRWUHSRUWHG 3URJUDP
8VLQJ64/WR/RRNIRU,QYDOLG&KDUDFWHU9DOXHV
LIBNAME CLEAN "C:\CLEANING"; ***Checking for invalid character data; PROC SQL; TITLE "Checking for Invalid Character Data"; SELECT PATNO, GENDER, DX, AE FROM CLEAN.PATIENTS WHERE GENDER NOT IN (’M’,’F’) OR VERIFY(DX,’0123456789 ’) NE 0 OR AE NOT IN (’0’,’1’); QUIT;
%HFDXVH WKHUH LV QR &5($7( VWDWHPHQW WKH REVHUYDWLRQV PHHWLQJ WKH :+(5( FODXVH ZLOO EH SULQWHG WR WKH 2XWSXW ZLQGRZ ZKHQ WKH SURFHGXUH LV VXEPLWWHG 7KH YDULDEOHV OLVWHGLQWKH6(/(&7VWDWHPHQWDUHVHSDUDWHGE\FRPPDVLQ352&64/QRWVSDFHVDVLQ D9$5VWDWHPHQWLQD'$7$VWHS1RWLFHWKDWWKH64/VROXWLRQORRNVYHU\PXFKOLNH WKH'$7$VWHSVROXWLRQXVHGLQ&KDSWHUZKLFKXVHVWKHVLPSOHZD\RIFKHFNLQJIRU LQYDOLG';YDOXHVVHHSDJHIRUGHWDLOV
®
156 Cody’s Data Cleaning Techniques Using SAS Software
+HUHLVWKHRXWSXWIURPUXQQLQJ3URJUDP Checking for Invalid Character Data Patient Number 002 003 004 006 010 013 002 023
Gender F X F f 2 F f
Diagnosis Code
Adverse Event?
X 3 5 6 1 1 X
0 1 A 1 0 0 0
,I \RX GR QRW ZDQW PLVVLQJ YDOXHV IRU *(1'(5 DQG $( WR EH LGHQWLILHG DV LQYDOLG LQFOXGHDPLVVLQJYDOXHEODQN LQWKHOLVWRIYDOLGYDOXHVDVVKRZQLQ3URJUDP 3URJUDP
8VLQJ 64/ WR /LVW ,QYDOLG &KDUDFWHU 'DWD 0LVVLQJ 9DOXHV 1RW )ODJJHGDV(UURUV
PROC SQL; TITLE "Checking for Invalid Character Data"; TITLE2 "Missing Values Not Flagged as Errors"; SELECT PATNO, GENDER, DX, AE FROM CLEAN.PATIENTS WHERE GENDER NOT IN (’M’,’F’,’ ’) OR VERIFY(DX,’0123456789 ’) NE 0 OR AE NOT IN (’0’,’1’,’ ’); QUIT;
Checking for Outliers
$ VLPLODU SURJUDP FDQ EH XVHG WR FKHFN IRU RXWRIUDQJH QXPHULF YDOXHV 7KH 64/ VWDWHPHQWV LQ 3URJUDP SURGXFH D UHSRUW IRU KHDUW UDWH V\VWROLF EORRG SUHVVXUH DQG GLDVWROLF EORRG SUHVVXUH UHDGLQJV RXWVLGH VSHFLILHG UDQJHV %HFDXVH PLVVLQJ YDOXHV DUH QRWLQWKHVSHFLILHGUDQJHVWKH\ZLOOEHUHSRUWHGDVHUURUVE\WKLVSURJUDP
Chapter 8
3URJUDP
Some SQL Solutions to Data Cleaning 157
8VLQJ64/WR&KHFNIRU2XWRI5DQJH1XPHULF9DOXHV
PROC SQL; TITLE "Checking for Out-of-Range Numeric Values"; SELECT PATNO, HR, SBP, DBP FROM CLEAN.PATIENTS WHERE HR NOT BETWEEN 40 AND 100 OR SBP NOT BETWEEN 80 AND 200 OR DBP NOT BETWEEN 60 AND 120; QUIT;
7KH:+(5(VWDWHPHQWFDQEHZULWWHQPDQ\ZD\VMXVWDVZLWKD:+(5(VWDWHPHQWLQD '$7$VWHS7KHRXWSXWIURPWKHVHVWDWHPHQWVLVVKRZQQH[W Checking for Out-of-Range Numeric Values Patient Number 004 008 009 010 011 014 017 123 321 020 023 027 029
Heart Rate
Systolic Blood Pressure
Diastolic Blood Pressure
101 210 86 . 68 22 208 60 900 10 22 . .
200 . 240 40 300 130 . . 400 20 34 166 .
120 . 180 120 20 90 84 . 200 8 78 106 .
,I\RXGRQ WZDQWWRFRQVLGHUPLVVLQJYDOXHVDVHUURUVPDNHWKLVVLPSOHPRGLILFDWLRQWR WKH:+(5(VWDWHPHQW WHERE HR NOT BETWEEN 40 AND 100 AND HR IS NOT MISSING SBP NOT BETWEEN 80 AND 200 AND SBP IS NOT MISSING DBP NOT BETWEEN 60 AND 120 AND DBP IS NOT MISSING;
OR OR
7KH WHUPV ,6 0,66,1* RU ,6 127 0,66,1* FDQ EH XVHG ZLWK HLWKHU FKDUDFWHU RU QXPHULFYDULDEOHV$OVRWKHWHUP18//FDQEHXVHGLQSODFHRIWKHZRUG0,66,1*
®
158 Cody’s Data Cleaning Techniques Using SAS Software
Checking a Range Using an Algorithm Based on the Standard Deviation
,Q&KDSWHURQSDJHDQDOJRULWKPWRGHWHFWRXWOLHUVEDVHGRQVWDQGDUGGHYLDWLRQZDV GHVFULEHG3URJUDPVKRZVDQ64/DSSURDFKXVLQJWKHVDPHDOJRULWKPDQGXVLQJWKH YDULDEOHV\VWROLFEORRGSUHVVXUH6%3 3URJUDP
8VLQJ 64/ WR &KHFN IRU 2XWRI5DQJH 9DOXHV %DVHG RQ WKH 6WDQGDUG'HYLDWLRQ
PROC SQL; SELECT PATNO, SBP FROM CLEAN.PATIENTS HAVING SBP NOT BETWEEN MEAN(SBP) - 2 * STD(SBP) AND MEAN(SBP) + 2 * STD(SBP) AND SBP IS NOT MISSING; QUIT;
7KLVSURJUDPXVHVWZRVXPPDU\IXQFWLRQV0($1DQG67':KHQWKHVHIXQFWLRQVDUH XVHG D +$9,1* FODXVH LV QHHGHG LQVWHDG RI WKH :+(5( FODXVH XVHG HDUOLHU ,Q WKLV H[DPSOHDOOYDOXHVPRUHWKDQWZRVWDQGDUGGHYLDWLRQVDZD\IURPWKHPHDQWKDWDUHQRW PLVVLQJDUHSULQWHGWRWKH2XWSXWZLQGRZ +HUHLVWKHRXWSXWDIWHUUXQQLQJ3URJUDP
Patient Number 011 321
Systolic Blood Pressure 300 400
7KH64/SURFHGXUHFDQEHPDGHPRUHJHQHUDOE\WXUQLQJWKHSURJUDPLQWRDPDFURDQG PDNLQJWKHYDULDEOHQDPHDPDFURYDULDEOHDVVKRZQLQ3URJUDP
Chapter 8
3URJUDP
Some SQL Solutions to Data Cleaning 159
&RQYHUWLQJ3URJUDPLQWRD0DFUR
%MACRO RANGESTD(DSN,VARNAME); PROC SQL; SELECT PATNO, &VARNAME FROM &DSN HAVING &VARNAME NOT BETWEEN MEAN(&VARNAME) - 2 * STD(&VARNAME) AND MEAN(&VARNAME) + 2 * STD(&VARNAME) AND &VARNAME IS NOT MISSING; QUIT; %MEND RANGESTD;
)RUH[DPSOHKHUHLVWKHVWDWHPHQWWRFDOOWKLVPDFURWRWHVWWKHYDULDEOH'%3 %RANGESTD(CLEAN.PATIENTS,DBP)
+HUHLVWKHFRUUHVSRQGLQJRXWSXW
Patient Number 009 321 020
Diastolic Blood Pressure 180 200 8
Checking for Missing Values
,W VSDUWLFXODUO\HDV\WRXVH352&64/WRFKHFNIRUPLVVLQJYDOXHV7KH:+(5(FODXVH ,6 0,66,1* FDQ EH XVHG IRU ERWK FKDUDFWHU DQG QXPHULF YDULDEOHV 7KH VLPSOH TXHU\ VKRZQLQ3URJUDPFKHFNVWKHGDWDVHWIRUDOOFKDUDFWHUDQGQXPHULFPLVVLQJYDOXHV DQGSULQWVRXWDQ\REVHUYDWLRQWKDWFRQWDLQVDPLVVLQJYDOXHIRURQHRUPRUHYDULDEOHV
®
160 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP
8VLQJ64/WR/LVW0LVVLQJ9DOXHV
PROC SQL; SELECT * FROM CLEAN.PATIENTS WHERE PATNO IS MISSING OR GENDER IS MISSING OR VISIT IS MISSING OR HR IS MISSING OR SBP IS MISSING OR DBP IS MISSING OR DX IS MISSING OR AE IS MISSING; QUIT;
7KH6(/(&7VWDWHPHQWXVHVDQDVWHULVN WRLQGLFDWHWKDWDOOWKHYDULDEOHVLQWKHGDWD VHWDUHOLVWHGLQWKH)520VWDWHPHQW +HUHLVWKHRXWSXWIURP3URJUDP
Patient Number
Gender
006 007 008 010 011 012 013 014 003 015 017 019 123 321 020 023 027 029
M F f M M 2 M M F F M M F F f F M
Visit Date
Heart Rate
Systolic Blood Pressure
Diastolic Blood Pressure
06/15/1999 . 08/08/1998 10/19/1999 . 10/12/1998 08/23/1999 02/02/1999 11/12/1999 . 04/05/1999 06/07/1999 . . . 12/31/1998 . 05/15/1998
72 88 210 . 68 60 74 22 58 82 208 58 60 900 10 22 . .
102 148 . 40 300 122 108 130 112 148 . 118 . 400 20 34 166 .
68 102 . 120 20 74 64 90 74 88 84 70 . 200 8 78 106 .
Diagnosis Code 6 7 1 4
Adverse Event? 1 0 0 0 1 0
1 3 2 1 5 7 4
1 0 1 0 0 0 1 0 0 0 1
Chapter 8
Some SQL Solutions to Data Cleaning 161
Range Checking for Dates
8VLQJ64/WR3HUIRUP5DQJH&KHFNVRQ'DWHV
PROC SQL; TITLE "Dates Before June 1, 1998 or After October 15, 1999"; SELECT PATNO, VISIT FROM CLEAN.PATIENTS WHERE VISIT NOT BETWEEN ’01JUN1998’D AND ’15OCT1999’D AND VISIT IS NOT MISSING; QUIT;
+HUHLVWKHUHVXOWLQJRXWSXWIURP3URJUDP Dates Before June 1, 1998 or After October 15, 1999 Patient Number
Visit Date
XX5 010 003 028 029
05/07/1998 10/19/1999 11/12/1999 03/28/1998 05/15/1998
®
162 Cody’s Data Cleaning Techniques Using SAS Software
Checking for Duplicates
,Q &KDSWHU \RX XVHG 352& 6257 ZLWK WKH 12'83 DQG 12'83.(< RSWLRQV WR GHWHFW GXSOLFDWHV DV ZHOO DV D '$7$ VWHS DSSURDFK XVLQJ WKH ),567 DQG /$67 WHPSRUDU\YDULDEOHV
8VLQJ64/WR/LVW'XSOLFDWH3DWLHQW1XPEHUV
PROC SQL; TITLE "Duplicate Patient Numbers"; SELECT PATNO, VISIT FROM CLEAN.PATIENTS GROUP BY PATNO HAVING COUNT(PATNO) GT 1; QUIT;
,Q3URJUDP\RXDUHWHOOLQJ352&64/WROLVWDQ\GXSOLFDWHSDWLHQWQXPEHUV1RWH WKDWPXOWLSOHPLVVLQJSDWLHQWQXPEHUVZLOOQRWDSSHDULQWKHOLVWLQJEHFDXVHWKH&2817 IXQFWLRQ UHWXUQV D IUHTXHQF\ FRXQW RQO\ IRU QRQPLVVLQJ YDOXHV +HUH DUH WKH UHVXOWV RI UXQQLQJ3URJUDP Duplicate Patient Numbers Patient Number
Visit Date
002 002 003 003 006 006
11/13/1998 11/13/1998 10/21/1998 11/12/1999 06/15/1999 07/07/1999
Chapter 8
Some SQL Solutions to Data Cleaning 163
Identifying Subjects with "n" Observations Each
8VLQJWKHJURXSLQJFDSDELOLW\RI352&64/DQGWKH&2817IXQFWLRQ\RXFDQOLVWDOO SDWLHQWVWKDWGRQRWKDYHH[DFWO\QYLVLWVRUREVHUYDWLRQVLQDGDWDVHWMXVWDV\RXGLGLQ 3URJUDPVDQG+HUHLVWKHSURJUDPZLWKDQH[SODQDWLRQIROORZLQJ 3URJUDP 8VLQJ64/WR/LVW3DWLHQWV:KR'R1RW+DYH7ZR9LVLWV TITLE "Listing of Patients Who Do Not Have Two Visits"; PROC SQL; SELECT PATNO, VISIT FROM CLEAN.PATIENTS2 GROUP BY PATNO HAVING COUNT(PATNO) NE 2; QUIT;
%\ ILUVW JURXSLQJ WKH REVHUYDWLRQV E\ SDWLHQW QXPEHU \RX FDQ WKHQ XVH WKH &2817 IXQFWLRQZKLFKUHWXUQVWKHQXPEHURIREVHUYDWLRQVLQDJURXS+HUHLVWKHRXWSXWIURP 3URJUDP Listing of Patients Who Do Not Have Two Visits PATNO 002 002 002 003 006
VISIT 01/01/1999 01/10/1999 02/09/1999 10/21/1998 11/11/1998
Checking for an ID in Each of Two Files
'R\RXWKLQN352&64/FDQFKHFNLIHDFKSDWLHQWQXPEHULVLQWZRILOHV":K\HOVHLV WKHUH D VHFWLRQ KHDGLQJ ZLWK WKDW WDVN OLVWHG" 2I FRXUVH \RX FDQ 1RZ RQ WR WKH SUREOHP
®
164 Cody’s Data Cleaning Techniques Using SAS Software
7KHHTXLYDOHQWRID'$7$VWHSPHUJHLVFDOOHGD-2,1LQ64/WHUPV1RUPDOO\D-2,1 OLVWVRQO\WKRVHREVHUYDWLRQVWKDWKDYHDPDWFKLQJYDOXHIRUWKHYDULDEOHVLQHDFKRIWKH ILOHV ,I \RX ZDQW DOO REVHUYDWLRQV IURP ERWK ILOHV UHJDUGOHVV LI WKH\ KDYH D FRUUHVSRQGLQJREVHUYDWLRQLQWKHRWKHUILOH\RXSHUIRUPD)8//-2,1WKLVLVHTXLYDOHQW WR D 0(5*( ZKHUH QR ,1 YDULDEOHV DUH XVHG 6R LI \RX SHUIRUP D )8// -2,1 EHWZHHQWZRGDWDVHWVDQGDQ,'YDOXHLVQRWLQERWKGDWDVHWVRQHRIWKHREVHUYDWLRQV ZLOO KDYH D PLVVLQJ YDOXH IRU WKH ,' YDULDEOH /HW V XVH WKH VDPH GDWD VHWV 21( DQG 7:2WKDWZHUHXVHGLQ&KDSWHU)RUFRQYHQLHQFHWKHFRGHWRSURGXFHWKHVHGDWDVHWV LVVKRZQLQ3URJUDP 3URJUDP &UHDWLQJ7ZR'DWD6HWVIRU7HVWLQJ3XUSRVHV DATA ONE; INPUT PATNO X Y; DATALINES; 1 69 79 2 56 . 3 66 99 5 98 87 12 13 14 ; DATA TWO; INPUT PATNO Z; DATALINES; 1 56 3 67 4 88 5 98 13 99 ;
3URJUDPVKRZVWKH64/SURJUDP 3URJUDP 8VLQJ64/WR/RRNIRU,' V7KDW$UH1RWLQ(DFKRI7ZR)LOHV PROC SQL; TITLE "Patient Numbers Not in Both Files"; SELECT ONE.PATNO AS ID_ONE, TWO.PATNO AS ID_TWO FROM ONE FULL JOIN TWO ON ONE.PATNO EQ TWO.PATNO WHERE ONE.PATNO IS MISSING OR TWO.PATNO IS MISSING; QUIT;
Chapter 8
Some SQL Solutions to Data Cleaning 165
%HFDXVH WKH YDULDEOH QDPH 3$712 LV XVHG LQ ERWK GDWD VHWV \RX FDQ GLVWLQJXLVK EHWZHHQWKHPE\DGGLQJHLWKHU21(RU7:2EHIRUHWKHYDULDEOHQDPHGHSHQGLQJRQ ZKHWKHU \RX DUH UHIHUULQJ WR WKH SDWLHQW QXPEHU IURP GDWD VHW 21( RU GDWD VHW 7:2 $OVRWRPDNHLWHDVLHUWRNHHSWUDFNRIWKHVHWZRYDULDEOHVDQDOLDVFUHDWHGE\XVLQJWKH $6VWDWHPHQW IRUHDFKRIWKHVHYDULDEOHQDPHVZDVFUHDWHG,'B21(DQG,'B7:2 7KHFRQGLWLRQIRUWKH)8//-2,1LVWKDWWKH,' VPDWFKEHWZHHQWKHWZRGDWDVHWV7KLV LVVSHFLILHGLQWKH21VWDWHPHQW1RWH
ID_TWO . 4 . 13
More Complicated Multi-File Rules
/HW V VWDUW WKH GLVFXVVLRQ RI PRUH FRPSOLFDWHG PXOWLILOH UXOHV E\ UHGRLQJ WKH H[DPSOH IURP&KDSWHU7RUHYLHZWKHUHDUHWZRILOHV$(ZKLFKUHFRUGHGDGYHUVHHYHQWVIRU SDWLHQWV LQ WKH VWXG\ DQG /$%B7(67 ZKLFK FRQWDLQHG WKH ODERUDWRU\ WHVWV IRU SHRSOH ZLWKYDULRXVDGYHUVHHYHQWVVHHWKH$SSHQGL[IRUWKHSURJUDPVWRFUHDWHWKHVHGDWDVHWV 7KHJRDOLVWROLVWDQ\SDWLHQWZKRKDGDQDGYHUVHHYHQWRI ; DQ\ZKHUHLQWKHDGYHUVH HYHQWGDWDVHWZKRHLWKHUGLGQRWKDYHDQ\HQWU\LQWKHODERUDWRU\GDWDVHWRUZKHUHWKH GDWHRIWKHODEWHVWZDVEHIRUHWKHGDWHRIWKHDGYHUVHHYHQW7KHIROORZLQJ64/TXHU\ ZLOO SURGXFH WKH VDPH LQIRUPDWLRQ DV WKH '$7$ VWHS VROXWLRQ VKRZQ LQ &KDSWHU LQ 3URJUDP
®
166 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP 8VLQJ64/WR'HPRQVWUDWH0RUH&RPSOLFDWHG0XOWL)LOH5XOHV PROC SQL; TITLE1 TITLE2 TITLE3 SELECT
"Patients with an AE of X Who Did Not Have a"; "Labtest or Where the Date of the Test Is Prior"; "to the Date of the Visit"; AE.PATNO AS AE_PATNO LABEL="AE Patient Number", A_EVENT, DATE_AE, LAB_TEST.PATNO AS LABPATNO LABEL="LAB Patient Number", LAB_DATE FROM CLEAN.AE LEFT JOIN CLEAN.LAB_TEST ON AE.PATNO=LAB_TEST.PATNO WHERE A_EVENT = ’X’ AND LAB_DATE LT DATE_AE; QUIT;
%HFDXVHWKHYDULDEOH3$712KDGWKHVDPHODEHO LQ ERWK WKH $( DQG /$%B7(67 GDWD VHWVD/$%(/FROXPQPRGLILHUZDVXVHGWRUHODEHOWKHVHYDULDEOHVVRWKDWWKH\FRXOGEH GLVWLQJXLVKHGLQWKHRXWSXWOLVWLQJ$QDOLDV$(B3$712DQG/$%3$712 DVZHOODV DODEHOZDVVHOHFWHGIRUHDFKRIWKHVHYDULDEOHV 7R KHOS H[SODLQ WKH GLIIHUHQFH EHWZHHQ D /()7 -2,1 D 5,*+7 -2,1 DQG D )8// -2,1OHW VH[HFXWHDOOWKUHHZLWKWKHGDWDVHWV21(DQG7:2ZKLFKZHUHGHVFULEHGLQ WKHSUHYLRXVVHFWLRQ,Q3URJUDPWKHIROORZLQJ64/VWDWHPHQWVH[HFXWHDOOWKUHH MRLQV 3URJUDP ([DPSOHRI/()75,*+7DQG)8//-RLQV PROC SQL; TITLE "Left Join"; SELECT ONE.PATNO AS ONE_ID, TWO.PATNO AS TWO_ID FROM ONE LEFT JOIN TWO ON ONE.PATNO EQ TWO.PATNO;
Chapter 8
Some SQL Solutions to Data Cleaning 167
TITLE "Right Join"; SELECT ONE.PATNO AS ONE_ID, TWO.PATNO AS TWO_ID FROM ONE RIGHT JOIN TWO ON ONE.PATNO EQ TWO.PATNO; TITLE "Full Join"; SELECT ONE.PATNO AS ONE_ID, TWO.PATNO AS TWO_ID FROM ONE FULL JOIN TWO ON ONE.PATNO EQ TWO.PATNO; QUIT;
%\ LQVSHFWLQJ WKH QH[W WKUHH OLVWLQJV LW LV YHU\ HDV\ WR VHH WKH GLIIHUHQFH DPRQJ WKHVH WKUHHGLIIHUHQW-2,1RSHUDWLRQV Left Join ONE_ID 1 2 3 5 12
TWO_ID 1 . 3 5 .
Right Join ONE_ID 1 3 . 5 .
TWO_ID 1 3 4 5 13
Full Join ONE_ID 1 2 3 . 5 12 .
TWO_ID 1 . 3 4 5 . 13
1RZ EDFN WR RXU H[DPSOH
®
168 Cody’s Data Cleaning Techniques Using SAS Software
)LQDOO\LI WKH ODE GDWH /$%B'$7( LV SULRU WR WKH DGYHUVH HYHQW RU LI WKH ODE GDWH LV PLVVLQJ WKH VWDWHPHQW /$%B'$7( /7 '$7(B$( ZLOO EH WUXH 1RWLFH LQ WKH RXWSXW WKDWIROORZVWKDWWKHVDPHWKUHHSDWLHQWVDUHOLVWHGZLWKWKLVTXHU\DVZHUHOLVWHGLQWKH H[DPSOHLQ&KDSWHU Patients with an AE of X Who Did Not Have a Labtest or Where the Date of the Test Is Prior to the Date of the Visit AE LAB Patient Adverse Patient Date of Number Event Date of AE Number Lab Test ------------------------------------------------009 X 12/25/1998 . 011 X 10/10/1998 . 025 X 02/09/1999 025 01/01/1999
352&64/SURYLGHVDYHU\FRQYHQLHQWZD\WRFRQGXFWPDQ\RIWKHGDWDFOHDQLQJWDVNV GHVFULEHGWKURXJKRXWWKLVERRN7KHUHDUHWKRVHRIXVZKRVWLOOIHHOPRUH FRPIRUWDEOH ZLWK '$7$ VWHS DQG 352& DSSURDFKHV DQG RWKHUV ZKR IHHO WKDW 352& 64/ LV WKH VROXWLRQWRDOOWKHLUSUREOHPV
9
Using Validation Data Sets ,QWURGXFWLRQ
$6LPSOH([DPSOHRID9DOLGDWLRQ'DWD6HW
0DNLQJWKH3URJUDP0RUH)OH[LEOHDQG&RQYHUWLQJ,WWRD0DFUR
9DOLGDWLQJ&KDUDFWHU'DWD
&RQYHUWLQJ3URJUDPLQWRD*HQHUDO3XUSRVH0DFUR
([WHQGLQJWKH9DOLGDWLRQ0DFURWR,QFOXGH9DOLG&KDUDFWHU5DQJHV
&RPELQLQJ1XPHULFDQG&KDUDFWHU9DOLGLW\&KHFNVLQD6LQJOH0DFUR ZLWKD6LQJOH9DOLGDWLRQ'DWD6HW
,QWURGXFLQJ6$6,QWHJULW\&RQVWUDLQWV9HUVLRQVDQG/DWHU
Introduction
,QSUHYLRXVFKDSWHUV6$6SURJUDPVDQGPDFURVZHUHXVHGWRFKHFNIRULQYDOLGGDWD YDOXHV 7KH UXOHV ZHUH KDUGFRGHG RU HQWHUHG DV FDOOLQJ DUJXPHQWV WR 6$6 PDFURV,QWKLVFKDSWHU\RXZLOOVHHKRZGDWDUXOHVFDQEHHQWHUHGLQWRDUDZGDWD ILOH DQG WXUQHG LQWR D 6$6 YDOLGDWLRQ GDWD VHW ZLWK WKH GDWD FKHFNLQJ RSHUDWLRQ SHUIRUPHGE\DJHQHUDOSXUSRVHFOHDQLQJYDOLGDWLRQSURJUDP7KLVDSSURDFKDOORZV \RX WR FUHDWH SHUPDQHQW YDOLGDWLRQ GDWD VHWV FRQWDLQLQJ UXOHV IRU IUHTXHQWO\ XVHG YDULDEOHVDQGWRDSSO\WKHVHYDOLGDWLRQGDWDVHWVDJDLQVWDQ\GDWDVHWWKDWQHHGVWR EH YDOLGDWHG )LQDOO\ DQ LQWURGXFWLRQ WR 6$6 ,QWHJULW\ &RQVWUDLQWV LV SUHVHQWHG ,QWHJULW\FRQVWUDLQWVDOORZ\RXWRVWRUHUXOHVFRQFHUQLQJ\RXUYDULDEOHVLQWKHGDWD VHWLWVHOI7KLVIHDWXUHLVDYDLODEOHVWDUWLQJZLWK9HUVLRQRI6$6VRIWZDUH A Simple Example of a Validation Data Set
/HW V VWDUW ZLWK D VLPSOH YDOLGDWLRQ GDWD VHW WKDW RQO\ KDQGOHV UDQJH FKHFNLQJ IRU QXPHULFYDULDEOHV7KHIROORZLQJWKUHHUXOHVDUHXVHGIRUWKLVH[DPSOH 9DOLGKHDUWUDWHYDOXHVDUHEHWZHHQDQG 9DOLGYDOXHVIRUV\VWROLFEORRGSUHVVXUHDUHEHWZHHQDQG 9DOLGYDOXHVIRUGLDVWROLFEORRGSUHVVXUHDUHEHWZHHQDQG
®
170 Cody’s Data Cleaning Techniques Using SAS Software
)RU WKLV VLPSOH H[DPSOH DOO PLVVLQJ YDOXHV DUH WUHDWHG DV LQYDOLG 3URJUDP LV JHQHUDOL]HGWRWUHDWPLVVLQJYDOXHVDVHLWKHUYDOLGRULQYDOLGIRUHDFKQXPHULFYDULDEOH )RUWKLVH[DPSOHWKHYDOLGDWLRQGDWDVHWFRQWDLQVWKHYDULDEOHQDPHWKHPLQLPXPYDOLG YDOXHDQGWKHPD[LPXPYDOLGYDOXH 3URJUDPFUHDWHVDYDOLGDWLRQGDWDVHWFDOOHG9$/,' 3URJUDP
&UHDWLQJD6LPSOH9DOLGDWLRQ'DWD6HW
DATA VALID; INFILE "C:\CLEANING\VALID1.TXT" MISSOVER; INPUT VARNAME : $32. MIN MAX; VARNAME = UPCASE(VARNAME); RUN;
¯
ZKHUHWKHGDWDILOH9$/,'7;7FRQWDLQVWKHIROORZLQJ HR 40 100 SBP 80 200 DBP 60 140
1RWLFH WKDW WKH SURJUDP LV DOORZLQJ YDULDEOH QDPHV XS WR FKDUDFWHUV LQ OHQJWK ¯ 9HUVLRQDQGODWHU DQGWKDWWKH83&$6(IXQFWLRQLVXVHGWRPDNHVXUHDOOWKHYDULDEOH QDPHVDUHLQXSSHUFDVH 7KH ILUVW VWHS LQ DSSO\LQJ WKLV YDOLGDWLRQ GDWD VHW DJDLQVW WKH 3$7,(176 GDWD VHW LV WR UHVWUXFWXUHWKHSDWLHQWGDWD)RUHDFKSDWLHQW,'\RXZDQWDVHSDUDWHREVHUYDWLRQIRUHDFK QXPHULFYDULDEOHLQWKHGDWDVHW
Chapter 9
Using Validation Data Sets
171
First 10 Observations in the Restructured Patients Data Set Patient ID (PATNO)
Variable Name (VARNAME)
001 001 001 001 002 002 002 002 003 003
VISIT HR SBP DBP VISIT HR SBP DBP VISIT HR
Value (VALUE) 14194 88 140 80 14196 84 120 78 14173 68
$V\RXPD\KDYHJXHVVHGDWWKLVSRLQW\RXZDQWWKLVSDUWLFXODUVWUXFWXUHVRWKDWLWFDQEH PHUJHG ZLWK WKH 9$/,' GDWD VHW WKH JRDO EHLQJ WR DGG WKH PLQLPXP DQG PD[LPXP FXWRIIVWRHDFKREVHUYDWLRQLQWKLVUHVWUXFWXUHGGDWDVHW3URJUDPZLOOSHUIRUPWKLV UHVWUXFWXULQJWDVN 1RWH(YHQWKRXJKWKHWDVNDWKDQGLVWRYDOLGDWHWKHWKUHHQXPHULFYDULDEOHV+56%3 DQG '%3 3URJUDP ZDV ZULWWHQ WR EH PRUH JHQHUDO DQG WR LQFOXGH DOO WKH QXPHULF YDULDEOHVLQWKHGDWDVHW 3URJUDP
5HVWUXFWXULQJ WKH 3$7,(176 'DWD 6HW DQG 3URGXFLQJ DQ ([FHSWLRQV5HSRUW
***Restructure PATIENTS; DATA PAT; SET CLEAN.PATIENTS; ***Make room for variable names up to 32 characters; LENGTH VARNAME $ 32; ***Array to contain all numeric variables; ARRAY NUMS[*] _NUMERIC_; DO I = 1 TO DIM(NUMS);
¯
°
CALL VNAME(NUMS[I],VARNAME); VARNAME = UPCASE(VARNAME); VALUE = NUMS[I];
³
OUTPUT; ´ END; KEEP PATNO VARNAME VALUE; RUN;
²
±
®
172 Cody’s Data Cleaning Techniques Using SAS Software
7KHNH\ZRUGB180(5,&BXVHGLQWKH$55$<VWDWHPHQW¯FUHDWHVWKH1806DUUD\ ZLWK DOO WKH QXPHULF YDULDEOHV LQ WKH GDWD VHW &/($13$7,(176 DV LWV HOHPHQWV 7KH ',0IXQFWLRQ°UHWXUQVWKHQXPEHURIHOHPHQWVLQWKHDUUD\7KHNH\WRWKLVSURJUDPLV WKH&$//91$0(VWDWHPHQW±7KLVYHU\XVHIXOIXQFWLRQWDNHVDVLWVILUVWDUJXPHQWDQ HOHPHQW RI DQ DUUD\ DQG SODFHV WKH DVVRFLDWHG YDULDEOH QDPH LQ WKH VHFRQG DUJXPHQW 9$51$0(LQWKLVFDVH ²,WLVLPSRUWDQWWRVHWWKHOHQJWKRIWKHFKDUDFWHUYDULDEOH WKDWLVWRKROGWKHYDULDEOHQDPHHDUOLHULQWKH'$7$VWHS$OHQJWKRIZDVFKRVHQWR EHFRPSDWLEOHZLWK9HUVLRQVDQGODWHURI6$6VRIWZDUH)RUWKHVDPHUHDVRQWKDWWKH 83&$6(IXQFWLRQZDVXVHGZLWKWKH9$/,'GDWDVHWLWLVDOVRXVHGKHUHWRHQVXUHWKDW ERWK GDWD VHWV FRQWDLQ WKH YDULDEOH QDPHV LQ WKH VDPH FDVH VR WKH\ FDQ EH PHUJHG )LQDOO\9$/8(³LVDVVLJQHGWKHQXPHULFYDOXHRIWKHYDULDEOH%HFDXVHWKH287387 VWDWHPHQW ´ LV ZLWKLQ WKH '2 ORRS WKHUH ZLOO EH DV PDQ\ REVHUYDWLRQV SHU SDWLHQW DV WKHUHDUHQXPHULFYDULDEOHVLQWKHRULJLQDOGDWDVHW7KHQH[WVWHSLVWRVRUWERWKWKLVGDWD VHW3$7 DQGWKH9$/,'GDWDVHWE\9$51$0( 3URJUDP
5HVWUXFWXULQJ WKH 3$7,(176 'DWD 6HW DQG 3URGXFLQJ DQ ([FHSWLRQV5HSRUW&RQWLQXHG
PROC SORT DATA=PAT; BY VARNAME; RUN; PROC SORT DATA=VALID; BY VARNAME; RUN;
/HW VPHUJHWKHWZRILOHV 3URJUDP
5HVWUXFWXULQJ WKH 3$7,(176 'DWD 6HW DQG 3URGXFLQJ DQ ([FHSWLRQV5HSRUW&RQWLQXHG
DATA VERIFY; MERGE PAT(IN=IN_PAT) VALID(IN=IN_VALID); BY VARNAME; IF IN_PAT AND IN_VALID AND (VALUE LT MIN OR VALUE GT MAX); RUN;
µ
Chapter 9
Using Validation Data Sets
173
%HFDXVH\RXRQO\ZDQWLQIRUPDWLRQRQWKRVHYDULDEOHVWKDWDUHLQERWKGDWDVHWVHDFKRI WKH GDWD VHWV LQ WKH 0(5*( VWDWHPHQW LV IROORZHG E\ WKH ,1 GDWD VHW RSWLRQ 7KH VXEVHWWLQJ,)VWDWHPHQWµDFFRPSOLVKHVWKLV,QDGGLWLRQWKLVVXEVHWWLQJ,)VWDWHPHQWLV XVHG WR FKHFN LI WKH YDOXH RI \RXU YDULDEOH LV RXWVLGH WKH DFFHSWDEOH UDQJH ,Q WKLV H[DPSOHPLVVLQJYDOXHVZKLFKDUHORJLFDOO\OHVVWKDQWKHPLQLPXPDUHLQFOXGHGLQWKH OLVWRILQYDOLGYDOXHV7KHSURJUDPLVJHQHUDOL]HGLQWKHQH[WVHFWLRQ7KHODVWVWHSLVWR VRUW WKH PHUJHG GDWD VHW E\ SDWLHQW QXPEHU DQG YDULDEOH QDPH VR WKDW DQ H[FHSWLRQ UHSRUWLQSDWLHQWQXPEHURUGHUFDQEHSURGXFHG 3URJUDP
5HVWUXFWXULQJ WKH 3$7,(176 'DWD 6HW DQG 3URGXFLQJ DQ ([FHSWLRQV5HSRUW&RQWLQXHG
PROC SORT DATA=VERIFY; BY PATNO VARNAME; RUN; PROC PRINT DATA=VERIFY; TITLE "Exceptions Report"; ID PATNO; VAR VARNAME VALUE; RUN;
7KHRXWSXWIURPWKHDERYH352&35,17LVVKRZQQH[W
®
174 Cody’s Data Cleaning Techniques Using SAS Software Exceptions Report PATNO
VARNAME
VALUE
004 008 008 008 009 009 010 010 011 011 014 017 017 020 020 020 023 023 027 029 029 029 123 123 321 321 321
HR DBP HR SBP DBP SBP HR SBP DBP SBP HR HR SBP DBP HR SBP HR SBP HR DBP HR SBP DBP SBP DBP HR SBP
101 . 210 . 180 240 . 40 20 300 22 208 . 8 10 20 22 34 . . . . . . 200 900 400
,I\RXGRQRWZDQWWROLVWPLVVLQJYDOXHVDVH[FHSWLRQV\RXFDQVXEVWLWXWHWKHIROORZLQJ VXEVHWWLQJ,)VWDWHPHQWLQ3URJUDP IF IN_PAT AND IN_VALID AND ((VALUE LT MIN AND VALUE NE . ) OR VALUE GT MAX);
Making the Program More Flexible and Converting It to a Macro
1RZWKDW\RXKDYHWKHJHQHUDOLGHDXQGHU\RXUEHOWOHW VPDNHWKHSURJUDPPRUHIOH[LEOH E\WUHDWLQJPLVVLQJYDOXHVDVYDOLGRULQYDOLGIRUHDFKRIWKHQXPHULFYDULDEOHV$WWKH VDPHWLPHOHW VDOVRWXUQWKHSURJUDPLQWRDPDFUR
Chapter 9
Using Validation Data Sets
175
,QWKHILOHFRQWDLQLQJWKHDFFHSWDEOHUDQJHVIRUHDFKRIWKHYDULDEOHV\RXDUHJRLQJWRDGG DIODJWRLQGLFDWHLIPLVVLQJYDOXHVDUHRND\RULIWKH\VKRXOGEHWUHDWHGDVGDWDHUURUV 7KH GHIDXOW ZLOO EH WR WUHDW PLVVLQJ YDOXHV DV HUURUV 7R RYHUULGH WKLV GHIDXOW D QHZ YDULDEOH0,66B2. ZLOOEHDGGHGWRWKH9$/,'GDWD VHW 9DOXHV RI < LQGLFDWH WKDW PLVVLQJ YDOXHV DUH RND\ DQ\WKLQJ HOVH LQFOXGLQJ QRWKLQJ LQGLFDWHV WKDW \RX ZDQW WKH GHIDXOWEHKDYLRURIWKHSURJUDP7RGHPRQVWUDWHWKLVQHZPDFUROHW VFUHDWHDYDOLGDWLRQ GDWDVHWZLWKWKHIROORZLQJQHZVHWRIUXOHV 9DOLGKHDUWUDWHYDOXHVDUHEHWZHHQDQG0LVVLQJYDOXHVDUHQRWYDOLG 9DOLGYDOXHVIRUV\VWROLFEORRGSUHVVXUHDUHEHWZHHQDQG0LVVLQJYDOXHV DUHYDOLG 9DOLGYDOXHVIRUGLDVWROLFEORRGSUHVVXUHDUHEHWZHHQDQG0LVVLQJYDOXHV DUHQRWYDOLG )ROORZLQJWKHVHUXOHVWKHQHZSURJUDP3URJUDP FUHDWHVWKH9$/,'GDWDVHW 3URJUDP
&UHDWLQJ D 1HZ 9DOLGDWLRQ 'DWD 6HW 7KDW &RQWDLQV 0LVVLQJ 9DOXH ,QVWUXFWLRQV
PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; ***Create a validation data set from the raw data; DATA VALID; INFILE "C:\CLEANING\VALID2.TXT" MISSOVER; ¯ LENGTH VARNAME $ 32 MISS_OK $ 1; INPUT VARNAME $ MIN MAX MISS_OK : $MISS.; ***Make sure all variable names are in uppercase so they will match the variable names in the data set to be checked; VARNAME = UPCASE(VARNAME); RUN;
:KHUHWKHGDWDILOH9$/,'7;7ORRNVOLNHWKLV hr 40 100 sbp 80 200 y dbp 60 140 n
®
176 Cody’s Data Cleaning Techniques Using SAS Software
1RWLFH WKH 0,6629(5 RSWLRQ LQ WKH ,1),/( VWDWHPHQW ¯ 7KLV LV FULWLFDO ZLWK OLVW GLUHFWHGVSDFHVEHWZHHQWKHYDOXHV GDWD:LWKWKLVRSWLRQLIWKH,1387VWDWHPHQWUXQV RXW RI GDWD EHIRUH LW UXQV RXW RI YDULDEOHV LW DVVLJQV D PLVVLQJ YDOXH IRU DQ\ RI WKH UHPDLQLQJ YDULDEOHV $OVR QRWLFH WKH XVHUGHILQHG LQIRUPDW 0,66 7KLV LQIRUPDW FRQYHUWV DOO FKDUDFWHU GDWD WR XSSHUFDVH E\ WKH 83&$6( RSWLRQ DQG LW FRQYHUWV DQ\ FKDUDFWHURWKHUWKDQDQXSSHUFDVHRUORZHUFDVH < WRWKHYDOXHRI 1 7KXVERWKKHDUW UDWH+5 DQGGLDVWROLFEORRGSUHVVXUH'%3 ZLOOIROORZWKHGHIDXOWEHKDYLRURIWUHDWLQJ PLVVLQJYDOXHVDVLQYDOLGZKLOHPLVVLQJYDOXHVIRUV\VWROLFEORRGSUHVVXUH6%3 ZLOOQRW EHOLVWHGDVHUURUV 7KHPDFUR3URJUDPLVVLPLODUWR3URJUDP,WLVFDOOHGZLWKWKUHHDUJXPHQWVWKH QDPHRIWKH,'YDULDEOHWKHQDPHRIWKHGDWD VHW WR EH YDOLGDWHG DQG WKH QDPH RI WKH YDOLGDWLRQGDWDILOH7KHFRGHWRFUHDWHWKHYDOLGDWLRQGDWDVHWIURPWKH UDZ YDOLGDWLRQ GDWDILOHLVDOVRLQFOXGHGDVSDUWRIWKHPDFUR 1RWH6RPHYDULDEOHQDPHVLQWKLVSURJUDPDQGRWKHUSURJUDPVLQWKLVFKDSWHUDUHORQJHU WKDQHLJKWFKDUDFWHUVLQOHQJWK7KH\QHHGWREHVKRUWHQHGLI\RXDUHXVLQJSUH9HUVLRQ 6$6VRIWZDUH 3URJUDP
9DOLGDWLQJ D 'DWD 6HW ZLWK D 0DFUR 7KDW &RQWDLQV 0LVVLQJ 9DOXH ,QVWUXFWLRQV
*------------------------------------------------------------------* | Program Name: VALID_NUM.SAS in C:\CLEANING | | Purpose: Macro that takes an ID variable, a SAS data set to be | | validated, and a validation data file, and prints an | | exception report to the output device. | | This macro is for numeric variable range checking only. | | Arguments: ID - ID variable | | DATASET - SAS data set to be validated | | VALID_FILE - Validation data file | | Each line of this file contains the name | | of a numeric variable, the minimum value,| | the maximum value, and a missing value | | indicator (’Y’ missing values OK, ’N’ | | missing values not OK), all separated by | | at least one space. | |Example: %VALID_NUM(PATNO,CLEAN.PATIENTS,C:\CLEANING\VALID2.TXT) | *------------------------------------------------------------------*;
Chapter 9 %MACRO VALID_NUM (ID, DATASET, VALID_FILE, );
Using Validation Data Sets
/* ID variable /* Data set to be validated /* Validation data set
***Note: For pre-Version 7, substitute a shorter name for several variables; ***Create the validation data set; PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; DATA VALID; INFILE "&VALID_FILE" MISSOVER; LENGTH VARNAME $ 32 MISS_OK $ 1; INPUT VARNAME $ MIN MAX MISS_OK : $MISS.; VARNAME = UPCASE(VARNAME); RUN; ***Restructure &DATASET; DATA PAT; SET &DATASET; LENGTH VARNAME $ 32; ARRAY NUMS[*] _NUMERIC_; N_NUMS = DIM(NUMS); DO I = 1 TO N_NUMS; CALL VNAME(NUMS[I],VARNAME); VARNAME = UPCASE(VARNAME); VALUE = NUMS[I]; OUTPUT; END; KEEP PATNO VARNAME VALUE; RUN; PROC SORT DATA=PAT; BY VARNAME PATNO; RUN; PROC SORT DATA=VALID; BY VARNAME; RUN;
*/ */ */
177
®
178 Cody’s Data Cleaning Techniques Using SAS Software ***Merge the validation data set and the restructured SAS data set; DATA VERIFY; MERGE PAT(IN=IN_PAT) VALID(IN=IN_VALID); BY VARNAME; IF (IN_PAT AND IN_VALID) AND (VALUE LT MIN OR VALUE GT MAX) AND NOT(VALUE = . AND MISS_OK EQ ’Y’); RUN; PROC SORT DATA=VERIFY; BY PATNO VARNAME; RUN; ***Reporting section; OPTIONS NODATE NONUMBER; TITLE; DATA _NULL_; FILE PRINT HEADER = REPORT_HEAD; SET VERIFY; BY PATNO; IF VALUE = . THEN PUT @1 PATNO @18 VARNAME @39 "Missing"; ELSE IF VALUE GT . AND VALUE LT MIN THEN PUT @1 PATNO @18 VARNAME @29 VALUE @39 "Below Minimum (" MIN +(-1) ")"; ELSE IF VALUE GT MAX THEN PUT @1 PATNO @18 VARNAME @29 VALUE @39 "Above Maximum (" MAX IF LAST.PATNO THEN PUT ; RETURN;
+(-1) ")";
Chapter 9
Using Validation Data Sets
179
REPORT_HEAD: PUT @1 "Exceptions Report for Data Set &DATASET" / "Using Validation Data File &VALID_FILE" // @1 "Patient ID" @18 "Variable" @29 "Value" @39 "Reason" / @1 60*"-"; RUN; ***Cleanup temporary data sets; PROC DATASETS LIBRARY=WORK NOLIST; DELETE PAT; DELETE VERIFY; RUN; QUIT; %MEND VALID_NUM;
1RWLFHWKDWWKHVLPSOH352&35,17ZDVUHSODFHGZLWKDQLFHUORRNLQJH[FHSWLRQUHSRUW E\XVLQJ387VWDWHPHQWV([HFXWLQJWKHPDFURRQWKH3$7,(176GDWDVHWE\XVLQJWKLV VWDWHPHQW %VALID_NUM(PATNO,CLEAN.PATIENTS,C:\CLEANING\VALID2.TXT)
SURGXFHVWKHRXWSXWVKRZQQH[W
®
180 Cody’s Data Cleaning Techniques Using SAS Software
Exceptions Report for Data Set CLEAN.PATIENTS Using Validation Data File C:\CLEANING\VALID2.TXT Patient ID Variable Value Reason -----------------------------------------------------------004 HR 101 Above Maximum (100) 008 008
DBP HR
210
Missing Above Maximum (100)
009 009
DBP SBP
180 240
Above Maximum (140) Above Maximum (200)
010 010
HR SBP
40
Missing Below Minimum (80)
011 011 014
DBP SBP HR
20 300 22
Below Minimum (60) Above Maximum (200) Below Minimum (40)
017
HR
208
Above Maximum (100)
020 020 020
DBP HR SBP
8 10 20
Below Minimum (60) Below Minimum (40) Below Minimum (80)
023 023
HR SBP
22 34
Below Minimum (40) Below Minimum (80)
027
HR
Missing
029 029
DBP HR
Missing Missing
123
DBP
Missing
321 321 321
DBP HR SBP
200 900 400
Above Maximum (140) Above Maximum (100) Above Maximum (200)
1RWLFHWKDWWKHPLVVLQJYDOXHVIRUV\VWROLFEORRGSUHVVXUH6%3 DUHQRWOLVWHGDVHUURUV EHFDXVHRIWKHYDOXHRI < IRUWKHPLVVLQJYDOXHLQGLFDWRUVHHGDWDILOH9$/,'7;7 Validating Character Data
Chapter 9
3URJUDP
Using Validation Data Sets
181
&UHDWLQJD7HVW'DWD6HWIRU&KDUDFWHU9DOLGDWLRQ
DATA TEST_CHAR; ***If there is a short record, set all variables to missing using the MISSOVER option in the INFILE statement. DATALINES is a special file reference that allows INFILE statement options to be used with data following a DATALINES statement, rather than an external file; INFILE DATALINES MISSOVER; LENGTH PATNO $ 3 CODE $ 2 GENDER AE $ 1; INPUT PATNO CODE GENDER AE; DATALINES; 001 A M 0 002 AB F 1 003 BA F . 004 . . . 005 X Y Z 006 AC M 0 ;
'DWD VHW 7(67B&+$5 FRQWDLQV WKH YDULDEOHV 3$712 &2'( *(1'(5 DQG $( DGYHUVHHYHQW 7KHOLVWLQJRIWKLVGDWDVHWLVVKRZQQH[W Listing of TEST_CHAR PATNO
CODE
GENDER
AE
001 002 003 004 005 006
A AB BA
M F F
0 1
X AC
Y M
Z 0
®
182 Cody’s Data Cleaning Techniques Using SAS Software
/HW VLQFOXGHDFKHFNIRUPLVVLQJYDOXHVLQWKHSURJUDP)RUWKLVH[DPSOHDVVXPHWKDW YDOLGYDOXHVIRU&2'(DUH $ $% & DQG ' YDOLGFRGHVIRU*(1'(5DUH ) DQG 0 YDOLGFRGHVIRU$(DUH DQG 0LVVLQJYDOXHVDUHQRWYDOLGIRU*(1'(5EXWDUHYDOLG IRU&2'(DQG$( $WKUHHOLQHGDWDILOHFDOOHGDYDOLGDWLRQGDWDILOH ZLOOEHXVHGWRVWRUHWKHVHUXOHV(DFK OLQH RI WKLV ILOH ZLOO KROG WKH UXOHV IRU RQH YDULDEOH (DFK OLQH EHJLQV ZLWK D YDULDEOH QDPH )ROORZLQJ RQH RU PRUH VSDFHV \RX HQWHU RQH RU PRUH YDOLG YDOXHV IRU WKLV YDULDEOH (QG WKLV OLVW ZLWK DW OHDVW WZR VSDFHV IROORZHG E\ D < PLVVLQJ YDOXHV DUH RND\ RUDQ 1 PLVVLQJYDOXHVDUHQRWRND\ $FWXDOO\\RXFDQOHDYHRIIWKH 1 LI\RX ZDQWWR$Q\YDOXHRWKHUWKDQD < LQFOXGLQJDPLVVLQJYDOXHZLOOPHDQWKDWPLVVLQJ YDOXHVDUHQRWRND\$ILOHZULWWHQWRFRQIRUPWRWKHUXOHVLQWKHSUHFHGLQJSDUDJUDSKLV VKRZQQH[W Data file "C:\CLEANING\VALID_C.TXT" CODE A B AB C GENDER M F N AE 0 1 Y
Y
1RWLFH WKHUH DUH DW OHDVW WZR VSDFHV EHWZHHQ WKH OLVW RI YDOLG FKDUDFWHU YDOXHV DQG WKH PLVVLQJYDOXHLQGLFDWRU3URJUDPFUHDWHV D YDOLGDWLRQ GDWD VHW IURP WKLV YDOLGDWLRQ GDWDILOH 3URJUDP
&UHDWLQJD9DOLGDWLRQ'DWD6HW&B9$/,' IRU&KDUDFWHU9DULDEOHV
PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; DATA C_VALID; LENGTH VARNAME $ 32 VALUES_LIST $ 200 MISS_OK $ 1; INFILE "C:\CLEANING\VALID_C.TXT" MISSOVER; INPUT VARNAME VALUES_LIST & $200. MISS_OK : $MISS.; RUN;
¯
/HW V XVH WKH VDPH XVHUGHILQHG LQIRUPDW LQ WKLV SURJUDP DV ZDV XVHG LQ 3URJUDP 7KH0,66LQIRUPDWWXUQVORZHUFDVHDQGXSSHUFDVH< VLQWRXSSHUFDVH< VDQGDOORWKHU QRQPLVVLQJ YDOXHV LQWR XSSHUFDVH 1 V 7KH YDULDEOH FDOOHG 9$/8(6B/,67 LV D OLVW RI YDOLG YDOXHV IRU HDFK RI WKH YDULDEOHV 1RWLFH WKH XVH RI WKH DPSHUVDQG LQIRUPDW PRGLILHULQWKH,1387VWDWHPHQW¯
Chapter 9
Using Validation Data Sets
183
7KH PRGLILHU FKDQJHV WKH GHIDXOW VLQJOH VSDFH GHOLPLWHU LQ D OLVWGLUHFWHG ,1387 VWDWHPHQW WR WZR RU PRUH VSDFHV ,W LV YHU\ LPSRUWDQW WKHUHIRUH WR KDYH DW OHDVW WZR VSDFHV EHWZHHQ WKH OLVW RI YDOLG FKDUDFWHU YDOXHV DQG WKH PLVVLQJ YDOXH LQGLFDWRU 0,66B2. 7KHQH[WVWHSVDUHVLPLODUWR3URJUDP)LUVWUHVWUXFWXUHWKH3$7,(176GDWDVHWDV EHIRUH7KLVWLPHWKHUHVHUYHGQDPHB&+$5$&7(5BLVXVHGLQVWHDGRIB180(5,&B VR WKDW WKH &+$56 DUUD\ ZLOO FRQWDLQ DOO WKH FKDUDFWHU YDULDEOHV LQ WKH GDWD VHW 3$7,(1761H[WVRUWERWKGDWDVHWVSULRUWRWKHPHUJHVWHSDVVKRZQLQ3URJUDP 3URJUDP
:ULWLQJWKH3URJUDPWR9DOLGDWH&KDUDFWHU9DULDEOHV
***Restructure TEST_CHAR; DATA PAT; SET TEST_CHAR; ARRAY CHARS[*] _CHARACTER_; LENGTH VARNAME $ 32; N_CHARS = DIM(CHARS); DO I = 1 TO N_CHARS; CALL VNAME(CHARS[I],VARNAME); VARNAME = UPCASE(VARNAME); VALUE = CHARS[I]; OUTPUT; END; KEEP PATNO VARNAME VALUE; RUN; PROC SORT DATA=PAT; BY VARNAME; RUN; PROC SORT DATA=C_VALID; BY VARNAME; RUN;
,QWKLVQH[WVHFWLRQRIWKHSURJUDP\RXQHHGWRH[WUDFWHDFKRIWKHYDOLGFKDUDFWHUYDOXHV IURPWKHVWULQJ9$/8(6B/,67
®
184 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP
:ULWLQJWKH3URJUDPWR9DOLGDWH&KDUDFWHU9DULDEOHVFRQWLQXHG
DATA VERIFY; MERGE PAT(IN=IN_PAT) C_VALID(IN=IN_C_VALID); BY VARNAME; IF (IN_PAT AND IN_C_VALID);
¯
LENGTH TOKEN $ 8; ***Obviously bad values; IF VERIFY (VALUE,VALUES_LIST) NE 0
OR
VALUE = ’ ’ AND MISS_OK NE ’Y’ THEN DO; OUTPUT; RETURN; END; FLAG = 0; DO I = 1 TO 99; TOKEN = SCAN(VALUES_LIST,I," "); ± IF VALUE = TOKEN THEN FLAG + 1; IF TOKEN = ’ ’ OR FLAG > 0 THEN LEAVE; END; IF FLAG = 0 THEN OUTPUT; DROP I TOKEN; RUN; PROC SORT DATA=VERIFY; BY PATNO VARNAME; RUN; ***Reporting section; OPTIONS NODATE NONUMBER; TITLE; DATA _NULL_; FILE PRINT HEADER = REPORT_HEAD; SET VERIFY; BY PATNO;
°
Chapter 9
Using Validation Data Sets
185
IF VALUE = ’ ’ THEN PUT @1 PATNO @18 VARNAME @39 "Missing"; ELSE PUT @1 PATNO @18 VARNAME @29 VALUE @39 "Not Valid"; IF LAST.PATNO THEN PUT ; RETURN; REPORT_HEAD: PUT @1 "Exceptions Report for Data Set TEST_CHAR" / "Using Validation Data Set VALID_C" // @1 "Patient ID" @18 "Variable" @29 "Value" @39 "Reason" / @1 60*"-"; RUN;
)LUVWPHUJHWKHWZRGDWDVHWVE\9$51$0(WKHQDPHRIWKHYDULDEOHWRWHVW $OVR VXEVHWWKHGDWDVHWZLWKWKHVDPH,)VWDWHPHQWXVHGEHIRUH¯)RUWKHVDNHRIHIILFLHQF\ LW V EHVW WR GR D TXLFN WHVW RI WKH HQWLUH OLVW RI YDOLG YDOXHV ,I WKH YDULDEOH WR EH WHVWHG FRQWDLQVDQ\FKDUDFWHUVWKDWDUHQRWDQ\ZKHUHLQWKHYDOLGOLVWWKH9(5,)<IXQFWLRQZLOO UHWXUQDQRQ]HURYDOXH°,QWKHVDPH,)VWDWHPHQW\RXFDQFKHFNLI\RXKDYHDPLVVLQJ YDOXHDQGWKH0,66B2.LQGLFDWRULVQRWD < DQHUURU 1RWHWKDWDFKDUDFWHUYDOXHFDQ SDVVWKH9(5,)<WHVW\HWVWLOOEHLQYDOLG)RUH[DPSOHWKHYDOXH %$ IRUSDWLHQWQXPEHU UHWXUQVDIURPWKH9(5,)<IXQFWLRQLWGRHVQRWFRQWDLQDQ\LQYDOLGFKDUDFWHUV EXWWKHYDOXHRI %$ LVQRWLQWKHOLVWRIYDOLGFRGHV,QWKHQH[WVHFWLRQRIWKHSURJUDP HDFKFKDUDFWHUYDOXHLVFRPSDUHGDJDLQVWHDFKRIWKHYDOLGYDOXHVLQWKH9$/8(6B/,67 7KH6&$1IXQFWLRQH[WUDFWVZRUGVIURPDFKDUDFWHUVWULQJ7KHILUVWDUJXPHQWWRWKH 6&$1 IXQFWLRQ ± LV WKH FKDUDFWHU VWULQJ WR EH SDUVHG WDNHQ DSDUW DQG WKH VHFRQG DUJXPHQW LV D QXPEHU ZKLFK LQGLFDWHV ZKLFK ZRUG \RX ZDQW 7KH ODVW DUJXPHQW LV \RXUFKRLFHRIDGHOLPLWHUDEODQNLQWKLVH[DPSOH%HIRUH\RXHQWHUWKH'2ORRSVHWD IODJHTXDOWR,IWKHYDOXHRIWKHFKDUDFWHUYDULDEOHEHLQJWHVWHGPDWFKHVDYDOXHLQWKH YDOXHVOLVWLQFUHPHQWWKHIODJ
®
186 Cody’s Data Cleaning Techniques Using SAS Software
$VVRRQDV\RXILQGDPDWFKLQWKH9$/8(6B/,67RUWKH6&$1IXQFWLRQUHWXUQVDQXOO VWULQJWKHUHDUHQRPRUHZRUGVLQWKH9$/8(6B/,67 OHDYHWKHORRS,I\RXILQLVK ORRSLQJWKURXJKWKHYDOLGFKDUDFWHUYDOXHVZLWKRXWLQFUHPHQWLQJWKHIODJ\RXNQRZWKDW \RX KDYH D FKDUDFWHU YDOXH WKDW LV QRW LQ WKH YDOLG OLVW DQG \RX QHHG WR RXWSXW DQ REVHUYDWLRQWRWKH9(5,)<GDWDVHW ,W VWLPHQRZWRVRUWE\SDWLHQWQXPEHUDQGYDULDEOHQDPH DQGSURGXFHDUHSRUW7KLV FRGHLVPRVWO\WKHVDPHDVWKHSUHYLRXVQXPHULFH[DPSOHJLYHQLQ3URJUDP PROC SORT DATA=VERIFY; BY PATNO VARNAME; RUN; ***Reporting section; OPTIONS NODATE NONUMBER; TITLE; DATA _NULL_; FILE PRINT HEADER = REPORT_HEAD; SET VERIFY; BY PATNO; IF VALUE = ’ ’ THEN PUT @1 PATNO @18 VARNAME @39 "Missing"; ELSE PUT @1 PATNO @18 VARNAME @29 VALUE @39 "Not Valid"; IF LAST.PATNO THEN PUT ; RETURN; REPORT_HEAD: PUT @1 "Exceptions Report for Data Set &DATASET" / "Using Validation Data Set &VALID" // @1 "Patient ID" @18 "Variable" @29 "Value" @39 "Reason" / @1 60*"-"; RUN;
Chapter 9
Using Validation Data Sets
187
7KHH[FHSWLRQUHSRUWFUHDWHGE\WKLVSURJUDPORRNVOLNHWKLV Exceptions Report for Data Set TEST_CHAR Using Validation Data Set VALID_C Patient ID Variable Value Reason -----------------------------------------------------------003 CODE BA Not Valid 004
GENDER
Missing
005 005 005
AE CODE GENDER
Z X Y
Not Valid Not Valid Not Valid
006
CODE
AC
Not Valid
$VEHIRUHOHW VFRQYHUWWKLVSURJUDPLQWRDJHQHUDOSXUSRVHPDFUR7KLVPDFURLQFOXGHV WKHSURJUDPWRFUHDWHWKHYDOLGDWLRQGDWDVHWIURPDUDZGDWDILOH Converting Program 9-7 into a General Purpose Macro
7KHQH[WVWHSLVWRFRQYHUWWKHSURJUDPWRYDOLGDWHFKDUDFWHUYDOXHVLQWRDPDFUR7KLVLV DVWUDLJKWIRUZDUGWDVN7KHVDPHFDOOLQJDUJXPHQWVDVWKHQXPHULFPDFURLQ3URJUDP DUHXVHGLQ3URJUDP
®
188 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP
:ULWLQJD0DFURWR&KHFNIRU,QYDOLG&KDUDFWHU9DOXHV
*------------------------------------------------------------------* | Program Name: VALID_CHAR in C:\CLEANING | | Purpose: This macro takes an ID variable, a SAS data set to | | be validated, and a validation data file and checks | | for invalid character values and prints an exception | | report. | | Arguments: ID - ID variable name | | DATASET - SAS data set to be validated | | VALID_FILE - Validation data file | | Each line of this file contains the name | | of a character variable, a list of valid | | values separated by spaces, and a ’Y’ | | (missing values okay) or an ’N’ (missing | | values not okay) separated from the list | | of valid values by 2 or more spaces. | | Example: %VALID_CHAR(PATNO,TEST_CHAR,C:\CLEANING\VALID_C.TXT) | *------------------------------------------------------------------*; %MACRO VALID_CHAR (ID, /* ID variable */ DATASET, /* Data set to be validated */ VALID_FILE, /* Validation data file */ ); ***Note: For pre-Version 7, substitute a shorter name for VALID_FILE; ***Create validation data set; PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; DATA VALID; LENGTH VARNAME $ 32 VALUES_LIST $ 200 MISS_OK $ 1; INFILE "&VALID_FILE" MISSOVER; INPUT VARNAME VALUES_LIST & $200. MISS_OK : $MISS.; RUN;
Chapter 9
Using Validation Data Sets
***Restructure &DATASET; DATA PAT; SET &DATASET; ARRAY CHARS[*] _CHARACTER_; LENGTH VARNAME $ 32; N_CHARS = DIM(CHARS); DO I = 1 TO N_CHARS; CALL VNAME(CHARS[I],VARNAME); VARNAME = UPCASE(VARNAME); VALUE = CHARS[I]; OUTPUT; END; KEEP &ID VARNAME VALUE; RUN; PROC SORT DATA=PAT; BY VARNAME; RUN; PROC SORT DATA=VALID; BY VARNAME; RUN; DATA VERIFY; MERGE PAT(IN=IN_PAT) VALID(IN=IN_VALID); BY VARNAME; IF (IN_PAT AND IN_VALID); LENGTH TOKEN $ 8; ***Obviously bad values; IF VERIFY (VALUE,VALUES_LIST) NE 0 OR VALUE = ’ ’ AND MISS_OK NE ’Y’ THEN DO; OUTPUT; RETURN; END; FLAG = 0; DO I = 1 TO 99; TOKEN = SCAN(VALUES_LIST,I," "); IF VALUE = TOKEN THEN FLAG + 1; IF TOKEN = ’ ’ OR FLAG > 0 THEN LEAVE; END;
189
®
190 Cody’s Data Cleaning Techniques Using SAS Software IF FLAG = 0 THEN OUTPUT; DROP I TOKEN; RUN; PROC SORT DATA=VERIFY; BY &ID VARNAME; RUN; ***Reporting section; OPTIONS NODATE NONUMBER; TITLE; DATA _NULL_; FILE PRINT HEADER = REPORT_HEAD; SET VERIFY; BY &ID; IF VALUE = ’ ’ THEN PUT @1 &ID @18 VARNAME @39 "Missing"; ELSE PUT @1 &ID @18 VARNAME @29 VALUE @39 "Not Valid"; IF LAST.&ID THEN PUT ; RETURN; REPORT_HEAD: PUT @1 "Exceptions Report for Data Set &DATASET" / "Using Validation Data File &VALID_FILE" // @1 "Patient ID" @18 "Variable" @29 "Value" @39 "Reason" / @1 60*"-"; RUN; PROC DATASETS LIBRARY=WORK; DELETE PAT VALID; RUN; QUIT; %MEND VALID_CHAR;
Chapter 9
Using Validation Data Sets
191
7RWHVWWKLVPDFURH[HFXWHWKHIROORZLQJPDFURFDOO %VALID_CHAR(PATNO,TEST_CHAR,C:\CLEANING\VALID_C.TXT)
7KHUHVXOWLQJOLVWLQJLVLGHQWLFDOWRWKHOLVWLQJIURPWKHQRQPDFURYHUVLRQRIWKLVSURJUDP ,WLVSRVVLEOHWRH[WHQG WKLV SURJUDP RU PDFUR WR KDQGOH PRUH FRPSOLFDWHG YDOLGDWLRQ UXOHV )RU H[DPSOH \RX PD\ ZDQW WR FKHFN IRU YDOLG FKDUDFWHU YDOXHV LQ D UDQJH IRU H[DPSOH$$$±=== 7KLVLVGHPRQVWUDWHGLQWKHQH[WVHFWLRQ Extending the Validation Macro to Include Valid Character Ranges
%HVLGHVOLVWLQJGLVFUHWHFKDUDFWHUYDOXHVIRUHDFKRIWKHFKDUDFWHUYDULDEOHVWREHWHVWHG \RXPLJKWZDQWWRLQGLFDWHDUDQJHRISRVVLEOHYDOXHVVXFKDV $ WR ( 7KHPDFURWKDW IROORZVGRHVMXVWWKDW,WDOORZVWKHXVHUWRLQFOXGHGLVFUHWHFKDUDFWHUYDOXHVDVZHOODV UDQJHVLQWKHYDOLGDWLRQGDWDVHW)RUH[DPSOHWRFKHFNIRUWKHYDOXHV $ % $% &
''' WR ))) DQG ; WR = IRUDYDULDEOHFDOOHG&2'(\RXZRXOGHQWHUWKHOLQH CODE A B AB C DDD-FFF X-Z
Y
7KHSURJUDPQHHGVWRVHDUFKWKHVWULQJ9$/8(6B/,67 IRUGDVKHVDQGWUHDWWKHVWULQJV GLUHFWO\ EHIRUH DQG GLUHFWO\ DIWHU LW DV WKH EHJLQQLQJ DQG HQGLQJ YDOXHV IRU D UDQJH /XFNLO\IRUXV6$6KDVDJRRGVHOHFWLRQRIFKDUDFWHUIXQFWLRQV7KLVPDFURLVGHVFULEHG EXWQRWLQWRRPXFKGHWDLO DIWHU3URJUDP
®
192 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP
:ULWLQJ D 0DFUR WR &KHFN IRU 'LVFUHWH &KDUDFWHU 9DOXHV DQG &KDUDFWHU5DQJHV
*------------------------------------------------------------------* | Program Name: RANGE.SAS in C:\CLEANING | | Purpose: This macro takes an ID variable, a SAS data set to be | | validated, and a validation data file and checks for | | discrete character values or character ranges for | | valid data, and prints an exception report. | | Arguments: ID - ID variable name | | DATASET - SAS data set to be validated | | VALID_FILE - Validation data file containing variable | | names, discrete valid character values | | and/or ranges, and a missing value flag. | | ’Y’ means missing values are OK. | | Example: %RANGE(PATNO,TEST_CHAR,C:\CLEANING\VALID_RANGE.TXT) | *------------------------------------------------------------------*; %MACRO RANGE(ID, /* ID variable */ DATASET, /* Data set to be validated */ VALID_FILE, /* Validation data file */ ); PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; DATA C_VALID; LENGTH VARNAME $ 32 VALUES_LIST $ 200 MISS_OK $ 1 WORD $ 17; INFILE "&VALID_FILE" MISSOVER; INPUT VARNAME VALUES_LIST & $200. MISS_OK : $MISS.; ***Separate VALUES_LIST into individual values and ranges; ***Array to store up to 10 ranges. The first dimension of the array tells which range it is, the second dimension takes on the value 1 for the lower range and 2 for the upper range. You may want to increase the length for each of the ranges to a larger number. ; ARRAY RANGES[10,2] $ 8 R1-R20;
¯
***Compute the number of ranges in the string; N_OF_RANGES = LENGTH(VALUES_LIST) – LENGTH(COMPRESS(VALUES_LIST,"-"));
°
Chapter 9
Using Validation Data Sets
193
***Break list into "words"; N_RANGE = 0; DO I = 1 TO 200 UNTIL (WORD = " "); WORD = SCAN(VALUES_LIST,I," ");
±
IF INDEX(WORD,’-’) NE 0 THEN DO; ² ***Range found, scan again to get lower and upper values; N_RANGE + 1; RANGES[N_RANGE,1] = SCAN(WORD,1,"-"); RANGES[N_RANGE,2] = SCAN(WORD,2,"-"); END; END; ***When all finished finding ranges, substitute spaces for dashes; VALUES_LIST = TRANSLATE(VALUES_LIST," ","-"); KEEP VALUES_LIST R1-R20 VARNAME N_OF_RANGES MISS_OK ; RUN; ***Restructure TEST_CHAR; DATA PAT; SET TEST_CHAR; ARRAY CHARS[*] _CHARACTER_; LENGTH VARNAME $ 32; N_CHARS = DIM(CHARS); DO I = 1 TO N_CHARS; CALL VNAME(CHARS[I],VARNAME); VARNAME = UPCASE(VARNAME); VALUE = CHARS[I]; OUTPUT; END; KEEP PATNO VARNAME VALUE; RUN; PROC SORT DATA=PAT; BY VARNAME; RUN; PROC SORT DATA=C_VALID; BY VARNAME; RUN;
®
194 Cody’s Data Cleaning Techniques Using SAS Software DATA VERIFY; ARRAY RANGES[10,2] $ 8 R1-R20; MERGE PAT(IN=IN_PAT) C_VALID(IN=IN_C_VALID); BY VARNAME; IF (IN_PAT AND IN_C_VALID); LENGTH TOKEN $ 8; ***Obviously bad values; IF (VERIFY (VALUE,VALUES_LIST) NE 0 AND N_OF_RANGES = 0) VALUE = ’ ’ AND MISS_OK NE ’Y’ THEN DO; OUTPUT; RETURN; END; ***Checking for discrete values; FLAG = 0; /* Flag incremented if a discrete match found */ DO I = 1 TO 99; TOKEN = SCAN(VALUES_LIST,I); IF VALUE = TOKEN THEN FLAG + 1; IF TOKEN = ’ ’ OR FLAG > 0 THEN LEAVE; END; ***Checking for ranges; ***R_FLAG incremented if in one of the ranges; ³ R_FLAG = 0; ***Lower and upper range values already checked above; IF N_OF_RANGES > 0 THEN DO I = 1 TO N_OF_RANGES; IF VALUE > RANGES[I,1] AND VALUE < RANGES[I,2] THEN DO; R_FLAG + 1; LEAVE; END; END; IF FLAG = 0 AND R_FLAG = 0 THEN OUTPUT; DROP I TOKEN; RUN; PROC SORT DATA=VERIFY; BY PATNO VARNAME; RUN;
OR
Chapter 9
Using Validation Data Sets
195
***Reporting section; OPTIONS NODATE NONUMBER; TITLE; DATA _NULL_; FILE PRINT HEADER = REPORT_HEAD; SET VERIFY; BY PATNO; IF VALUE = ’ ’ THEN PUT @1 PATNO @18 VARNAME @39 "Missing"; ELSE PUT @1 PATNO @18 VARNAME @29 VALUE @39 "Not Valid"; IF LAST.PATNO THEN PUT ; RETURN; REPORT_HEAD: PUT @1 "Exceptions Report for Data Set TEST_CHAR" / "Using Validation Data Set VALID_C" // @1 "Patient ID" @18 "Variable" @29 "Value" @39 "Reason" / @1 60*"-"; RUN; PROC DATASETS LIBRARY=WORK NOLIST; DELETE PAT C_VALID; RUN; QUIT; %MEND RANGE;
7KHILUVWSDUWRIWKLVPDFURFUHDWHVWKHYDOLGDWLRQGDWDVHWIURPWKHUDZGDWDILOHXVLQJ WKHXVHUGHILQHGLQIRUPDW0,66DVEHIRUH8VLQJDWZRGLPHQVLRQDODUUD\\RXFDQKROG XSWRVHSDUDWHUDQJHV¯7KHILUVWGLPHQVLRQWHOOV\RXZKLFKUDQJH\RXDUHZRUNLQJ ZLWKDQGWKHVHFRQGGLPHQVLRQWHOOVZKHWKHUWKHYDOXHLVWKHORZHUYDOXHRI RUXSSHU YDOXHRI HQGRIWKHUDQJH$TXLFNDQGHDV\ZD\WRFRXQWWKHQXPEHURIUDQJHVLVWR FRXQWWKHQXPEHURIGDVKHVLQWKH9$/,'B/,67VWULQJ7KLVLVDFFRPSOLVKHGE\DXVHIXO
®
196 Cody’s Data Cleaning Techniques Using SAS Software
WULFNWDNHWKHOHQJWKRIWKHRULJLQDOVWULQJDQGVXEWUDFWWKHOHQJWKRIWKHVWULQJDIWHU\RX KDYHUHPRYHG&2035(66IXQFWLRQ WKHGDVKHV° )RU HDFK RI WKH UDQJHV \RX KDYH WR H[WUDFW WKH ORZHU DQG XSSHU YDOXH 7KLV LV DFFRPSOLVKHG E\ WKH VWDWHPHQWV VWDUWLQJ ZLWK WKH '2 ORRS ± )LUVW VFDQ WKH HQWLUH 9$/8(6B/,67DQGEUHDNLWLQWRZRUGV,IDZRUGFRQWDLQVDGDVK²\RXNQRZWKDW \RXKDYHORFDWHGDUDQJHUDWKHUWKDQDGLVFUHWHFKDUDFWHUYDOXH8VHWKH6&$1IXQFWLRQ DJDLQWKLVWLPHXVLQJDGDVKDVWKHGHOLPLWHU7KHSRUWLRQRIWKHZRUGEHIRUHWKHGDVKLV WKHORZHUUDQJHDQGLVVWRUHGLQWKHDUUD\HOHPHQW5$1*(6>1B5$1*(@7KHSRUWLRQ DIWHU WKH GDVK LV WKH XSSHU UDQJH DQG LV VWRUHG LQ WKH DUUD\ HOHPHQW 5$1*(6>1B5$1*(@ 5HVWUXFWXULQJWKHGDWDVHWWREHYDOLGDWHGLVWKHVDPHDVEHIRUH7KLVWLPHDVHFWLRQZDV DGGHGWRFKHFNIRUUDQJHV³7KHUHPDLQGHURIWKHSURJUDPLVLGHQWLFDOWR3URJUDP 7RWHVWWKLVPDFURFDOOLWDVIROORZV %RANGE(PATNO,TEST_CHAR,C:\CLEANING\VALID_RANGE.TXT)
ZKHUHWKHFRQWHQWVRIWKHUDZGDWDILOH9$/,'B5$1*(7;7LV CODE A B AB C DDD-FFF X-Z GENDER M F N AE 0 1 Y
Y
7KHUHSRUWJHQHUDWHGE\WKLVPDFURLVVKRZQQH[W Exceptions Report For Data Set TEST_CHAR Using Validation Data Set C:\CLEANING\VALID_RANGE.TXT Patient ID Variable Value Reason -----------------------------------------------------------003 CODE BA Not Valid 004
GENDER
Missing
005 005
AE GENDER
Z Y
Not Valid Not Valid
006
CODE
AC
Not Valid
Chapter 9
Using Validation Data Sets
197
Combining Numeric and Character Validity Checks in a Single Macro with a Single Validation Data Set
,Q WKLV VHFWLRQ WKH WZR SUHYLRXV PDFURV DUH FRPELQHG RQH WR WHVW IRU YDOLG QXPHULF YDOXHV DQG WKH RWKHU WR FKHFN IRU GLVFUHWH FKDUDFWHU YDOXHV RU FKDUDFWHU UDQJHV 7KH YDOLGDWLRQGDWDILOHLQFOXGHVLQIRUPDWLRQRQERWKWKHQXPHULFDQGFKDUDFWHUYDULDEOHVLQ DQ\ RUGHU )RU WKH QXPHULF YDULDEOHV LQFOXGH WKH YDULDEOH QDPH LQ XSSHUFDVH RU ORZHUFDVH WKHPLQLPXPDQGPD[LPXPYDOXHVDQGD < LIPLVVLQJYDOXHVDUHRND\QRW WREHIODJJHGDVHUURUV )RUWKHFKDUDFWHUYDULDEOHVLQFOXGHWKHYDULDEOHQDPHDOLVWRI GLVFUHWHFKDUDFWHUYDOXHVDQGRUUDQJHVRIFKDUDFWHUYDOXHVDQGDPLVVLQJYDOXHLQGLFDWRU %H VXUH WR LQVHUW DW OHDVW WZR VSDFHV EHWZHHQ WKH YDOXHV OLVW DQG WKH < RU 1 7KH GLVFUHWHFKDUDFWHUYDOXHVDQGUDQJHVFDQEHLQDQ\RUGHU )RUH[DPSOHWRZULWHDYDOLGDWLRQILOHIRUWKHIROORZLQJWKUHHUXOHV D YDULDEOH FDOOHG &2'( ZLWK DFFHSWDEOH YDOXHV RI $ % . 3 DQG ))
++ ZKHUHPLVVLQJYDOXHVDUHRND\ DYDULDEOHFDOOHG;ZLWKDPLQLPXPDQGPD[LPXPYDOXHRIDQG ZKHUH PLVVLQJYDOXHVDUHQRWRND\WREHFRQVLGHUHGDVHUURUV D YDULDEOH FDOOHG &+2,&( ZLWK DFFHSWDEOH YDOXHV RI : % DQG $ ZKHUH PLVVLQJYDOXHVDUHQRWRND\WREHFRQVLGHUHGDVHUURUV \RXHQWHUWKHIROORZLQJWKUHHGDWDOLQHVLQWR\RXUYDOLGDWLRQILOH CODE A B K-P FF-HH X 10 20 CHOICE W B A N
Y
1RWH
®
198 Cody’s Data Cleaning Techniques Using SAS Software
3URJUDP &UHDWLQJ D 0DFUR WR 9DOLGDWH ERWK 1XPHULF DQG &KDUDFWHU 'DWD ,QFOXGLQJ&KDUDFWHU5DQJHVZLWKD6LQJOH9DOLGDWLRQ'DWD)LOH *--------------------------------------------------------------------* | Program Name: VALID_ALL.SAS in C:\CLEANING | | Purpose: This macro takes an ID variable, a SAS data set to be | | validated, and a validation data file, and checks for | | discrete character values or character ranges for | | character variables and valid ranges for numeric data | | and prints an exception report. | | Arguments: ID - ID variable name | | DATASET - SAS data set to be validated | | VALID_FILE - Validation data file containing variable | | names, discrete values and/or ranges for | | character variables, minimum and maximum | | values for numeric variables, and a | | missing value flag ’Y’ means missing | | values are okay. | | Example: %VALID_ALL(PATNO,CLEAN.PATIENTS,C:\CLEANING\VALID_ALL.TXT)| *--------------------------------------------------------------------*; %MACRO VALID_ALL(ID, /* ID variable */ DATASET, /* Data set to be validated */ VALID_FILE /* Validation data file */ ); ***Get a list of variable names and type (numeric or character); PROC CONTENTS NOPRINT DATA=&DATASET OUT=NAMETYPE(KEEP=NAME TYPE);
¯
RUN; ***Find number of observations in data set TYPE and assign to a macro variable; %LET DSID = %SYSFUNC(OPEN(NAMETYPE)); ° %LET NUM_OBS = %SYSFUNC(ATTRN(&DSID,NOBS)); %LET RC = %SYSFUNC(CLOSE(&DSID)); ***Place the variable names and types in a single observation, using NAMES1-NAMESn and TYPE_VAR1-TYPE_VARn to hold the variable names and types, respectively;
Chapter 9
Using Validation Data Sets
DATA XTYPE; ± SET NAMETYPE END=LAST; NAME = UPCASE(NAME); ARRAY NAMES[&NUM_OBS] $ 32; ARRAY TYPE_VAR[&NUM_OBS]; RETAIN NAMES1-NAMES&NUM_OBS TYPE_VAR1-TYPE_VAR&NUM_OBS; NAMES[_N_] = NAME; TYPE_VAR[_N_] = TYPE; IF LAST THEN OUTPUT; KEEP NAMES1-NAMES&NUM_OBS TYPE_VAR1-TYPE_VAR&NUM_OBS; RUN; ***Turn the validation data file into a SAS data set; PROC FORMAT; INVALUE $MISS(UPCASE DEFAULT=1) ’Y’ = ’Y’ OTHER = ’N’; RUN; ***Need to distinguish lines with numeric ranges from ones with character values and ranges. Use the variable TYPE from the one observation data set (XTYPE) above; DATA VALID; ARRAY NAMES[&NUM_OBS] $ 32; ARRAY TYPE_VAR[&NUM_OBS]; LENGTH VARNAME $ 32 VALUES_LIST $ 200 MISS_OKAY $ 1 WORD $ 17; INFILE "&VALID_FILE" MISSOVER; IF _N_ = 1 THEN SET XTYPE; INPUT VARNAME @; ³ VARNAME = UPCASE(VARNAME);
²
***Find VARNAME in NAMES array and determine TYPE; DO I = 1 TO &NUM_OBS; IF VARNAME = NAMES[I] THEN DO; IF TYPE_VAR[I] = 1 THEN DO; INPUT MIN MAX MISS_OKAY : $MISS.; TYPE = ’N’; END; ELSE IF TYPE_VAR[I] = 2 THEN DO; INPUT VALUES_LIST & $200. MISS_OKAY : $MISS.; ***Separate VALUES_LIST into individual values and ranges; ARRAY RANGES[10,2] $ 8 R1-R20; N_OF_RANGES = LENGTH(VALUES_LIST) LENGTH(COMPRESS(VALUES_LIST,"-"));
199
®
200 Cody’s Data Cleaning Techniques Using SAS Software ***Break list into "words"; N_RANGE = 0; DO I = 1 TO 200 UNTIL (WORD = " "); WORD = SCAN(VALUES_LIST,I," "); IF INDEX(WORD,’-’) NE 0 THEN DO; ***Range found, scan again to get lower and upper values; N_RANGE + 1; RANGES[N_RANGE,1] = SCAN(WORD,1,"-"); RANGES[N_RANGE,2] = SCAN(WORD,2,"-"); END; END; TYPE = ’C’; END; OUTPUT; LEAVE; END; END; KEEP VARNAME MIN MAX MISS_OKAY VALUES_LIST TYPE R1-R20 N_OF_RANGES; RUN; ***Restructure data set to be validated. Need separate variable to hold numeric and character values; DATA PAT; SET &DATASET; ARRAY CHARS[*] _CHARACTER_; ARRAY NUMS[*] _NUMERIC_; LENGTH VARNAME $ 32; N_CHARS = DIM(CHARS); N_NUMS = DIM(NUMS);
µ
´
DO I = 1 TO N_CHARS; ¶ CALL VNAME(CHARS[I],VARNAME); VARNAME = UPCASE(VARNAME); C_VALUE = CHARS[I]; OUTPUT; END;
·
Chapter 9
Using Validation Data Sets
201
DO I = 1 TO N_NUMS; CALL VNAME(NUMS[I],VARNAME); VARNAME = UPCASE(VARNAME); N_VALUE = NUMS[I]; C_VALUE = " "; OUTPUT; END; KEEP PATNO VARNAME C_VALUE N_VALUE; RUN; PROC SORT DATA=PAT; BY VARNAME; RUN; PROC SORT DATA=VALID; BY VARNAME; RUN; DATA VERIFY; ARRAY RANGES[10,2] $ 8 R1-R20; MERGE PAT(IN=IN_PAT) VALID(IN=IN_VALID); BY VARNAME; IF NOT(IN_PAT AND IN_VALID) THEN DELETE; LENGTH TOKEN $ 8; ***Character variable section; IF TYPE = ’C’ THEN DO; ***Obviously bad values; IF (VERIFY (C_VALUE,VALUES_LIST) NE 0 AND N_OF_RANGES = 0) OR C_VALUE = ’ ’ AND MISS_OKAY = ’N’ THEN DO; OUTPUT; RETURN; END; ***Checking for discrete values; FLAG = 0; DO I = 1 TO 99; TOKEN = SCAN(VALUES_LIST,I); IF C_VALUE = TOKEN THEN FLAG + 1; IF TOKEN = ’ ’ OR FLAG > 0 THEN LEAVE; END;
®
202 Cody’s Data Cleaning Techniques Using SAS Software ***Checking for ranges; R_FLAG = 0; IF N_OF_RANGES > 0 THEN DO I = 1 TO N_OF_RANGES; IF C_VALUE > RANGES[I,1] AND C_VALUE < RANGES[I,2] THEN DO; R_FLAG + 1; LEAVE; END; END; IF FLAG = 0 AND R_FLAG = 0 THEN OUTPUT; END; ***End of character section; ***Numeric variable section; IF TYPE = ’N’ THEN DO; IF (N_VALUE LT MIN OR N_VALUE GT MAX) AND NOT(N_VALUE = . AND MISS_OKAY EQ ’Y’) THEN OUTPUT; END; ***End of numeric section; DROP VALUES_LIST TOKEN I FLAG; RUN; PROC SORT DATA=VERIFY; BY PATNO VARNAME; RUN; ***Reporting section; OPTIONS NODATE NONUMBER; TITLE; DATA _NULL_; FILE PRINT HEADER = REPORT_HEAD; SET VERIFY; BY PATNO; ***Numeric variables; IF TYPE = ’N’ THEN DO; IF N_VALUE = . THEN PUT @1 PATNO @18 VARNAME @39 "Missing";
Chapter 9
Using Validation Data Sets
ELSE IF N_VALUE GT . AND N_VALUE LT MIN THEN PUT @1 PATNO @18 VARNAME @29 N_VALUE @39 "Below Minimum (" MIN +(-1) ")"; ELSE IF N_VALUE GT MAX THEN PUT @1 PATNO @18 VARNAME @29 N_VALUE @39 "Above Maximum (" MAX +(-1) ")"; IF LAST.PATNO THEN PUT ; END; ***End of numeric report; ***Character report; IF TYPE = ’C’ THEN DO; IF C_VALUE = ’ ’ THEN PUT @1 PATNO @18 VARNAME @39 "Missing"; ELSE PUT @1 PATNO @18 VARNAME @29 C_VALUE @39 "Not Valid"; IF LAST.PATNO THEN PUT ; END; ***End of character report; RETURN; REPORT_HEAD: PUT @1 "Exceptions Report for Data Set &DATASET" / "Using Validation Data File &VALID_FILE" // @1 "Patient ID" @18 "Variable" @29 "Value" @39 "Reason" / @1 60*"-"; RUN;
203
®
204 Cody’s Data Cleaning Techniques Using SAS Software ***Cleanup temporary data sets; PROC DATASETS LIBRARY=WORK NOLIST; DELETE PAT; DELETE VERIFY; DELETE NAMETYPE; RUN; QUIT; RUN; %MEND VALID_ALL;
$WWKHEHJLQQLQJRIWKLVPDFUR352&&217(176¯LVXVHGWRFUHDWHDQRXWSXWGDWDVHW 1$0(7<3( ZKLFK FRQWDLQV WKH YDULDEOH QDPH DQG W\SH QXPHULF FKDUDFWHU 1H[W WKH PDFUR IXQFWLRQ 6<6)81& LV FDOOHG WKUHH WLPHV WR SODFH WKH QXPEHU RI REVHUYDWLRQVLQWKHGDWDVHW1$0(7<3(WKHRXWSXWGDWDVHWIURP352&&217(176 LQWR WKH PDFUR YDULDEOH 180B2%6 ° 6HH 6$6 0DFUR 3URJUDPPLQJ 0DGH (DV\ S E\0LFKHOH%XUOHZIRUPRUHGHWDLOVRQ6<6)81& 6WDUWLQJ ZLWK OLQH ± \RX FUHDWH D GDWD VHW WKDW KDV RQH REVHUYDWLRQ ZLWK YDULDEOHV 1$0(61$0(6Q DQG 7<3(B9$57<3(B9$5Q ZKHUH Q LV WKH QXPEHU RI REVHUYDWLRQVLQWKH1$0(7<3(GDWDVHW7KLVGDWDVHWZLOOEHXVHGODWHUWRORRNXSD YDULDEOHQDPHDQGGHWHUPLQHLWVW\SH ,Q RUGHU WR UHDG WKH YDOLGDWLRQ GDWD ILOH DQG FUHDWH WKH YDOLGDWLRQ GDWD VHW \RX QHHG WR GHFLGHZKHWKHU\RXDUHUHDGLQJDOLQHFRUUHVSRQGLQJWRDQXPHULFYDULDEOHWKHYDULDEOH QDPH PLQLPXP DQG PD[LPXP YDOXHV DQG D PLVVLQJ IODJ RU RQH FRUUHVSRQGLQJ WR D FKDUDFWHUYDULDEOHWKHYDULDEOHQDPHWKHOLVWRIYDOLGFKDUDFWHUYDOXHVRUUDQJHVDQGD PLVVLQJIODJ
Chapter 9
Using Validation Data Sets
205
®
206 Cody’s Data Cleaning Techniques Using SAS Software patno 001-999 gender M F hr 40 100 sbp 80 200 dbp 60 120 dx 001-999 ae 0 1
N Y N Y y Y y
)LUVW DV ZH PHQWLRQHG LQ &KDSWHU WKH FKDUDFWHU UDQJH WR PD\ FRQWDLQ VXFK YDOXHVDV $ GHSHQGLQJRQ\RXURSHUDWLQJHQYLURQPHQW,IWKDWLVWKHFDVH\RXVKRXOG UXQDQDGGLWLRQDOFKHFNRQWKHSDWLHQWQXPEHUVXVLQJDVKRUW'$7$VWHS7KHVDPHLV WUXHIRU';FRGHV7RUXQWKHYDOLGDWLRQSURJUDPRQWKHGDWDVHW&/($13$7,(176 XVLQJWKHGDWDILOH&?&/($1,1*?9$/,'B$//7;7VXEPLWWKHIROORZLQJVWDWHPHQW %VALID_ALL(PATNO,CLEAN.PATIENTS,C:\CLEANING\VALID_ALL.TXT
ZKLFKUHVXOWVLQWKHIROORZLQJHUURUUHSRUW Exceptions Report For Data Set clean.patients Using Validation Data File c:\cleaning\VALID_ALL.TXT Patient ID Variable Value Reason -----------------------------------------------------------PATNO Missing 002 002
DX DX
X X
Not Valid Not Valid
003
GENDER
X
Not Valid
004 004
AE HR
A 101
Not Valid Above Maximum (100)
008
HR
210
Above Maximum (100)
009 009
DBP SBP
180 240
Above Maximum (120) Above Maximum (200)
010 010 010
GENDER HR SBP
f 40
Not Valid Missing Below Minimum (80)
011 011
DBP SBP
20 300
Below Minimum (60) Above Maximum (200)
013
GENDER
2
Not Valid
014
HR
22
Below Minimum (40) Continued
Chapter 9
Using Validation Data Sets
017
HR
208
Above Maximum (100)
020 020 020
DBP HR SBP
8 10 20
Below Minimum (60) Below Minimum (40) Below Minimum (80)
023 023 023
GENDER HR SBP
f 22 34
Not Valid Below Minimum (40) Below Minimum (80)
027
HR
Missing
029
HR
Missing
321 321 321
DBP HR SBP
200 900 400
Above Maximum (120) Above Maximum (100) Above Maximum (200)
XX5
PATNO
XX5
Not Valid
207
7KDW VDERXWDVIDUDVWKLVERRNZLOOWDNHWKHFRQFHSWRIYDOLGDWLRQGDWDVHWV
6WDUWLQJ ZLWK 9HUVLRQ RI 6$6 VRIWZDUH D IHDWXUH FDOOHG LQWHJULW\ FRQVWUDLQWV ZDV LPSOHPHQWHG,QWHJULW\FRQVWUDLQWVDUHUXOHVWKDWDUHVWRUHGZLWKD6$6GDWDVHWWKDWFDQ UHVWULFW GDWD YDOXHV DFFHSWHG LQWR WKH GDWD VHW ZKHQ QHZ GDWD LV DGGHG ZLWK 352& $33(1''$7$VWHS02',)<DQG64/LQVHUWGHOHWHRUXSGDWH7KHVHFRQVWUDLQWV DUHSUHVHUYHGZKHQWKHGDWDVHWLVFRSLHGXVLQJ352&&23<&3257RU&,03257RU LVVRUWHGZLWK352&6257 %ULHIO\ WKHUH DUH WZR W\SHV RI LQWHJULW\ FRQVWUDLQWV 2QH W\SH FDOOHG JHQHUDO LQWHJULW\ FRQVWUDLQWVDOORZV\RXWRUHVWULFWGDWDYDOXHVWKDWDUHDGGHGWRD6$6GDWDVHW
®
208 Cody’s Data Cleaning Techniques Using SAS Software
,QWHJULW\FRQVWUDLQWVFDQNHHSDGDWDVHWSXUHE\UHMHFWLQJDQ\REVHUYDWLRQVWKDWYLRODWH RQHRUPRUHRIWKHFRQVWUDLQWV7KLVIHDWXUHSUREDEO\KDVPRUHXWLOLW\LQDGDWDZDUHKRXVH WKDQ LQ D FOLQLFDO GDWD DSSOLFDWLRQ 7KH UHDVRQ IRU WKLV LV WKDW ZKHQ \RX DWWHPSW WR DSSHQGGDWDWRDQH[LVWLQJGDWDVHWWKDWFRQWDLQVLQWHJULW\FRQVWUDLQWVWKRVHREVHUYDWLRQV WKDWDUHYDOLGDUHDGGHGDQGWKRVHREVHUYDWLRQVWKDWYLRODWHRQHRUPRUHRIWKHLQWHJULW\ FRQVWUDLQWV DUH UHMHFWHG 7KH 6$6 /RJ GRHV QRW JLYH \RX LQIRUPDWLRQ DERXW ZKLFK REVHUYDWLRQVZHUHDFFHSWHGRUUHMHFWHG²\RXZLOOKDYHWRGHWHUPLQHWKDWRQ\RXURZQ 7KLVPD\FKDQJHZLWKODWHUUHOHDVHVRI6$6VRIWZDUH ,QWHJULW\FRQVWUDLQWVFDQEHFUHDWHGZLWK352&64/VWDWHPHQWVRU352&'$7$6(76 7R GHPRQVWUDWH KRZ \RX FRXOG XVH LQWHJULW\ FRQVWUDLQWV WR SUHYHQW GDWD YDOXHV EHLQJ DGGHGWRDQH[LVWLQJ6$6GDWDVHWOHW VILUVWFUHDWHDVPDOOGDWDVHWFDOOHG,&B'(02E\ UXQQLQJWKHIROORZLQJSURJUDP DATA IC_DEMO; INPUT PATNO : $3. GENDER : $1. DATALINES; 001 M 88 140 80 002 F 84 120 78 003 M 58 112 74 004 F . 200 120 007 M 88 148 102 015 F 82 148 88 ;
HR
SBP
$OLVWLQJRIWKLVGDWDVHWLVVKRZQQH[W Listing of IC_DEMO Obs
PATNO
GENDER
HR
SBP
DBP
1 2 3 4 5 6
001 002 003 004 007 015
M F M F M F
88 84 58 . 88 82
140 120 112 200 148 148
80 78 74 120 102 88
DBP;
Chapter 9
Using Validation Data Sets
209
7KH QH[W SURJUDP ZLOO DGG LQWHJULW\ FRQVWUDLQWV WR WKH ,&B'(02 GDWD VHW 7KH FRQVWUDLQWVDUH *(1'(5PXVWEH ) RU 0 +5KHDUWUDWH PXVWEHEHWZHHQDQG0LVVLQJYDOXHVDUHDOORZHG 6%3V\VWROLFEORRGSUHVVXUH PXVWEHEHWZHHQDQG0LVVLQJYDOXHVDUH DOORZHG '%3GLDVWROLFEORRGSUHVVXUH PXVWEHEHWZHHQDQG0LVVLQJYDOXHVDUH QRWDOORZHG 3$712SDWLHQWQXPEHU PXVWEHXQLTXH +HUHDUHWKH352&'$7$6(76VWDWHPHQWV PROC DATASETS LIBRARY=WORK NOLIST; MODIFY IC_DEMO; IC CREATE GEN_CHK = CHECK (WHERE=(GENDER IN(’F’,’M’))); IC CREATE HR_CHK = CHECK (WHERE=( HR BETWEEN 40 AND 100 OR HR = .)); IC CREATE SBP_CHK = CHECK (WHERE=(SBP BETWEEN 80 AND 200 OR SBP IS NULL)); IC CREATE DBP_CHK = CHECK (WHERE=(DBP BETWEEN 60 AND 140)); IC CREATE ID_CHK = UNIQUE(PATNO); QUIT;
5XQQLQJ 352& &217(176 ZLOO GLVSOD\ WKH XVXDO GDWD VHW LQIRUPDWLRQ DV ZHOO DV WKH LQWHJULW\ FRQVWUDLQWV 7KH 352& &217(176 VWDWHPHQWV DQG WKH UHVXOWLQJ RXWSXW HGLWHG DUHVKRZQQH[W PROC CONTENTS DATA=IC_DEMO; TITLE "Output from PROC CONTENTS"; RUN;
®
210 Cody’s Data Cleaning Techniques Using SAS Software Output from PROC CONTENTS The CONTENTS Procedure Data Set Name: Member Type: Engine: Created:
WORK.IC_DEMO DATA V7 14:41 Tuesday, May 4, 1999 Last Modified: 14:42 Tuesday, May 4, 1999 Protection: Data Set Type: Label:
Observations: Variables: Indexes: Integrity Constraints:
6 5 1 5
Observation Length:
32
Deleted Observations: Compressed: Sorted:
0 NO YES
-----Alphabetic List of Variables and Attributes----# Variable Type Len Pos Label --------------------------------------------------------------5 DBP Num 8 16 Diastolic Blood Pressure 2 GENDER Char 1 27 Gender 3 HR Num 8 0 Heart Rate 1 PATNO Char 3 24 Patient Number 4 SBP Num 8 8 Systolic Blood Pressure -----Alphabetic List of Integrity Constraints----Integrity WHERE # Constraint Type Variables Clause -------------------------------------------------------------------1 dbp_chk Check ((DBP>=60 and DBP<=140)) 2 gen_chk Check GENDER in (’F’, ’M’) 3 hr_chk Check (HR=. or (HR>=40 and HR<=100)) 4 id_chk Unique PATNO 5 sbp_chk Check ((SBP>=80 and SBP<=200)) or (SBP is null)
1RWLFHWKDWHDFKRIWKH:+(5(FODXVHV WKDW FUHDWHG WKH LQWHJULW\ FRQVWUDLQWV DUH OLVWHG LQ WKHRXWSXWIURP352&&217(176 :KDW KDSSHQV ZKHQ \RX WU\ WR DSSHQG GDWD WKDW YLRODWHV RQH RU PRUH RI WKH LQWHJULW\ FRQVWUDLQWV" 7KH VKRUW '$7$ VWHS VKRZQ QH[W FUHDWHV D GDWD VHW 1(: LQ ZKLFK WKH VHFRQGREVHUYDWLRQ3$712 YLRODWHVWKHLQWHJULW\FRQVWUDLQWVIRU+5/HW VUXQ352& $33(1'QH[WLQDQDWWHPSWWRDSSHQGWKHQHZGDWDWRWKH,&B'(02GDWDVHW
Chapter 9
Using Validation Data Sets
211
DATA NEW; INPUT PATNO : $3. GENDER : $1. HR SBP DBP; DATALINES; 456 M 66 98 72 567 F 150 130 80 ; PROC APPEND BASE=IC_DEMO DATA=NEW; RUN;
7KH6$6/RJVKRZQQH[W VKRZVWKDWRQHREVHUYDWLRQKDGDGDWDYDOXHWKDWYLRODWHGWKH KHDUWUDWHLQWHJULW\FRQVWUDLQW+5B&+.
38 39 40
; PROC APPEND BASE=IC_DEMO DATA=NEW; RUN;
NOTE: Appending WORK.NEW to WORK.IC_DEMO. WARNING: Data value(s) do not comply with integrity constraint HR_CHK for file IC_DEMO, 1 observations rejected. NOTE: 1 observations added. NOTE: The data set WORK.IC_DEMO has 7 observations and 5 variables. NOTE: PROCEDURE APPEND used: real time 0.33 seconds
$V PHQWLRQHG LQ WKH EHJLQQLQJ RI WKLV VHFWLRQ WKH 6$6 /RJ GRHV QRW WHOO \RX ZKLFK REVHUYDWLRQ ZDV DGGHG DQG ZKLFK ZDV UHMHFWHG ,W GRHV HQVXUH WKDW RQO\ REVHUYDWLRQV PHHWLQJWKHLQWHJULW\FRQVWUDLQWVFDQEHDSSHQGHGWRWKHRULJLQDOGDWDVHW /HW VWHVWWKH81,48(SURSHUW\RIWKH3$712YDULDEOH7KHQH[W'$7$VWHSFUHDWHVD WKUHHREVHUYDWLRQ GDWD VHW 1(: LQ ZKLFK WKH ILUVW WZR REVHUYDWLRQV FRQWDLQ SDWLHQW QXPEHUVWKDWDUHDOUHDG\LQWKH,&B'(02GDWDVHW/HW VVHHZKDWKDSSHQVZKHQ \RX DWWHPSWWRDSSHQGLWWRWKH,&B'(02GDWDVHW DATA NEW2; INPUT PATNO : $3. GENDER : $1. HR SBP DBP; DATALINES; 003 M 66 98 72 015 F 80 130 80 777 F 70 110 70 ; PROC APPEND BASE=IC_DEMO DATA=NEW2; RUN;
®
212 Cody’s Data Cleaning Techniques Using SAS Software
+HUHLVWKHUHVXOWLQJ6$6/RJ 47 48 49
; PROC APPEND BASE=IC_DEMO DATA=NEW2; RUN;
NOTE: Appending WORK.NEW2 to WORK.IC_DEMO. WARNING: Data value(s) do not comply with integrity constraint ID_CHK for file IC_DEMO, 2 observations rejected. NOTE: 1 observations added. NOTE: The data set WORK.IC_DEMO has 8 observations and 5 variables. NOTE: PROCEDURE APPEND used: real time 0.04 seconds
7KLV VHFWLRQ JLYHV RQO\ D EULHI JOLPSVH RI LQWHJULW\ FRQVWUDLQWV 7ZR VRXUFHV RI DGGLWLRQDOLQIRUPDWLRQDERXWLQWHJULW\FRQVWUDLQWVDUHWKH6$62QOLQH'RFXPHQWDWLRQIRU 9HUVLRQRUODWHU DQGWKH6$6:HEVLWH ZZZVDVFRPWHFKVXSGRZQORDGWHFKQRWHWVKWPO
Appendix
Listing of Raw Data Files and SAS Programs
'HVFULSWLRQRIWKH5DZ'DWD)LOH3$7,(1767;7
/D\RXWIRUWKH'DWD)LOH3$7,(1767;7
/LVWLQJRI5DZ'DWD)LOH3$7,(1767;7
3URJUDPWR&UHDWHWKH6$6'DWD6HW3$7,(176
/LVWLQJRI5DZ'DWD)LOH3$7,(1767;7
3URJUDPWR&UHDWHWKH6$6'DWD6HW3$7,(176
3URJUDPWR&UHDWHWKH6$6'DWD6HW$($GYHUVH(YHQWV
3URJUDPWR&UHDWHWKH6$6'DWD6HW/$%B7(67
Description of the Raw Data File PATIENTS.TXT
7KHUDZGDWDILOH3$7,(1767;7FRQWDLQVERWKFKDUDFWHUDQGQXPHULFYDULDEOHVIURPD W\SLFDOFOLQLFDOWULDO$QXPEHURIGDWDHUURUVZHUHLQFOXGHGLQWKHILOHVRWKDW\RXFDQ WHVWWKHGDWDFOHDQLQJSURJUDPVWKDWDUHGHYHORSHGLQWKLVWH[W7KHSURJUDPVLQWKLVERRN DVVXPH WKDW WKH ILOH 3$7,(1767;7 LV ORFDWHG LQ D GLUHFWRU\ IROGHU FDOOHG &?&/($1,1*7KLVLVWKHGLUHFWRU\WKDWLVXVHGWKURXJKRXWWKLVWH[WDVWKHORFDWLRQIRU GDWDILOHV6$6GDWDVHWV6$6SURJUDPVDQG6$6PDFURV
214
®
Cody’s Data Cleaning Techniques Using SAS Software
Layout for the Data File PATIENTS.TXT
9DULDEOH 1DPH
'HVFULSWLRQ
6WDUWLQJ &ROXPQ
/HQJWK 9DULDEOH7\SH
9DOLG9DOXHV
3$712
3DWLHQW 1XPEHU
&KDUDFWHU
1XPHUDOVRQO\
*(1'(5
*HQGHU
&KDUDFWHU
0 RU )
9,6,7
9LVLW'DWH
00''<<
$Q\YDOLGGDWH
+5
+HDUW5DWH
1XPHULF
%HWZHHQDQG
6%3
6\VWROLF %ORRG 3UHVVXUH
1XPHULF
%HWZHHQDQG
'%3
'LDVWROLF %ORRG 3UHVVXUH
1XPHULF
%HWZHHQDQG
';
'LDJQRVLV &RGH
&KDUDFWHU
WRGLJLWQXPHUDO
$(
$GYHUVH(YHQW
&KDUDFWHU
RU
Appendix
Listing of Raw Data Files and SAS Programs
Listing of Raw Data File PATIENTS.TXT
1234567890123456789012345 (ruler) 001M11/11/1998 88140 80 10 002F11/13/1998 84120 78 X0 003X10/21/1998 68190100 31 004F01/01/1999101200120 5A XX5M05/07/1998 68120 80 10 006 06/15/1999 72102 68 61 007M08/32/1998 88148102 0 M11/11/1998 90190100 0 008F08/08/1998210 70 009M09/25/1999 86240180 41 010f10/19/1999 40120 10 011M13/13/1998 68300 20 41 012M10/12/98 60122 74 0 013208/23/1999 74108 64 1 014M02/02/1999 22130 90 1 002F11/13/1998 84120 78 X0 003M11/12/1999 58112 74 0 015F 82148 88 31 017F04/05/1999208 84 20 019M06/07/1999 58118 70 0 123M15/12/1999 60 10 321F 900400200 51 020F99/99/9999 10 20 8 0 022M10/10/1999 48114 82 21 023f12/31/1998 22 34 78 0 024F11/09/199876 120 80 10 025M01/01/1999 74102 68 51 027FNOTAVAIL NA 166106 70 028F03/28/1998 66150 90 30 029M05/15/1998 41 006F07/07/1999 82148 84 10
215
216
®
Cody’s Data Cleaning Techniques Using SAS Software
Program to Create the SAS Data Set PATIENTS *----------------------------------------------------------------* _352*5$01$0(3$7,(1766$6,1&?&/($1,1*_ _385326(72&5($7($6$6'$7$6(7&$//('3$7,(176_ _'$7(0$<_
/,%1$0(&/($1&?&/($1,1* '$7$&/($13$7,(176 ,1),/(&?&/($1,1*?3$7,(1767;73$' ,1387#3$712 #*(1'(5 #9,6,700''<< #+5 #6%3 #'%3 #'; #$( /$%(/3$712 *(1'(5 9,6,7 +5 6%3 '%3 '; $(
3DWLHQW1XPEHU *HQGHU 9LVLW'DWH +HDUW5DWH 6\VWROLF%ORRG3UHVVXUH 'LDVWROLF%ORRG3UHVVXUH 'LDJQRVLV&RGH $GYHUVH(YHQW"
)250$79,6,700''<< 581
Appendix
Listing of Raw Data Files and SAS Programs
Listing of Raw Data File PATIENTS2.TXT
Listing of the File PATIENTS2.TXT 1 2 1234567890123456789012345 (ruler) ----------------------------------00106/12/1998 80130 80 00106/15/1998 78128 78 00201/01/1999 48102 66 00201/10/1999 70112 82 00202/09/1999 74118 78 00310/21/1998 68120 70 00403/12/1998 70102 66 00403/13/1998 70106 68 00504/14/1998 72118 74 00504/14/1998 74120 80 00611/11/1998100180110 00709/01/1998 68138100 00710/01/1998 68140 98
Program to Create the SAS Data Set PATIENTS2
LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.PATIENTS2; INFILE "C:\CLEANING\PATIENTS2.TXT" PAD; INPUT @1 PATNO $3. @4 VISIT MMDDYY10. @14 HR 3. @17 SBP 3. @20 DBP 3.; FORMAT VISIT MMDDYY10.; RUN;
217
218
®
Cody’s Data Cleaning Techniques Using SAS Software
Program to Create the SAS Data Set AE (Adverse Events)
LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.AE; INPUT @1 PATNO $3. @4 DATE_AE MMDDYY10. @14 A_EVENT $1.; LABEL PATNO = ’Patient ID’ DATE_AE = ’Date of AE’ A_EVENT = ’Adverse Event’; FORMAT DATE_AE MMDDYY10.; DATALINES; 00111/21/1998W 00112/13/1998Y 00311/18/1998X 00409/18/1998O 00409/19/1998P 01110/10/1998X 01309/25/1998W 00912/25/1998X 02210/01/1998W 02502/09/1999X ;
Appendix
Listing of Raw Data Files and SAS Programs
Program to Create the SAS Data Set LAB_TEST
LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.LAB_TEST; INPUT @1 PATNO $3. @4 LAB_DATE DATE9. @13 WBC 5. @18 RBC 4.; LABEL PATNO = ’Patient ID’ LAB_DATE = ’Date of Lab Test’ WBC = ’White Blood Cell Count’ RBC = ’Red Blood Cell Count’; FORMAT LAB_DATE MMDDYY10.; DATALINES; 00115NOV1998 90005.45 00319NOV1998 95005.44 00721OCT1998 82005.23 00422DEC1998110005.55 02501JAN1999 82345.02 02210OCT1998 80005.00 ;
219
220
®
Cody’s Data Cleaning Techniques Using SAS Software
Index
A DGYHUVHHYHQWVGDWDVHW $(GDWDVHW DPSHUVDQG LQPDFURV DWVLJQ# FROXPQSRLQWHU
B ER[SORWV
C FKDUDFWHUGDWD FKHFNLQJQXPHULFGDWDIRU VHWWLQJDUDQJHIRU FKDUDFWHUGDWDFKHFNLQJ 6HHDOVRPLVVLQJYDOXHV '$7$VWHSIRU GHFLPDOSRLQWVFRQYHUWLQJ HUURUVLJQRULQJ IRUPDWVIRU )5(4SURFHGXUHIRU LQIRUPDWVIRU LQWHJHUVLGHQWLI\LQJ OLVWLQJLQYDOLGYDOXHV OLVWLQJXQLTXHYDOXHV 02'IXQFWLRQIRU 35,17SURFHGXUHIRU 64/SURFHGXUHIRU 75$16/$7(IXQFWLRQIRU
YDOLGDWLRQGDWDVHWV 9(5,)<IXQFWLRQIRU :+(5(VWDWHPHQWIRU ""GRXEOHTXHVWLRQPDUN PRGLILHU &203$5(SURFHGXUHFRPSDULQJGDWDVHWV FRQWDLQLQJGLIIHUHQWYDULDEOHV ZLWKDQ,'YDULDEOH ZLWKXQHTXDOQXPEHUVRIREVHUYDWLRQV ZLWKRXWDQ,'YDULDEOH
D GDWDYDOLGLW\FKHFNLQJ 6HHFKDUDFWHUGDWDFKHFNLQJ 6HHGDWHV 6HHGXSOLFDWHV 6HHPLVVLQJYDOXHV 6HHQXPHULFGDWDFKHFNLQJ 6HHRXWOLHUGHWHFWLRQ 6HHYDOLGDWLRQGDWDVHWV GDWDRQPXOWLSOHILOHV 6HHPXOWLSOHILOHV GDWDVHWVFRPSDULQJ 6HH&203$5(SURFHGXUH GDWDVHWVPXOWLSOH 6HHPXOWLSOHILOHV '$7$VWHSV FKHFNLQJFKDUDFWHUGDWD FKHFNLQJGDWHUDQJHV
222
®
Cody’s Data Cleaning Techniques Using SAS Software
FRXQWLQJPLVVLQJYDOXHV GHWHFWLQJGXSOLFDWHV GHWHFWLQJRXWOLHUV YHULI\LQJQREVHUYDWLRQVSHU VXEMHFW GDWHFRQVWDQWV GDWHOLWHUDOV GDWHV FKHFNLQJRUGHULQPXOWLSOHILOHV 0'<PRQWKGD\\HDU IXQFWLRQ PLVVLQJGD\RIWKHPRQWK 021<<LQIRUPDW QRQVWDQGDUGIRUPV VXVSHQGLQJHUURUFKHFNLQJ YDOLGLW\FKHFNLQJ GDWHVFKHFNLQJUDQJHVZLWK '$7$VWHSV 35,17SURFHGXUH 64/SURFHGXUH GHFLPDOSRLQWV FKHFNLQJLQQXPHULFGDWD FRQYHUWLQJLQFKDUDFWHUGDWD GRXEOHHQWU\DQGYHULILFDWLRQ 6HH&203$5(SURFHGXUH GRXEOHTXHVWLRQPDUNV"" FKHFNLQJFKDUDFWHUGDWD VXSSUHVVLQJHUURUPHVVDJHV GXSOLFDWHVGHWHFWLQJZLWK DPDFUROLVW '$7$VWHSV ),567YDULDEOH )5(4SURFHGXUH /$67YDULDEOH 64/SURFHGXUH
GXSOLFDWHVHOLPLQDWLQJZLWK 12'83RSWLRQ 12'83.(<RSWLRQ 6257SURFHGXUH
E HQKDQFHGQXPHULFLQIRUPDWV HUURUFKHFNLQJVXVSHQGLQJ H[SORUDWRU\GDWDDQDO\VLV('$
F ILOHVFRPSDULQJ 6HH&203$5(SURFHGXUH ILOHVGDWDLQPXOWLSOH 6HHPXOWLSOHILOHV ),567YDULDEOH )250$7SURFHGXUH FKHFNLQJIRULQYDOLGYDOXHV IRUPDWV FKHFNLQJFKDUDFWHUGDWD GHWHFWLQJRXWOLHUV )5(4SURFHGXUH FKHFNLQJFKDUDFWHUGDWD FRXQWLQJPLVVLQJYDOXHV GHWHFWLQJGXSOLFDWHV YHULI\LQJQREVHUYDWLRQVSHUVXEMHFW
H KLJKHVWYDOXHGHWHUPLQLQJ 6HHRXWOLHUGHWHFWLRQ 6HHYDOLGDWLRQGDWDVHWV
I ,'VWDWHPHQW
Index
,1RSHUDWRU LQIRUPDWV FKHFNLQJFKDUDFWHUGDWD GHWHFWLQJRXWOLHUV HQKDQFHGQXPHULF LQWHJHUVLGHQWLI\LQJ LQWHUTXDUWLOHUDQJHVGHWHFWLQJ RXWOLHUV ,1387IXQFWLRQ
L /$%B7(67GDWDVHW /$67YDULDEOH ORZHUFDVHFRQYHUWLQJWRXSSHUFDVH ORZHVWYDOXHGHWHUPLQLQJ 6HHRXWOLHUGHWHFWLRQ 6HHYDOLGDWLRQGDWDVHWV
M PDFUROLVWGHWHFWLQJGXSOLFDWHV PDFURV FKHFNLQJ,'VLQPXOWLSOHILOHV FRXQWLQJPLVVLQJYDOXHV GHWHFWLQJRXWOLHUVE\QXPEHU GHWHFWLQJRXWOLHUVE\SHUFHQWDJH GHWHFWLQJRXWOLHUVHVWDEOLVKLQJ DFXWRII DPSHUVDQG VHPLFRORQ 0'<PRQWKGD\\HDU IXQFWLRQ 0($16SURFHGXUH FRXQWLQJPLVVLQJYDOXHV
GHWHFWLQJRXWOLHUV PLVVLQJYDOXHV GHWHFWLQJ GHWHFWLQJRXWOLHUV LJQRULQJ LQVSHFWLQJWKH6$6/RJ UHSRUWLQJLQQXPHULFGDWD VXEVWLWXWLQJDQXPHULFYDOXHIRU YDOLGDWLRQGDWDVHWV YHUVXVLQYDOLGFKDUDFWHUYDOXHV PLVVLQJYDOXHVFRXQWLQJZLWK DPDFUR '$7$VWHSV )5(4SURFHGXUH 0($16SURFHGXUH 7$%8/$7(SURFHGXUH 02'IXQFWLRQ 021<<LQIRUPDW PXOWLILOHUXOHV PXOWLSOHILOHVFKHFNLQJGDWHRUGHU PXOWLSOHILOHVFKHFNLQJ,'V PDFURVIRU PXOWLILOHUXOHV RQQILOHV RQWZRILOHV 6$6/RJVDPSOH 64/SURFHGXUHIRU
N 12'83RSWLRQ 12'83.(<RSWLRQ QRQPLVVLQJYDOXHVFRXQWLQJ QRQVWDQGDUGGDWHIRUPV QRUPDOSUREDELOLW\SORWV
223
224
®
Cody’s Data Cleaning Techniques Using SAS Software
QXPHULFGDWD FRXQWLQJPLVVLQJDQGQRQPLVVLQJ YDOXHV OLVWLQJPLQLPXPDQGPD[LPXP YDOXHV VHDUFKLQJIRUVSHFLILFYDOXHV QXPHULFGDWDFKHFNLQJ 6HHDOVRPLVVLQJYDOXHV 6HHDOVRRXWOLHUGHWHFWLRQ GHFLPDOSRLQWV IRUFKDUDFWHUGDWD
O REVHUYDWLRQVYHULI\LQJQSHUVXEMHFW '$7$VWHSV )5(4SURFHGXUH 64/SURFHGXUH RXWRIUDQJHYDOXHV 6HHRXWOLHUGHWHFWLRQ 6HHYDOLGDWLRQGDWDVHWV RXWOLHUGHWHFWLRQ 6HHDOVRYDOLGDWLRQGDWDVHWV IRUPXOWLSOHYDULDEOHV PLVVLQJYDOXHV 64/SURFHGXUH RXWOLHUGHWHFWLRQE\QXPEHU '$7$VWHSIRU IRUPDWVIRU LQIRUPDWVHQKDQFHGQXPHULF LQIRUPDWVIRU PDFURIRU 0($16SURFHGXUHIRU
35,17SURFHGXUHIRU 5$1.SURFHGXUHIRU 6257SURFHGXUHIRU 7$%8/$7(SURFHGXUHIRU 81,9$5,$7(SURFHGXUHIRU :+(5(VWDWHPHQWIRU RXWOLHUGHWHFWLRQE\SHUFHQWDJH PDFURIRU RXWSXWGLUHFWLQJWRGDWDVHW 5$1.SURFHGXUHIRU 81,9$5,$7(SURFHGXUHIRU RXWOLHUGHWHFWLRQHVWDEOLVKLQJDFXWRII LQWHUTXDUWLOHUDQJHV PDFURVIRU VWDQGDUGGHYLDWLRQ WULPPHGPHDQV
P 3$'RSWLRQ 3$7,(1767;7GDWDVHW 3$7,(1767;7GDWDVHW 35,17SURFHGXUH FKHFNLQJFKDUDFWHUGDWD FKHFNLQJGDWHUDQJHV GHWHFWLQJRXWOLHUV XVLQJZLWK:+(5(VWDWHPHQW 387IXQFWLRQ
R UDQJHVYHULI\LQJ 6HHRXWOLHUGHWHFWLRQ 6HHYDOLGDWLRQGDWDVHWV
Index
5$1.SURFHGXUH GHWHFWLQJRXWOLHUVE\QXPEHU GHWHFWLQJRXWOLHUVE\SHUFHQWDJH
S 6$6LQWHJULW\FRQVWUDLQWV 6$6/RJ LQVSHFWLQJIRUPLVVLQJYDOXHV VDPSOHRXWSXW VHPLFRORQ LQPDFURV 6257SURFHGXUH GHWHFWLQJRXWOLHUV HOLPLQDWLQJGXSOLFDWHV 64/SURFHGXUH FKDUDFWHUGDWDFKHFNLQJ GDWHUDQJHVFKHFNLQJ GHVFULSWLRQ GXSOLFDWHVGHWHFWLQJ ,'VLQPXOWLSOHILOHVFKHFNLQJ PLVVLQJYDOXHVGHWHFWLQJ PLVVLQJYDOXHVLJQRULQJ PXOWLILOHUXOHV RXWOLHUVGHWHFWLQJ VXEMHFWVZLWKQREVHUYDWLRQV LGHQWLI\LQJ VWDQGDUGGHYLDWLRQGHWHFWLQJRXWOLHUV VWHPDQGOHDISORWV VXEMHFWVYHULI\LQJQREVHUYDWLRQVZLWK '$7$VWHSV )5(4SURFHGXUH 64/SURFHGXUH
T 7$%8/$7(SURFHGXUH FRXQWLQJPLVVLQJDQGQRQPLVVLQJYDOXHV GHWHFWLQJRXWOLHUV 75$16/$7(IXQFWLRQ WULPPHGPHDQV
U 81,9$5,$7(SURFHGXUH ER[SORWV GHWHFWLQJRXWOLHUVE\QXPEHU GHWHFWLQJRXWOLHUVE\SHUFHQWDJH H[SORUDWRU\GDWDDQDO\VLV('$ OLPLWLQJRXWSXW ORFDWLQJRULJLQDOGDWD QRUPDOSUREDELOLW\SORWV VWHPDQGOHDISORWV 83&$6(IXQFWLRQ 83&$6(RSWLRQ XSSHUFDVHFRQYHUWLQJIURPORZHUFDVH
V YDOLGDWLRQGDWDVHWV 6HHDOVRRXWOLHUGHWHFWLRQ FKDUDFWHUGDWDFKHFNLQJ FKDUDFWHUGDWDVHWWLQJDUDQJHIRU GHILQLWLRQ H[DPSOHSURJUDP PDFURV PLVVLQJYDOXHV 6$6LQWHJULW\FRQVWUDLQWV
225
226
®
Cody’s Data Cleaning Techniques Using SAS Software
YDOLGLW\FKHFNLQJGDWD 6HHFKDUDFWHUGDWDFKHFNLQJ 6HHGDWHV 6HHGXSOLFDWHV 6HHPLVVLQJYDOXHV 6HHQXPHULFGDWDFKHFNLQJ 6HHRXWOLHUGHWHFWLRQ 6HHYDOLGDWLRQGDWDVHWV 9(5,)<IXQFWLRQ 91$0(URXWLQH
W :+(5(VWDWHPHQW FKHFNLQJFKDUDFWHUGDWD GHWHFWLQJRXWOLHUV RUGHURIRSHUDWLRQV XVLQJZLWK352&35,17 6SHFLDO&KDUDFWHUV DPSHUVDQG LQPDFURV VHPLFRORQ LQPDFURV ""GRXEOHTXHVWLRQPDUNV FKHFNLQJFKDUDFWHUGDWD VXSSUHVVLQJHUURUPHVVDJHV #DWVLJQ FROXPQSRLQWHU GHFLPDOSRLQW FKHFNLQJLQQXPHULFGDWD FRQYHUWLQJLQFKDUDFWHUGDWD
Call your local SAS office to order these books from
Books by Users Press
Advanced Log-Linear Models Using SAS® by Daniel Zelterman . . . . . . . . . . . . . .Order No. A57496
Annotate: Simply the Basics by Art Carpenter . . . . . . . . . . . . . . . . .Order No. A57320
Cody’s Data Cleaning Techniques Using SAS ® Software by Ron Cody . . . . . . . . . . . . . . . . . . . .Order No. A57198
Common Statistical Methods for Clinical Research with SAS ® Examples, Second Edition by Glenn A. Walker . . . . . . . . . . . . . .Order No. A58086
Applied Multivariate Statistics with SAS® Software, Second Edition
Concepts and Case Studies in Data Management
by Ravindra Khattree and Dayanand N. Naik . . . . . . . . . . . .Order No. A56903
by William S. Calvert and J. Meimei Ma . . . . . . . . . . . . . . . .Order No. A55220
Applied Statistics and the SAS ® Programming Language, Fourth Edition
Debugging SAS ® Programs: A Handbook of Tools and Techniques
by Ronald P. Cody and Jeffrey K. Smith . . . . . . . . . . . . .Order No. A55984
by Michele M. Burlew . . . . . . . . . . . . .Order No. A57743
An Array of Challenges — Test Your SAS ® Skills
Efficiency: Improving the Performance of Your SAS ® Applications
by Robert Virgile . . . . . . . . . . . . . . . .Order No. A55625
Beyond the Obvious with SAS Screen Control Language
by Robert Virgile . . . . . . . . . . . . . . . .Order No. A55960
®
by Don Stanley . . . . . . . . . . . . . . . . . .Order No. A55073
Carpenter’s Complete Guide to the SAS Macro Language ®
by Art Carpenter . . . . . . . . . . . . . . . . .Order No. A56100
The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith . . . . . . . . . . . . .Order No. A5515
Categorical Data Analysis Using the SAS ® System, Second Edition by Maura E. Stokes, Charles S. Davis, and Gary G. Koch . . . . . . . . . . . . . . . .Order No. A57998
A Handbook of Statistical Analyses Using SAS®, Second Edition by B.S. Everitt and G. Der . . . . . . . . . . . . . . . . . . . . . .Order No. A58679
Health Care Data and the SAS® System by Marge Scerbo, Craig Dickstein, and Alan Wilson . . . . . . . . . . . . . . . . .Order No. A57638
The How-To Book for SAS/GRAPH ® Software by Thomas Miron
. . . . . . . . . . . . . . .Order No. A55203
In the Know... SAS® Tips and Techniques From Around the Globe by Phil Mason . . . . . . . . . . . . . . . . . .Order No. A55513
support.sas.com/pubs
Integrating Results through Meta-Analytic Review Using SAS® Software by Morgan C. Wang and Brad J. Bushman . . . . . . . . . . . .Order No. A55810
Learning SAS ® in the Computer Lab, Second Edition
Output Delivery System: The Basics by Lauren E. Haworth . . . . . . . . . . . . Order No. A58087
Painless Windows: A Handbook for SAS ® Users by Jodie Gilmore . . . . . . . . . . . . . . . . Order No. A55769 (for Windows NT and Windows 95)
by Rebecca J. Elliott . . . . . . . . . . . . .Order No. A57739 The Little SAS ® Book: A Primer by Lora D. Delwiche and Susan J. Slaughter . . . . . . . . . .Order No. A55200 The Little SAS ® Book: A Primer, Second Edition by Lora D. Delwiche and Susan J. Slaughter . . . . . . . . . .Order No. A56649 (updated to include Version 7 features) Logistic Regression Using the SAS® System: Theory and Application by Paul D. Allison . . . . . . . . . . . . . . .Order No. A55770 Longitudinal Data and SAS : A Programmer’s Guide by Ron Cody . . . . . . . . . . . . . . . . . . .Order No. A58176 ®
Painless Windows: A Handbook for SAS ® Users, Second Edition by Jodie Gilmore . . . . . . . . . . . . . . . . Order No. A56647 (updated to include Version 7 features)
PROC TABULATE by Example by Lauren E. Haworth . . . . . . . . . . . . Order No. A56514
Professional SAS ® Programmer’s Pocket Reference, Fourth Edition by Rick Aster . . . . . . . . . . . . . . . . . . . Order No. A58128
Professional SAS ® Programmer’s Pocket Reference, Second Edition by Rick Aster . . . . . . . . . . . . . . . . . . . Order No. A56646
Maps Made Easy Using SAS® by Mike Zdeb . . . . . . . . . . . . . . . . . . .Order No. A57495
Professional SAS ® Programming Shortcuts
Models for Discrete Date by Daniel Zelterman . . . . . . . . . . . . .Order No. A57521
Programming Techniques for Object-Based Statistical Analysis with SAS® Software
Multiple Comparisons and Multiple Tests Using SAS® Text and Workbook Set (books in this set also sold separately) by Peter H. Westfall, Randall D. Tobias, Dror Rom, Russell D. Wolfinger and Yosef Hochberg . . . . . . . . . . . . . Order No. A55770
Multiple-Plot Displays: Simplified with Macros by Perry Watts . . . . . . . . . . . . . . . . . . Order No. A58314
Multivariate Data Reduction and Discrimination with SAS ® Software by Ravindra Khattree, and Dayanand N. Naik . . . . . . . . . . .Order No. A56902
The Next Step: Integrating the Software Life Cycle with SAS ® Programming by Paul Gill . . . . . . . . . . . . . . . . . . . . Order No. A55697
support.sas.com/pubs
by Rick Aster . . . . . . . . . . . . . . . . . . . Order No. A59353
by Tanya Kolosova and Samuel Berestizhevsky. . . . . . . Order No. A55869
Quick Results with SAS/GRAPH ® Software by Arthur L. Carpenter and Charles E. Shipp . . . . . . . . . . . . Order No. A55127
Quick Results with the Output Delivery System by Sunil Gupta . . . . . . . . . . . . . . . . . . .Order No. A58458
Quick Start to Data Analysis with SAS ® by Frank C. Dilorio and Kenneth A. Hardy. . . . . . . . . . . . Order No. A55550
Reading External Data Files Using SAS®: Examples Handbook by Michele M. Burlew . . . . . . . . . . . . Order No. A58369
Regression and ANOVA: An Integrated Approach Using SAS ® Software
SAS ® System for Elementary Statistical Analysis, Second Edition
by Keith E. Muller and Bethel A. Fetterman. . . . . . . . . . Order No. A57559
by Sandra D. Schlotzhauer and Ramon C. Littell . . . . . . . . . . . . .Order No. A55172
Reporting from the Field: SAS ® Software Experts Present Real-World Report-Writing
SAS ® System for Forecasting Time Series, 1986 Edition
Applications . . . . . . . . . . . . . . . . . . .Order No. A55135
by John C. Brocklebank and David A. Dickey . . . . . . . . . . . . .Order No. A5612
SAS ®Applications Programming: A Gentle Introduction by Frank C. Dilorio . . . . . . . . . . . . . .Order No. A56193
SAS ® System for Mixed Models
SAS ® for Forecasting Time Series, Second Edition
by Ramon C. Littell, George A. Milliken, Walter W. Stroup, and Russell D. Wolfinger . .Order No. A55235
by John C. Brocklebank and David A. Dickey . . . . . . . . . . . . .Order No. A57275
SAS ® System for Regression, Second Edition
SAS ® for Linear Models, Fourth Edition
by Rudolf J. Freund and Ramon C. Littell . . . . . . . . . . . . .Order No. A56141
by Ramon C. Littell, Walter W. Stroup. and Rudolf Freund . . . . . . . . . . . . . .Order No. A56655
SAS ® System for Statistical Graphics, First Edition by Michael Friendly . . . . . . . . . . . . . .Order No. A56143
SAS® for Monte Carlo Studies: A Guide for Quantitative Researchers by Xitao Fan, Ákos Felsovályi, Stephen A. Sivo, ˝ and Sean C. Keenan . . . . . . . . . . . . .Order No. A57323
SAS ® Macro Programming Made Easy
The SAS ® Workbook and Solutions Set (books in this set also sold separately) by Ron Cody . . . . . . . . . . . . . . . . . . .Order No. A55594
by Michele M. Burlew . . . . . . . . . . . .Order No. A56516
Selecting Statistical Techniques for Social Science Data: A Guide for SAS® Users
SAS ® Programming by Example
by Frank M. Andrews, Laura Klem, Patrick M. O’Malley, Willard L. Rodgers, Kathleen B. Welch, and Terrence N. Davidson . . . . . . . .Order No. A55854
by Ron Cody and Ray Pass . . . . . . . . . . . . . . . . . . .Order No. A55126
SAS ® Programming for Researchers and Social Scientists, Second Edition by Paul E. Spector . . . . . . . . . . . . . . .Order No. A58784
SAS ® Software Roadmaps: Your Guide to Discovering the SAS ® System by Laurie Burch and SherriJoyce King . . . . . . . . . . . .Order No. A56195
SAS ® Software Solutions: Basic Data Processing by Thomas Miron . . . . . . . . . . . . . . .Order No. A56196
SAS® Survival Analysis Techniques for Medical Research, Second Edition by Alan B. Cantor . . . . . . . . . . . . . . .Order No. A58416
Solutions for Your GUI Applications Development Using SAS/AF ® FRAME Technology by Don Stanley . . . . . . . . . . . . . . . . .Order No. A55811
Statistical Quality Control Using the SAS ® System by Dennis W. King . . . . . . . . . . . . . . .Order No. A55232
A Step-by-Step Approach to Using the SAS ® System for Factor Analysis and Structural Equation Modeling by Larry Hatcher . . . . . . . . . . . . . . . .Order No. A55129
A Step-by-Step Approach to Using the SAS ® System for Univariate and Multivariate Statistics by Larry Hatcher and Edward Stepanski . . . . . . . . . . .Order No. A55072
support.sas.com/pubs
Step-by-Step Basic Statistics Using SAS ®: Student Guide and Exercises (books in this set also sold separately)
JMP® Books Basic Business Statistics: A Casebook
by Larry Hatcher . . . . . . . . . . . . . . . . .Order No. A57541
by Dean P. Foster, Robert A. Stine, and Richard P. Waterman . . . . . . . . .Order No. A56813
Strategic Data Warehousing Principles Using SAS ® Software
Business Analysis Using Regression: A Casebook
by Peter R. Welbrock . . . . . . . . . . . .Order No. A56278
by Dean P. Foster, Robert A. Stine, and Richard P. Waterman . . . . . . . . .Order No. A56818
Survival Analysis Using the SAS ® System: A Practical Guide
JMP ® Start Statistics, Second Edition
by Paul D. Allison . . . . . . . . . . . . . . .Order No. A55233
by John Sall, Ann Lehman, and Lee Creighton . . . . . . . . . . . . . . .Order No. A58166
Table-Driven Strategies for Rapid SAS ® Applications Development
Regression Using JMP ®
by Tanya Kolosova and Samuel Berestizhevsky . . . . . . .Order No. A55198
by Rudolf J. Freund, Ramon C. Littell, and Lee Creighton . . . . . . . . . . . . . . .Order No. A58789
Tuning SAS ® Applications in the MVS Environment by Michael A. Raithel . . . . . . . . . . . .Order No. A55231
Univariate and Multivariate General Linear Models: Theory and Applications Using SAS ® Software by Neil H. Timm and Tammy A. Mieczkowski . . . . . . .Order No. A55809
Using SAS ® in Financial Research by Ekkehart Boehmer, John Paul Broussard, and Juha-Pekka Kallunki . . . . . . . . .Order No. A57601
Using the SAS ® Windowing Environment: A Quike Tutorial by Larry Hatcher . . . . . . . . . . . . . . . .Order No. A57201
Visualizing Categorical Data by Michael Friendly . . . . . . . . . . . . . .Order No. A56571
Working with the SAS ® System by Erik W. Tilanus . . . . . . . . . . . . . . .Order No. A55190
Your Guide to Survey Research Using the SAS ® System by Archer Gravely . . . . . . . . . . . . . . .Order No. A55688
support.sas.com/pubs