Вы находитесь на странице: 1из 77

Introduction to SAS Essentials

Mastering SAS for Data Analytics

Alan Elliott and Wayne Woodward


1 SAS ESSENTIALS -- Elliott & Woodward
Chapter 6: Advanced Programming Topics
Part I

2 SAS ESSENTIALS -- Elliott & Woodward


LEARNING OBJECTIVES
 To be able to use SAS® Functions
 To be able to TRANSPOSE data sets
 To be able to perform data recoding using SELECf
 To be able to use SAS programming techniques to clean
up a messy data set

3 SAS ESSENTIALS -- Elliott & Woodward


6.1 USING SAS FUNCTIONS
 Sophisticated calculations can be created using SAS
functions. These functions include arithmetic and text
and date manipulation. The format for the use of SAS
functions is

result = function(argl,arg2, etc.);

 Where arg1 etc are arguments (information) you “send”


to the function for it to act upon, and it returns a result.

4 SAS ESSENTIALS -- Elliott & Woodward


Function Arguments
 Functions can require one or more arguments. Some
require no arguments. For example, a few mathematical
functions in SAS are as follows:

ABS(X); * Absolute value;


FACT(X); * Factorial;
INT(X); * Integer portion of a value;
LOG(X); * Natural log;
SQRT(X); * Square root;

5 SAS ESSENTIALS -- Elliott & Woodward


Types of functions
 Function in SAS can be categorized into these types:
 ARITHMETIC
 TRIGNOMETRIC
 DATE AND TIME
 CHARACTER
 TRUNCATION
 SPECIAL USE AND MISCELANNEOUS
 FINANCIAL
 ACCESS PREVIOUS OBSERVATIONS (LAG) F

6 SAS ESSENTIALS -- Elliott & Woodward


Using Functions in a Calculation
 Functions can also be used as a part of a more extensive
calculation. For example,

C =MEASURE+ SQRT(A**2 + B**2 ) ;

 would calculate the square of A and then the square of B,


then add those two numbers, take the square root of that
value, add that number to the value of MEASURE, and
assign the result to the variable named C.

7 SAS ESSENTIALS -- Elliott & Woodward


Functions that take more than one argument
 Some examples of functions that take more the one
argument are:

MAX(xl,x2,x3, ... ) ; * Maximum of a list;


MIN(xl,x2,x3, ... ) ; * Minimum of a list;
SUM(xl,x2,x3, ... ) ; * Sum of a list;
MEDIAN(xl,x2,x3, ... ) ; * Median of a list;
ROUND(value, round); * Round off a value;

 For example
TOTAL=SUM(TIMEl,TIME2,TIME3,TIME4);

8 SAS ESSENTIALS -- Elliott & Woodward


Rounding Values
 The round-off unit in the ROUND function determines how the
rounding will be performed. The default is 1. A round-off value of
0.01 means to round it off to the nearest lOOth, and a round-off
value of 5 means to round it off to the nearest 5. Here are a few
examples:

ROUND (3.1415, .01) *Returns the value 3.14;


ROUND(107,5) * Returns the value 105;
ROUND(3.6234)* Returns the value 4 (rounds to
integer);

9 SAS ESSENTIALS -- Elliott & Woodward


A Function with No Argument
 A function that takes no argument is TODAY ().For
example, the code

NOW=TODAY();

 puts the current date value from the computer's clock


into the variable named NOW.

10 SAS ESSENTIALS -- Elliott & Woodward


Specifying Arguments
 When arguments are a list of values, such as in MAX or MIN,
you can specify the list as variables separated by commas, or
as a range preceded by the word OF. For example, if these
variables have the following values:

X1 = 1; X2 = 2; X3 = 13; X4=10;

 Then

MAX(1,2,3,4,5) * Returns the value 5;


MAX(X1,X2,X3,X4) * Returns the value 13;
MAX(OF X1-X4) * Returns the value 13;

11 SAS ESSENTIALS -- Elliott & Woodward


More about Specifying Arguments
 Note that the designation OF X1-X4 is interpreted by SAS
as all same-named consecutively numbered variables
from X1 to X4 in the MAX () function example above. If
there are missing values in the list, they are ignored.
Other similar functions are illustrated here:

MIN(OF X1-X4) * Returns the value 1;


SUM(OF X1-X4) * Returns the value 26;
MEDIAN(OF X1-X4) * Returns the value;
NMISS(OF X1-X4) * Returns 0 (# missing values);
N(OF X1-X4) * Returns 4 (# non-missing);

12 SAS ESSENTIALS -- Elliott & Woodward


EXAMPLE
Suppose you have this code
DATA NUM;
X1 = 1; X2 = 2; X3 = 13; X4=.;
M1= SUM(X1-X4) ;
M2= SUM(OF X1-X4);
RUN;
PROC PRINT; RUN;

What is the value of M1 and M2?


EXERCISE – Enter this code and run the program. PAUSE and return
when you’ve successfully run the code

13 SAS ESSENTIALS -- Elliott & Woodward


RESULTS
 The answer is shown here. When you use
M1= SUM(X1-X4) ;
A missing value in the list makes the returned value missing,
whereas if you use
M2= SUM(OF X1-X4);
The function uses only non-missing values in its evaluation
of the function.

14 SAS ESSENTIALS -- Elliott & Woodward


6.2 USING PROC TRANSPOSE
 PROC TRANSPOSE allows you to restructure the values in
your data set by transposing (or reorienting) the data.
This is typically performed when your data are not in the
structure required for an analysis.

ROWS Transpose
COLUMNS

This is one example of how PROC


Transpose is used.

15 SAS ESSENTIALS -- Elliott & Woodward


Simplified Syntax for TRANSPOSE
PROC TRANSPOSE DATA=input-data OUT=output-data;
<PREFIX=prefix>; PREFIX specifies a prefix to the
names of variables created in the
<BY <variables>; transposition. The default names
are COL1, COL2, and so on
<ID variable>;
VAR variables;
The BY variable, when specified,
indicates the variable that is used to
The VAR statement form BY groups.
specifies which variables
are to be transposed.

16 SAS ESSENTIALS -- Elliott & Woodward


Hands On Example Page 140
 Open the file DTRANSPOSE1.SAS

DATA SUBJECTS;
INPUT SUB1 $ SUB2 $ SUB3 $ SUB4 $;
DATALINES;
12 21 13 14
13 21 12 14 Notice how the data are not in a
desired form. Use PROC TRANSPOSE
15 31 23 23 to “flip” the data.
15 33 21 32
M F F M
;
RUN;

17 SAS ESSENTIALS -- Elliott & Woodward


Using PROC TRANSPOSE to “flip” data
PROC TRANSPOSE DATA=SUBJECTS
OUT=TRANSPOSED;
VAR SUB1 SUB2 SUB3 SUB4;
RUN; Results: Note that
column names
PROC PRINT DATA=TRANSPOSED; are COL1, COL2
RUN; by default.

Obs _NAME_ COL1 COL2 COL3 COL4 COL5


1 SUB1 12 13 15 15 M
2 SUB2 21 21 31 33 F
3 SUB3 13 12 23 21 F
4 SUB4 14 14 23 32 M

18 SAS ESSENTIALS -- Elliott & Woodward


HANDS ON EXERCISE
 You want column named to be INFO1, INFO2, etc. Do this
with the PREFIX Statement:

EXERCISE – 1. Change the PROC statement to

PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED


PREFIX=INFO;

Rerun the code, and observe the results… particularly the


column headings. PAUSE and return when you’ve run the new
code.

19 SAS ESSENTIALS -- Elliott & Woodward


RESULTS
 The added PREFIX= statement in
PROC TRANSPOSE DATA=SUBJECTS
OUT=TRANSPOSED PREFIX=INFO;
 Changed the column headings: Note new
column names

20 SAS ESSENTIALS -- Elliott & Woodward


CONTINUE EXERCISE
 2. To make the results better, add the following code (to
RENAME the columns)
This is used to rename the
variables. For example
DATA NEW;SET TRANSPOSED; INFO2 is renamed T2.

RENAME INFO1=T1 INFO2=T2


INFO3=T3 INFO4=T4
INFO5=GENDER _NAME_=SUBJECT;
RUN;
PROC PRINT DATA=NEW;RUN;
PAUSE – enter the new code, run it, and observe the results.
Return when you are finished.
21 SAS ESSENTIALS -- Elliott & Woodward
RESULTS
• Notice how the column names are now much
more informative.

22 SAS ESSENTIALS -- Elliott & Woodward


ANOTHER WAY TO NAME COLUMNS
 Open the file DTRANSPOSE1a.SAS
 Notice the label variable LAB in this version of the code:

DATA SUBJECTS;
INPUT LAB $ SUB1 $ SUB2 $ SUB3 $ SUB4 $;
DATALINES;
BASELINE 12 21 13 14
TIME1 13 21 12 14
TIME2 15 31 23 23
TIME3 15 33 21 32
GENDER M F F M
;

23 SAS ESSENTIALS -- Elliott & Woodward


TRANSPOSE WITH LABELS
 When you have a label variable, use the ID LAB
statement to create column names.

PROC TRANSPOSE DATA=SUBJECTS


OUT=TRANSPOSED;
ID LAB;
VAR SUB1 SUB2 SUB3 SUB4;
RUN;
PROC PRINT DATA=TRANSPOSED;
RUN;

24 SAS ESSENTIALS -- Elliott & Woodward


RESULTS WHEN YOU RUN THIS CODE
 Notice how the column names reflect the LAB variable
used in the ID LAB Statement.
 (You may still want to rename the _NAME_ Column.)

25 SAS ESSENTIALS -- Elliott & Woodward


USING PROC TRANSPOSE WHEN YOU HAVE
MULTIPLE RECORDS PER SUBJECT
 Suppose you have data
that have one or more
observations per
subject, but you want to
analyze the data by
observation (a set of
observations per row).
Use PROC TRANSPOSE
to transpose the data by
a key variable.

Note: Some subjects have more


than one record in the data set.
26 SAS ESSENTIALS -- Elliott & Woodward
COMBINE MULTIPLE RECORDS ONTO ONE LINE
 Use PROC TRANSPOSE to combine multiple records
(identified by a key variable) onto to one line per record

PROC TRANSPOSE
DATA="C:\SASDATA\COMPLICATIONS"
OUT=COMP_OUT The PREFIX allows you to name the
combined variable.
PREFIX=COMP;
The BY variable identifies the key
BY SUBJECT;
variable to expand on.
VAR COMPLICATION;
RUN;
The VAR identifies which variables
to expand.
27 SAS ESSENTIALS -- Elliott & Woodward
HANDS ON EXERCISE P 143 (DTRANSPOSE2.SAS)
 Run the code.

Performs the transpose.

Cleans up the data and limits


output to subjects with 3 or
more complications.

Limits the names of the complications to


10 characters in length for the report.
28 SAS ESSENTIALS -- Elliott & Woodward
You’ve found that subject
2076 has three
RESULTS OF PROC TRANSPOSE complications listed.

Obs SUBJECT COMP1 COMP2 COMP3 COMP4


1 2076 Pneumonia Heart Atta Renal Fail
2 3585 DVT (Lower Pneumonia Renal Fail
3 3630 DVT (Lower Heart Atta Pancreatit Pneumonia
4 4585 Compartmen Pneumonia Skin Break
5 4599 Aspiration Pneumonia Renal Fail
6 4760 Acute Resp Pneumonia Renal Fail
7 4775 Pneumonia DVT (Lower Pneumonia Renal Fail

Note how PROC TRANSPOSE expanded SUBJECT records that had


multiple complications and named them COMP1, COMP2 etc.

29 SAS ESSENTIALS -- Elliott & Woodward


EXERCISE - ANALYZE THE RESULTS
 How many subjects had Renal Failure? Examine this code

CATT concatenates all of the


DATA RENAL;SET COMP_OUT; complications into a single
CCAT=CATT(OF COMP1-COMP7); variable named CCAT.

IF FIND(UPCASE(CCAT),"RENAL") NE 0 then
RENALFAILURE="Yes";
ELSE RENALFAILURE="No";
RUN; Use FIND() to find any instance of
PROC FREQ DATA=RENAL; “RENAL” in CCAT, and create a new
TABLES RENALFAILURE;RUN; variable named RENALFAILURE

Counts the number of


subjects with Renal Failure.

30 SAS ESSENTIALS -- Elliott & Woodward


WHAT THIS CODE IS DOING…
CCAT=CATT(OF COMP1-COMP7);
Concatenates complications for each subject, producing the
records shown here:

Note how multiple


complications are
concatenated.

31 SAS ESSENTIALS -- Elliott & Woodward


WHAT THIS CODE IS DOING…
IF FIND(UPCASE(CCAT),"RENAL") NE 0 then
RENALFAILURE="Yes";
This is the new
ELSE RENALFAILURE="No"; RENALFAILURE variable
created by the IF
statement
RESULTS

Note this is Yes


because RENAL is in
the complications
list.

32 SAS ESSENTIALS -- Elliott & Woodward


HANDS ON EXERCISE

DATA RENAL;SET COMP_OUT;


CCAT=CATT(OF COMP1-COMP7);

IF FIND(UPCASE(CCAT),"RENAL") NE 0 then
RENALFAILURE="Yes";
ELSE RENALFAILURE="No";
RUN;
PROC FREQ DATA=RENAL;
TABLES RENALFAILURE;RUN;

EXERCISE – Enter and run the new code and observe results.
PAUSE – Return once you’ve completed this exercise.

33 SAS ESSENTIALS -- Elliott & Woodward


RESULTS

You discover that 50 of the


528 subjects had Renal Failure
as one of their complications.

34 SAS ESSENTIALS -- Elliott & Woodward


6.3 THE SELECT STATEMENT
 The SELECT statement evaluates the value of a variable
and creates new assignments based on those values.
Syntax (simplified) is as follows:

SELECT <(select - expression)> ;


WHEN- 1 statement;
WHEN- n statement;
< OTHERWISE statement;>

35 SAS ESSENTIALS -- Elliott & Woodward


SELECT Statement Example
 Suppose you want to calculate NEWVAL according to some
specific values of the variable OBSERVED. That is, if
OBSERVED=1, you want to set NEWVAL=AGE+2. If
OBSERVED=2 or 3, you want to set NEWVAL=AGE+10, and so
on. A SELECT statement to perform this recoding would be as
follows: Note that OBSERVED is the
comparison variable – identified as
SELECT (OBSERVED); the SELECT EXPRESSION

WHEN (1) NEWVAL=AGE+2;


WHEN (1) is
WHEN (2,3) NEWVAL=AGE+10;
interpreted
WHEN (4,5,6) NEWVAL=AGE+20 ; as WHEN
OTHERWISE NEWVAL=0; OBSERVED=1

END;
36 SAS ESSENTIALS -- Elliott & Woodward
Without a Specific Select Expression
 Another way to use SELECT is without a specific select-
expression. In this case, the WHEN statements include
conctitional expressions that should be in parentheses.
For example:
Note in this version there is no specified comparison
variables. Comparisons are specified in the WHEN
statements.
SELECT;
WHEN (GP='A') STATUS2=1;
WHEN (GP='B' and SEX=1) STATUS2=2;
WHEN (GP='C ' and SEX=0) STATUS2=3;
OTHERWISE STATUS2=0; Use this version of SELECT
END; when comparisons are more
complex (and not just equal.)

37 SAS ESSENTIALS -- Elliott & Woodward


Do Hands On Example p 145( DSELECT.SAS)
DATA MYDATA;SET "C: \SASDATA\SOMEDATA";
FORMAT ECONOMIC $7.; Note the difference
in using a specified
SELECT(STATUS);
SELECT expression
WHEN (1,2) ECONOMIC="LOW"; (here is it STATUS)
and in the second
WHEN (3) ECONOMIC="MIDDLE"; part of this example,
WHEN (4,5) ECONOMIC="HIGH"; when you do not
have a specified
OTHERWISE ECONOMIC="MISSING"; expression.
END;
PROC PRINT DATA=MYDATA;
RUN;

38 SAS ESSENTIALS -- Elliott & Woodward


6.4 GOING DEEPER: CLEANING A MESSY DATA SET
 Many of the features in the SAS language are helpful in
cleaning up messy data.
 By messy we mean data sets that are not quite ready for
analysis.
 Most data analysts experience problems dealing with files
that contain data that have coding problems and must be
fixed before a proper analysis is possible.
 This section walks you through a case study of a data set
with problems and illustrates how they might be
corrected.

39 SAS ESSENTIALS -- Elliott & Woodward


A TYPICAL MESSY DATA SET – NOTE SOME OF
THE ISSUES (A FEW ARE MARKED)

40 SAS ESSENTIALS -- Elliott & Woodward


Problems with the data set
 A few of the problems you might quickly note include the
following:
 Line 17 is blank.
 There are non-date values in the "DateLeft" column.
 There is a non-number in the "Age" column (>29).
 Values of Gender are mixed upper and lower cases.
 There are multiple answers in columns that should have only
one answer.

 You can correct these issues by using SAS code. One


reason doing this in SAS code is that that it leaves an
audit trail of changes.

41 SAS ESSENTIALS -- Elliott & Woodward


A TYPICAL MESSY DATA SET – NOTE SOME OF
THE ISSUES (A FEW ARE MARKED)

42 SAS ESSENTIALS -- Elliott & Woodward


6.4.1 FIX LABELS, RENAME VARIABLES
 Often, a first step in creating a clean data set is to attach
labels to the variables to make output more “readable.”
Variable Name Type Label
1 Subject Char Subject ID
2 DateArrived Date Date Arrived
3 TimeArrive Time Time Arrived
4 DateLeft Date Date Left
5 TimeLeft Time Time Left Use the LABEL
6 Married Num Married?
statement to
7 Single Num Single?
8 Age Num Age Jan 1, 2014
created these
9 Gender Char Gender labels…
10 Education Num Years of Schooling
11 Race Char Race
12 How_Arrived Char How Arrived at Clinic
13 Top_Reason Num Top Reason for Coming
14 Arrival Num Temperature
15 Satisfaction Num Satisfaction Score

SAS friendly variable named created in Excel.


43 SAS ESSENTIALS -- Elliott & Woodward
HANDS ON EXERCISE P 148
 Open the file MESSY1. SAS
 Add labels to the variables. For example:

LABEL statement

These statements shows


how you might rename a
variable using a more
descriptive name (TEMP
for temperature.)

44 SAS ESSENTIALS -- Elliott & Woodward


DISPLAY THE RESULTS
 The rest of the code in MESSY1.SAS displays the first 10
records so you can verify the changes
 Run the code to see the changes so far.
Note that we’re saving
PROC PRINT LABEL changes in the file
named CLEANED
DATA=MYSASLIB.CLEANED
(firstobs=1 obs=10);
VAR SUBJECT EDUCATION TEMP
TOP_REASON SATISFACTION;
RUN;

45 SAS ESSENTIALS -- Elliott & Woodward


CLEANED DATA SO FAR Note the explanatory
column labels…

46 SAS ESSENTIALS -- Elliott & Woodward


Other housekeeping in the data set
 There are some other housekeeping chores in the data
set to clean up the variable names. For example
The ARRIVAL variable (which has to do with
TEMP=ARRIVAL; temperature) is renamed to avoid confusion
with the HOW_ARRIVED variable.
DROP ARRIVAL;

47 SAS ESSENTIALS -- Elliott & Woodward


EXERCISE
Variable Name Type Label

1 Subject Char Subject ID PAUSE the


2 DateArrived Date Date Arrived tutorial and
3 TimeArrive Time Time Arrived enter the
4 DateLeft Date Date Left remaining
5 TimeLeft Time Time Left labels as
6 Married Num Married?
shown in this
7 Single Num Single?
table. Run
8 Age Num Age Jan 1, 2014
the code…
9 Gender Char Gender
return once
10 Education Num Years of Schooling

11 Race Char Race


you’ve
12 How_Arrived Char How Arrived at Clinic
completed
13 Top_Reason Num Top Reason for Coming
this exercise.
14 Arrival Num Temperature

15 Satisfaction Num Satisfaction Score

48 SAS ESSENTIALS -- Elliott & Woodward


THE COMPLETED LABEL STATEMENT
LABEL
EDUCATION='Years of Schooling'
HOW_ARRIVED='How Arrived at Clinic'
TOP_REASON='Top Reason for Coming'
SATISFACTION='Satisfaction Score'
Subject="Subject ID"
DateArrived="Date Arrived"
TimeArrive="Time Arrived"
DateLeft="Date Left"
Note: We left out the
TimeLeft="Time Left" label for ARRIVAL since
Married="Married?" we replaced that
Single="Single?" variable with TEMP.
Age="Age Jan 1, 2014"
Gender="Gender"
Race="Race"
Satisfaction="Satisfaction Score";
49 SAS ESSENTIALS -- Elliott & Woodward
Fix Case Problems, Allowed Categories, and Delete
Unneeded Lines
 To correct case problems, you can use the UPCASE ( )
or LOWCASE () function to convert data values to all
upper or all lower case.
 A second common fix is to verify that all items in a
categorical variable are allowable. For example, in the
HOW_ ARRIVED variable, only CAR, BUS, or WALK is
acceptable. This statement can fix that problem…

IF HOW_ARRIVED NOT IN ('CAR', 'BUS',


'WALK') THEN HOW_ARRIVED=" ";

(Refer to the IN() function in Appendix B for more info.)

50 SAS ESSENTIALS -- Elliott & Woodward


More fixes…
 A third easy-to-perform check is to delete irrelevant
records. In this data set, if a line does not contain a
Subject ID, we want to eliminate that record. This is done
with an IF statement

IF SUBJECT ="" THEN DELETE;

51 SAS ESSENTIALS -- Elliott & Woodward


HANDS ON EXAMPLE P 150
 Open the file MESSY2.SAS (Note it includes labels from
the previous exercise.)

Fix case problems.

Eliminate bad
entries in
HOW_ARRIVED

Get rid of empty records.


52 SAS ESSENTIALS -- Elliott & Woodward
EXERCISE – MAKE MORE CORRECTIONS
 Make these changes: Use IN() to correct GENDER
IF GENDER NOT IN(‘M’,’F’) THEN GENDER="";
 Fix RACE using the code The only correct race codes are H, C, and
AA

IF RACE="MEX" OR RACE="M" then RACE="H";


IF RACE=“A" then RACE=“AA";
IF RACE=“W" then RACE=“C";
IF RACE="X" OR RACE="NA" then RACE="";

PAUSE – Make these changes to the code. Rerun the


program and verify that the changes have taken place.
Return to the tutorial when you have completed this
exercise.
53 SAS ESSENTIALS -- Elliott & Woodward
RESULTS – FIRST FEW (CORRECTED) RECORDS
OF OUTPUT

Note the missing


RACE entry where
there was
previously an
incorrect entry.

 We’ll continue with this example in the next tutorial…

54 SAS ESSENTIALS -- Elliott & Woodward


Check and Fix Incorrect Categories, Fix duplicated
Variables
 Two troubling variables are MARRIED and SINGLE. The
survey asked respondents their marital status, and the
information was recorded in the data set where 1 means
yes and 0 means no.
 Technically, these two variables should be the opposite of
each other, and you should only require one of them in
the data set. However, if you look at the frequencies of
each using PROC FREQ, you discover that they are not
telling you the same thing.

55 SAS ESSENTIALS -- Elliott & Woodward


Do Hands On Exercise p 151 (DISCOVER1.SAS)
 Notice how the MARRIED
and SINGLE frequencies
do not match…
 You must make some
decision to reconcile this
problem.
 The researcher should
make this decision.

56 SAS ESSENTIALS -- Elliott & Woodward


TOP REASON FOR COMING
 This variable allows subject to select the top reason for
coming to the clinic. The survey dis not intend to allow
multiple answers, but some respondents chose more
than one answer…results in these data problems:

57 SAS ESSENTIALS -- Elliott & Woodward


SOME FIXES APPLIED TO CATEGORICAL
PROBLEMS
 This code is in the file MESSY3.SAS…
Note the decision to use the SINGLE variable.
Thus, MARRIED is no longer needed.
DROP MARRIED;

IF TOP_REASON NE "1" AND


TOP_REASON NE "2" AND
TOP_REASON NE "3" THEN
TOP_REASON=.; Get rid of any TOP_REASON
RUN; values that are not allowed.

To double check this fix, go back to DISCOVER1.SAS (take


MARRIED out of the PROC PRINT) and run it again to make
sure the fix is correct.
58 SAS ESSENTIALS -- Elliott & Woodward
Check and Fix Out-of-Range Numeric Variables
 Do Hands On Exercise p 153 (DISCOVER2.SAS) to discover
unusual minimum and maximum values in numeric
variables. This can be done using the simple code:
PROC MEANS MAXDEC=2 DATA=CLEAN.CLEANED;
RUN;
Note
problems

Also note that AGE does not show up as a numeric


variable… it is current a character variables.
59 SAS ESSENTIALS -- Elliott & Woodward
Correct “out of range” problems
Convert all
temperatures
to Fahrenheit
IF EDUCATION=99 then EDUCATION=.;
IF TEMP LT 45 THEN TEMP=(9/S)*TEMP+32;
IF TEMP=1018 then TEMP=101.8;
IF SATISFACTION= -99 THEN SATISFACTION=.;
* Convert AGE from character to numeric;
AGEN=INPUT(AGE,5.); When you convert a
variable from character to
DROP AGE; numeric, you can’t use
the same variable name,
LABEL AGEN="Age Jan 1, 2014"; so we chose AGEN to be
the numeric version of
the AGE variable.

60 SAS ESSENTIALS -- Elliott & Woodward


RECHECK AND FIX AGEN
 Now that AGEN is numeric, a rerun of DISCOVER2.SAS
reveals a range problem for AGEN

Out of range
values for
AGEN

Exercise. Using MESSY4.SAS, add this code to correct the AGEN


problem:

IF AGEN LT 10 OR AGEN GT 99 then AGEN = .;

PAUSE. Rerun this code, and observe results (in


DISCOVER2.SAS). Return after you’ve completed this exercise.
61 SAS ESSENTIALS -- Elliott & Woodward
RESULTS (FROM DISCOVER2.SAS)
 Note that AGEN values are within an acceptable range.

62 SAS ESSENTIALS -- Elliott & Woodward


ASIDE: CORRECTING DATA, ADMINISTRATIVE
ISSUES
 Statisticians (Data Scientists) often work with data
provided to them from other people such as:
 research data from an experiment or a clinical trial
 data from an on-line survey or extracted from on-line forms
 electronically gathered data from observed behaviors
 etc.
 Before making data corrections, determine who has the
authority, knowledge, and/or responsibility for making
data change decisions.
 Keep track of changes (SAS provides an audit trail) in case
there are questions in the future.

63 SAS ESSENTIALS -- Elliott & Woodward


CORRECT DATE AND TIME VALUES
 The date and time the subject arrived and left the clinic are
needed to calculate how long it took to serve each patient.
However, these values are currently of character type.
 The following example illustrates how to convert the
character variables to SAS date and time values, and how to
combine them in a single “datetime” value.
 The date values are store as character values. For example
2/7/2005
 Use INPUT() to convert them to dates:

DATEARRIVED2=INPUT(TRIM(DATEARRIVED),MMDDYYl0.);

64 SAS ESSENTIALS -- Elliott & Woodward


CONVERTING THE TIME VALUE
 Converting the TIME value is a little more complex. A
TIMEARRIVE value looks like this: Use FIND to locate the blank
between the number and either
 11:18:00 A A or P. Assign it the value I

Use SUBSTR function and the I


I= FIND(TIMEARRIVE," "); location of P or A and extract the
number portion of the time.

TIMEARRIVE=SUBSTR(TIMEARRIVE, l ,I-1);
Also, determine if there is a “P”
in the value – which implies that
it is PM. If P=0 then it is AM.
P=FIND(TIMEARRIVE,"P");

65 SAS ESSENTIALS -- Elliott & Woodward


CONTINUE CONVERSION OF TIME
Convert TIMEARRIVE2 to a
number using INPUT()

TIMEARRIVET=INPUT(TRIM(TIMEARRIVE2),TIME8.);
If the time is after noon, add 12 hours of seconds to the value…
(P>0 means that this time in in the PM (afternoon.))

IF P>0 AND TIMEARRIVET LT 43200 THEN


TIMEARRIVET=TIMEARRIVET+43200;
Convert seconds (ARRIVEDT) value to a DATETIME SAS
variable using the DHMS function, and give it a label.

ARRIVEDT=DHMS(DATEARRIVED2,0,0,TIMEARRIVET);
Label ARRIVEDT="Date & Time Arrived";
66 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS ON EXAMPLE P 156
 Open MESSY5.SAS to do correct datetime variables.

67 SAS ESSENTIALS -- Elliott & Woodward


RUN MESSY5.SAS, OBSERVE RESULTS.

Note the finalized Date & Time Arrived


variable, which is needed to calculate the
difference between when arrived and when
left the clinic.

68 SAS ESSENTIALS -- Elliott & Woodward


CALCULATE DATE & TIME SUBJECT LEFT CLINIC
To complete the calculations, correct values for TIMELEFT
using a similar process (The time subject left the clinic):

And add the label


LABEL LEFTDT ="Date & Time Left";

69 SAS ESSENTIALS -- Elliott & Woodward


CALCULATE HOW LONG STAYED IN CLINIC
 Once ARRIVEDR and LEFTST are calculated, calculate how
long a subject stayed in the clinic using the INTCK()
function with a "MIN" (minutes) argument
STAYMINUTES=INTCK('MIN',ARRIVEDT,LEFTDT);

 Divide by 60 and round it off to get the number of hours


stayed in the clinic:

STAYHOURS=ROUND(STAYMINUTES/60,.1);

As always, check the results…

70 SAS ESSENTIALS -- Elliott & Woodward


STILL PROBLEMS WITH STAY LENGTH

 There are some


STAYHOURS that are
negative or too large.
 Use code to eliminate
“impossible” values. For
example:

IF STAYHOURS<0 or
STAYHOURS>48 then
STAYHOURS=.;

71 SAS ESSENTIALS -- Elliott & Woodward


LOOK FOR DUPLICATE RECORDS
 A final check for this data set is to determine if there are
duplicate records. Typically, this is accomplished by looking for
duplicate IDs.
 A simple way to do this is with PROC FREQ.
 This final example for this section illustrates this process.
 Do Hands On Exercise p 158.
Use PROC FREQ to count the
 DISCOVER3.SAS number of unique SUBJECT IDs

PROC FREQ DATA=MYSASLIB.CLEANED NOPRINT;


TABLES SUBJECT / OUT=FREQCNT;
RUN;
PROC PRINT DATA=FREQCNT;
WHERE COUNT>1; Display results where there are
RUN; more than one SUBJCTS with
the same ID
72 SAS ESSENTIALS -- Elliott & Woodward
CORRECT DUPLICATE RECORDS
 The PROC FREQ identifies one duplicated record. There
are two SUBJECTS with the ID number 26.

 In the example, you discover that SUBJECT 27 was


miscoded as 26, thus you can fix that type with the code:

IF_N_=27 THEN SUBJECT=27;

73 SAS ESSENTIALS -- Elliott & Woodward


CLEANING A DATA SET: SUMMARY
 When you get a new data set, do these checks:
 Visually inspect the data set for obvious problems.
 Rename variables that have strange or unclear names.
 Label your variables to make them more readable.
 Fix case problems.
 Delete unneeded records.
 Use PROC FREQ to discover incorrect categorical values.
 Check for and correct any duplicated variables.
 Use PROC MEANS to check for unusual minimum or
maximums.
 Set those missing value codes.
 Convert variables not in a correct format.
 Search for and reconcile any duplicate records.
74 SAS ESSENTIALS -- Elliott & Woodward
THE COMPLETED MESSY FIX
 The entire SAS code for fixing the MESSY data
set is in the file MESSY_ALL.SAS
 Review this code to see how each fix builds on
one another
 See how this code provides an audit for the
fixes, so you can verify them if needed, or
illustrate how a change was performed to the
data set.
 EXAMINE THIS CODE: Use it as a template for
fixing your own data sets.
75 SAS ESSENTIALS -- Elliott & Woodward
6.5 SUMMARY
 This chapter provides additional information on common
programming topics for the SAS language. The subjects
covered are not exhaustive but were selected because
they are often used for preparing data for analysis. Many
more topics could have been covered, and readers are
encouraged to refer to the SAS documentation for
additional information.
 Continue to Chapter 7: SAS® ADVANCED PROGRAMMING
TOPICS PART 2

76 SAS ESSENTIALS -- Elliott & Woodward


These slides are based on the book:

Introduction to SAS Essentials


Mastering SAS for Data Analytics, 2nd Edition

By Alan C, Elliott and Wayne A. Woodward

Paperback: 512 pages


Publisher: Wiley; 2 edition (August 3, 2015)
Language: English
ISBN-10: 111904216X
ISBN-13: 978-1119042167

These slides are provided for you to use to teach SAS using this book. Feel free to
modify them for your own needs. Please send comments about errors in the slides
(or suggestions for improvements) to acelliott@smu.edu. Thanks.

77 SAS ESSENTIALS -- Elliott & Woodward