Introduction to SAS Essentials

Alan Elliott and Wayne Woodward

Part I

LEARNING OBJECTIVES
 To be able to use SAS® Functions
 To be able to TRANSPOSE data sets
 To be able to perform data recoding using SELECf
 To be able to use SAS programming techniques to clean
up a messy data set

6.1 USING SAS FUNCTIONS
 Sophisticated calculations can be created using SAS
functions. These functions include arithmetic and text
and date manipulation. The format for the use of SAS
functions is

 Where arg1 etc are arguments (information) you “send”

to the function for it to act upon, and it returns a result.

Function Arguments
 Functions can require one or more arguments. Some
require no arguments. For example, a few mathematical
functions in SAS are as follows:

ABS(X); * Absolute value;

FACT(X); * Factorial;
INT(X); * Integer portion of a value;
LOG(X); * Natural log;
SQRT(X); * Square root;

Types of functions
 Function in SAS can be categorized into these types:
 ARITHMETIC
 TRIGNOMETRIC
 DATE AND TIME
 CHARACTER
 TRUNCATION
 SPECIAL USE AND MISCELANNEOUS
 FINANCIAL
 ACCESS PREVIOUS OBSERVATIONS (LAG) F

Using Functions in a Calculation
 Functions can also be used as a part of a more extensive
calculation. For example,

 would calculate the square of A and then the square of B,

then add those two numbers, take the square root of that
value, add that number to the value of MEASURE, and
assign the result to the variable named C.

Functions that take more than one argument
 Some examples of functions that take more the one
argument are:

MAX(xl,x2,x3, ... ) ; * Maximum of a list;

MIN(xl,x2,x3, ... ) ; * Minimum of a list;
SUM(xl,x2,x3, ... ) ; * Sum of a list;
MEDIAN(xl,x2,x3, ... ) ; * Median of a list;
ROUND(value, round); * Round off a value;

 For example
TOTAL=SUM(TIMEl,TIME2,TIME3,TIME4);

Rounding Values
 The round-off unit in the ROUND function determines how the
rounding will be performed. The default is 1. A round-off value of
0.01 means to round it off to the nearest lOOth, and a round-off
value of 5 means to round it off to the nearest 5. Here are a few
examples:

ROUND (3.1415, .01) *Returns the value 3.14;

ROUND(107,5) * Returns the value 105;
ROUND(3.6234)* Returns the value 4 (rounds to
integer);

A Function with No Argument
 A function that takes no argument is TODAY ().For
example, the code

NOW=TODAY();

 puts the current date value from the computer's clock

into the variable named NOW.

Specifying Arguments
 When arguments are a list of values, such as in MAX or MIN,
you can specify the list as variables separated by commas, or
as a range preceded by the word OF. For example, if these
variables have the following values:

X1 = 1; X2 = 2; X3 = 13; X4=10;

 Then

MAX(1,2,3,4,5) * Returns the value 5;

MAX(X1,X2,X3,X4) * Returns the value 13;
MAX(OF X1-X4) * Returns the value 13;

 Note that the designation OF X1-X4 is interpreted by SAS
as all same-named consecutively numbered variables
from X1 to X4 in the MAX () function example above. If
there are missing values in the list, they are ignored.
Other similar functions are illustrated here:

MIN(OF X1-X4) * Returns the value 1;

SUM(OF X1-X4) * Returns the value 26;
MEDIAN(OF X1-X4) * Returns the value;
NMISS(OF X1-X4) * Returns 0 (# missing values);
N(OF X1-X4) * Returns 4 (# non-missing);

EXAMPLE
Suppose you have this code
DATA NUM;
X1 = 1; X2 = 2; X3 = 13; X4=.;
M1= SUM(X1-X4) ;
M2= SUM(OF X1-X4);
RUN;
PROC PRINT; RUN;

What is the value of M1 and M2?

EXERCISE – Enter this code and run the program. PAUSE and return
when you’ve successfully run the code

RESULTS
 The answer is shown here. When you use
M1= SUM(X1-X4) ;
A missing value in the list makes the returned value missing,
whereas if you use
M2= SUM(OF X1-X4);
The function uses only non-missing values in its evaluation
of the function.

6.2 USING PROC TRANSPOSE
 PROC TRANSPOSE allows you to restructure the values in
your data set by transposing (or reorienting) the data.
This is typically performed when your data are not in the
structure required for an analysis.

ROWS Transpose
COLUMNS

This is one example of how PROC

Transpose is used.

Simplified Syntax for TRANSPOSE
PROC TRANSPOSE DATA=input-data OUT=output-data;
<PREFIX=prefix>; PREFIX specifies a prefix to the
names of variables created in the
<BY <variables>; transposition. The default names
are COL1, COL2, and so on
<ID variable>;
VAR variables;
The BY variable, when specified,
indicates the variable that is used to
The VAR statement form BY groups.
specifies which variables
are to be transposed.

Hands On Example Page 140
 Open the file DTRANSPOSE1.SAS

DATA SUBJECTS;
INPUT SUB1 \$ SUB2 \$ SUB3 \$ SUB4 \$;
DATALINES;
12 21 13 14
13 21 12 14 Notice how the data are not in a
desired form. Use PROC TRANSPOSE
15 31 23 23 to “flip” the data.
15 33 21 32
M F F M
;
RUN;

Using PROC TRANSPOSE to “flip” data
PROC TRANSPOSE DATA=SUBJECTS
OUT=TRANSPOSED;
VAR SUB1 SUB2 SUB3 SUB4;
RUN; Results: Note that
column names
PROC PRINT DATA=TRANSPOSED; are COL1, COL2
RUN; by default.

Obs _NAME_ COL1 COL2 COL3 COL4 COL5

1 SUB1 12 13 15 15 M
2 SUB2 21 21 31 33 F
3 SUB3 13 12 23 21 F
4 SUB4 14 14 23 32 M

HANDS ON EXERCISE
 You want column named to be INFO1, INFO2, etc. Do this
with the PREFIX Statement:

PREFIX=INFO;

Rerun the code, and observe the results… particularly the

column headings. PAUSE and return when you’ve run the new
code.

RESULTS
 The added PREFIX= statement in
PROC TRANSPOSE DATA=SUBJECTS
OUT=TRANSPOSED PREFIX=INFO;
 Changed the column headings: Note new
column names

CONTINUE EXERCISE
 2. To make the results better, add the following code (to
RENAME the columns)
This is used to rename the
variables. For example
DATA NEW;SET TRANSPOSED; INFO2 is renamed T2.

RENAME INFO1=T1 INFO2=T2

INFO3=T3 INFO4=T4
INFO5=GENDER _NAME_=SUBJECT;
RUN;
PROC PRINT DATA=NEW;RUN;
PAUSE – enter the new code, run it, and observe the results.
Return when you are finished.
RESULTS
• Notice how the column names are now much

ANOTHER WAY TO NAME COLUMNS
 Open the file DTRANSPOSE1a.SAS
 Notice the label variable LAB in this version of the code:

DATA SUBJECTS;
INPUT LAB \$ SUB1 \$ SUB2 \$ SUB3 \$ SUB4 \$;
DATALINES;
BASELINE 12 21 13 14
TIME1 13 21 12 14
TIME2 15 31 23 23
TIME3 15 33 21 32
GENDER M F F M
;

TRANSPOSE WITH LABELS
 When you have a label variable, use the ID LAB
statement to create column names.

PROC TRANSPOSE DATA=SUBJECTS

OUT=TRANSPOSED;
ID LAB;
VAR SUB1 SUB2 SUB3 SUB4;
RUN;
PROC PRINT DATA=TRANSPOSED;
RUN;

RESULTS WHEN YOU RUN THIS CODE
 Notice how the column names reflect the LAB variable
used in the ID LAB Statement.
 (You may still want to rename the _NAME_ Column.)

USING PROC TRANSPOSE WHEN YOU HAVE
MULTIPLE RECORDS PER SUBJECT
 Suppose you have data
that have one or more
observations per
subject, but you want to
analyze the data by
observation (a set of
observations per row).
Use PROC TRANSPOSE
to transpose the data by
a key variable.

Note: Some subjects have more

than one record in the data set.
COMBINE MULTIPLE RECORDS ONTO ONE LINE
 Use PROC TRANSPOSE to combine multiple records
(identified by a key variable) onto to one line per record

PROC TRANSPOSE
DATA="C:\SASDATA\COMPLICATIONS"
OUT=COMP_OUT The PREFIX allows you to name the
combined variable.
PREFIX=COMP;
The BY variable identifies the key
BY SUBJECT;
variable to expand on.
VAR COMPLICATION;
RUN;
The VAR identifies which variables
to expand.
HANDS ON EXERCISE P 143 (DTRANSPOSE2.SAS)
 Run the code.

Cleans up the data and limits

output to subjects with 3 or
more complications.

Limits the names of the complications to

10 characters in length for the report.
You’ve found that subject
2076 has three
RESULTS OF PROC TRANSPOSE complications listed.

Obs SUBJECT COMP1 COMP2 COMP3 COMP4

1 2076 Pneumonia Heart Atta Renal Fail
2 3585 DVT (Lower Pneumonia Renal Fail
3 3630 DVT (Lower Heart Atta Pancreatit Pneumonia
4 4585 Compartmen Pneumonia Skin Break
5 4599 Aspiration Pneumonia Renal Fail
6 4760 Acute Resp Pneumonia Renal Fail
7 4775 Pneumonia DVT (Lower Pneumonia Renal Fail

Note how PROC TRANSPOSE expanded SUBJECT records that had

multiple complications and named them COMP1, COMP2 etc.

EXERCISE - ANALYZE THE RESULTS
 How many subjects had Renal Failure? Examine this code

CATT concatenates all of the

DATA RENAL;SET COMP_OUT; complications into a single
CCAT=CATT(OF COMP1-COMP7); variable named CCAT.

IF FIND(UPCASE(CCAT),"RENAL") NE 0 then
RENALFAILURE="Yes";
ELSE RENALFAILURE="No";
RUN; Use FIND() to find any instance of
PROC FREQ DATA=RENAL; “RENAL” in CCAT, and create a new
TABLES RENALFAILURE;RUN; variable named RENALFAILURE

Counts the number of

subjects with Renal Failure.

WHAT THIS CODE IS DOING…
CCAT=CATT(OF COMP1-COMP7);
Concatenates complications for each subject, producing the
records shown here:

Note how multiple

complications are
concatenated.

WHAT THIS CODE IS DOING…
IF FIND(UPCASE(CCAT),"RENAL") NE 0 then
RENALFAILURE="Yes";
This is the new
ELSE RENALFAILURE="No"; RENALFAILURE variable
created by the IF
statement
RESULTS

Note this is Yes

because RENAL is in
the complications
list.

HANDS ON EXERCISE

DATA RENAL;SET COMP_OUT;

CCAT=CATT(OF COMP1-COMP7);

IF FIND(UPCASE(CCAT),"RENAL") NE 0 then
RENALFAILURE="Yes";
ELSE RENALFAILURE="No";
RUN;
PROC FREQ DATA=RENAL;
TABLES RENALFAILURE;RUN;

EXERCISE – Enter and run the new code and observe results.
PAUSE – Return once you’ve completed this exercise.

RESULTS

You discover that 50 of the

as one of their complications.

6.3 THE SELECT STATEMENT
 The SELECT statement evaluates the value of a variable
and creates new assignments based on those values.
Syntax (simplified) is as follows:

SELECT <(select - expression)> ;

WHEN- 1 statement;
WHEN- n statement;
< OTHERWISE statement;>

SELECT Statement Example
 Suppose you want to calculate NEWVAL according to some
specific values of the variable OBSERVED. That is, if
OBSERVED=1, you want to set NEWVAL=AGE+2. If
OBSERVED=2 or 3, you want to set NEWVAL=AGE+10, and so
on. A SELECT statement to perform this recoding would be as
follows: Note that OBSERVED is the
comparison variable – identified as
SELECT (OBSERVED); the SELECT EXPRESSION

WHEN (1) NEWVAL=AGE+2;

WHEN (1) is
WHEN (2,3) NEWVAL=AGE+10;
interpreted
WHEN (4,5,6) NEWVAL=AGE+20 ; as WHEN
OTHERWISE NEWVAL=0; OBSERVED=1

END;
Without a Specific Select Expression
 Another way to use SELECT is without a specific select-
expression. In this case, the WHEN statements include
conctitional expressions that should be in parentheses.
For example:
Note in this version there is no specified comparison
variables. Comparisons are specified in the WHEN
statements.
SELECT;
WHEN (GP='A') STATUS2=1;
WHEN (GP='B' and SEX=1) STATUS2=2;
WHEN (GP='C ' and SEX=0) STATUS2=3;
OTHERWISE STATUS2=0; Use this version of SELECT
END; when comparisons are more
complex (and not just equal.)

Do Hands On Example p 145( DSELECT.SAS)
DATA MYDATA;SET "C: \SASDATA\SOMEDATA";
FORMAT ECONOMIC \$7.; Note the difference
in using a specified
SELECT(STATUS);
SELECT expression
WHEN (1,2) ECONOMIC="LOW"; (here is it STATUS)
and in the second
WHEN (3) ECONOMIC="MIDDLE"; part of this example,
WHEN (4,5) ECONOMIC="HIGH"; when you do not
have a specified
OTHERWISE ECONOMIC="MISSING"; expression.
END;
PROC PRINT DATA=MYDATA;
RUN;

6.4 GOING DEEPER: CLEANING A MESSY DATA SET
 Many of the features in the SAS language are helpful in
cleaning up messy data.
 By messy we mean data sets that are not quite ready for
analysis.
 Most data analysts experience problems dealing with files
that contain data that have coding problems and must be
fixed before a proper analysis is possible.
 This section walks you through a case study of a data set
with problems and illustrates how they might be
corrected.

A TYPICAL MESSY DATA SET – NOTE SOME OF
THE ISSUES (A FEW ARE MARKED)

Problems with the data set
 A few of the problems you might quickly note include the
following:
 Line 17 is blank.
 There are non-date values in the "DateLeft" column.
 There is a non-number in the "Age" column (>29).
 Values of Gender are mixed upper and lower cases.
 There are multiple answers in columns that should have only

 You can correct these issues by using SAS code. One

reason doing this in SAS code is that that it leaves an
audit trail of changes.

A TYPICAL MESSY DATA SET – NOTE SOME OF
THE ISSUES (A FEW ARE MARKED)

6.4.1 FIX LABELS, RENAME VARIABLES
 Often, a first step in creating a clean data set is to attach
labels to the variables to make output more “readable.”
Variable Name Type Label
1 Subject Char Subject ID
2 DateArrived Date Date Arrived
3 TimeArrive Time Time Arrived
4 DateLeft Date Date Left
5 TimeLeft Time Time Left Use the LABEL
6 Married Num Married?
statement to
7 Single Num Single?
8 Age Num Age Jan 1, 2014
created these
9 Gender Char Gender labels…
10 Education Num Years of Schooling
11 Race Char Race
12 How_Arrived Char How Arrived at Clinic
13 Top_Reason Num Top Reason for Coming
14 Arrival Num Temperature
15 Satisfaction Num Satisfaction Score

SAS friendly variable named created in Excel.

HANDS ON EXERCISE P 148
 Open the file MESSY1. SAS
 Add labels to the variables. For example:

LABEL statement

These statements shows

how you might rename a
variable using a more
descriptive name (TEMP
for temperature.)

DISPLAY THE RESULTS
 The rest of the code in MESSY1.SAS displays the first 10
records so you can verify the changes
 Run the code to see the changes so far.
Note that we’re saving
PROC PRINT LABEL changes in the file
named CLEANED
DATA=MYSASLIB.CLEANED
(firstobs=1 obs=10);
VAR SUBJECT EDUCATION TEMP
TOP_REASON SATISFACTION;
RUN;

CLEANED DATA SO FAR Note the explanatory
column labels…

Other housekeeping in the data set
 There are some other housekeeping chores in the data
set to clean up the variable names. For example
The ARRIVAL variable (which has to do with
TEMP=ARRIVAL; temperature) is renamed to avoid confusion
with the HOW_ARRIVED variable.
DROP ARRIVAL;

EXERCISE
Variable Name Type Label

1 Subject Char Subject ID PAUSE the

2 DateArrived Date Date Arrived tutorial and
3 TimeArrive Time Time Arrived enter the
4 DateLeft Date Date Left remaining
5 TimeLeft Time Time Left labels as
6 Married Num Married?
shown in this
7 Single Num Single?
table. Run
8 Age Num Age Jan 1, 2014
the code…
9 Gender Char Gender
return once
10 Education Num Years of Schooling

11 Race Char Race

you’ve
12 How_Arrived Char How Arrived at Clinic
completed
13 Top_Reason Num Top Reason for Coming
this exercise.
14 Arrival Num Temperature

THE COMPLETED LABEL STATEMENT
LABEL
EDUCATION='Years of Schooling'
HOW_ARRIVED='How Arrived at Clinic'
TOP_REASON='Top Reason for Coming'
SATISFACTION='Satisfaction Score'
Subject="Subject ID"
DateArrived="Date Arrived"
TimeArrive="Time Arrived"
DateLeft="Date Left"
Note: We left out the
TimeLeft="Time Left" label for ARRIVAL since
Married="Married?" we replaced that
Single="Single?" variable with TEMP.
Age="Age Jan 1, 2014"
Gender="Gender"
Race="Race"
Satisfaction="Satisfaction Score";
Fix Case Problems, Allowed Categories, and Delete
Unneeded Lines
 To correct case problems, you can use the UPCASE ( )
or LOWCASE () function to convert data values to all
upper or all lower case.
 A second common fix is to verify that all items in a
categorical variable are allowable. For example, in the
HOW_ ARRIVED variable, only CAR, BUS, or WALK is
acceptable. This statement can fix that problem…

IF HOW_ARRIVED NOT IN ('CAR', 'BUS',

'WALK') THEN HOW_ARRIVED=" ";

More fixes…
 A third easy-to-perform check is to delete irrelevant
records. In this data set, if a line does not contain a
Subject ID, we want to eliminate that record. This is done
with an IF statement

HANDS ON EXAMPLE P 150
 Open the file MESSY2.SAS (Note it includes labels from
the previous exercise.)

entries in
HOW_ARRIVED

Get rid of empty records.

EXERCISE – MAKE MORE CORRECTIONS
 Make these changes: Use IN() to correct GENDER
IF GENDER NOT IN(‘M’,’F’) THEN GENDER="";
 Fix RACE using the code The only correct race codes are H, C, and
AA

IF RACE="MEX" OR RACE="M" then RACE="H";

IF RACE=“A" then RACE=“AA";
IF RACE=“W" then RACE=“C";
IF RACE="X" OR RACE="NA" then RACE="";

PAUSE – Make these changes to the code. Rerun the

program and verify that the changes have taken place.
exercise.
RESULTS – FIRST FEW (CORRECTED) RECORDS
OF OUTPUT

RACE entry where
there was
previously an
incorrect entry.

Check and Fix Incorrect Categories, Fix duplicated
Variables
 Two troubling variables are MARRIED and SINGLE. The
survey asked respondents their marital status, and the
information was recorded in the data set where 1 means
yes and 0 means no.
 Technically, these two variables should be the opposite of
each other, and you should only require one of them in
the data set. However, if you look at the frequencies of
each using PROC FREQ, you discover that they are not
telling you the same thing.

Do Hands On Exercise p 151 (DISCOVER1.SAS)
 Notice how the MARRIED
and SINGLE frequencies
do not match…
 You must make some
decision to reconcile this
problem.
 The researcher should
make this decision.

TOP REASON FOR COMING
 This variable allows subject to select the top reason for
coming to the clinic. The survey dis not intend to allow
multiple answers, but some respondents chose more
than one answer…results in these data problems:

SOME FIXES APPLIED TO CATEGORICAL
PROBLEMS
 This code is in the file MESSY3.SAS…
Note the decision to use the SINGLE variable.
Thus, MARRIED is no longer needed.
DROP MARRIED;

IF TOP_REASON NE "1" AND

TOP_REASON NE "2" AND
TOP_REASON NE "3" THEN
TOP_REASON=.; Get rid of any TOP_REASON
RUN; values that are not allowed.

To double check this fix, go back to DISCOVER1.SAS (take

MARRIED out of the PROC PRINT) and run it again to make
sure the fix is correct.
Check and Fix Out-of-Range Numeric Variables
 Do Hands On Exercise p 153 (DISCOVER2.SAS) to discover
unusual minimum and maximum values in numeric
variables. This can be done using the simple code:
PROC MEANS MAXDEC=2 DATA=CLEAN.CLEANED;
RUN;
Note
problems

Also note that AGE does not show up as a numeric

variable… it is current a character variables.
Correct “out of range” problems
Convert all
temperatures
to Fahrenheit
IF EDUCATION=99 then EDUCATION=.;
IF TEMP LT 45 THEN TEMP=(9/S)*TEMP+32;
IF TEMP=1018 then TEMP=101.8;
IF SATISFACTION= -99 THEN SATISFACTION=.;
* Convert AGE from character to numeric;
AGEN=INPUT(AGE,5.); When you convert a
variable from character to
DROP AGE; numeric, you can’t use
the same variable name,
LABEL AGEN="Age Jan 1, 2014"; so we chose AGEN to be
the numeric version of
the AGE variable.

RECHECK AND FIX AGEN
 Now that AGEN is numeric, a rerun of DISCOVER2.SAS
reveals a range problem for AGEN

Out of range
values for
AGEN

problem:

PAUSE. Rerun this code, and observe results (in

DISCOVER2.SAS). Return after you’ve completed this exercise.
RESULTS (FROM DISCOVER2.SAS)
 Note that AGEN values are within an acceptable range.

ISSUES
 Statisticians (Data Scientists) often work with data
provided to them from other people such as:
 research data from an experiment or a clinical trial
 data from an on-line survey or extracted from on-line forms
 electronically gathered data from observed behaviors
 etc.
 Before making data corrections, determine who has the
authority, knowledge, and/or responsibility for making
data change decisions.
 Keep track of changes (SAS provides an audit trail) in case
there are questions in the future.

CORRECT DATE AND TIME VALUES
 The date and time the subject arrived and left the clinic are
needed to calculate how long it took to serve each patient.
However, these values are currently of character type.
 The following example illustrates how to convert the
character variables to SAS date and time values, and how to
combine them in a single “datetime” value.
 The date values are store as character values. For example
2/7/2005
 Use INPUT() to convert them to dates:

DATEARRIVED2=INPUT(TRIM(DATEARRIVED),MMDDYYl0.);

CONVERTING THE TIME VALUE
 Converting the TIME value is a little more complex. A
TIMEARRIVE value looks like this: Use FIND to locate the blank
between the number and either
 11:18:00 A A or P. Assign it the value I

Use SUBSTR function and the I

I= FIND(TIMEARRIVE," "); location of P or A and extract the
number portion of the time.

TIMEARRIVE=SUBSTR(TIMEARRIVE, l ,I-1);
Also, determine if there is a “P”
in the value – which implies that
it is PM. If P=0 then it is AM.
P=FIND(TIMEARRIVE,"P");

CONTINUE CONVERSION OF TIME
Convert TIMEARRIVE2 to a
number using INPUT()

TIMEARRIVET=INPUT(TRIM(TIMEARRIVE2),TIME8.);
If the time is after noon, add 12 hours of seconds to the value…
(P>0 means that this time in in the PM (afternoon.))

IF P>0 AND TIMEARRIVET LT 43200 THEN

TIMEARRIVET=TIMEARRIVET+43200;
Convert seconds (ARRIVEDT) value to a DATETIME SAS
variable using the DHMS function, and give it a label.

ARRIVEDT=DHMS(DATEARRIVED2,0,0,TIMEARRIVET);
Label ARRIVEDT="Date & Time Arrived";
DO HANDS ON EXAMPLE P 156
 Open MESSY5.SAS to do correct datetime variables.

RUN MESSY5.SAS, OBSERVE RESULTS.

Note the finalized Date & Time Arrived

variable, which is needed to calculate the
difference between when arrived and when
left the clinic.

CALCULATE DATE & TIME SUBJECT LEFT CLINIC
To complete the calculations, correct values for TIMELEFT
using a similar process (The time subject left the clinic):

LABEL LEFTDT ="Date & Time Left";

CALCULATE HOW LONG STAYED IN CLINIC
 Once ARRIVEDR and LEFTST are calculated, calculate how
long a subject stayed in the clinic using the INTCK()
function with a "MIN" (minutes) argument
STAYMINUTES=INTCK('MIN',ARRIVEDT,LEFTDT);

 Divide by 60 and round it off to get the number of hours

stayed in the clinic:

STAYHOURS=ROUND(STAYMINUTES/60,.1);

STILL PROBLEMS WITH STAY LENGTH

 There are some

STAYHOURS that are
negative or too large.
 Use code to eliminate
“impossible” values. For
example:

IF STAYHOURS<0 or
STAYHOURS>48 then
STAYHOURS=.;

LOOK FOR DUPLICATE RECORDS
 A final check for this data set is to determine if there are
duplicate records. Typically, this is accomplished by looking for
duplicate IDs.
 A simple way to do this is with PROC FREQ.
 This final example for this section illustrates this process.
 Do Hands On Exercise p 158.
Use PROC FREQ to count the
 DISCOVER3.SAS number of unique SUBJECT IDs

PROC FREQ DATA=MYSASLIB.CLEANED NOPRINT;

TABLES SUBJECT / OUT=FREQCNT;
RUN;
PROC PRINT DATA=FREQCNT;
WHERE COUNT>1; Display results where there are
RUN; more than one SUBJCTS with
the same ID
CORRECT DUPLICATE RECORDS
 The PROC FREQ identifies one duplicated record. There
are two SUBJECTS with the ID number 26.

 In the example, you discover that SUBJECT 27 was

miscoded as 26, thus you can fix that type with the code:

CLEANING A DATA SET: SUMMARY
 When you get a new data set, do these checks:
 Visually inspect the data set for obvious problems.
 Rename variables that have strange or unclear names.
 Fix case problems.
 Delete unneeded records.
 Use PROC FREQ to discover incorrect categorical values.
 Check for and correct any duplicated variables.
 Use PROC MEANS to check for unusual minimum or
maximums.
 Set those missing value codes.
 Convert variables not in a correct format.
 Search for and reconcile any duplicate records.
THE COMPLETED MESSY FIX
 The entire SAS code for fixing the MESSY data
set is in the file MESSY_ALL.SAS
 Review this code to see how each fix builds on
one another
 See how this code provides an audit for the
fixes, so you can verify them if needed, or
illustrate how a change was performed to the
data set.
 EXAMINE THIS CODE: Use it as a template for
6.5 SUMMARY
 This chapter provides additional information on common
programming topics for the SAS language. The subjects
covered are not exhaustive but were selected because
they are often used for preparing data for analysis. Many
more topics could have been covered, and readers are
encouraged to refer to the SAS documentation for
 Continue to Chapter 7: SAS® ADVANCED PROGRAMMING
TOPICS PART 2

