Вы находитесь на странице: 1из 78

d D D S a e a A t t f t S o a a a L r e u s i x l t e b n t r e S r g ; A p a p i S s i e n W s e r p o t s m u r a t a k B t l n r i e e a b m n n r e t d a n d $ r t W a y e s t a ( a r S s ; t e l o t a r b i e n l g B t r e a m n p d o = r ' a B r r y a d n a d t N a a s m e e t ' ) W e a r = ' W e a r S i z e ' ; d a t a l i n

Chapter 1: Creating SAS Data Sets The Basic


Contents Introduction 1 Create Simple SAS Data Set 4 Reading raw data from external data files Reading Data Fields With Special Format Permanent SAS Data Set 19

7 14

Introduction
SAS programs consist of SAS statements. SAS statements can be broadly categorized into three groups: 1) System statements group 2) Data steps group 3) Procedure steps group (Chapter 2) A SAS statement has two important characteristics: It usually begins with a SAS keyword. It ends with a semicolon (;).
Statements (begins with keyword) System statements group OPTIONS state ment

Sample Program Code options nodate; data eg1; infile 'D\abc.txt'; input name$ id age; run; proc print data=eg1; run;

DATA statement INFILE statement Input statement RUN statement PROC PRINT statement RUN statement

Data steps group

Procedure steps group

Some Basic System Statements


All SAS system options have default settings. For example, page numbers are automatically displayed (unless you modify this default setting by using OPTIONS statements).

To modify system options, you can submit an OPTIONS statement.

e s ; A c m e 4 3 A j a x 3 4 A t l a s . ; r u n ;

OPTIONS statement OPTIONS option1 <=v1> RUN;

option2 <=v2>

Commonly used options


CENTER NOCENTER DATE NODATE LINESIZE = n NUMBER NONUMBER PAGESIZE = n Controls whether output are centered or left-justified. Controls whether or not todays date will appear at the top of each page of output. Controls the maximum length of output lines. Possible values for n are 64 to 256. Controls whether or not page numbers appear on each page of SAS output. Controls the maximum number of lines per page of output. Possible values for n are 15 to 32767. Specifies the first year in a hundred-year span for interpreting two-digit dates. Default: CENTER Default: DATE

Default: NUMBER

YEARCUTOFF = yyyy

Default: 1920

Warning: The settings of the OPTIONS statement remain in effect until you modify them, or until you end your SAS session. Question: What does the following OPTIONS statement do?
options linesize=110 nodate;

a. suppresses the date and limits the horizontal page size for text output b. suppresses the date and limits the vertical page size for text output c. suppresses the date and limits the vertical page size for the log d. suppresses the date and limits the horizontal page size for the log Answer: b

Create Simple SAS Data Set


Raw data file
Raw data is a set of data that has not yet been processed by SAS (not in SAS format)

Raw data exist in many form, such as Text file (*.txt) and Excel file (*.xls)
A typical raw data set

Each field is fixed columns

Each field is separated by at least one space (blank delimiter)

The title row may /may not be part of a raw data file Each field is either separated by at least one space (blank delimiter) or in fixed columns

Note: SAS can read files with other delimiters such as commas or tabs.

SAS data set


SAS processes the raw data set and creates a SAS data set A typical SAS data set may look like this:
M issing value (character variable) issing value (numeric variable) M

One observation per row In SAS data set, missing value is represented by (default) a period '.' for numeric variable blank string " " for character variable By default, value of character variables align on the left and value of numeric variables align on the right

Question: What type of variable is the variable AcctNum in the data set below?

a. b. c. d.

numeric character can be either character or numeric can't tell from the data shown

Correct answer: b It must be a character variable, because the values contain letters and underscores, which are not valid characters for numeric values. Besides, it aligns on the left.

Question: What type of variable is the variable Wear in the data set below?

. . .
.

numeric character can be either character or numeric can't tell from the data shown

Correct answer: a It must be a numeric variable, because the missing value is indicated by a period rather than by a blank. Besides, it aligns on the right

Reading raw data from external data files


Data steps statements--- List input method
Two list input methods 1. Enter raw data directly into SAS system 2. Read raw data from an external data file If the values in your raw data file are all separated by at least one space (or other delimiters), then using list input (also called free formatted input) to read the data may be appropriate.

1) Enter Raw Data Directly into SAS system


Syntax
DATA data_set_name ; INPUT varname1 <$ > varname2 <$ > . . . ; <Other DATA step statements> DATALINES; . . . data . . . . . . data . . . ; RUN;

Example
data ex1; (1) input Brand $ Wear; (2) datalines; Acme 43 Ajax 34 (3) Atlas . ; run;

Description (1) DATA statement initiate the data input process and defines the name assigned to the created SAS data set. DATA statement must be the first statement of a DATA step.

Rules for SAS data set names A SAS name can contain from 1-32 characters ii. iii. The first character must be a letter or an underscore (_) iv. Subsequent characters must be letters, numbers, or underscores v. Blanks cannot appear in SAS names

Question: Which of the following variable names is valid? a. 4BirthDate b. $Cost c. _Items_ d. Tax-Rate Correct answer: c Variable names follow the same rules as SAS data set names. They can be 1 to 32 characters long, must begin with a letter (AZ, either uppercase or lowercase) or an underscore, and can continue with any combination of numbers, letters, or underscores. (2) INPUT statement defines the list of variables contained in the data set. The $ after Brand indicates that it is a character variable, whereas Wear is numeric variable. Note: Rules for SAS variables names same as the rules for SAS data set names (3) Each line following the DATALINES statement are considered a data record until a line contains only a semicolon(;) is reached. Missing value must be represented by a period (.)

Add labels to SAS variables To improve a readability, a label can be attached to a variable when the SAS data set is being created using LABEL statement LABEL var1 = 'Label1' var2 = 'Label2' ... ; Label can be any thing up to 256 characters long May put as many variable names and labels into one LABEL statement Example:

2) Reading external raw data text file


Example DATA data_set_name ; data Q7; INFILE 'source_file_name' <options>; infile 'C:\sas\ex1\Q7.txt'; INPUT varname1 <$> varname2 <$>. . . ; input Size $ Colour $ Price Cost; < Other DATA step statements > run; RUN; 'source_file_name' is a single quoted string containing the path and full name with extension of the text file <options> describes the input file's characteristics and specifies how it is to be read with the INFILE statement. infile options will be further discussed Restrictions for list input method Each value is separated by a delimiter (one blank space) Specify each variable in INPUT statement in the order that they appear in the records of raw data Missing value must be represented by a placeholder such as a period by default Data must be in standard character or numeric format by default Character values cannot contain embedded blanks if blank space is used as delimiter Default maximum length of character variables is 8 Example

The maximum length of var1 is 8, so it only shows abcdefghi (first 8 characters) only.

Solution: LENGTH Statement LENGTH varname1 $ n1 varname2 $ n2 varnameX $ nX ; You should put Length statement before the Input statement nX sets the length of the variable to X in SAS data set (default length is 8) SAS still reads the raw data record from delimiter to delimiter regardless of the length by default

Using Length statement, the maximum length of var1 is 14.

There would be an error message if you put Length statement after the Input statement

Column input method


For reading raw data with each field is being arranged in fixed columns (or aligned) Syntax: Example: DATA data_set_name; INPUT varname1 <$> bc<-ec> varname2 <$> bc <-ec>. . . ; DATALINES; data . . . data quiz1b_q1; ; infile 'd:\temp\quiz1b_data1.txt'; input id 21-27 name $ 1-20 age 37-38 ; RUN;
run ;

DATA data_set_name; INFILE 'source_file_name' ; INPUT varname1 <$> bc<-ec> varname2 <$> bc<-ec> . . . ; RUN;

Comparison between List input method and Column input method

List input method Each value is separated by a delimiter (one blank space) Specify each variable in INPUT statement in the order that they appear in the records of raw data Missing value must be represented by a placeholder such as a period by default Character values cannot contain embedded blanks if blank space is used as delimiter Default maximum length of character variables is 8 Raw data is not required to be within fixed columns Data must be in standard character or numeric format by default

Column input method No delimiter is required Can read variables in any order

A placeholder is not required to indicate a missing value Allows embedded blanks for character values (since no delimiter is required) Longer than 8 character values is allowed Raw data must be contained within fixed columns Data must be in standard character or numeric format by default

Note: Both input methods have the same restriction: Data must be in standard format by default

INFILE Statement Options


1. DELIMITER = 'list-of-delimiting-characters' Specifies an alternate delimiter to be used for list input method Commonly used delimiters: ',' '!' '&' '09'X If a non-blank delimiter is used: Character value may contain embedded space Default maximum length of character variables is 8 Consecutive() delimiters is treated as a single delimiter Blank field between two non-consecutive delimiters is read as missing value

Example :
data school; infile datalines delimiter=','; length district $ 12; input district teachers_no students_no; datalines; North Point,,,, 1000 , 30000 Central, 520 , 16000 Wan Chai, , 2500 ; run;

use comma ( , ) as a delimiter the maximum length of district has to set to 12 datalines must be included when enter Raw Data
directly (Datalines statement) Consecutive delimiters is treated as a single delimiter and character value may contain embedded space Blank field between two delimiters is read as missing value

Output:

2. DSD Use comma as a delimiter ( You can change the delimiter by using delimiter= Option)

Treats consecutive non-blank delimiters as a missing value Removes quotation marks from character value Reads character value that contains a delimiter within a quoted string Default maximum length of character variables is 8 Example :
Removes quotation marks from character value Treats consecutive non-blank delimiters as a missing value Reads character value that contains a delimiter within a quoted string data school; infile 'D:\district.txt' dsd delimiter='!'; length district $ 12; input district teachers_no students_no; run; Use ! as a delimiter

Output:

3. MISSOVER Adding the MISSOVER option if there is any missing data at the end of datalines It tells SAS that if it runs out of data, dont go to the next data line to continue reading. Example :
data case4a; input var1 var2 var3; datalines; 1 2 4 5 7 1 8 9 ; run; data case4b; infile datalines missover; input var1 var2 var3; datalines; 1 2 4 5 7 1 8 9 ; run;

Other INFILE statement options LRECL = logical-record-length SAS assumes external files have a record length of 256 or less. If your data lines are long, and it looks like SAS is not reading all your data, then use the LRECL= option in the INFILE statement to specify a record length at least as long as the longest record in your data file.
INFILE 'C:\MyRawData\President.txt' LRECL=2000;

FIRSTOBS = n It tells SAS at what line to begin reading data. OBS = n To specify a number (n) that SAS uses to stop reading raw data records after it in the raw data file Example :

DATA icecream; INFILE 'D:\Sales.txt' FIRSTOBS = 3 OBS=5; INPUT Flavor $ 1-9 Location BoxesSold; RUN;

Reading Data Fields With Special Format


Formatted Input Method Raw data records with special format that cannot be handled by using only basic list input method or basic column input method.

1) Format input for character variables


INPUT varname1 : $w. varname2 : $w. ; Informat $w. tells SAS to read exactly w columns of characters immediately after the last encountered delimiter It also sets the length of this character variable in SAS data set to w Informat modifier : (colon) tells SAS to use the informat supplied but to stop reading the value for this variable when a delimiter or end of line is encountered Example:
Data Qaaa; infile datalines delimiter=',' missover; input var1 var2 var3 : $9. var4; datalines; 1, ,HELLO,7 2,4,TEXT,8 9, , ,6 21,31,SHORT 100,200,LAST LINE,999 ; run;

Error missing Informat modifier : (colon)


Data Qaaa2; infile datalines delimiter=',' missover; input var1 var2 var3 $9. var4; datalines; 1, ,HELLO,7 2,4,TEXT,8 9, , ,6 21,31,SHORT 100,200,LAST LINE,999 ; run;

Note: If : (colon) is missing, SAS reads exactly 9 columns of characters for var3 even a delimiter is encountered

2) Format input for numeric variables


INPUT varname1 : DOLLARw. ; Informat DOLLARw. tells SAS to read exactly w columns of numeric values with special characters immediately after the last encountered delimiter It removes embedded blanks (with nonblank delimiters), thousand commas (with non-comma delimiters), dollar signs($), right and left parentheses (which are converted to minus signs) Should always use together with Informat modifier : (colon)

Example:
data case6; infile datalines delimiter=','; input age profit: dollar12. ; datalines; 21, $6 750.55 19,$1 0000 00 22, ($3 000) ; run;

Informat DOLLARw.

remove embedded blanks and $ remove right and left parentheses

3) Format input for date fields


It is convenient to store date in a form of numeric values so that it can be used in calculation Examples: Calculate the number of days (weeks, months, or years) between the two dates Date can be expressed in many different forms Examples: 1/31/90, 31/1/90, 31Jan1990, 90-1-31, 90-31-1

Use SAS date informats to read date values: Informat Date format in raw data DATEw. 17Jan03, 17/Jan/03, 17-Jan-03, 17 Jan 03 MMDDYYw. 011703, 01/17/03, 01-17-03, 01 17 03 DDMMYYw. 170103, 17/01/03, 17-01-03, 17 01 03, YYMMDDw. 030117, 03/01/17, 03-01-17, 03 01 17 MONYYw. Jan03, Jan/03, Jan-03, Jan 03 YYMMw. 0301 Example:
data date; infile datalines delimiter=','; input date1: date9. date2 : ddmmyy8. date3 : mmddyy10.; datalines; 31Dec59, 01011960, 01172003 31Dec1959, 01-01-60, 01-17-03 31DEC59, 01/01/60, 01/17/03 31DEC1959, 01 01 60, 01 17 03 ; run;

Width 7 <= w <= 32 6 <= w <= 32 6 <= w <= 32 6 <= w <= 32 6 <= w <= 32 4 <= w <= 6

How does SAS convert calendar dates to SAS date values? SAS date values is the number of days between 1 Jan 1960 and the specified date Dates before 1 Jan 1960 are negative values, dates after are positive values
1-Jan-1959 1-Jan-1960 1-Jan-1961 1-Jan-1962

How does SAS know which century a two-digit years belong to? If you-365 use two-digit year values in your data lines or external files, 0 366 731 you should consider the

1-Jan-1959

1-Jan-1960

1-Jan-1961

1-Jan-1962

-365

366

731

YEARCUTOFF= option. This option specifies which 100-year span is used to interpret two-digit year values. The default value of YEARCUTOFF= is 1920 Two-digit years 20-99 are assumed to be 1920-1999 Two-digit years 00-19 are assumed to be 2000-2019 Date Expression Interpreted As 12/07/41 12/07/1941 18Dec15 18Dec2015 04/15/30 04/15/1930 15Apr95 15Apr1995 To change the cutoff year value: OPTION YEARCUTOFF = cutoffyear ; For example, if you specify YEARCUTOFF=1950, then the 100-year span will be from 1950 to 2049. options yearcutoff=1950; Using YEARCUTOFF=1950, dates are interpreted as shown below:

Date Expression 12/07/41 18Dec15 04/15/30 15Apr95

Interpreted As 12/07/2041 18Dec2015 04/15/2030 15Apr1995

Question: SAS date values are the number of days since which date? a) January 1, 1901 b) January 1, 1950 c) January 1, 1960 d) January 1, 2001 Correct answer: a Questio n: In order for the date values 05May1955 and 04Mar2046 to be read correctly, what value must the YEARCUTOFF= option have? a. a value between 1947 and 1954, inclusive b. 1955 or higher c. 1946 or higher d. any value

Correct answer: d As long as you specify an informat (e.g date7.) with the correct field width for reading the entire date value, the YEARCUTOFF= option doesn't affect date values that have four-digit years. Questio n: Which time span is used to interpret two-digit year values if the YEARCUTOFF= option is set to 1950? a. 1950-2049 b. 1950-2050 c. 1949-2050 d. 1950-2000

Correct answer: a The YEARCUTOFF= option specifies which 100-year span is used to interpret two-digit year values. The default value of YEARCUTOFF= is 1920. However, you can override the default and change the value of YEARCUTOFF= to the first year of another 100-year span. If you specify YEARCUTOFF=1950, then

the 100-year span will be from 1950 to 2049.

Permanent SAS Data Set


The SAS datasets created so far are temporary (i.e. They are deleted when you close the SAS window). A permanent SAS data set can stay in your computer after the SAS session ends. A SAS data set is temporary if it is stored in SAS Work library All SAS data set in Work library will be deleted automatically when a SAS session ends A SAS data set is permanent if it is stored in a SAS library other than Work

Creating a SAS data library -- using LIBNAME statement LIBNAME libref 'SAS-data-library'; where libref is 1 to 8 characters long, begins with a letter or underscore, and contains only letters, numbers, or underscores. SAS-data-library is the name of a SAS data library in which SAS data files are stored

The LIBNAME statement below assigns the libref Mysaslib to the SAS data library D:\.
libname Mysaslib 'd:\';

Creating permanent SAS Data Set Suppose a SAS library Mysaslib has been created A permanent SAS data set is created by the DATA statement with a two-level name (two names separated by a period) Syntax DATA libref.data-set-name ;

A SAS library can be deleted In Explorer, select the library icon, right click, select Delete It only removes the connection between the physical storage location and SAS. All SAS data sets remain in the storage folder

If a SAS data set is deleted from the a SAS library, it is deleted permanently from the storage folder A SAS data set can be copied/moved from a SAS library to another library Questio n. Which one of the following statements is false? a. LIBNAME statements can be stored with a SAS program to reference the SAS library automatically when you submit the program. b. When you delete a libref, SAS no longer has access to the files in the library. However, the contents of the library still exist on your operating system. c. Librefs can last from one SAS session to another. d. You can access files that were created with other vendors' software by submitting a LIBNAME statement.

Correct answer: c The LIBNAME statement is global, which means that librefs remain in effect until you modify them, cancel them, or end your SAS session. Therefore, the LIBNAME statement assigns the libref for the current SAS session only. You must assign a libref before accessing SAS files that are stored in a permanent SAS data library.

Chapter 2: Simple SAS Reports


Contents PRINT Procedure PROC PRINT 1 Producing Frequency Tables - PROC FREQ 0 1 Computing Statistics -- PROC MEANS 16 Defining Custom Formats -- PROC FORMAT 1 2

PRINT Procedure PROC PRINT


Syntax: PROC PRINT DATA = data_set_name <(data-set-options)> <options> ; < VAR variable-list ; > <SUM variable-list ;> <BY variable-list ;> <WHERE where-expression ;> <TITLEn 'title statement' ;> <LABEL variable-name1 = 'label-string1' ... ;> <FORMAT variable-name1 format1 variable-name2 format2 ;> RUN;

Basic Report
proc print data=Q1; run;

Data Set Options


l l l l l (FIRSTOBS = n) Starts the printing from nth observation from data-set-name (OBS = m) Stops the printing at mth observation from data-set-name N Prints the total number of observations in the data-set-name NOOBS Suppresses the Obs column in Output OBS = 'column header' Specifies a header for the Obs column in print out

Example : The following output only shows the 3rd 8th observations and prints the total number of observations in the dataset Q1.
proc print data=Q1 (firstobs=3 obs=8) n; run;

Example : The following output suppresses the Obs column.


proc print data=Q1 noobs ; run;

Example : The following output specifies Observation Number as a header for the Obs column.
proc print data=Q1 obs= 'Observation Number'; run;

Note: If you include both NOOBS and OBS = 'column header' in your statement, SAS will suppresses the Obs column. (OBS = 'column header' does not take effect)
proc print data=Q1 run; obs= 'Observation Number' noobs ;

Selected Variables
VAR statement You can choose the observations and variables that appear in your report.
proc print data=Q1; var gender cost; run;

Selected Observations
WHERE statement To print observations that meet certain conditions Definition Equal to Not equal to Greater than Less than Greater than or equal to Less than or equal to Equal to one of a list Specified substring Operator EQ NE GT LT GE LE IN CONTAINS AND OR NOT

= ^= > < >= <= ? & | ^

proc print data=Q1; var gender cost; where cost>50; run;

Column Totals
You can produce column totals for numeric variables within your report. Example :
proc print data=Q1; var gender cost; sum cost;

run;

Specifying TITLE
TITLE statement: To make your report more meaningful, you can specify up to 10 titles by using TITLE statements in your output. TITLEn 'title' ; n is a number from 1 to 10 that specifies the line number of the title Skipping some values of n indicates those lines are blank Titles are centered by default SAS uses the same title for all subsequent outputs until you cancel it or define a new title To cancel a title, specify a blank TITLEn statement, e.g. TITLE1; Example :
proc print data=Q1; var gender cost; sum cost; title 'ABC Company'; title3 'Transaction records'; run; proc print data=Q2; run;

Note1: title is the same as title1 Note2: Skipping title2 indicates the second lines is blank Note3: SAS uses the same title for printing Dataset Q2 Note4: If you do not want to same titles appear in the second output, you can specify a blank TITLE statement proc print data=Q2; title; run;

Temporarily Assigning Labels to Variables


You can enhance your PROC PRINT report by labeling columns with more descriptive text LABEL statement: PROC PRINT DATA = data_set_name LABEL ; LABEL variable-name1 = 'label-string1' variable-name2 = 'label-string2' ; Label-string can be up to 256 characters long, including blanks, and must be enclosed in single quotation marks If you have assigned labels(permanent label) when you created the SAS data set, you can omit the LABEL statement from PRINT

procedure Example
proc print data=Q1 label; var gender cost; sum cost; title 'ABC Company'; title3 'Transaction records'; label cost='Transaction cost'; run;

Temporarily Assigning Formats to Variables


In your SAS reports, formats control how the data values are displayed. Formats affect only how the data values appear in output, not the actual data values as they are stored in the SAS data set. FORMAT statement FORMAT variable-name1 format1 variable-name2 format2 ; Possible forms of system formatx include: COMMAn.d , DOLLARn.d (d specifies the number of decimal places), DATEw. , DDMMYYw. , MMDDYYw. Ensure to specify sufficient large value of column (n) to contain the largest value, including special characters such as commas and dollar signs If permanent format (see later section) is used when the SAS data set is created, the format statement in PRINT procedure can be omitted Some commonly used formats Format COMMAw.d DOLLARw.d MMDDYYw. DDMMYYw. DATEw. WORDDATE w. w.d $w. Example Specifies These Values that contain commas and decimal places that contain dollar signs, commas, and decimal places as date values of the form 09/12/97 (MMDDYY8.) or 09/12/1997 (MMDDYY10.) / 12/09/97 (DDMMYY8.) or 12/09/1997 (DDMM10.) as date values of the form 16OCT99 (DATE7.) or 16OCT1999 (DATE9.) as date values of the form Apr 12, 1999 rounded to d decimal places in w spaces as character values in w spaces Example comma8.2 dollar6.2 mmddyy10. ddmmyy10. date9. worddate3 2. 8.2 $12.

proc print data=Mylib.year_sales label noobs; var units amountsold; where salesrep= 'Garcia' and quarter='1'; sum amountsold; label unit ='Units sold' amountsold='Amount sold'; format units comma7. amountsold dollar12.2; title1 'Sales in first quarter by Garcia'; run;

Example

This FORMAT Statement


format date mmddyy8.; format net comma5.0 gross comma8.2; format net gross dollar9.2;

To display Values as
06/05/03 1,234 5,678.90

$1,234.00 $5,678.90

Creating a Customized Layout with BY Groups


Produces separate section of the report for each BY group observations BY statement BY variable-list ; If data is not already sorted by the same variable-list, must add a PROC SORT step before PROC PRINT step The same variables cannot appear in both VAR and BY statements PROC SORT PROC SORT DATA = datain <OUT = dataout> ; BY <DESCENDING> variable-list ; If dataout is not specified, the datain is replaced by the sorted dataset By default, observations are sorted in ascending order of the specified variable-list * Missing values always sort low * Sorted order for character variables: . (missing) < symbol < 0 < 1 < 11 < 2 < A < B < a

Example
proc sort data=ex1 out=sort_ex1; by month year; run; proc print data=sort_ex1 n; var year telephone; by month; run;

Questio n.

What happens if you submit the following program?


proc sort data=clinic.diabetes; run; proc print data=clinic.diabetes; var age height weight pulse; where sex='F'; run;

a. The PROC PRINT step runs successfully, printing observations in their sorted order. b. The PROC SORT step permanently sorts the input data set. c. The PROC SORT step generates errors and stops processing, but the PROC PRINT step runs successfully, printing observations in their original (unsorted) order.

d. The PROC SORT step runs successfully, but the PROC PRINT step generates errors and stops processing. Correct answer: c The BY statement is required in PROC SORT. Without it, the PROC SORT step fails. However, the PROC PRINT step prints the original data set as requested Questio n. What does PROC PRINT display by default? a. PROC PRINT does not create a default report; you must specify the rows and columns to be displayed. b. PROC PRINT displays all observations and variables in the data set. If you want an additional column for observation numbers, you can request it. c. PROC PRINT displays columns in the following order: a column for observation numbers, all character variables, and all numeric variables. d. PROC PRINT displays all observations and variables in the data set, a column for observation numbers on the far left, and variables in the order in which they occur in the data set.

Correct answer: d

Producing Frequency Tables - PROC FREQ


The FREQ procedure is a descriptive procedure and a statistical procedure. It produces one-way and n-way frequency tables, and it counts how many observations have each value, provides percentages and cumulative statistics. PROC FREQ PROC FREQ DATA = sas-data-set <options>; <TABLE variable-list < / options> ;> <BY by-variables ;> <WHERE where-expression ;> <FORMAT variable-format ;> <TITLE 'title-text' ;> <LABEL variable-name1 = 'label-string1' ... ;> RUN;

General form: Basic FREQ Procedure


PROC FREQ DATA = sas-data-set <options>; RUN; By default, PROC FREQ creates a one-way table with the frequency, percent, cumulative frequency, and cumulative percent of every value of all variables in a data set. Example:
proc print data=Q1; run; proc freq data=Q1; run;

PROC FREQ statement options


1. NLEVELS : displays the number of levels for all variables in TABLE statement
proc freq data=Q1 nlevels; run;

2.

ORDER = : specifies the order for listing the variable values FREQ: orders values by descending frequency count

proc freq data=Q1 order=freq; table strength_of_fragrance; run;

FORMATTED: orders values by ascending their formatted values INTERNAL(default): orders values by ascending their unformatted values DATA: orders values according to their order in the data set
data eg1; input var1 $; datalines; d 1 f 2 e 3 a 4 b 5 b c6 e ; proc freq data=eg1 order=data; run;

Specifying Variables in PROC FREQ


TABLE statement :

TABLE variable-list < / options> ; variable-list specifies variables included in the report For one-way tables, specify the variable name For more than one variable, separate the variable names by space For two-way table, separate the paired variables by * For more than a pair of variables, separate each pair by space

One-way tables
proc freq data=crew; table jobcode location; run;

Two-way table
proc freq data=crew; table location*jobcode; run;

In this example, a two-way table is produced. In this example, two one-way tables are produced. PROC FREQ produces one-way tables with cells that contain frequency percent cumulative frequency cumulative percent PROC FREQ produces two-way tables with cells that contain cell frequency cell percent of total frequency cell percent of row frequency cell percent of column frequency

TABLE statement options


TABLE statement : TABLE variable-list < / options> ; Commonly used options in all FREQ tables: MISSING includes missing values in frequency statistics, i.e. treat missing value is a valid value NOPRINT suppresses displaying table OUT = out-data-set writes the frequencies to SAS data set out-data-set Example
proc freq data=crew; table location jobcode / missing out=result; run;

Commonly used options in one-way tables: NOCUM suppresses display of cumulative frequencies and percentages NOPERCENT suppresses display of percentages OUTCUM includes the cumulative frequency and cumulative percentage in the output data set

Commonly used options in two-way tables: LIST prints crosstabulations in list format rather than grid CROSSLIST prints cross-tabulations in crosslist format NOCUM - suppresses display of cumulative frequencies and cumulative percentages in list format NOCOL suppresses display of column percentage for each cell NOROW suppresses display of row percentage

for each cell NOFREQ suppresses display of the frequency count for each cell OUTPCT - includes the percentage of column frequency, row frequency, and two-way table frequency in the output data set Commonly used options in one-way tables: Example
proc freq data=Mylib.car ; table size /nopercent; run;

Example
proc freq data=Mylib.car ; table size /nocum; run;

Example
proc freq data=Mylib.car ; table size/ noprint out=Q13 outcum; run;

Commonly used options in two-way tables: Example


proc freq data=crew; table location*jobcode / list;; run;

Example
proc freq data=crew; table location*jobcode / crosslist; run;

Example
proc freq data=crew; table location*jobcode / out=result2 output; run;

BY statement
To obtain separate analyses on observations in groups defined by the BY variables If the data set is not sorted in ascending order, sort the data using the SORT procedure with a similar BY statement Example
proc sort data=crew; by location; proc freq data=crew; table jobcode / missing; by location; run;

Computing Statistics -- PROC MEANS


PROC MEANS - Produces a report on variables in a SAS data set Computes summary statistics such as maximum, minimum, mean, standard deviation etc. Only applies to numeric values and missing values are excluded for statistical calculations
PROC MEANS DATA = sas_data_set <requested-statistics> <options>;

<VAR variable-list ;> <BY by-variables ;> <CLASS class-variables ;> <OUTPUT OUT = sas-data-set <output-statistic = output-label> ;> <WHERE where-expression ;> <TITLE 'title-text' ;> <LABEL variable-name1 = 'label-string1 ' ... ;> RUN;

VAR statement
VAR variable-list ; Reports on every numeric variable in sas-data-set if VAR statement is not included For more than one variable, separate the variable names by space Default reported statistics are N, MEAN, STD, MIN, MAX

proc means data=Mylib.Car; run;

Note: The above report shows all numeric variables (only mileage and reliability are numeric variables) proc means data=Mylib.Car; var mileage; run;

Requested statistics
Other statistics include: RANGE, MEDIAN, SUM, NMISS, SKEWNESS, VAR, Q1, Q3, P1, P5, P10, P90, P95, P99, etc. If you add any statistics in requested-statistics, PROC MEANS no longer produce the default statistics. They must be requested.
proc means data=Mylib.Car n mean; var MILEAGE; run;

PROC MEANS statement options


MAXDEC= - specifies the number of decimal places for the statistics
proc means data=Mylib.Car n mean std maxdec=3; var MILEAGE; run;

NOPRINT suppresses all displayed output

Group processing
Group Processing Using the CLASS Statement Group Processing Using the BY Statement

PROC MEANS DATA = sas_data_set <requestedstatistics> <options>; <VAR variable-list ;> CLASS class-variables ; RUN;

PROC MEANS DATA = sas_data_set <requested-statistics>; <VAR variable-list ;> BY by-variables ; RUN;

CLASS Statement Options: ORDER = - specify the order for listing the values of CLASS variable MISSING - treat missing values as a valid value of CLASS variable Note: You do not need to use the PROC SORT when using the CLASS Statement.
Group Processing Using the CLASS Statement

Note: If the data set is not sorted in ascending order, sort the data using the PROC SORT with a similar BY statement
Group Processing Using the BY Statement

Example
proc means data=Mylib.Car mean median order=freq; var mileage; class size; run;

Example
proc sort data=Mylib.Car out=sort_Car; by size; run; proc means data=sort_Car mean median; var mileage; by size; run;

Creating a Summarized Data Set -- OUTPUT statement


OUTPUT OUT = sas-data-set <output-statistic1 = output-name1a output-name1b output-statistic2 = output-name2a output-name2b >; Use the OUTPUT without specifying the output-statistics = option produces default statistics (N, MIN, MAX, MEAN, STD) for all of the variables specified in VAR statement. output-statistics = specify the summary statistic to be written out and it is not necessary identical to the requested-statistics in PROC MEANS statement output-names specify the names of the variables that will be created to contain the values of the summary statistics. The output-names must be listed in the same order as in the VAR statement Example
proc means data=Mylib.Car noprint; var RELIABILITY MILEAGE;

output out=car_average mean=MEAN_REL MEAN_MILE nmiss=nm_rel nm_mile; run;

Example
Note: Values of _TYPE_ indicates which combinations of Class variables are used to compute the statistics

Questio n.

The default statistics produced by the MEANS procedure are n-count, mean, minimum, maximum, and a. median. b. range. c. standard deviation. d. standard error of the mean. Correct answer: c

Questio n.

Which statement will limit a PROC MEANS analysis to the variables Boarded, Transfer, and Deplane? a. by boarded transfer deplane; b. class boarded transfer deplane; c. output boarded transfer deplane; d. var boarded transfer deplane; Correct answer: d To specify the variables that PROC MEANS analyzes, add a VAR statement and list the variable names.

Questio n.

Which of the following statements is true regarding BY-group processing? a. BY variables must be either indexed or sorted. b. Summary statistics are computed for BY variables. c. BY-group processing is preferred when you are categorizing data that contains few variables. d. BY-group processing overwrites your data set with the newly grouped observations. Correct answer: a Unlike CLASS processing, BY-group processing requires that your data already be indexed or sorted in the order of the BY variables. You might need to run the SORT procedure before using PROC MEANS with a BY group.

Defining Custom Formats -- PROC FORMAT


You can use the FORMAT procedure to define your own custom formats for displaying values of variables. It does not affect the internal data values that are stored in the SAS data set

Once defined, custom format is used like SAS system format PROC FORMAT <LIBRARY = libref>; VALUE <$>format-name1 range1a = 'label1a' range2a = 'label2a' ; VALUE <$>format-name2 range2b = 'label1b' range2b = 'label2b' ; ... RUN ; Temporary custom format (default) A custom format is stored in a format catalog under WORK library (so the format is temporarily stored). You only need to submit the PROC FORMAT procedure once during one session, but you need to re-run the procedure again when you re-open the SAS software (session). Permanent custom format Option LIBRARY = librref specifies the name for a permanent SAS data library in which the format catalog will be stored Need to tell SAS where to find the defined format before using it but do not need to re-run the procedure format-name names the format that you are creating Must begin with a $ sign if the format applies to character values Cannot be longer than eight characters Cannot be the name of an existing SAS format Cannot end with a number Does not end in a period range specifies one or more values to be grouped Values in different ranges should not overlap label is a text string enclosed in quotation marks ( ) Note: In a single PROC FORMAT procedure, you can use several VALUE statements to define several formats

Specifying VALUE ranges


Range 1 -10 1 <- 10 1 -<10 1 10, 15 1, 3, 5 Low - 10 10 - High a g a d Low g g - High Other Description 1 to 10 inclusive ( ) ) ) or x=15) ) Greater than 1 up through 10 ( 1 up to but not including 10 ( 1 through 10 and value 15 ( Values 1, 3, and 5

Lowest non-missing value through 10 (

10 through the highest non-missing value ( ) First character of data value matches any letters from a through g, case sensitive First character of data value matches a or d, case sensitive Any first character of non-missing value through g, case sensitive G through any first character of non-missing value, case sensitive Any value not specified elsewhere

Associating User-Defined Formats with Variables


Example - Creating Temporary custom format Without using format Creating Temporary custom format
data eg; input age sex income colour$; datalines; 19 1 14000 Y 45 1 65000 G 72 2 35000 B . 1 44000 Y 58 2 83000 W ; run; proc print data=eg; run; proc format; value gender 1='Male' 2='Female'; value agegroup low-18='Teen' 19-<65='Adult' 65high='Elder' .='Missing'; value $col 'W'='White' 'B'='Blue' 'Y'='Yellow' 'G'='Green'; run;

proc print data=eg; format age agegroup. sex gender. colour $col. income dollar8.; run;

proc freq data=eg; table age/ missing; run;

proc freq data=eg; table age/ missing; format age agegroup.; run;

Without using format (continued)


proc means data=eg mean maxdec=0 missing;

Creating Temporary custom format (continued)


proc means data=eg mean maxdec=0 missing;

var income; class age; run;

var income; class age; format age agegroup.; run;

Example - Creating a SAS data set using custom format Without using format
data eg; input age sex income colour$; datalines; 19 1 14000 Y 45 1 65000 G 72 2 35000 B . 1 44000 Y 58 2 83000 W ; run;

Creating a SAS data set using custom format


proc format; value gender 1='Male' 2='Female'; value agegroup low-18='Teen' 19-<65='Adult' 65high='Elder' .='Missing'; value $col 'W'='White' 'B'='Blue' 'Y'='Yellow' 'G'='Green'; run; data eg; input age sex income colour$; format age agegroup. sex gender. colour $col. income dollar8.; datalines; 19 1 14000 Y 45 1 65000 G 72 2 35000 B . 1 44000 Y 58 2 83000 W ; run;

Note: The user defined format must be created before the DATA step using the format

Permanent custom format


Example
libname mylib 'd\temp'; options fmtsearch=(mylib); Tell SAS to search for format in this library proc format library=mylib; value gender 1='Male' 2='Female'; value agegroup low-18='Teen' 19-<65='Adult' 65-high='Elder' .='Missing'; value $col 'W'='White' 'B'='Blue' 'Y'='Yellow' 'G'='Green'; run; data mylib.eg; format age agegroup. sex gender. colour $col. datalines; 19 1 14000 Y 45 1 65000 G 72 2 35000 B . 1 44000 Y 58 2 83000 W ; run;

income dollar8.;

Questio n.

If you don't specify the LIBRARY= option, your formats are stored in Work.Formats, and they exist a. only for the current procedure. b. only for the current DATA step. c. only for the current SAS session. d. permanently. Correct answer: c If you do not specify the LIBRARY= option, formats are stored in a default format catalog named Work.Formats. As the libref Work implies, any format that is stored in Work.Formats is a temporary format that exists only for the current SAS session.

Questio n.

Which of the following statements will store your formats in a permanent catalog? a. libname library 'c:\sas\formats\lib';proc format library=library ...; b. libname library 'c:\sas\formats\lib';format library =library ...; c. library='c:\sas\formats\lib';proc format library ...; d. library='c:\sas\formats\lib';proc library ...; Correct answer: a To store formats in a permanent catalog, you first write a LIBNAME statement to associate the libref with the SAS data library in which the catalog will be stored. Then add the LIBRARY=

option to the PROC FORMAT statement, specifying the name of the catalog. Questio n. When creating a format with the VALUE statement, the new format's name cannot end with a number cannot end with a period cannot be the name of a SAS format, and a. b. c. d. cannot be the name of a data set variable. must be at least two characters long. must be at least eight characters long. must begin with a dollar sign ($) if used with a character variable.

Correct answer: d The name of a format that is created with a VALUE statement must begin with a dollar sign ($) if it applies to a character variable. Questio n. Which of the following FORMAT procedures is written correctly? a. proc format library=library value colorfmt; 1='Red' 2='Green' 3='Blue' run; b. proc format library=library; value colorfmt 1='Red' 2='Green' 3='Blue'; run; c. proc format library=library; value colorfmt; 1='Red' 2='Green' 3='Blue' run; d. proc format library=library; value colorfmt 1='Red'; 2='Green'; 3='Blue'; run; Correct answer: b A semicolon is needed after the PROC FORMAT statement. The VALUE statement begins with the keyword VALUE and ends with a semicolon after all the labels have been defined.

Questio n.

Which of these is false? Ranges in the VALUE statement can specify a. a single value, such as 24 or 'S'. b. a range of numeric values, such as 01500. c. a range of character values, such as 'A''M'. d. a list of numeric and character values separated by commas, such as 90,'B',180,'D',270. Correct answer: d You can list values separated by commas, but the list must contain either all numeric values or all character values. Data set variables are either numeric or character.

Questio n.

How many characters can be used in a label? a. 40 b. 96 c. 200 d. 256 Correct answer: d When specifying a label, enclose it in quotation marks and limit the label to 256 characters

Questio n.

Which keyword can be used to label missing values as well as any values that are not specified in a range? a. LOW b. MISS c. MISSING d. OTHER Correct answer: d MISS and MISSING are invalid keywords, and LOW does not include missing values. The

keyword OTHER can be used in the VALUE statement to label missing values as well as any values that are not specifically included in a range. Questio n. You can place the FORMAT statement in either a DATA step or a PROC step. What happens when you place the FORMAT statement in a DATA step? a. You temporarily associate the formats with variables. b. You permanently associate the formats with variables. c. You replace the original data with the format labels. d. You make the formats available to other data sets. Correct answer: b By placing the FORMAT statement in a DATA step, you permanently associate the defined formats with variables. Questio n. The format JOBFMT was created in a FORMAT procedure. Which FORMAT statement will apply it to the variable JobTitle in the program output? 1. format jobtitle jobfmt; 2. format jobtitle jobfmt.; 3. format jobtitle=jobfmt; 4. format jobtitle='jobfmt'; Correct answer: b To associate a user-defined format with a variable, place a period at the end of the format name when it is used in the FORMAT statement.

Y N R o e s c o r d 1 2 3

Chapter 3: Basic Programming


Contents Understanding DATA Step Processing 1 Debugging In DATA Step 10 Single Observation From Multiple Records 3 1 Creating Variables - Assignment statements 5 1 Conditional Logic Statements 17 Processing Group of Variables 22 Selecting Variables And Observations 26 Calculations Across Observations 29 Reading Mixed Record Types 32 Reading Fixed Number of Repeating Fields 34 Reading Varying Number of Repeating Fields 5 3 Reading Hierarchical Raw Data Files 36 SAS Functions 43

Understanding DATA Step Processing


In Chapter 1, you learned how to write a DATA step to create a temporary or permanent SAS data set from raw data. When you submit a DATA step, SAS processes the DATA step and then creates a new SAS data set. In this section, you can learn more about how SAS processes the DATA step. A SAS DATA step is processed in two phases:

During the compilation phase, each statement is scanned for syntax errors. Most syntax errors prevent further processing of the DATA step. When the compilation phase is complete, the descriptor portion of the new data set is created. If the DATA step compiles successfully, then the execution phase begins. During the execution phase, the DATA step reads and processes the input data. The DATA step executes once for each record in the input file, unless otherwise directed.

Compilation Phase 1. Input Buffer


At the beginning of the compilation phase, the input buffer (an area of memory) is created to hold a record from the external file.
Input Buffer

2. Program Data Vector


After the input buffer is created, the program data vector (PDV) is created. The PDV is the area of memory where SAS builds a data set, one observation at each time. The program data vector contains two automatic variables that can be used for processing but which are not written to the data set as part of an observation. _N_ counts the number of times that the DATA step begins to execute.

PDV _N_

_ERROR_ signals the occurrence of an error that is caused by the data during execution. The default value is 0, which means there is no error. _ERROR_ = 1, when one or more errors occur.
_ERROR_

Question Suppose you run a program that causes three DATA step errors. What is the value of the automatic variable _ERROR_ when the observation that contains the third error is processed? a. 0 b. 1 c. 2 d. 3 Correct answer: b

3. Syntax Checking
During the compilation phase, SAS also scans each statement in the DATA step, looking for syntax errors. Syntax errors include missing or misspelled keywords invalid variable names missing or invalid punctuation invalid options.

4. Data Set Variables


As the INPUT statement is compiled, any variable appears in the DATA step will add to the PDV. Usually, variable attributes such as length and type are determined the first time a variable is encountered. In the example below, the variable ID is defined as a character variable and is assigned the default length of 8. Income and Expense are defined as a numeric variable and are assigned the default length of 8 Moreover, any variables that are created with an assignment statement in the DATA step are also added to the program data vector. For example, the assignment statement below creates the variable NetProfit. The attributes of the variable are determined by the expression (NetProfit=Income-Expense) in the statement. Because the expression produces a numeric value, NetProfit is also defined as a numeric variable and is assigned the default length of 8. Example :
data profit; input ID $ Income Expense; NetProfit=Income-Expense; datalines; 001 1000 2000 PDV 002 300 150 _N_ 003 888 777 ; run;

_ERROR_

ID

Income

Expense

NetProfit

5. Descriptor Portion of the SAS Data Set


At the bottom of the DATA step (in this example, when the RUN statement is encountered), the compilation phase is complete, and the descriptor portion of the new SAS data set is created. The descriptor portion of the data set includes

the name of the data set the number of observations and variables the names and attributes of the variables.

At this point
The example data set contains the four variables that are defined in the input data set and in the assignment statement. _N_ and _ERROR_ are not written to the data set. There are no observations because the DATA step has not yet executed.

Execution Phase
After the DATA step is compiled, it is ready for execution. During the execution phase, the data portion of the data set is created. The data portion contains the data values. Example :
data profit; input ID $ Income Expense; NetProfit=Income-Expense; datalines; PDV 001 1000 2000 _N_ 002 300 150 003 888 777 ; run;

_ERROR_

ID

Income

Expense

NetProfit

1. Set variables in the PDV to missing and Update _N_ & _Error_ in PDV
At the beginning of the execution phase, the value of _N_ is 1. Because there are no data errors, the value of _ERROR_ is 0. The remaining variables are initialized to missing. Missing numeric values are represented by periods, and missing character values are represented by blanks.

PDV _N_ 1

_ERROR_ 0

ID

Income

Expense

NetProfit

2. Put a new record to input buffer and read data value to the PDV
Input Buffer

1---+----10---+----20 001 1000 2000


PDV _N_ 1 _ERROR_ 0 ID 001 Income 1000 Expense 2000 NetProfit

3. Executes additional executable statements in DATA step


The assignment statement (NetProfit=Income-Expense;) executes
PDV _N_ 1 _ERROR_ 0 ID 001 Income 1000 Expense 2000 NetProfit -1000

4. End of the DATA Step

At the end of the DATA step, several actions occur. First, the values in the PDV are written to the output data set as the first observation.
SAS Data Set profit

Next, the value of _N_ is set to 2 and control returns to the top of the DATA step. Finally, the variable values in the program data vector are re-set to missing. Notice that the automatic variable _ERROR_ retains its value.
PDV _N_ 2 _ERROR_ 0 ID Income Expense NetProfit

5. Iterations of the DATA Step


You can see that the DATA step works like a loop, repetitively executing statements to read data values and create observations one by one. Each loop (or cycle of execution) is called an iteration. At the beginning of the second iteration, the value of _N_ is set to 2, and _ERROR_ is still 0. The values from the second record are held in the input buffer and then read into the PDV.
Input Buffer

1---+----10---+----20 002 300 150


PDV _N_ 2 PDV _N_ _ERROR_ 0 ID 002 Income 300 Expense 150 NetProfit

_ERROR_

ID

Income

Expense

NetProfit

002

300

150

150

SAS Data Set profit

6. End-of-File Marker
The execution phase continues until the end-of-file marker is reached in the raw data file. When there are no more records in the raw data file to be read, the data portion of the new data set is complete.
Final SAS Data Set profit

Questio Which of the following is not created during the compilation phase? n. a. the data set descriptor b. the first observation c. the program data vector d. the _N_ and _ERROR_ automatic variables Correct answer: b Observations are not written until the execution phase. Questio During the compilation phase, SAS scans each statement in the DATA step, looking for syntax n. errors. Which of the following is not considered a syntax error? a. incorrect values and formats b. invalid options or variable names c. missing or invalid punctuation d. missing or misspelled keywords Correct answer: a Questio Unless otherwise directed, the DATA step executes n. a. once for each compilation phase. b. once for each DATA step statement. c. once for each record in the input file. d. once for each variable in the input file. Correct answer: c Questio At the beginning of the execution phase, the value of _N_ is 1, the value of _ERROR_ is 0, and n. the values of the remaining variables are set to a. 0 b. 1 c. undefined d. missing Correct answer: d

Questio Suppose you run a program that causes three DATA step errors. What is the value of the n. automatic variable _ERROR_ when the observation that contains the third error is processed?

a. b. c. d.

0 1 2 3

Correct answer: b The default value of _ERROR_ is 0, which means there is no error. When an error occurs, whether it is one error or multiple errors, the value is set to 1. Questio Which of the following actions occurs at the end of the DATA step? n. a. The automatic variables _N_ and _ERROR_ are incremented by one. b. The DATA step stops execution. c. The descriptor portion of the data set is written. d. The values of variables created in programming statements are re-set to missing in the program data vector. Correct answer: d

Debugging In DATA Step


Type of errors in SAS programming 1. Syntax error Program statements do not conform to the rules of the SAS language. Syntax errors include : missing or misspelled keywords invalid variable names missing or invalid punctuation invalid options. 2. Data errors Some data values are not consistent with the data type specified in a program Such as reading a character value for a SAS numeric variable 3. Logic error Statements are free of syntax error and data error but not producing anticipated results Such as a + sign is used instead of a sign in a formula Note: SAS can detect and report all syntax errors and data errors but will not recognize logic errors When an error is detected by SAS: In the SAS log displays the word ERROR identifies the possible location of the error gives an explanation of the error SAS may or may not continue the execution of the statements depending on the kind of error detected Some commonly made errors: Omitting a semi-colon Incorrectly type of variable Number of variables specified in the INPUT statement is higher than the number of fields in the raw data Unbalanced quotation marks

Questio n

What usually happens when a syntax error is detected? SAS continues processing the step. a. b. SAS continues to process the step, and the SAS log displays messages about the error. c. SAS stops processing the step in which the error occurred, and the SAS log displays messages about the error. d. SAS stops processing the step in which the error occurred, and the Output window displays messages about the error.

Correct answer: c Syntax errors generally cause SAS to stop processing the step in which the error occurred. When a program that contains an error is submitted, messages regarding the problem also appear in the SAS log. When a syntax error is detected, the SAS log displays the word ERROR, identifies the possible location of the error, and gives an explanation of the error. A syntax error occurs when a. some data values are not appropriate for the SAS statements that are specified in a program. b. the form of the elements in a SAS statement is correct, but the elements are not valid for that usage. c. program statements do not conform to the rules of the SAS language. d. none of the above. Correct answer: c Questio How can you tell whether you have specified an invalid option in a SAS program? n a. A log message indicates an error in a statement that seems to be valid. b. A log message indicates that an option is not valid or not recognized. c. The message "PROC running" or "DATA step running" appears at the top of the active window. d. You can't tell until you view the output from the program. Correct answer: b When you submit a SAS statement that contains an invalid option, a log message notifies you that the option is not valid or not recognized. You should recall the program, remove or replace the invalid option, check your statement syntax as needed, and resubmit the corrected program. Questio n

Multiple Observations From Single Record


Some raw data files may contain more than one observation per record. @@ (double trailing @) line-hold specifier - typically is used to read multiple SAS observations from a single data line Syntax INPUT varname1 @@; It holds the data line in the input buffer across multiple executions of the DATA step It prevents SAS from loading a new record into input buffer at each DATA step iteration unless the end of record line is detected,

Example:

or another INPUT statement without a line-hold specifier is encountered

data profit; input ID $ @@; input Department $5.;

should not be used with the @ pointer control (discuss in later section), with column input, nor with the MISSOVER option Example:
Data Q6; input X Y @@; datalines; 1 2 3 4 5 6 11 12 13 14 21 22 23 24 ; run;

7 8 25 26 27 28

Single Observation From Multiple Records


Some raw data files may contain more than one record per object Example Each observation consists of 3 records Method 1: Multiple INPUT statements Number of INPUT statements equals to the number of records for an object Works for equal number of records in each observation Example
data Case8a; infile 'd:\temp\list7.txt'; input id 1-4 name $ 6-16 input gender $1 input weight_before 1-4 weight_after 6-9; run;

Method 2: / Line-pointer control Forces a new record into the input buffer and start reading from the beginning of that record Works for equal number of records in each observation INPUT varname / varname / varname ; Example

data Case8b; infile 'd:\temp\list7.txt'; input id 1-4 name $ 6-16 / gender $1 / weight_before 1-4 weight_after 6-9; run;

Method 3: #n line-pointer control Puts multiple records to the input buffer and assigns the records to PDV in any specified order INPUT #n1 varname #n2 varname #n3 varname ; nX represents the record number in the input buffer Example
data Case8c; infile 'd:\temp\list7.txt'; input #2 gender $1 #1 id 1-4 name $ 6-16 #3 weight_before 1-4 weight_after 6-9; run;

Creating Variables - Assignment statements


To produce new information or to change the information from the original information New information can be added to a SAS data set by creating new variables with an assignment statement in a Data step Syntax variable = expression; The left hand side must be a variable name expression may contain combinations of numeric or non-numeric constant, a variable, SAS function, and mathematical operators When the expression contains character data, the data must be enclosed in a pair of single (or double) quotation marks Mathematical Operators Addition Subtraction Multiplication Division Exponentiation + * / ** SAS performs exponentiation first, then multiplication and division, followed by

addition and subtraction Can use parentheses to override the order Example: var1 = 10 * 4 + 3 ** 2 var1 = 49 var1 =10 * (4 + 3) ** 2 var1 = 490

Example

data case1; infile datalines delimiter=','; input name $ tomato cucumber peas grapes; zone=14; type='Home'; cucumber=cucumber*10; total= tomato + cucumber + peas + grapes; tomato_percent = tomato / total*100; datalines; David,10,2,40,0 Mary,15,5,10,1000 Francis,50,10,15,50 Tom,20,0, . ,20 ; run;

Note: SAS executes each statement once during each round of iteration of DATA step If a variable has already been assigned a value in PDV, SAS replaces the original value with the new one The variable PEAS had a missing value for the last observation. Variables calculated from Peas were also set to missing Note: The sequence of assignment statements and INPUT statement affect the assigned values
total= tomato + cucumber + peas + grapes; cucumber=cucumber*10;

Conditional Logic Statements


IF-THEN statements the IF-THEN statement executes a SAS statement when the condition in the IF clause is true IF condition THEN statement;
where condition is any valid SAS expression (e.g. VAR1 >= 10) statement is what SAS should do when the condition is true, often an assignment statement

Comparison Operators
Operator = or eq ^= or ne > or gt < or lt >= or ge Comparison Operation equal to not equal to greater than less than greater than or equal to

<= or le in

less than or equal to equal to one of a list

Example

if test<85 and time<=20 then Status='RETEST'; if region in ('NE','NW','SW') then Rate=fee-25; if target>300 or sales<50000 then Bonus=salary*.05;

Logical Operators
Operat or & | ^ or ~ Logical Operation and or not

Use the AND operator to execute the THEN statement if both expressions that are linked by AND are true.
Example if status='OK' and type=3 then Count+1; if (age^=agecheck or time^=3) & error=1 then Test=1;

Use the OR operator to execute the THEN statement if either expression that is linked by OR is true.
Example if status='S' or cond='E' then Control='Stop';

Use the NOT operator with other operators to reverse the logic of a comparison.
Example if not(loghours<7500) then Schedule='Quarterly'; if region not in ('NE','SE') then Bonus=200;

Character values must be specified in the same case in which they appear in the data set and must be enclosed in quotation marks.
Example if status='OK' and type=3 then Count+1; if status='S' or cond='E' then Control='Stop'; if not(loghours<7500) then Schedule='Quarterly'; if region not in ('NE','SE') then Bonus=200;

Logical comparisons that are enclosed in parentheses are evaluated as true or false before they are compared to other expressions. In the example below, the OR comparison in parentheses is evaluated before the first expression and the AND operator are evaluated.

SAS sets the length of a character variable first time it is evaluated


Example data case2; input var1 @@; if var1>20 then var2='Big'; it sets the length of var2 is 3 if 11<=var1<=20 then var2='Medium'; if var1<11 then var2='Small'; datalines; 5 15 25 ; run; Example (continued) data case2; input var1 @@; if 11<=var1<=20 then var2='Medium'; it sets the length of var2 is 5 if var1>20 then var2='Big';

if var1<11 then var2='Small'; datalines; 5 15 25 ; run; data case2; input var1 @@; length var2 $8.; it sets the length of var2 is 8 if var1>20 then var2='Big'; if 11<=var1<=20 then var2='Medium'; if var1<11 then var2='Small'; datalines; 5 15 25 ; run;

Missing value of a numeric variable is smaller than any specified value Example:
data case3; input age @@; if age <=18 then agroup='A'; if 18<age<30 then agroup ='B'; if 31<= age then agroup='C'; datalines; 14 . 25 19 ; run;

IF-THEN blocks To execute more than one action when the condition is true IF condition THEN DO; statements; statements; END; Example:
data case4; input course $ @@; if course='MS3215' then do; lecturer = 'AB Chan'; class_size=45; end; if course='MS3216' then do; lecturer = 'CD Ma'; class_size=30; end; datalines; MS3215 MS3216 MS3217 ; run;

IF-THEN-ELSE statements / IF-THEN-ELSE blocks To put a number of related IF-THEN statements / IF-THEN blocks together IF-THEN-ELSE statement: IF-THEN-ELSE block: IF condition THEN statement; ELSE IF condition THEN statement; IF condition THEN DO; statements;

ELSE IF condition THEN statement; ELSE statement;

statements; END; ELSE IF condition THEN DO; statements; statements; END;

Example (IF-THEN-ELSE):
if var1=. then var2='Unknown'; else if var1 <11 then var2='Small'; else if 11<= var1<=20 then var2 ='Medium'; else if 45 >=var1 >20 then var2='Big'; else var2='Very Big';

Example (IF-THEN-ELSE block):

data case4; input course $ @@; if course='MS3215' then do; lecturer = 'AB Chan'; class_size=45; end; else if course='MS3216' then do; lecturer = 'CD Ma'; class_size=30; end; else do; lecturer = 'Other'; class_size=.; end; datalines; MS3215 MS3216 MS3217 ; run;

Processing Group of Variables


In DATA step programming, you often need to perform the same action on more than one variable. Although you can process variables individually, it is easier to handle them as a group. You can do this by using array processing. Array statement Defines a set of variables to be processed as a group Any variables can be grouped as an array as long as they are either all numeric type or all character type Syntax ARRAY arrayname[n] <$> variable_list; arrayname: names the array, must not be the name of a variable in the same DATA step arrayname is not a variable and it will not appear in PDV or created SAS data set n is the number of variables grouped in the array.

n must be surrounded by either ( ), { }, or [ ] $ is needed if the variables are character type and the variables have not been defined before the ARRAY statement arrayname[n] in an assignment statement refers to the nth elements of the array as defined in the array statement, n = 1, 2, . . .. In the example below, newarray[1] is var1, newarray[2] is var2 and newarray[3] is var3 Example: Use of array
data case5; array newarray[3] var1 var2 var3; newarray[1]=1; var1 newarray[3]=3; var3 run; newarray[2]=2; var2

Questio n.

Which statement is false regarding an ARRAY statement? a. It is an executable statement. b. It can be used to create variables. c. It must contain either all numeric or all character elements. d. It must be used to define an array before the array name can be referenced. Correct answer: a

Questio n.

What belongs within the braces of this ARRAY statement? array contrib{?} qtr1-qtr4; a. quarter b. quarter* c. 1-4 d. 4 Correct answer: d

Questio n.

For the program below, select an iterative DO statement to process all elements in the contrib array. data work.contrib; array contrib{4} qtr1-qtr4; ... contrib{i}=contrib{i}*1.25; end; run; a. b. c. d. do i=4; do i=1 to 4; do until i=4; do while i le 4;

Correct answer: b Questio n. What is the value of the index variable that references Jul in the statements below? array quarter{4} Jan Apr Jul Oct; do i=1 to 4; yeargoal=quarter{i}*1.2; end; a. 1 b. 2 c. 3

d. 4 Correct answer: c DO loop - To process an array of variables iteratively Syntax DO index_variable = k TO m < BY increment_amount >; SAS statements END; index_variable is a variable that changes value at each iteration of the loop Starts iteration with value k (m often equals to 1) increment_amount is a numeric variable or constant that controls how the value of index_variable changes Default value is 1 At END, index_variable changes by the amount of increment_amount Iteration continues until the value of index_variable > m Example:
data case5a; array newarray[3] var1 var2 var3; do i=1 to 3; newarray[i]=i; end; run;

Variable i can be dropped from the data set by including a DROP statement in the DATA step
data case5a; array newarray[3] var1 var2 var3; do i=1 to 3; newarray[i]=i; end; drop i; run;

Abbreviated list of variable names To replace regular list of variable names Numbered range lists Variables which start with the same characters and end with consecutive numbers The numbers can start and end anywhere as long as the number sequence between is complete Example:
Regular Abbreviated list Example : variable list INPUT var7 var8 var9; INPUT var6 to - var9; Array ALLvar6 has 20 numeric elements. Write Do statements refer to the following elements: ARRAY narray(4) var6 var7 var8 var9; ARRAY narray(4) var6 - var9; a. All elements PROC PRINT DATA = data1; PROC PRINT DATA = data1; b. Even-numbered elements VAR var6 var7 var8 var9; VAR var6 - var9; c. Every third element, beginning with 1 (i.e. 1, 4, 7, )

(a)

data case_a; array ALL[20] var1-var20;

c)

data case_c; array ALL[20] var1-var20;

do k=1 to 20; ALL[k]=k; end; run;

DO k = 1 to 20 BY 3; ALL[k]=k; END;

(b)

data case_b; array ALL[20] var1-var20; DO k = 2 to 20 BY 2; ALL[k]=k; END; run;

Selecting Variables And Observations


Selecting variables- Can put or exclude selected variables in the PDV to the SAS data set Sometimes you might need to read and process fields that you don't want to keep in your data set. In this case, you can use the DROP statement or the KEEP statement to specify the variables that you want to drop or keep. DROP statement specifies a list of variables not to write to output data sets. DROP variable_list ; where variable_list identifies the variables to drop. KEEP statement specifies a list of variables to write to output data sets. KEEP variable_list ; where variable_list identifies the variables to keep.

Example:
data case6; infile datalines delimiter=','; input stud_id $ quiz1-quiz5; array quiz[5] quiz1-quiz5; quiz_sum=0; do i=1 to 5; quiz_sum=quiz_sum + quiz[i]; end; quiz_mean=quiz_sum/5; keep stud_id quiz_sum; or datalines; S1,45,33,60,75,80 S2,67,58,75,69,55 ; run; drop quiz1-quiz5 quiz_sum i;

Selecting observations By default, SAS put an observation to the SAS data set at the end of each DATA step iteration. Use OUTPUT statement in an IF-THEN statement makes SAS outputs an observation based on a condition

Example:

data case7; infile datalines delimiter=','; input age sex $ @@; if sex='f' then output; datalines; 25, m, 18, f, 19, m, 20, m, 21, f ; run;

Note: If the value of an assignment statement wants to be kept in the SAS data set, it must be placed before the OUTPUT statement Example:
data case7; infile datalines delimiter=','; input age sex $ @@; if sex='f' then do; newvar=1; output; end; datalines; 25, m, 18, f, 19, m, 20, m, 21, f ; run;

Example:
data case7; infile datalines delimiter=','; input age sex $ @@; if sex='f' then do; output; newvar=1; end; datalines; 25, m, 18, f, 19, m, 20, m, 21, f ; run;

Writing observations to multiple data sets To write observations to a selected SAS data set, specify the SAS data set name in the OUTPUT statement The SAS data set name appears in the OUTPUT statement must be already appeared in the DATA statement Example:
Data Q45am Q45pm; input group :$10. class :$10. enclosure $ fedtime $; if fedtime='am' then output Q45am; else if fedtime='pm' then output Q45pm; else if fedtime='both' then output Q45am Q45pm; datalines; bears Mammalia E2 both elephants Mammalia W3 am flamingos Aves W1 pm frogs Amphibia S2 pm kangaroos Mammalia N4 am lions Mammaliz W6 pm snakes Retilia S1 pm tigers Mammaliz W2 both zebras Mammaliz W2 am ;

run;

Calculations Across Observations

Retaining the Values of Variables RETAIN statement - Stops resetting some variables to missing in the PDV RETAIN variable1 <init_value1> variable2 <init_value2> ; A RETAIN statement can specify both numeric and character variables <init_valueN> Optional to specify starting value of each variable Example: Calculate the running total
data case8; input month $ sales @@; acc_sales= acc_sales + sales; retain acc_sales 0; starting value of acc_sales = 0 datalines; Jan 3500 Feb 2888 Mar 887 Apr 698 May 6789 Jun 906 ; run;

Example: Put value of an observation to another EX2_DATA1.TXT

Write a SAS DATA step to create a data set which contains the name and sales date in every observation.
data Q50; infile 'F:\SAS\sas\ex3\Ex2_Data1.txt'; input name $1-15 @16 salesdate date11. salesamount 31-35; if name^=' ' then do; oldname=name; oldsalesdate=salesdate; end; else if name=' ' then do; name=oldname; salesdate=oldsalesdate; end; retain oldname oldsalesdate; drop oldname oldsalesdate; format salesdate date9.; run;

Effect of missing value on running totals Missing values will be generated from operations performed on missing values Example:
data case8; input month $ sales @@; acc_sales= acc_sales + sales; retain acc_sales 0; starting value of acc_sales = 0 datalines; Jan 3500 Feb 2888 Mar 887 Apr . May 6789 Jun 906 ; The Sales of April is missing run;

Solution:

data case8; input month $ sales @@; if sales ^=. then acc_sales= acc_sales+sales; retain acc_sales 0; starting value of acc_sales = 0 datalines; Jan 3500 Feb 2888 Mar 887 Apr . May 6789 Jun 906 ; run;

adding an IF-THEN statement

Sum statement Retains values from the previous iteration of the DATA step in order to cumulatively add the value of a variable across observations variable + expression; variable specifies the name of the accumulator variable which must be numeric. variable is automatically set to 0 before the first observation is read. variable 's value is retained from one DATA step execution to the next. expression contains the value to be added to the variable. expression can be a variable or a constant The Sum statement adds the result of the expression that is on the right side of the plus sign (+) to the numeric variable that is on the left side of the plus sign. At the beginning of the DATA step, the value of the numeric variable is not set to missing as it usually is when reading raw data. Instead, the variable retains the new value in the program data vector for use in processing the next observation. Note: The Sum statement is one of the few SAS statements that doesn't begin with a keyword. Note: If the expression produces a missing value, the Sum statement treats it like a zero. (By contrast, in an assignment statement, a missing value is assigned if the expression produces a missing value.) Example:

data case8; input month $ sales @@; acc_sales + sales; datalines; Jan 3500 Feb 2888 Mar 887 Apr . May 6789 Jun 906 ; run;

Reading Mixed Record Types


A raw data file may have more than one type of record layout, e.g. variables with different format in different records Example: Records with different date formats
data case9; infile datalines delimiter=','; input salesid $ location $ ; if location='USA' then input saledate : mmddyy10. amount; if location='EUR' then input saledate : date9. amount; datalines; 101, USA, 1-20-2008,3445 433,EUR,30Mar2008,432.3 102,USA,4-12-2008,5320 444,EUR,26Apr2008,3433.3 ; run;

Error !

Solution: adding @ (single trailing @) line-hold specifier

@ (single trailing @) line-hold specifier Holds the record in the input buffer until the last statement of the DATA step is executed, or encountered another INPUT statement without a line-hold specifiers Note: The term trailing indicates that the @ must be the last item that is specified in the INPUT statement. E.g. input salesid $ location $ @ ;

Example: Records with different date formats


data case9; infile datalines delimiter=','; input salesid $ location $ @; if location='USA' then input saledate : mmddyy10. amount;

if location='EUR' then input saledate : date9. amount; datalines; 101, USA, 1-20-2008,3445 433,EUR,30Mar2008,432.3 102,USA,4-12-2008,5320 444,EUR,26Apr2008,3433.3 ; run;

Reading Fixed Number of Repeating Fields


Example:

Each record in temp.txt consists of a group's ID and followed by three experimental results How to pair each group's ID with one result to a single observation so that three observations can be derived from each record?

data temp; infile 'D:\temp.txt'; input id $ @; input result @; output; input result @; output; input result @; output; run;

Alternative:

data temp; infile 'D:\temp.txt'; input id $ @; do i=1 to 3; input result @; output; end; drop i; run;

Reading Varying Number of Repeating Fields


DO-WHILE loop statement To execute a DO loop until a condition is reached or while a condition exists, without specifying the number of iterations required DO WHILE (condition) ; SAS statements END ;
condition is a valid SAS condition enclosed in parentheses

Example: In EXE2_DATA5.txt, the first field is the ID of the student and the second field number of examination scores for that record. Create a SAS data set which contains 2 variables only, namely the student ID and examination score. The number of observations in the SAS data set equals to the number of examination scores for every student.
data Q64; infile 'F:\SAS\sas\ex3\Exe2_Data5.txt' missover; input id $ no score @; do while (score ^=''); output; input score @; end; drop no; run;

Alternative:

data Q64; infile 'F:\SAS\sas\ex3\Exe2_Data5.txt'; input id $ no @; do i= 1 to no; input score@; output; end; keep id score; run;

Reading Hierarchical Raw Data Files


Introduction
Raw data files can be hierarchical in structure, consisting of a header record and one or more detail records. Typically, each record contains a field that identifies the record type. Here, the Employee indicates a header record that contains an employees the last name and first name. The Dependent indicates a detail record that contains an employees dependants name, relationship and age.

Raw Data File - LIST2_3.TXT Employee,Adams,Cheung Dependent,Machael,C,15 Dependent,Machael,C,13 Employee,Thomas,Leung Dependent,Susan,S,26 Employee,Lewis,Chan Dependent,Richard,C,8 Employee,Dansky,Wong Employee,Nicholls,Tsang Dependent,Robert,C,12 Employee,Mary,Fong Dependent,John,S,40

header record detail record detail record header record detail record header record detail record header record header record detail record header record detail record

You can build a SAS data set from a hierarchical file by creating one observation per detail record and storing each header record as part of the observation.
SAS data set one observation per detail record

You can also build a SAS data set from a hierarchical file by creating one observation per header record and combining the information from detail records into summary variables.
SAS data set one observation per header record

In this section, you learn how to read from a hierarchical file and create a SAS data set that contains either one observation for each detail record or one observation for each header record.

Creating One Observation Per Detail Record


Step 1. Retaining the Values of Variables As you write the DATA step to read this file, remember that you want to keep the header record as a part of each observation until the next header record is encountered. To do this, you need to use a RETAIN statement to retain the values for empfname and emplname across iterations of the DATA step. Next, you need to read the first field in each record, which identifies the record's type. You also need to use the single trailing @ line-hold specifier to hold the current record so that the other values in the record can be read.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; retain empfname emplname;

Step 2. Conditionally Executing SAS Statements You can use the value of type to identify each record. If type is Employee, execute an INPUT statement to read the values for first name (empfname) and last name (emplname). However, if type is Dependent, then execute an INPUT statement to read the values for first name (depfname), relation, and age. You can tell SAS to perform a given task based on a specific condition by using an IF-THEN statement.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then input empfname: $15. emplname : $15.; else if type='Dependent' then do; input depfname : $15. relation $ age; end;

retain empfname emplname;

Step 3. Reading a Detail Record Now think about what needs to happen when a detail record is read. Remember, you want to write an observation to the data set only when the value of type is Dependent. You can use an OUTPUT statement in an IF-THEN statement makes SAS outputs an observation only when the condition is true (i.e. type is Dependent).
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then input empfname: $15. emplname : $15.; else if type='Dependent' then do; input depfname : $15. relation $ age; output; end; retain empfname emplname;

Step 4. Dropping Variables and Final SAS Data Set Because type is useful only for identifying a record's type, drop the variable from the data set.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then input empfname: $15. emplname : $15.; else if type='Dependent' then do; input depfname : $15. relation $ age; output; end; retain empfname emplname; drop type; run;

SAS data set one observation per detail record

Creating One Observation Per Header Record


Refer to LIST2_3.TXT. Suppose you want to generate a SAS data set contains a list of all employees and their monthly payroll deduction for insurance such that Insurance is free for the employee Each employee pays $100 per month for a spouse's (S) insurance if applicable Each employee pays 60 per month for a child's (C) insurance if applicable Step 1. Retaining the Values of Variables As you write the DATA step to read this file, you need to think about performing several tasks. First, the value of empfname and emplnames must be retained as detail records are read and summarized. Next, the value of type must be read in order to determine whether the current record is a header record or a detail record. Add a single trailing at sign (@) to hold the record so that another INPUT statement can read the remaining values.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; retain empfname emplname;

Step 2. DO Group Actions for Header Records To execute multiple SAS statements based on the value of a variable, you can use a simple DO group with an IF-THEN statement. When the condition type='Employee' is true, you need to execute several statements.

data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then do;

First, you need to determine whether this is the first header record in the external file. You do not want the first header record to be written as an observation until the related detail records have been read and summarized. _N_ is an automatic variable whose value is the number of times the DATA step has begun to execute. The expression _n_^= 1 defines a condition where the DATA step has executed more than once. Use this expression in conjunction with the previous IF-THEN statement to check for these two conditions: When the conditions type='Employee' and _n_^= 1 are true, an OUTPUT statement is executed. Thus, each header record except for the first one causes an observation to be written to the data set.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;

insurance_cost=0;
end;

An INPUT statement reads the values of empfname and emplnames. An assignment statement creates the summary variable insurance_cost and sets its value to 0.

Step 3. Reading Detail Records When the value of type is not Employee, you need to define an alternative action. You can do this by adding an ELSE statement to the IF-THEN statement. If its value is 'Dependent' then continue to read for values of the first name, relation, and age. You want to count each person who is represented by a detail record and store the accumulated value in the summary variable insurance_cost. You have already initialized the value of insurance_cost to 0 each time a header record is read. Now, as each detail record is read, you can increment the value of insurance_cost by using a Sum statement. If relation = 'S' accumulate the cost of insurance by 100. If relation = 'C' accumulate the cost of insurance by 60.

data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;

insurance_cost=0; end; else if type='Dependent' then do; input depfname : $15. relation $ age; if relation='S' then insurance_cost+100; if relation='C' then insurance_cost+60; end; retain empfname emplname;

keep empfname emplname; run;

Step 4. Determining the End of the External File and Final SAS Data Set
Your program writes an observation to the data set only when another header record is read and the DATA step has executed more than once. But after the last detail record is read, there are no more header records to cause the last observation to be written to the data set. You need to determine when the last record in the file is read so that you can then execute another explicit OUTPUT statement. You can determine when the current record is the last record in an external file by specifying the END= option in the INFILE statement. INFILE 'file-name' END = variable_name ; variable_name is any valid SAS variable name that is not included in the INPUT statement or other assignment statements in the same DATA step equals 1 if it is the last record in the raw data file; 0 otherwise Remains 0 until SAS processes the last data record Appears in PDV but not exported to the SAS data set

data case12; infile 'd:\LIST2_3.TXT' delimiter=',' end=eofile; input type : $9. @; if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;

insurance_cost=0; end; else if type='Dependent' then do; input depfname : $15. relation $ age; if relation='S' then insurance_cost+100; if relation='C' then insurance_cost+60; end; if eofile=1 then output; retain empfname emplname; keep empfname emplname; run; SAS data set one observation per header record

SAS Functions
A SAS function performs a computation on one or more variables over the same observation and returns a

value. SAS functions include mathematical functions, statistical functions, date functions, character functions, and others SAS function syntax Function_name(<argument1> <, , argumentn>) Function_name(OF abbreviated_variable_list) Function_name(OF array_name[*]) Function_name must be joined by a pair of parentheses If used in an assignment statement, the function must be placed on the right hand side The parentheses may contain one argument, more than one argument, or no argument (i.e. empty parentheses) The argument can be a variable name, a constant, another SAS function, or valid SAS expression Multiple arguments are separated by a comma Mathematical functions Function name Description ABS (argument) Returns a nonnegative number that is equal in magnitude to that of the argument. EXP(argument) Returns the value of the exponential function LOG(argument) Returns the natural (base e) logarithm LOG10(argument) Returns the logarithm to the base 10 SQRT(argument) Returns the square root of a value Example:

data test; input quantity @@; abs_quantity=abs(quantity); log_quantity=log(abs_quantity); sqrt_quantity=sqrt(abs_quantity); datalines; 1244 -1898 34232 10 242 ; run;

Truncation functions Function name INT(argument) ROUND(argument) ROUND(argument, rounding_unit)

Description Returns the integer portion of the argument Returns the nearest integer to the argument Rounds the first argument to a value that is very close to a multiple of the second argument

Example:

data test; x1=int(10.499); x2=int(10.599); x3=round(10.49); x4=round(10.5); x5=round(10.51); x6=round(10.449,0.01); x7=round(10.501,0.01); x8=round(10.504,0.05);

x9=round(13,2); run;

Statistical functions Function name sum(argument, argument,...) mean(argument, argument,...) min(argument, argument,...) max(argument, argument,...) median(argument, argument,...) var(argument, argument,...) std(argument, argument,...) N(argument, argument,...) NMISS(argument, argument,...)

Description sum of values average of nonmissing values minimum value maximum value Median value variance of the values standard deviation of the values the number of nonmissing values the number of missing values

Example: The following figure displays the first few records of a raw data set containing the student quiz scores. The first line is not part of the data set. If a student took all five quizzes, the lowest of the five quiz scores is dropped. Write a program that will compute the average quiz score based on this decision. If a student took fewer than five quizzes, compute the average of the non-missing quizzes.

data Q77; input ID $ Q1-Q5; nmiss=nmiss(of Q1-Q5); if nmiss=0 then average=(sum(of Q1-Q5)-min(of Q1-Q5))/4; /*if n=5 then average=(sum(of Q1-Q5)-min(of Q1-Q5))/4;*/ else average=mean(of Q1-Q5); drop nmiss; datalines; 1 85 76 79 80 85 2 . 56 65 72 81 3 44 49 . . 54 ; run;

Character functions Function name CAT(string-1 <, ... string-n>) CATS(string-1 <, ... string-n>)

Description Concatenates character strings without removing leading or trailing blanks Concatenates character strings and removes leading and trailing blanks

CATT(string-1 <, ... string-n>) CATX(separator, string-1 <, ...string-n>) COMPBL(source)

Concatenates character strings and removes trailing blanks Concatenates character strings, removes leading and trailing blanks, and inserts separators Removes multiple blanks into a single blank from a character string

Example: Create a SAS data set that joins the two fields into a single variable for the full name in the form of firstname lastname such that there is only one blank space between the first name and the last name.
data Q78; infile datalines delimiter=','; input first $ last $; name=catx(' ', first, last); datalines; Mary,Leung John,Wong Jonathan,Ng ; run;

Character functions(continued) Function name LEFT(argument) LENGTH(string) LENGTHC(string) LENGTHN(string) LOWCASE(argument) RIGHT(argument) TRIM(argument) TRIMN(argument) UPCASE(argument) Description Left aligns a SAS character expression Returns the length of a non-blank character string, excluding trailing blanks, and returns 1 for a blank character string Returns the length of a character string, including trailing blanks Returns the length of a non-blank character string, excluding trailing blanks, and returns 0 for a blank character string Converts all letters in an argument to lowercase Right aligns a character expression Removes trailing blanks from character expressions and returns one blank if the expression is missing Removes trailing blanks from character expressions and returns a null string (zero blanks) if the expression is missing Converts all letters in an argument to uppercase

Description Searches for a specific substring of characters within a character string that you specify string specifies a character constant, variable, or expression that will be searched for substrings. Tip: Enclose a literal string of characters in quotation marks.

Function name FIND(string,substring)

substring is a character constant, variable, or expression that specifies the substring of characters to search for in string. Tip: Enclose a literal string of characters in quotation marks. Function name Description SUBSTR(string, position<,length>) Extracts a substring from an argument string specifies any SAS character expression. position specifies a numeric expression that is the beginning character position. length specifies a numeric expression that is the length of the substring to extract. Tip: If you omit length, SAS extracts the remainder of the expression. Example:

data test; infile datalines delimiter=','; input name :$20. sex $; new_name = compbl(name); blank_pos=find(new_name,' '); name_len=length(new_name); last_name=substr(new_name,blank_pos); first_name=substr(name,1,name_len - length(last_name)); sex=upcase(sex); datalines; Mary Chan, f Tom Ng, M David Wong,m Betty Chung,F ; run;

Character functions(continued) Function name Description SCAN(string ,n<, delimiter(s)>) Selects a given word from a character expression n specifies a numeric expression that produces the number of the word in the character string you want SCAN to select. delimiter specifies a character expression that produces characters that you want SCAN to use as a word separator in the character string. Note: If you omit delimiter, SAS uses the following characters by default: blank . < ( + & ! $ * ); ^/,%| Tip: If you represent delimiter, enclose delimiter in quotation marks. Example:

data test; input name $ 20.; surname=scan(name,1,' '); givenname1=scan(name,2,' '); givenname2=scan(name,3,' '); givenname=catx(' ',givenname1,givenname2); datalines;

Chan Wai Chiu Yau Sen Hei Yu Tang Fei ; run;

Example:

data Q79; infile datalines delimiter=' ' dsd; input age 1-2 @4 name:$50.; surname=scan(name,1,','); firstname=scan(name,2,','); drop name; datalines; 18 "HO, Chun Kit" 17 "LO, Yu Yin" 20 "SUM, On Man" ; run;

Character functions (continued) Function name Description COMPRESS(<source><, chars><, modifiers>) Removes specific characters from a character string source specifies a source string that contains characters to remove. chars specifies a character string that initializes a list of characters. By default, the characters in this list are removed from the source. If you specify the K modifier in the third argument, then only the characters in this list are kept in the result. Tip: You can add more characters to this list by using other modifiers in the third argument. Tip: Enclose a literal string of characters in quotation marks. modifiers specifies a character string in which each character modifies the action of the COMPRESS function. Blanks are ignored. These are the characters that can be used as modifiers: a or A - adds letters of the Latin alphabet (A - Z, a - z) to the list of characters. d or D - adds numerals to the list of characters. i or I - ignores the case of the characters to be kept or removed. k or K - keeps the characters in the list instead of removing them. p or P - adds punctuation marks to the list of characters. Example:

data test; input productcode :$ 10.; product=compress(productcode, ,'ka'); code=compress(productcode, ,'a'); datalines; Aa235 BXT3218 6798ZYV 316X

; run;

Date functions Function name day(date) month(date) today()

week(date) weekday(date) year(date) mdy(month,day,year)

YRDIF(sdate,edate,Actual)

DATDIF(sdate,edate,Actual)

Description Extracts the day value from a SAS date value. Extracts the month value from a SAS date value. Returns the current date as a SAS date value, empty argument This function requires no arguments, but they must still be followed by parentheses. Returns the week number value Returns the day of the week from a SAS date value, where 1=Sunday, 2=Monday,, 7=Saturday Extracts the year value from a SAS date value. Returns a SAS date value from numeric expression of month, day, and year values month can be a variable that represents the month, or a number from 1-12 day can be a variable that represents the day, or a number from 1-31 year can be a variable that represents the year, or a number that has 2 or 4 digits. Returns the difference in years between two dates Actual uses the actual number of days between dates in calculating the number of years. Returns the actual number of days between two dates

Example:

data test; input id birthday birthmonth birthyear; birthdate=mdy(birthmonth,birthday,birthyear); birthweek=week(birthdate); birthweekday=weekday(birthdate); cutoffdate='1jan2004'd; day_diff=datdif(cutoffdate,birthdate,'actual'); year_diff=yrdif(cutoffdate,birthdate,'actual'); format birthdate cutoffdate; datalines; 1 31 12 2005 2 1 1 2006 3 28 2 2006 4 31 3 2006 ; run;

Date functions (continued) Function name Description

INTCK('interval',from,to) Returns the number of time intervals that occur in a given time span where l l l 'interval' specifies a character constant or variable. The value must be one of the following: DAY, WEEKDAY, WEEK, MONTH, HOUR, QTR, YEAR from specifies a SAS date value that identifies the beginning of the time span. to specifies a SAS date value that identifies the end of the time span

The INTCK function counts intervals from fixed interval beginnings, not in multiples of an interval unit from the from value. Partial intervals are not counted. For example, WEEK intervals are counted by Sundays rather than seven-day multiples from the from argument. MONTH intervals are counted by day 1 of each month, and YEAR intervals are counted from 01JAN, not in 365-day multiples. SAS Statement Weeks = intck ('week','31 dec 2000'd,'01jan2001'd); Months = intck ('month','31 dec 2000'd,'01jan2001'd); Years = intck ('year','31 dec 2000'd,'01jan2001'd); Value 0 1 1

Because December 31, 2000, is a Sunday, no WEEK interval is crossed between that day and January 1, 2001. However, both MONTH and YEAR intervals are crossed. Date functions (continued) Function name Description INTNX('interval',startIncrements a date value by a given interval or from,increment<,'alignment'>) intervals, and returns a date value where 'interval' specifies a character constant or variable. The value must be one of the following: DAY, WEEKDAY, WEEK, MONTH, HOUR, QTR, YEAR start-from specifies a starting SAS date value increment specifies a negative or positive integer that represents time intervals toward the past or future 'alignment' (optional) forces the alignment of the returned date to the beginning, middle, or end of the interval. For example, the following statement creates the variable TargetYear and assigns it a SAS date value of 13515, which corresponds to January 1, 1997. TargetYear=intnx('year','05feb94'd,3); The purpose of the optional alignment argument: it lets you specify whether the date value should be at the beginning, middle, or end of the interval. When specifying date alignment in the INTNX function, use the following arguments or their corresponding aliases: BEGINNING B MIDDLE M END E SAMEDAY S The best way to understand the alignment argument is to see its effect on identical statements. The following table shows the results of three INTNX statements that differ only in the value of alignment.
SAS Statement Date Value

MonthX=intnx('month','01jan95'd,5,'b'); MonthX=intnx('month','01jan95'd,5,'m');

12935 (June 1, 1995) 12949 (June 15, 1995)

MonthX=intnx('month','01jan95'd,5,'e');

12964 (June 30, 1995)

These statements count five months from January, but the returned value depends on whether alignment specifies the beginning, middle, or end day of the resulting month. If alignment is not specified, the beginning day is returned by default.

Special functions Function name INPUT(source,informat) where


Description Explicit Character-to-Numeric Conversion

source indicates the character variable, constant, or expression to be converted to a numeric value a numeric informat must also be specified, as in this example: input(payrate,2.) Description Explicit Numeric-to-Character Conversion

Function name PUT(source,format) where


source indicates the numeric variable, constant, or expression to be converted to a character value a format matching the data type of the source must also be specified, as in this example: put(site,2.)

Question A typical value for the character variable Target is 123,456. Which statement correctly converts the values of Target to numeric values when creating the variable TargetNo? a. TargetNo=input(target,comma6.); b. TargetNo=input(target,comma7.); c. TargetNo=put(target,comma6.); d. TargetNo=put(target,comma7.); Correct answer: b

Chapter 4: Modifying and Combining SAS Data Sets


Contents Reading Single SAS Data Set Concatenating SAS data sets Merging data sets 14 1 11

Reading Single SAS Data Set


It is often necessary to update existing SAS data set or creating a new SAS data set from an existing SAS data set for: selecting observations based on one or more conditions keeping or dropping variables renaming variables creating new variables To bring an existing SAS data, we may use SET statement

SET statement DATA data_set_name <data_set_options>; <Other DATA step statements> SET sas_data_set <data_set_options> <options>; <Other DATA step statements> RUN; data_set_name is the name of the SAS data set to be created sas_data_set is the name of the SAS data set to be read Any DATA step statements can be placed before/after the SET statement

How does it work? 1. Compilation phase No input buffer is created, tracking pointer points to the first observation of the SAS data set to be read PDV is created as usual, all variables contained in the SAS data set to be read will be included by default 2. Execution phase As the SET statement is executed, the values from the pointed observation is copied to the PDV At the end of each round of DATA step execution, the values in the PDV are written to the new data set At the beginning of each iteration, the values of variables which were read from the SAS data set with the SET statement, or those were created by a SUM statement are retained in PDV, all other variable values are set to missing Example: Suppose a SAS data set Scores exists in the Mylib library

data case1; set Mylib.scores; run;

SET statement - Dropping unwanted variables Suppose the variables score2 and score3 of SCORES are not wanted anymore DROP data set option : These variables (score2 and score3) are not kept in the PDV and cannot be used in the DATA step Example:
data case1; set Mylib.scores (drop=score2 score3); run;

DROP statement These variables are kept in the PDV but not output to the new data set, they can still be used in the DATA step Example:
data case2; set Mylib.scores; drop score2 score3; run;

SET statement - Keeping selected variables only Suppose only the variables StudentID and score3 are wanted KEEP data set option Only these variables are kept in PDV and output to new data set Example:
data case3; set Mylib.scores (keep=StudentID score3); run;

KEEP statement All variables are kept in the PDV but only these variables are output to the new data set Example:
data case3; set Mylib.scores; keep StudentID score3; run;

SET statement - Rename variables Suppose variable StudentID would be renamed to SID and variable score3 would be renamed to quiz3. Example:
data case4; set Mylib.scores (rename=(StudentID=SID score3=quiz3)); run;

Note: It only affects the PDV and the new data set.

SET statement - Selecting the nth-mth observations

Example: Suppose a SAS data set TEMP1 contains 500 observations, write SAS data step to create a SAS data set for each of the followings: a. The new data set contains only the first 100 observations of TEMP1. b. The new data set contains only the last 100 observations of TEMP1. c. The new data set contains the 101th 300th observations of TEMP1.
data Qa; set Temp1 (obs=100); run; data Qb; set Temp1 (firstobs=401); run; data Q12c; set Temp1 (firstobs=101 obs=300); run;

SET statement - Selecting observations conditionally

Example:

data Q16a Q16b; set Mylib.Fltattnd; IF JOBCODE='FLTAT1' THEN output Q16a; IF JOBCODE='FLTAT2' THEN output Q16b; run;

SET statement - Selecting an observation directly (direct access) Use POINT option in the SET statement point_variable = obs_number ; SET data_set_name POINT = point_variable ; point_variable specifies a temporary numeric variable point_variable appears in PDV but not final data set obs_number contains the observation number of the observation to be read, it must appear assigned to point_variable before the execution of the SET statement Example:

data case5; obsnum=102; set Mylib.booksales (keep =ID gender firstpurch) point=obsnum; output; stop; run;

The POINT= option reads only the specified observations, SAS cannot read an endof-file indicator, hence cause an infinite loop Must use a STOP statement to cause SAS to stop processing the current DATA step immediately DATA step writes observations to output at the end of the DATA step, but STOP statement stops processing before the end of the DATA step, hence no output of observations Use an OUTPUT statement before the STOP statement to override the automatic output

SET statement - Selecting every kth observation Example: Write a SAS DATA step to select first 1000-observation subset from the data set SALE2000 by reading every tenth observation from observation number 10.
data case6; do obs=10 to 10000 by 10; set Mylib.sale2000 point=obs; output; end; stop; run;

Write a SAS DATA step to select every tenth observation of the observations in SALE2000. Suppose you do not know total number of observations in SALE2000.SAS7BDAT. You can use NOBS = option creates a temporary variable that contains the total number of observations in the input data files. Note that NOBS = variable in executable statements that appear before the SET statement
data case6q; do obs=0 to ttlobs by 10; set Mylib.sale2000 point=obs nobs=ttlobs; output; end; stop; run;

SET statement - Creating a random sample with replacement With replacement: Observations can be selected more than once The major steps: First generate a random number, say k Read the kth observation directly Repeat the above two steps until the require numbers of observations are selected

Generate a random number Function RANUNI(seed) returns a value between 0 and 1 seed must be an integer seed = 0 uses the system clock time, resulting in different output each time To get an integer between 1 and M, use function CEIL( ) as follows: CEIL(RANUNI(seed) * M) CEIL( ) function returns the smallest integer that is greater than or equal to the argument Example:
data case7; samplesize=100; do i=1 to samplesize; sample_point=ceil(ranuni(0)*ttlobs); set Mylib.booksales (keep =ID gender firstpurch) point=sample_point nobs=ttlobs;

output; stop; drop samplesize i; run;

BY-group processing - To group observations for processing DATA data_set_name ; SET sas_data_set <(data_set_options)> <options>; BY variable1 <variable2 >; The data set in the SET statement must be sorted by the values of the BY variables Two temporary variables for each BY variable are created First.variable1: equals 1 for the first observation in a BY group; 0 otherwise Last.variable1: equals 1 for the last observation in a BY group; 0 otherwise Example: Suppose you want to compute the total amount of money spent (M) on books by each MCODE level in BOOKSALES.SAS7BDAT
proc sort data=Mylib.booksales out=sort_booksales; by mcode; run; data case8; set sort_booksales (keep= mcode m); by mcode; if first.mcode=1 then total_spent=0; total_spent+m; if last.mcode=1 then output; drop m; run;

Behind the scenes - PDV

Using more than one variable in BY statement FIRST.BY-primary-variable = 1 forces FIRST.BY-secondary-variable =1

LAST.BY-primary-variable = 1 forces LAST.BY-secondary-variable =1

Example: Suppose you want to compute the total amount of money spent (M) on books by each gender in each MCODE level
proc sort data=Mylib.booksales out=sort_booksales; by mcode gender; run;

BY-primary-variable data case8; set sort_booksales; by mcode gender; if first.gender=1 then total_spent=0; total_spent+m; if last.gender=1 then output; keep mcode gender total_spent; run;

BY-secondary-variable

MCODE GENDER FIRST.MCODE LAST.MCODE FIRST.GENDER LAST.GENDER 1 0 1 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 0 1 0 1 2 1 1 0 1 0 2 1 0 0 0 0 2 1 0 1 0 1

Concatenating SAS data sets


Stacking data sets -To stack or concatenate SAS data sets one on top of the other

DATA data_set_name ; SET sas_data_set1 <(data_set_options)> sas_data_set2 <(data_set_options)> <options> ; <Other DATA step statements> RUN; Can read any number of SAS data sets in one SET statement Common variables must have the same data type attribute The new data set contains all of the variables and observations from all of the data sets listed in the SET statement

How does it work? Similar to reading single SAS data set Observations from the first data set that is listed in the SET statement are read first Then the observations from the second data set that is listed, and so on Example:
data Jan; input name $ 1-20 sales; datalines; Daivd Wong 4500 Francis Leung 6000 Joe Chan 3000 ; run; data case9; set Jan Feb; run; data Feb; input name $ 1-20 sales; datalines; Joe Chan 5000 Daivd Wong 6000 John Tai 4500 ; run;

Missing values will be generated if stacking data sets with different variable names Example:
data Jan; input name $ 1-20 sales1; datalines; Daivd Wong 4500 Francis Leung 6000 Joe Chan 3000 ; run; data Feb; input name $ 1-20 sales2; datalines; Joe Chan 5000 Daivd Wong 6000 John Tai 4500 ; run; data case10; set Jan Feb; run;

Solution: Change to the same variable name

data case10a; set Jan (rename=(sales1=sales)) Feb(rename= (sales2=sales)); run;

Use IN= option to determine which data set contributed to the current observation SET sas_data_set (IN = in_variable) ; in_variable is a temporary numeric variable that equals 1 when the data set contributed to the current observation, 0 otherwise

data Jan; input name $ 1-20 sales; datalines; Daivd Wong 4500 Francis Leung 6000 Joe Chan 3000 ; run; data Feb; input name $ 1-20 sales; datalines; Joe Chan 5000 Daivd Wong 6000 John Tai 4500 ; run;

data case11; set Jan (in=file1) Feb (in=file2); if file1=1 then month='Jan'; if file2=1 then month='Feb'; run;

Merging data sets


To join corresponding observations from two or more SAS data sets

DATA data_set_name ; MERGE sas_data_set1 <(data_set_options)> sas_data_set2 <(data_set_options)> <options>; BY variable1 <variable2 >; <Other DATA step statements> RUN; The data sets in the MERGE statement must be sorted by the values of the BY variables Available options are identical to that of SET statement If variables that have the same name appear in more than one data set, the value of the variable is the value in the last data set that contains it How does it work? Compilation Phase - To prepare to merge data sets, SAS

1. reads the descriptor portions of the data sets that are listed in the MERGE statement 2. reads the rest of the DATA step program creates the program data vector (PDV) for the merged data set 3. 4. assigns a tracking pointer to each data set that is listed in the MERGE statement. Execution phase As the MERGE statement executes, compare the pointed observation of each listed data set to see whether the BY values match If yes, the observations are written to the PDV in the order in what the data sets appear in the MERGE statement If no, SAS determines which of the values comes first and writes the observation that contains this value to the PDV At the end of each iteration, writes observation to the data set and Variables created by the Data step are set to missing in PDV If neither data set contains any more observations in the BY group, variables come from the listed data sets are set to missing in the PDV. Otherwise, their values are retained in PDV One-to-one with equal list matching Example: Suppose marks of MS1111 and MS1112 for each student for stored in SAS data set MS1111 and MS1112 respectively. To calculate the average mark for each student, the two data sets must be merged
data combinea; merge ms1111 ms1112; by id; run;

MARK in MS1112 overwrite MARK in MS1111


data combineb; merge ms1111(rename=(mark=mark_ms1111)) ms1112(rename=(mark=mark_ms1112)); by id; average_mark=(mark_ms1111+mark_ms1112)/2; run;

One-to-one with unequal list matching Some students took MS1111 but not MS1112, or vice versa
data combinec; merge ms1111 ms1112; by id; run;

Use IN= option to select observations that appear in both data sets
data combined; merge ms1111(in=ms1) ms1112(in=ms2); by id; if ms1=1 and ms2=2 then do; average_mark=(mark1+mark2)/2; output; end; run;

One-to-many / Many-to-one matching The order of the data sets in the MERGE statement does not matter to SAS A One-to-many merge is the same as a many-to-one-merge, although the order of the variables in the new data set are not the same Example: Suppose CUSTOMERID contains profile of customers and SALES contains products purchased by each customer

data sale_profile; merge customerid sales; by id; run;

A One-to-many merge is the same as a many-to-one-merge, although the order of the variables in the new data set are not the same
data sale_profile; merge sales customerid; by id; run;

Use IN= option to identify the non-matches Example: Suppose SALESA contains list of products purchased by some customers. You want to identify the group of customers who did not purchase any item at all

data sale_profile; merge customerid (in=file1) salesa(in=file2); by id; if file1=1 and file2=0 then output; keep id gender age; run;

Вам также может понравиться