Class 5 Notes

More Programming Ideas in SAS This class involves several SAS programming ideas: User-written formats Conditional Logic
al Logic Arrays RETAIN and SUM statements Functions Reshaping Data (including PROC TRANSPOSE) Macro Programming Modular Programming
User-Written Formats: There are two types of formats in SAS Temporary formats defined through a procedure Permanent formats defined through a data step Formats can be defined in SAS using PROC FORMAT. The syntax is:
PROC FORMAT; VALUE name value1 (or range1) = formatted-text 1 value2 (or range2) = formatted-text 2 ; RUN; where name is the name of the format which has the usual naming convention (le 32 characters and must start with a letter or underscore). The name should be associated with the variable in a procedure using FORMAT varname name. ; Note the period after the name in the FORMAT statement. Let us take a look at display 8.5 from Bailer.
More Programming Ideas in SAS/A Narayanan Page 1
options ls=80 nocenter nodate nonumber formdlim='='; /*********************** Display 8.5 Bailer *****************/ data toyexample; input literacy @@; literacy_too = literacy; datalines; -99 25.55 53 53.5 73.7 83 99.9 107 . ; run; proc format; value literacyfmt 0-53='First quartile' 53<-76='Second quartile' 76<-90 ='Third quartile' 90<-100='Fourth quartile' . = 'Missing' OTHER = 'Invalid'; run; proc print data=toyexample; format literacy literacyfmt.; title 'Temporary Format via a PROC STEP'; run;
Temporary Format via a PROC STEP literacy_ too -99.00 25.55 53.00 53.50 73.70 83.00 99.90 107.00 .
Obs 1 2 3 4 5 6 7 8 9
literacy Invalid First quartile First quartile Second quartile Second quartile Third quartile Fourth quartile Invalid Missing
Since this format is assigned through a procedure (PROC PRINT) it is a temporary format. If we print the data again without associating the format we get:
proc print data=toyexample; title 'Printing without format'; run;

Printing without format literacy_ too -99.00 25.55 53.00 53.50 73.70 83.00 99.90 107.00 .
Obs 1 2 3 4 5 6 7 8 9
literacy -99.00 25.55 53.00 53.50 73.70 83.00 99.90 107.00 .
Permanent Format via Data Step: We can assign a permanent format using a data step.
data toyexample2; set toyexample; format literacy literacyfmt.;
run; proc print data=toyexample2; title 'Permanent format via data step'; run;
Permanent format via data step literacy_ too -99.00 25.55 53.00 53.50 73.70 83.00 99.90 107.00 .
Obs 1 2 3 4 5 6 7 8 9
literacy Invalid First quartile First quartile Second quartile Second quartile Third quartile Fourth quartile Invalid Missing
More Programming Ideas in SAS/A Narayanan
Page 3
Since this is a permanent format, there is no need to associate a format statement in the PROC PRINT. However, the data are stored internally as numeric variable as shown via PROC MEANS:
proc means data=toyexample2 maxdec=2; var literacy literacy_too; title 'Data still stored as numbers'; run;
Data still stored as numbers
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum literacy 8 49.58 65.69 -99.00 107.00 literacy_too 8 49.58 65.69 -99.00 107.00
In summary, Method of assignment PROC step DATA step Status Temporary Permanent
Conditional Logic: Sometimes, assignments are made based on a condition; if the condition is true, then the assignment is done; else, the assignment is not done. This is called conditional logic and can be implemented in SAS using IF-THEN statements. The syntax is: IF <condition> THEN <assignment>; The <condition> usually involves numeric comparisons which can be done using the following operators: Mnemonic EQ NE
Symbolic = ~= or ^=
Page 4
LT GT LE GE
< > <= >=
Sometimes, the condition could involve several comparisons using logical operators: IF <condition 1> AND/OR <condition 2> THEN <assignment>; The following logical operators can be used:
Mnemonic AND OR
Symbolic & | or !
In the above structure, only one assignment can be made. A series of assignments can be made using a DO-END block: IF <condition> THEN DO; <assignment>; <assignment>; END;
data sportscars; input model $ year make $ seats color $; if year < 1975 then status = 'classic'; if model = 'Corvette' or model = 'Camaro' then make = 'Chevy'; if model = 'Miata' then do; make = 'Mazda'; seats = 2; end; datalines; Corvette 1955 . 2 black XJ6 1995 Jaguar 2 teal Mustang 1966 Ford 4 red Miata 2002 . . silver CRX 2001 Honda 2 black More Programming Ideas in SAS/A Narayanan Page 5
Camaro run;
2000 .
4 red
proc print data = sportscars; title "Using IF-THEN statements"; run;

Using IF-THEN statements Obs 1 2 3 4 5 6 model Corvette XJ6 Mustang Miata CRX Camaro year 1955 1995 1966 2002 2001 2000 make Chevy Jaguar Ford Mazda Honda Chevy seats 2 2 4 2 2 4 color black teal red silver black red status classic classic
Question: Values of text variable are case-sensitive. If I had corvette instead of Corvette what would happen? Grouping Observations Using IF-THEN/Else statements One of the common ways to group observations and create a grouping variable is to use IF-THEN/ELSE condition: It is more efficient than a series of IF statements It will create mutually exclusive groups IF <condition> THEN <statement>; ELSE IF <condition> THEN <statement>; ELSE ELSE <statement>; Every successive statement gets executed only if the previous statement(s) is (are) not true The last statement (which is optional) gets executed if all statements are not true
/*********using IF-THEN/ELSE section 3.6 Little SAS Book ***********/ data homeimprovements; input owner $ 1-7 description $ 9-33 cost; More Programming Ideas in SAS/A Narayanan Page 6
if cost = . then costgroup = 'missing'; else if cost < 2000 then costgroup = 'low'; else if cost < 10000 then costgroup = 'medium'; else costgroup = 'high'; datalines; Bob kitchen cabinet face-lift 1253.00 Shirley bathroom addition 11350.70 Silvia paint exterior . Al backyard gazebo 3098.63 Norm paint interior 647.77 Kathy second floor addition 75362.93 run; proc print data = homeimprovements; title 'Using IF-THEN/ELSE statements'; run;
Using IF-THEN/ELSE statements Obs 1 2 3 4 5 6 owner Bob Shirley Silvia Al Norm Kathy description kitchen cabinet face-lift bathroom addition paint exterior backyard gazebo paint interior second floor addition cost 1253.00 11350.70 . 3098.63 647.77 75362.93 costgroup low high missing medium low high
When there is missing data it is important to check for that first in these conditional statements Missing values are considered smaller than any numeric/character value Without the first IF statement, missing would be assigned to the low group Subsetting Observations Using IF: Sometimes, you want to choose (or delete) observations if a condition is satisfied. For this, a subsetting IF can be used: IF <condition>; The following action is implied by the statement: IF <condition> THEN OUTPUT;
If the condition is true, SAS continues processing that observation through the remainder of the data step and outputs the observations to the data set. If the condition is false, the observation is deleted from the data set. The opposite action can be achieved by using the DELETE statement: IF <condition> THEN DELETE; In this case, if the condition is true, the observation will be deleted from the data set.
/************* using subsetting IF ----section 3.7 Little SAS Book ********/ data comedy; input title $ 1-26 year type $; if type = 'comedy'; datalines; A Midsummer Nights Dream 1595 comedy Comedy of Errors 1590 comedy Hamlet 1600 tragedy Macbeth 1606 tragedy Richard III 1594 history Romeo and Juliet 1596 tragedy Taming of the Shrew 1593 comedy Tempest 1611 romance run; proc print data = comedy; title 'Using subsetting IF'; run;
Using subsetting IF Obs 1 2 3 title A Midsummer Nights Dream Comedy of Errors Taming of the Shrew year 1595 1590 1593 type comedy comedy comedy
Arrays: It is a temporary data structure convenient for iterative processing. All items in an array should be either numeric or character; cannot mix and match. An array can also be thought of as a train with several cars where each car is an item in
the array. The number of cars is the length (or dimension) of the array. The general syntax for an array is: ARRAY name (dimension) <variable list>; Name should follow the usual naming convention of 32 characters or less and must start with a letter or underscore. IMPORTANT: The number of variables in the variable list should equal the dimension of the array. One of the common uses of arrays is for repetitive processing. In this example, the value 9 is used as a place holder for missing which can be processed via arrays:
/******************* arrays *************************/ * Change all 9s to missing values; data songs; input city $ 1-15 age domk wj hwow simbh kt aomm libm tr filp ttr; array song (10) domk wj hwow simbh kt aomm libm tr filp ttr; do i = 1 to 10; if song(i) = 9 then song(i) = .; end; datalines; albany 54 4 3 5 9 9 2 1 4 4 9 richmond 33 5 2 4 3 9 2 9 3 3 3 oakland 27 1 3 2 9 9 9 3 4 2 3 richmond 41 4 3 5 5 5 2 9 4 5 5 berkeley 18 3 4 9 1 4 9 3 9 3 2 ; run; proc print data = songs; title 'Using Arrays for Repetitive Processing'; run;
Using Arrays for Repetitive Processing Obs 1 2 3 4 5 city albany richmond oakland richmond berkeley age 54 33 27 41 18 domk 4 5 1 4 3 wj 3 2 3 3 4 hwow 5 4 2 5 . simbh . 3 . 5 1 kt . . . 5 4 aomm 2 2 . 2 . libm 1 . 3 . 3 tr 4 3 4 4 . filp 4 3 2 5 3 ttr . 3 3 5 2 i 11 11 11 11 11
Page 9
If the value 9 is not replaced with missing, all summary measures and statistical analysis would include the value 9. Note: The array song is not part of the dataset. It is a temporary data structure. Question: Why is the value if i=11? This program can be cleaned up by making the following changes: use wildcard (*) for the dimension instead of explicitly defining use the dim function as a counter use the drop statement to drop the counter (i) use the range name list (domkttr) instead of listing all variable names
/************** cleaning up the previous program******************/ * Change all 9s to missing values; data songs; input city $ 1-15 age domk wj hwow simbh kt aomm libm tr filp ttr; array song (*) domk -- ttr; do i = 1 to dim(song); if song(i) = 9 then song(i) = .; end; drop i; datalines; albany 54 4 3 5 9 9 2 1 4 4 9 richmond 33 5 2 4 3 9 2 9 3 3 3 oakland 27 1 3 2 9 9 9 3 4 2 3 richmond 41 4 3 5 5 5 2 9 4 5 5 berkeley 18 3 4 9 1 4 9 3 9 3 2 ; run; proc print data = songs; title 'Cleaning up the Previous Program'; run;
Another Application of ARRAY: In many situations, a single record of observation may have to be expanded into multiple observations to satisfy the requirements of certain procedures in SAS. For example, the linear model procedure (PROC GLM) expects data to be of certain format where the response variable to be a column and the classification variable is a separate column whose values are the same for several observations. For example, one observation Smith M 6 6 5 5 5 4 3 should be expanded to
1 2 3 4 5 6 7
Smith Smith Smith Smith Smith Smith Smith
M M M M M M M
1 2 3 4 5 6 7
6 6 5 5 5 4 3
data D6; input name $ sex $ t1 t2 t3 t4 t5 time6 time_7; ARRAY num_array{*} _NUMERIC_; DO inum = 1 to dim(num_array); time = inum; ADL = num_array{inum}; output; END; keep name sex time ADL; datalines; Smith M 6 6 5 5 5 4 3 Jones F 7 5 4 4 3 2 1 Fisher M 5 5 5 3 2 2 1 ; run; proc print data=d6; title "dataset = D6 [reshaping data sets]"; run;
dataset = D6 [reshaping data sets]
Obs 1 2 3 4 5 6 7 8 9 10
name Smith Smith Smith Smith Smith Smith Smith Jones Jones Jones
sex M M M M M M M F F F
time 1 2 3 4 5 6 7 1 2 3
ADL 6 6 5 5 5 4 3 7 5 4
Page 11
11 12 13 14 15 16 17 18 19 20 21
Jones Jones Jones Jones Fisher Fisher Fisher Fisher Fisher Fisher Fisher
F F F F M M M M M M M
4 5 6 7 1 2 3 4 5 6 7
4 3 2 1 5 5 5 3 2 2 1
Further Application of ARRAY---Imputing Missing Values: Missing values are inevitable in data collection. One of the simplest methods of imputation is mean substitution within the same group to preserve homogeneity. The following program goes through these steps: Sort the data by group Compute average for all variables by group and output averages Merge the original data with the means Impute missing data with means within group
/******************* substitute missing values *************/ /******************* application using arrays *************/
data missing; input x1-x5 group $; datalines; 3 4 3 5 2 a 2 . 3 5 3 b 4 5 4 5 4 a 5 5 4 3 3 a . 4 4 4 3 b . . . 3 3 b ; run; title 'Imputation using arrays'; proc sort data=missing; by group; run; proc print data=missing; run; proc means data=missing; var x1-x5; More Programming Ideas in SAS/A Narayanan Page 12
output out=mean mean=xbar1-xbar5; by group; run; proc print data=mean; run; data meanmerge; merge missing mean; by group; run; proc print data=meanmerge; run; data imputed; set meanmerge; array x(*) x1-x5; array y(*) xbar1-xbar5; do i=1 to dim(x); if x(i)=. then x(i)=y(i); end; drop _type_ _freq_ xbar1-xbar5 i; run; proc print data=imputed; run;
Imputation using arrays Obs 1 2 3 4 5 6 x1 3 4 5 2 . . x2 4 5 5 . 4 . x3 3 4 4 3 4 . x4 5 5 3 5 4 3 x5 2 4 3 3 3 3 group a a a b b b
================================================================================ Imputation using arrays group=a The MEANS Procedure Variable N Mean Std Dev Minimum Maximum x1 3 4.0000000 1.0000000 3.0000000 5.0000000 x2 3 4.6666667 0.5773503 4.0000000 5.0000000
Page 13
x3 3 3.6666667 0.5773503 3.0000000 4.0000000 x4 3 4.3333333 1.1547005 3.0000000 5.0000000 x5 3 3.0000000 1.0000000 2.0000000 4.0000000
group=b Variable N Mean Std Dev Minimum Maximum x1 1 2.0000000 . 2.0000000 2.0000000 x2 1 4.0000000 . 4.0000000 4.0000000 x3 2 3.5000000 0.7071068 3.0000000 4.0000000 x4 3 4.0000000 1.0000000 3.0000000 5.0000000 x5 3 3.0000000 0 3.0000000 3.0000000 ================================================================================ Imputation using arrays Obs 1 2 group a b _TYPE_ 0 0 _FREQ_ 3 3 xbar1 4 2 xbar2 4.66667 4.00000 xbar3 3.66667 3.50000 xbar4 4.33333 4.00000 xbar5 3 3
================================================================================ Imputation using arrays Obs x1 x2 x3 x4 x5 group _TYPE_ _FREQ_ xbar1 1 2 3 4 5 6 3 4 5 2 . . 4 5 5 . 4 . 3 4 4 3 4 . 5 5 3 5 4 3 2 4 3 3 3 3 a a a b b b 0 0 0 0 0 0 3 3 3 3 3 3 4 4 4 2 2 2 xbar2 4.66667 4.66667 4.66667 4.00000 4.00000 4.00000 xbar3 3.66667 3.66667 3.66667 3.50000 3.50000 3.50000 xbar4 4.33333 4.33333 4.33333 4.00000 4.00000 4.00000 xbar5 3 3 3 3 3 3
================================================================================ Imputation using arrays Obs 1 2 3 4 5 6 x1 3 4 5 2 2 2 x2 4 5 5 4 4 4 x3 3.0 4.0 4.0 3.0 4.0 3.5 x4 5 5 3 5 4 3 x5 2 4 3 3 3 3 group a a a b b b
Page 14
RETAIN and SUM statements: When SAS processes a data record, it sets all variables to missing at the start of each observation. These variable values may get changed when SAS reads each observation, but they are set back to missing when SAS reaches the beginning of the data step. This default behavior can be changed by the RETAIN and SUM statements which remember the values from one iteration to the next. The RETAIN statement has the following form: RETAIN <variable list> <initial value>; The RETAIN statement will retain the last value for that variable. The sum statement has the following form: variable + expression; This creates a running total with the value of variable set to zero at the beginning.
/**********using RETAIN and SUM statements ************/ * Using RETAIN and sum statements to find most runs and total runs; data gamestats; input month 1 day 3-4 team $ 6-25 hits 27-28 runs 30-31; retain maxruns; maxruns = max(maxruns, runs); runstodate + runs; datalines; 6-19 Columbia Peaches 8 3 6-20 Columbia Peaches 10 5 6-23 Plains Peanuts 3 4 6-24 Plains Peanuts 7 2 6-25 Plains Peanuts 12 8 6-30 Gilroy Garlics 4 4 7-1 Gilroy Garlics 9 4 7-4 Sacramento Tomatoes 15 9 7-4 Sacramento Tomatoes 10 10 7-5 Sacramento Tomatoes 2 3 ; run; proc print data = gamestats; More Programming Ideas in SAS/A Narayanan Page 15
title "Using RETAIN and SUM statements"; run;

Using RETAIN and SUM statements Obs 1 2 3 4 5 6 7 8 9 10 month 6 6 6 6 6 6 7 7 7 7 day 19 20 23 24 25 30 1 4 4 5 team Columbia Peaches Columbia Peaches Plains Peanuts Plains Peanuts Plains Peanuts Gilroy Garlics Gilroy Garlics Sacramento Tomatoes Sacramento Tomatoes Sacramento Tomatoes hits 8 10 3 7 12 4 9 15 10 2 runs 3 5 4 2 8 4 4 9 10 3 maxruns 3 5 4 2 8 4 4 9 10 3 runstodate 3 8 12 14 22 26 30 39 49 52
Question: What would happen if there was no retain statement? The maxruns from the previous observation is not retained; so, for every observation, we take the maximum of runs and missing. At the end of the data step, maxruns will be the same as runs.
Without The RETAIN statement Obs 1 2 3 4 5 6 7 8 9 10 month 6 6 6 6 6 6 7 7 7 7 day 19 20 23 24 25 30 1 4 4 5 team Columbia Peaches Columbia Peaches Plains Peanuts Plains Peanuts Plains Peanuts Gilroy Garlics Gilroy Garlics Sacramento Tomatoes Sacramento Tomatoes Sacramento Tomatoes hits 8 10 3 7 12 4 9 15 10 2 runs 3 5 4 2 8 4 4 9 10 3 maxruns 3 5 4 2 8 4 4 9 10 3 runstodate 3 8 12 14 22 26 30 39 49 52
One-to-many Row Expansion: Sometimes, there is a need to expand one row of input to many rows in the data set. This type of reshaping is necessary as input to certain procedures like PROC GLM. The following program does that:
************************* Display 9.4 Bailer**************************/
Page 16
data D6; input name $ sex $ t1 t2 t3 t4 t5 time6 time_7; ARRAY num_array{*} _NUMERIC_; DO inum = 1 to dim(num_array); time = inum; ADL = num_array{inum}; output; END; keep name sex time ADL; datalines; Smith M 6 6 5 5 5 4 3 Jones F 7 5 4 4 3 2 1 Fisher M 5 5 5 3 2 2 1 ; run; proc print data=d6; title "dataset = D6 [reshaping data sets]"; run;
dataset = D6 [reshaping data sets]
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
name Smith Smith Smith Smith Smith Smith Smith Jones Jones Jones Jones Jones Jones Jones Fisher Fisher Fisher Fisher Fisher Fisher Fisher
sex M M M M M M M F F F F F F F M M M M M M M
time 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
ADL 6 6 5 5 5 4 3 7 5 4 4 3 2 1 5 5 5 3 2 2 1
Page 17
The output statement inside the DO-END block is the key. As each record is read, the output gets executed seven times writing out the observations one at a time. The keep statement determines which variables are output to the data set. QUESTION: What will happen if there is no OUTPUT statement?
data D6; input name $ sex $ t1 t2 t3 t4 t5 time6 time_7; ARRAY num_array{*} _NUMERIC_; DO inum = 1 to dim(num_array); time = inum; ADL = num_array{inum}; *output; END; keep name sex time ADL; datalines; Smith M 6 6 5 5 5 4 3 Jones F 7 5 4 4 3 2 1 Fisher M 5 5 5 3 2 2 1 ; run; proc print data=d6; title "dataset = D6 [reshaping data sets] without the OUTPUT statement"; run;
dataset = D6 [reshaping data sets] without the OUTPUT statement Obs 1 2 3 name Smith Jones Fisher sex M F M time 7 7 7 ADL 3 1 1
SAS Functions: Functions are pre-programmed sequence of instructions for common tasks. There are hundreds of SAS functions including statistical, mathematical, financial, character, etc. Examples of mathematical functions are INT, LOG, SIN, COS, etc. Examples of statistical functions are MEAN, STDEV, SUM, etc. Examples of character functions are UPCASE, TRIM, SUBSTR, LENGTH, etc.
Functions have arguments and generally return a value. An example is AVERAGE = MEAN (var1, var2, var3, ); Here, var1, var2, var3 are the arguments; AVERAGE is assigned the value of the mean of var1, var2, var3, etc.
/********************* section 3.2 Little SAS Book *************/ /* variables are name = name of contestant age = age of contestant type = carved or decorated date = date of entry scr1 - scr5 = score from five judges */ data contest; input name $16. age 3. +1 type $1. +1 date mmddyy10. (scr1 scr2 scr3 scr4 scr5) (4.1); avgscore = mean(scr1, scr2, scr3, scr4, scr5); dayentered = day(date); type = upcase(type); datalines; alicia grossman 13 c 10-28-2008 7.8 6.5 7.2 8.0 7.9 matthew lee 9 d 10-30-2008 6.5 5.9 6.8 6.0 8.1 elizabeth garcia 10 c 10-29-2008 8.9 7.9 8.5 9.0 8.8 lori newcombe 6 d 10-30-2008 6.7 5.6 4.9 5.2 6.1 jose martinez 7 d 10-31-2008 8.9 9.510.0 9.7 9.0 brian williams 11 c 10-29-2008 7.8 8.4 8.5 7.9 8.0 run; proc print data = contest; title 'SAS Function'; run;
SAS Function Obs 1 2 3 4 5 6 name alicia grossman matthew lee elizabeth garcia lori newcombe jose martinez brian williams age type 13 9 10 6 7 11 C D C D D C date scr1 scr2 scr3 scr4 scr5 avgscore dayentered 17833 17835 17834 17835 17836 17834 7.8 6.5 8.9 6.7 8.9 7.8 6.5 7.2 5.9 6.8 7.9 8.5 5.6 4.9 9.5 10.0 8.4 8.5 8.0 6.0 9.0 5.2 9.7 7.9 7.9 8.1 8.8 6.1 9.0 8.0 7.48 6.66 8.62 5.70 9.42 8.12 28 30 29 30 31 29
Page 19
The input statement has a new way of reading data when all variables follow the same format. The mean function computes the average of the judges scores; the upcase is a character function changing the case to upper case. One of nice things about the mean function is that it will compute the mean of only non-missing values; this way you will not get a missing answer when one of the values is missing which will happen with an assignment statement. CDF and QUANTILE functions: Two important statistical functions are CDF and QUANTILES. CDF finds the tail area and the QUANTILE function finds the value of the random variable for a given tail area. Remember statistical tables at the back of the book? The p-values and quantiles from these tables can be calculated using these functions. The syntax for CDF function is CDF(distribution name, value, <parameters>) The CDF function computes probabilities from lower tail Upper tail probabilities can be computed as 1 CDF() The CDF function is available for several continuous distributions (normal, t, F, etc.) and several discrete distributions (binomial, Poisson, etc.)
data cdf_examples; /* Z ~ N(0,1) table values */ norm_area_left = cdf("Normal",-1.645); norm_area_right = 1-cdf("Normal",-1.645); * area above -1.645 under N(0,1); /* T ~ t(df) table values */ t_area_left_06 = cdf("T",-1.645, 6); * area <= -1.645 for t(df=6); t_area_left_60 = cdf("T",-1.645, 60); * area <= -1.645 for t(df=60); t_area_left_600 = cdf("T",-1.645, 600); * area <= -1.645 for t(df=600); /* Pr(Y<=m) n=trials=4) bin_cdf_0 bin_cdf_1 bin_cdf_2 bin_cdf_3 bin_cdf_4 for Y ~ binomial(m=successes, p=prob of success=0.5, */ = CDF('binomial', 0, 0.50, 4); = CDF('binomial', 1, 0.50, 4); = CDF('binomial', 2, 0.50, 4); = CDF('binomial', 3, 0.50, 4); = CDF('binomial', 4, 0.50, 4); /* Pr(Y=m) for Y ~ binomial(p=0.5, n=4) */ Page 20
p0 = bin_cdf_0;
p1 p2 p3 p4 run;
= = = =
bin_cdf_1 bin_cdf_2 bin_cdf_3 bin_cdf_4
bin_cdf_0; bin_cdf_1; bin_cdf_2; bin_cdf_3;
proc print data=cdf_examples; title 'CDF Functions'; run;

CDF Functions norm_ area_ left 0.049985 norm_ area_ right 0.95002
Obs 1
t_area_ left_06 0.075536
t_area_ left_60 0.052600
t_area_ left_600 0.050247
bin_ cdf_0 0.0625
bin_ cdf_1 0.3125
Obs 1
bin_ cdf_2 0.6875
bin_ cdf_3 0.9375
bin_ cdf_4 1
p0 0.0625
p1 0.25
p2 0.375
p3 0.25
p4 0.0625
Here, the parameters of normal are (0,1). As the degrees of freedom of the tdistribution increases, it approaches the normal distribution. The parameters of binomial are: p =probability of success, and n=number of trials For a discrete distribution, like the binomial, P(X=x) can be found using the PDF function: PDF(distribution name, value, <parameters>)
data pdf_examples; /* Pr(Y=m) for Y ~ binomial(p=prob bin_0 = PDF('binomial', 0, 0.50, bin_1 = PDF('binomial', 1, 0.50, bin_2 = PDF('binomial', 2, 0.50, bin_3 = PDF('binomial', 3, 0.50, bin_4 = PDF('binomial', 4, 0.50, run; proc print data=pdf_examples; title 'PDF Functions'; More Programming Ideas in SAS/A Narayanan Page 21 of success=0.5, 4); 4); 4); 4); 4); n=trials=4) */
run;
PDF Functions Obs 1 bin_0 0.0625 bin_1 0.25 bin_2 0.375 bin_3 0.25 bin_4 0.0625
The answer is the same as before using CDF (X=x2)-CDF(X=x1). The inverse function to CDF is QUANTILE which returns the value of x such that F(x)=p. Here, we specify the probability and ask for the value of x. The syntax is: QUANTILE(distribution, probability,<parameters>)
/*************** QUANTILES display 8.55 Bailer ****************/ data quant_calc; * z examples ; zq_50 = QUANTILE('Normal',0.50); zq_90 = QUANTILE('Normal',0.90); zq_95 = QUANTILE('Normal',0.95); zq_975 = QUANTILE('Normal',0.975); put put put put put "Z: "Z: "Z: "Z: " "; 50th 90th 95th 97.5th percentile percentile percentile percentile = = = = " " " " @25 @25 @25 @25 zq_50; zq_90; zq_95; zq_975;
* binomial binq_50 = binq_90 = binq_95 = binq_975 = put put put put put
examples; QUANTILE('Binomial',0.50,.50,4); QUANTILE('Binomial',0.90,.50,4); QUANTILE('Binomial',0.95,.50,4); QUANTILE('Binomial',0.975,.50,4); 50th 90th 95th 97.5th percentile percentile percentile percentile = = = = " " " " @35 @35 @35 @35 binq_50; binq_90; binq_95; binq_975;
"Binomial: "Binomial: "Binomial: "Binomial: " ";
run;
The PUT function will print the results to the log by default. Here, the common 50th, 90th, 95th, and 99th percentiles of Z and Binomial are computed using the QUANTILE function.
Page 22
Z: Z: Z: Z:
50th 90th 95th 97.5th
percentile percentile percentile percentile 50th 90th 95th 97.5th
= = = =
-1.15194E-17 1.2815515655 1.644853627 1.9599639845 = = = = 2 3 4 4
Binomial: Binomial: Binomial: Binomial:
percentile percentile percentile percentile
SAS Macro Programming: When a certain task is done repeatedly the macro facility makes your program more efficient and avoids repeating the statements over and over again. It makes production jobs a lot more efficient. Writing macro programs is like meta programming because the macro program you write in turn writes regular SAS programming code. When SAS encounters a macro program the macro processor interprets the code and in turn writes a program using regular programming statements:
SAS macro program
Macro processor
Regular SAS statements
When the word scanner senses macro triggers, it sends that part of the code to the macro processor to resolve the macro statements. What are macro triggers? There are two symbols (% and &) that act as macro triggers. Macro variable: One of the simplest uses of macro programming is to use a macro variable. A macro variable can be used to substitute a value throughout the program instead of hard coding different values for that variable each time. This can be done through a %LET statement: %LET macro-variable name = value;
Page 23
When the macro processor encounters the macro variable in the program (identified by using &macro-variable name), it substitutes that value for every occurrence. Local and Global Macro Variables: Macro variables can be local or global. A macro variable is local if it is defined inside a macro; can be used only within the macro(%DO--%END) A macro variable is global it is defined outside of macro in open code; can be used anywhere in the program (%LET) Here is an example using %LET. There are three types of companies (Pharmaceuticals, Textiles, Super Market). Let us say we want to print selectively by type.
options ls=80 formdlim='=' ps=64 nodate nonumber; data fin; infile 'C:\Users\narayaa.BUSINESS\Documents\Statistical Programming BANA\Class 5\fin.dat' firstobs=5; input type $ id ror debt_eq sales eps npm pe profit; run; /****** use %LET *******/ %LET type=Textil; proc print data=fin; where type="&type"; title "Data for &type Companies"; run;
Data for Textil Companies Obs 15 16 17 18 19 type Textil Textil Textil Textil Textil id 15 16 17 18 19 ror 9.9 9.9 8.5 9.3 13.3 debt_eq 2.7 0.9 1.2 1.1 0.3 sales 40.6 28.1 39.7 22.3 16.9 eps 34.8 23.7 24.9 22.5 17.0 npm 6.0 6.9 5.1 6.1 5.7 pe 22 22 20 19 14 profit 0.40820 0.17003 0.31763 0.30667 0.31658
Page 24
Note: In the %LET statement, the value does not require a quotation mark even though it is a character variable. This assignment statement creates a macro variable called &type which can be accessed anywhere in the program including title. Macro processor does not look for single quotes; so, use double quotation marks.
Passing Parameters to Macro: Let us say we want to look at specific type of companies, but only the top 5 according to sales. This would involve passing parameters to macro. First, let us define the macro first: %MACRO macro-name (parameter1=, parameter2=,., parameterN=); SAS statements; %MEND macro-name; This macro can invoked using %macro-name (parameter1=, parameter2=,.,parameterN=); The semicolon (;) is not necessary in the macro invocation, but it is good practice to keep it.
/************ passing parameters to macro ***********/ %macro prinit(type, sortby, dsn, nobs); proc sort data=&dsn; by descending &sortby; run; proc print data=&dsn (obs=&nobs); where type="&type"; title "Largest &nobs Observations for &type Company"; run; %mend prinit; %prinit(type=SuperM, sortby=sales, dsn=fin, nobs=5);
Largest 5 Observations for SuperM Company Obs type id ror debt_eq sales eps npm pe profit
Page 25
7 10 12 17 21
SuperM SuperM SuperM SuperM SuperM
23 20 21 24 25
10.9 15.7 18.4 10.4 9.8
1.1 0.7 0.2 0.5 1.0
18.8 16.3 15.7 13.6 12.2
16.1 12.0 12.2 18.1 5.0
1.8 1.7 1.7 1.0 1.0
9 8 9 6 7
0.32749 0.61254 0.60138 0.47901 0.59935
Now, we have built in the flexibility to choose: the type of company the sorting variable the data set to analyze the number of observations to be printed
Conditional Logic: Conditional logic in a macro adds more flexibility to the program. The syntax is: %IF condition %THEN %DO; SAS statements; %END; %ELSE other condition %THEN %DO; Other SAS statements; %END; Here, we want to produce two types of reports; type=a is to print the entire data; otherwise, type=b is to print only summary data.
/****************** using conditional logic with %IF--%THEN ************/
%macro prinit1(dsn,report); %if &report=a %then %do; proc print data=&dsn; title "Type of Report=&report"; run; %end; %else %do; proc means data=&dsn; var _numeric_; title "Type of Report=&report"; run; %end; %mend prinit1; %prinit1(dsn=fin,report=b);
Type of Report=b The MEANS Procedure
Page 26
Variable N Mean Std Dev Minimum Maximum id 25 13.0000000 7.3598007 1.0000000 25.0000000 ror 25 10.8960000 2.6224162 6.7000000 18.4000000 debt_eq 25 0.7040000 0.5435071 0.2000000 2.7000000 sales 25 17.4280000 7.9697616 9.8000000 40.6000000 eps 25 13.6480000 8.3780029 -3.0000000 34.8000000 npm 25 4.4200000 2.2477767 1.0000000 8.0000000 pe 25 10.7600000 4.8500859 5.0000000 22.0000000 profit 25 0.4189075 0.1246135 0.1700290 0.6125430
Useful Options in Macro Programming: SAS has several useful options for debugging macro programs. Some of the useful options are: Macro Options MLOGIC MPRINT SYMBOLGEN Note: All details are printed in the log
options mprint nomlogic nosymbolgen; %prinit1(dsn=fin,report=b);
What it does Prints details of execution logic Prints the SAS code generated by macro Prints values of macro variables
With the MPRINT option on, SAS prints the code written by the macro processor.
182 options mprint nomlogic nosymbolgen; 183 %prinit1(dsn=fin,report=b); MPRINT(PRINIT1): proc means data=fin; MPRINT(PRINIT1): var _numeric_; MPRINT(PRINIT1): title "Type of Report=b"; MPRINT(PRINIT1): run;
options nomprint mlogic nosymbolgen; %prinit1(dsn=fin,report=b);
With the MLOGIC option, details about execution of macro gets printed
184 options nomprint mlogic nosymbolgen; 185 %prinit1(dsn=fin,report=b); MLOGIC(PRINIT1): Beginning execution. MLOGIC(PRINIT1): Parameter DSN has value fin MLOGIC(PRINIT1): Parameter REPORT has value b MLOGIC(PRINIT1): %IF condition &report=a is FALSE
Page 27
NOTE: There were 25 observations read from the data set WORK.FIN. NOTE: PROCEDURE MEANS used (Total process time): real time 0.01 seconds cpu time 0.03 seconds MLOGIC(PRINIT1): Ending execution.
With the SYMBOLGEN option, details about parameter resolution gets printed
options nomprint nomlogic symbolgen; %prinit1(dsn=fin,report=b);
SYMBOLGEN: SYMBOLGEN: SYMBOLGEN: Macro variable REPORT resolves to b Macro variable DSN resolves to fin Macro variable REPORT resolves to b
Note: You may not need all the options. Turning all of them on will make reading the log quite difficult. Suggestion: Turn each one of options on in turn till you fix the problem!
Modular Programming: The basic idea in modular programming is to sub-divide the problem into smaller components which can be programmed and tested as separate modules. An excellent example of this is given in Section 8.6 (Bailer). The example uses a Monte Carlo simulation to test the robustness of the twosample t-test to violations of the assumptions of equal variance. It involves writing the pseudo-code first:
/*************** Display 8.30 ***********************/ /* Problem: Explore whether t-test really is robust to violations of the equal variance assumption
Strategy: See if the t-test operates at the nominal Type I error rate when the unequal variance assumption is violated */ /* /* /* /*
specify the conditions to be generated */ generate data sets reflecting these conditions */ calculate the test statistic */ accumulate results over numerous simulated data sets
*/
Then, expanding on specifying the conditions to be generated:

/************************ Display 8.31 ******************/ /* Problem: Explore whether t-test really is robust to violations of the equal variance assumption Strategy: See if the t-test operates at the nominal Type I error rate when the unequal variance assumption is violated */
*/ /* specify the conditions to be generated Nsims = 1; Myseed = 65432; N1 = 10; N2 = 10; Mu_1 = 0; Sig_1 = 1; Mu_2 = 0; Sig_2 = 1;
* number of simulated experiments; * specify seed for random number sequence;
* sample sizes from populations 1 and 2;
* mean/sd of population 1;
/* generate data sets reflecting these conditions */ * generate N1 observations ~ N(mu_1, sig_1^2) ; * generate N2 observations ~ N(mu_2, sig_2^2) ; /* calculate the test statistic */ /* accumulate results over numerous simulated data sets
*/
Then, generate the data sets:

/******************** Display 8.32 ********************/ /* Problem: Explore whether t-test really is robust to violations of the equal variance assumption Strategy: See if the t-test operates at the nominal Type I error rate when the unequal variance assumption is violated
*/ /* specify the conditions to be generated */
Data simulate_2group_t; Nsims = 1; * number of simulated experiments; Myseed = 65432; * specify seed for random number sequence;
Page 29
call streaminit(Myseed); N1 = 10; N2 = 10; Mu_1 = 0; Sig_1 = 1; Mu_2 = 0; Sig_2 = 1;
* see Section 8.11 for more descrip.;
* sample sizes from populations 1 and 2;
do iexpt = 1 to Nsims; /* generate data sets reflecting these conditions * generate N1 observations ~ N(mu_1, sig_1^2) ; do ix = 1 to N1; group = 1; Y = RAND('normal',mu_1,sig_1); output; end; * generate N2 observations ~ N(mu_2, sig_2^2) do ix = 1 to N2; group = 2; Y = RAND('normal',mu_2,sig_2); output; end; ; */
/* calculate the test statistic */ /* accumulate results over numerous simulated data sets end; run; * of the do-loop over simulated experiments;
*/
proc print data=simulate_2group_t; run; proc means data=simulate_2group_t; var y; class group; run;
Then, compute the test statistic:

/*********************** Display 8.33 ****************************/ /* Problem: Explore whether t-test really is robust to
Page 30
violations of the equal variance assumption Strategy: See if the t-test operates at the nominal Type I error rate when the unequal variance assumption is violated*/ /* specify the conditions to be generated */ data simulate_2group_t; Nsims = 1; * number of simulated experiments; Myseed = 65432; * specify seed for random number sequence; call streaminit(Myseed); N1 = 10; N2 = 10; Mu_1 = 0; Sig_1 = 1; Mu_2 = 0; Sig_2 = 1; * sample sizes from populations 1 and 2;
* mean/sd of population 1; * mean/sd of population 2;
do iexpt = 1 to Nsims; /* generate data sets reflecting these conditions */ * generate N1 observations ~ N(mu_1, sig_1^2) ; do ix = 1 to N1; group = 1; Y = RAND('normal',mu_1,sig_1); output; end; * generate N2 observations ~ N(mu_2, sig_2^2) do ix = 1 to N2; group = 2; Y = RAND('normal',mu_2,sig_2); output; end; ;
end; * of the do-loop over simulated experiments; run ; /* calculate the test statistic */ ods trace on/listing; proc ttest data=simulate_2group_t; by iexpt; class group; var Y; run; ods trace off;
Page 31
After isolating the table name from the trace run the following program:
ods output TTests=Out_TTests; proc ttest data= simulate_2group_t; by iexpt; class group; var Y; run; ods output close;
The results of running the procedure based on 4000 simulations is shown below:
Results of simulation of 2-sample t-test to violations of homogeneity The FREQ Procedure Cumulative Cumulative Pooled_reject Frequency Percent Frequency Percent 0 3801 95.03 3801 95.03 1 199 4.98 4000 100.00
Cumulative Cumulative Satter_reject Frequency Percent Frequency Percent 0 3807 95.18 3807 95.18 1 193 4.83 4000 100.00
The pooled t-test is rejected 4.98% when the data are from normal distribution with equal variances which is what we would expect since we chose a nominal level of 0.05. Extension to macro programming: Can the above program to test robustness of ttest be written as a macro?
/****************** convert previous program into a macro *********************/ %macro ttestsim(n1,n2,nsims,mu1,mu2,sigma1,sigma2,seed,pval); data simulate_2group_t; call streaminit(&seed); do iexpt = 1 to &nsims; /* generate data sets reflecting these conditions * generate N1 observations ~ N(mu_1, sig_1^2) ; */
Page 32
do ix = 1 to &n1; group = 1; Y = RAND('normal',&mu1,&sigma1); output; end; * generate N2 observations ~ N(mu_2, sig_2^2) ; do ix = 1 to &n2; group = 2; Y = RAND('normal',&mu2,&sigma2); output; end; end; * of the do-loop over simulated experiments; run; /* calculate the test statistic */ /* Note: ODS TRACE was used to determine the output object containing the test statistics. This included the pooled-variance t-test and the Satterthwaite df approximation for the t-test allowing for unequal variances */ ods output TTests=Out_TTests; proc ttest data= simulate_2group_t ; by iexpt; class group; var Y; run; ods output close; proc print data=simulate_2group_t (obs=21); run; /* accumulate results over numerous simulated data sets /* refer to Table 8.35 in Bailer(p.297)*/ */
data out_ttests; set out_ttests; retain Pooled_p; * RETAIN explained in Section 8.3; if method="Pooled" then Pooled_p = Probt; else do; Satter_p = Probt; Pooled_reject = (Pooled_p <= &pval); * Boolean trick again; Satter_reject = (Satter_p <= &pval); keep iexpt Pooled_p Satter_p Pooled_reject Satter_reject; output; end; run; proc freq; table Pooled_reject Satter_reject; title 'Results of simulation of 2-sample t-test to violations of homogeneity'; run; %mend ttestsim; options mprint nomlogic nosymbolgen; %ttestsim(n1=10,n2=10,nsims=40,mu1=0,mu2=0,sigma1=1,sigma2=2,seed=45056,pval= 0.05);
Page 33

Class 5 Notes

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Class 5 Notes

Загружено:

Авторское право:

Доступные форматы

More Programming Ideas in SAS This class involves several SAS programming ideas: User-written formats Conditional Logic

proc print data=toyexample; title 'Printing without format'; run;

literacy -99.00 25.55 53.00 53.50 73.70 83.00 99.90 107.00 .

More Programming Ideas in SAS/A Narayanan

< > <= >=

proc print data = sportscars; title "Using IF-THEN statements"; run;

More Programming Ideas in SAS/A Narayanan

Smith Smith Smith Smith Smith Smith Smith

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

title "Using RETAIN and SUM statements"; run;

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

bin_cdf_1 bin_cdf_2 bin_cdf_3 bin_cdf_4

bin_cdf_0; bin_cdf_1; bin_cdf_2; bin_cdf_3;

proc print data=cdf_examples; title 'CDF Functions'; run;

t_area_ left_06 0.075536

t_area_ left_60 0.052600

t_area_ left_600 0.050247

bin_ cdf_0 0.0625

bin_ cdf_1 0.3125

bin_ cdf_2 0.6875

bin_ cdf_3 0.9375

"Binomial: "Binomial: "Binomial: "Binomial: " ";

More Programming Ideas in SAS/A Narayanan

50th 90th 95th 97.5th

percentile percentile percentile percentile 50th 90th 95th 97.5th

-1.15194E-17 1.2815515655 1.644853627 1.9599639845 = = = = 2 3 4 4

Binomial: Binomial: Binomial: Binomial:

percentile percentile percentile percentile

SAS macro program

Regular SAS statements

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

SuperM SuperM SuperM SuperM SuperM

10.9 15.7 18.4 10.4 9.8

1.1 0.7 0.2 0.5 1.0

18.8 16.3 15.7 13.6 12.2

16.1 12.0 12.2 18.1 5.0

1.8 1.7 1.7 1.0 1.0

0.32749 0.61254 0.60138 0.47901 0.59935

More Programming Ideas in SAS/A Narayanan

options nomprint mlogic nosymbolgen; %prinit1(dsn=fin,report=b);

More Programming Ideas in SAS/A Narayanan

Then, expanding on specifying the conditions to be generated:

* number of simulated experiments; * specify seed for random number sequence;

* sample sizes from populations 1 and 2;

Then, generate the data sets:

*/ /* specify the conditions to be generated */

More Programming Ideas in SAS/A Narayanan

call streaminit(Myseed); N1 = 10; N2 = 10; Mu_1 = 0; Sig_1 = 1; Mu_2 = 0; Sig_2 = 1;

* see Section 8.11 for more descrip.;

* sample sizes from populations 1 and 2;

Then, compute the test statistic:

More Programming Ideas in SAS/A Narayanan

* mean/sd of population 1; * mean/sd of population 2;

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

More Programming Ideas in SAS/A Narayanan

/ / specify the conditions to be generated */