Вы находитесь на странице: 1из 328

SAS® Programming 2: Data

Manipulation Techniques

Course Notes
SAS® Programming 2: Data Manipulation Techniques Course Notes was developed by Stacey
Syphus, Beth Hardin, and Michele Ensor. Additional contributions were made by Bruce Dawless,
Anita Hillhouse, Marty Hultgren, Mark Jordan, Eva-Maria Kegelmann, Gina Repole, Samantha
Rowland, Allison Saito, Prem Shah, Kristin Snyder, Peter Styliadis, and Kitty Tjaris . Instructional
design, editing, and production support was provided by the Learning Design and Development
team.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

SAS® Programming 2: Data Manipulation Techniques Course Notes

Copyright © 2020 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States
of America. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.

Book code E71658, course code LWPG2V2/PG2V2_001, prepared date 01Apr2020. LWPG2V2_001

ISBN 978-1-62960-550-0
For Your Infor mation iii

Table of Contents

Lesson 1 Controlling DATA Step Processing ......................................................1-1

1.1 Setting Up for This Course ................................................................................1-3

1.2 Understanding DATA Step Processing.............................................................. 1-11

Demonstration: DATA Step Processing ....................................................... 1-20

Practice ................................................................................................... 1-24

1.3 Directing DATA Step Output............................................................................. 1-26

Demonstration: Controlling Row Output ....................................................... 1-32

Demonstration: Controlling Column Output ................................................. 1-35

Practice ................................................................................................... 1-38

1.4 Solutions ....................................................................................................... 1-40

Solutions to Practices................................................................................ 1-40

Solutions to Activities and Questions .......................................................... 1-42

Lesson 2 Summarizing Data ...............................................................................2-1

2.1 Creating an Accumulating Column .....................................................................2-3

Demonstration: Creating an Accumulating Column .........................................2-5

Practice ................................................................................................... 2-10

2.2 Processing Data in Groups .............................................................................. 2-12

Demonstration: Identifying the First and Last Row in Each Group ................. 2-15

Demonstration: Creating an Accumulating Column within Groups ................. 2-22

Practice ................................................................................................... 2-28

2.3 Solutions ....................................................................................................... 2-30

Solutions to Practices................................................................................ 2-30

Solutions to Activities and Questions .......................................................... 2-33

Lesson 3 Manipulating Data with Functions .......................................................3-1

3.1 Understanding SAS Functions and CALL Routines ..............................................3-3


iv For Your Information

3.2 Using Numeric and Date Functions ....................................................................3-9

Demonstration: Using Numeric Functions .................................................... 3-12

Demonstration: Shifting Date Values .......................................................... 3-23

Practice ................................................................................................... 3-25

3.3 Using Character Functions .............................................................................. 3-27

Demonstration: Using Character Functions to Extract Words from a String ..... 3-32

Practice ................................................................................................... 3-39

3.4 Using Special Functions to Convert Column Type ............................................. 3-42

Demonstration: Using the INPUT and PUT Functions to Convert Column


Types ................................................................................ 3-54

3.5 Solutions ....................................................................................................... 3-60

Solutions to Practices................................................................................ 3-60

Solutions to Activities and Questions .......................................................... 3-63

Lesson 4 Creating Custom Formats ...................................................................4-1

4.1 Creating and Using Custom Formats..................................................................4-3

Demonstration: Creating and Using Custom Formats ................................... 4-10

Practice ................................................................................................... 4-13

4.2 Creating Custom Formats from Tables.............................................................. 4-16

Demonstration: Creating Custom Formats from Tables ................................. 4-19

Practice ................................................................................................... 4-26

4.3 Solutions ....................................................................................................... 4-30

Solutions to Practices................................................................................ 4-30

Solutions to Activities and Questions .......................................................... 4-34

Lesson 5 Combining Tables ...............................................................................5-1

5.1 Concatenating Tables .......................................................................................5-3

Demonstration: Concatenating Tables ..........................................................5-6

Practice ................................................................................................... 5-10

5.2 Merging Tables ............................................................................................... 5-12


For Your Infor mation v

Demonstration: Merging Tables .................................................................. 5-22

5.3 Identifying Matching and Nonmatching Rows .................................................... 5-25

Demonstration: Merging Tables with Nonmatching Rows ............................. 5-31

Practice ................................................................................................... 5-41

5.4 Solutions ....................................................................................................... 5-44

Solutions to Practices................................................................................ 5-44

Solutions to Activities and Questions .......................................................... 5-47

Lesson 6 Processing Repetitive Code ................................................................6-1

6.1 Using Iterative DO Loops ..................................................................................6-3

Demonstration: Executing an Iterative DO Loop .............................................6-7

Demonstration: Using Iterative DO Loops .................................................... 6-16

Practice ................................................................................................... 6-19

6.2 Using Conditional DO Loops ........................................................................... 6-24

Demonstration: Using Conditional DO Loops ............................................... 6-28

Demonstration: Combining Iterative and Conditional DO Loops ..................... 6-32

Practice ................................................................................................... 6-36

6.3 Solutions ....................................................................................................... 6-40

Solutions to Practices................................................................................ 6-40

Solutions to Activities and Questions .......................................................... 6-48

Lesson 7 Restructuring Tables ...........................................................................7-1

7.1 Restructuring Data with the DATA Step...............................................................7-3

Demonstration: Creating a Narrow Table with the DATA Step ..........................7-6

Practice ................................................................................................... 7-10

7.2 Restructuring Data with the TRANSPOSE Procedure ........................................ 7-13

Demonstration: Creating a Wide Table with PROC TRA NSPOSE .................. 7-16

Practice ................................................................................................... 7-23

7.3 Solutions ....................................................................................................... 7-25


vi For Your Information

Solutions to Practices................................................................................ 7-25

Solutions to Activities and Questions .......................................................... 7-27


For Your Infor mation vii

To learn more…
For information about other courses in the curriculum, contact the
SAS Education Division at 1-800-333-7660, or send e-mail to
training@sas.com. You can also find this information on the web at
http://support.sas.com/training/ as well as in the Training Course
Catalog.

For a list of SAS books (including e-books) that relate to the topics
covered in this course notes, visit https://www.sas.com/sas/books.html or
call 1-800-727-0025. US customers receive free shipping to US
addresses.
viii For Your Information
Lesson 1 Controlling DATA Step
Processing
1.1 Setting Up for This Course........................................................................................... 1-3

1.2 Understanding DATA Step Processing ....................................................................... 1-11


Demonstration: DATA Step Processing ................................................................... 1-20
Practice............................................................................................................... 1-24

1.3 Directing DATA Step Output ....................................................................................... 1-26


Demonstration: Controlling Row Output ................................................................... 1-32
Demonstration: Controlling Column Output .............................................................. 1-35
Practice............................................................................................................... 1-38

1.4 Solutions ................................................................................................................... 1-40


Solutions to Practices ............................................................................................ 1-40
Solutions to Activities and Questions........................................................................ 1-42
1-2 Lesson 1 Controlling DATA Step Processing

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Setting Up for This Course 1-3

1.1 Setting Up for This Course

Course Overview

Analyze and
Access Explore Prepare report on
Export
data data data results
data

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The complete SAS programming process includes accessing data, exploring and validating data,
preparing data, analyzing and reporting on data, and exporting results. But it is likely that the
majority of your time as a programmer is spent preparing data. For this reason, this course is
focused on the SAS DATA step and various procedures that expand your skills and help make you
more productive working with your data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-4 Lesson 1 Controlling DATA Step Processing

Course Overview

OUTPUT

Prepare
data

$customFormat.

4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

• This lesson, “Controlling DATA Step Processing,” digs deeper into the DATA step. You learn how
the DATA step processes data behind the scenes. Then you use this knowledge to control when
and where the DATA step outputs rows to new tables.
• In “Summarizing Data,” you learn how to create an accumulating column—in other words, how to
generate a running total. Then you learn how to process data in groups so that you can perform
an action when each group begins or ends.
• In “Manipulating Data with Functions,” you learn how to use some new functions that enable you
to manipulate numeric, date, and character values. In addition, you learn how to use functions that
change a column from one data type to another.
• In “Creating Custom Formats,” you learn how to create and use custom formats to enhance how
your data is displayed in a table or report.
• In “Combining Tables,” you learn how to concatenate tables, merge tables, and identify matching
and nonmatching rows.
• In “Processing Repetitive Code,” you learn how to save time by taking advantage of iterative and
conditional processing with DO loops.
• In “Restructuring Tables,” you learn techniques that can be used to transpose or restructure a
table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Setting Up for This Course 1-5

Practicing in This Course

US National
class
Park data
cars
international
storm and
weather data
shoes
European
tourism and
trade data
5
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In this course, you analyze mainly international storm and weather data. This is real data about
storms such as hurricanes, typhoons, and cyclones that has been collected since 1980. The
practices use various tables from US national parks and European tourism and trade. The course
also uses tables from the Sashelp library to illustrate new data manipulation techniques.
• The detailed international storm data can be found at
https://www.ncdc.noaa.gov/ibtracs/index.php?name=wmo-data as part of the International Best
Track Archive for Climate Stewardship (IBTrACS). The data has been summarized and cleansed
to use in this course.
• The US National Park data can be found at https://irma.nps.gov/Stats/Reports/National. The data
has been summarized and cleansed to use in this course.
• The European tourism data can be found at http://ec.europa.eu/eurostat/data/database. The data
has been summarized and cleansed to use in this course.
• SAS sample tables are provided in the Sashelp library. See
https://support.sas.com/documentation/tools/sashelpug.pdf for documentation about the available
tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6 Lesson 1 Controlling DATA Step Processing

Practicing in This Course


Demonstration Performed by your instructor as an example for you to
observe
Activity Short practice opportunities for you to work in SAS,
either independently or with the guidance of your
instructor
Practice Extended practice opportunities for you to work
independently
Case Study A comprehensive practice opportunity at the end of
the class

6
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Case studies can be accessed on the Extended Learning page for your course.

Choosing a Practice Level


Level 1 Solve basic problems with step-by-step
guidance.
Level 2 Solve intermediate problems with defined
goals.
Challenge Solve complex problems with SAS Help
Choose one
and documentation resources. practice to do in
class based on your
interest and skill
level.

7
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Setting Up for This Course 1-7

SAS Programming Interfaces

SAS Enterprise
SAS Studio Guide
SAS windowing
environment

You can use the interface of your choice, but some demonstrations
in this course use features specifically in SAS Enterprise Guide.
8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

SAS has several programming interfaces that you can use to interactively write and submit code.
These interfaces include the SAS windowing environment (the interface that is part of SAS), SAS
Enterprise Guide (a client application that runs on your PC and accesses SAS on a local or remote
server), and SAS Studio (a web-based interface to SAS that you can use on any computer).
Note: In this class, we use SAS Studio and SAS Enterprise Guide because they include the most
modern programming tools.

Accessing the Course Files

course
files
activities
Make note of
data the location of
your course files
demos folder.

practices

9
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8 Lesson 1 Controlling DATA Step Processing

Accessing the Course Files


Programs in the
activities, demos,
course and practices
files folders follow this
activities naming convention.

data

demos

practices
p204d01.sas
Programming 2, Lesson 4, demo 1

10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

These folders contain starter SAS programs for you to use. The file names follow this naming
convention: the name starts with p1 for programming 1, followed by two digits for the lesson number.
Then the letter A, D, or P indicates activity, demo, or practice, followed by a sequential two-digit
number within the lesson. When you come to an activity, demo, or practice, the instructions indicate
the file that you need to open. There is also a solutions folder in the practices folder that has
complete solution programs.

Creating the Course Data

course
files
activities

data cre8data.sas

demos

practices

11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Setting Up for This Course 1-9

1.01 Activity (Required)


1. Navigate to the location of the course files.
SAS Studio: In the Navigation pane, expand Files and Folders.
SAS Enterprise Guide: In the Servers list, expand Servers  Local  Files.
2. Double-click the cre8data.sas file to open the program.
3. Find the %LET statement. As directed by your instructor, provide the
path to your course files.
4. Run the program and verify that a report is created listing the generated
tables.

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

1.02 Activity (Required)


1. Open the libname.sas program in the course folder. The path macro variable
should be the folder where your course files are located.
2. Run the code and verify that the library was successfully assigned in the log.
3. Navigate to your list of libraries and expand the pg2 library. Open and view
the storm_summary SAS table.
Note: In Enterprise Guide, click Libraries  Refresh to update the library list.

Be sure to run the libname.sas program


each time that the SAS session is restarted.

14
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-10 Lesson 1 Controlling DATA Step Processing

Extending Your Learning

Use your Extended


Learning page to
download course files
and access additional
videos, papers, and
other helpful resources!

18
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-11

1.2 Understanding DATA Step Processing

DATA Step Review


read and data storm_complete;
write tables set pg2.storm_summary_small;
length Ocean $ 8;
filter rows drop EndDate;
where Name is not missing;
and columns
Basin=upcase(Basin);
compute StormLength=EndDate-StartDate;
columns
if substr(Basin,2,1)="I"
conditionally then Ocean="Indian";
process else if substr(Basin,2,1)="A"
then Ocean="Atlantic";
else Ocean="Pacific";
run;

20
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p201a03

The DATA step is the primary tool that you use in the SAS programming language for manipulating
data. This DATA step reads and writes tables, filters rows, computes new columns, uses conditional
processing to assign values to a new column, and subsets columns. We know what this code does,
but now we will learn how these statements work behind the scenes to process data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12 Lesson 1 Controlling DATA Step Processing

1.03 Activity
Open p201a03.sas from the activities folder and perform the following tasks:
1. Run the program and examine the log, PROC CONTENTS report, and
output table.
2. Move the DROP statement to the end of the DATA step, just before the
RUN statement. Run the program and examine the log, PROC CONTENTS
report, and output table. Did the results change?
3. Move the LENGTH statement between the DROP and RUN statements.
Run the program and examine the log, PROC CONTENTS report, and
output table. Did the results change?

21
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

DATA Step Processing

Compilation Execution
establish data read, manipulate, and
attributes and rules write data
for execution
What happens
behind the
scenes when a
DATA step runs?

24
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

To truly understand the DATA step and take advantage of its many powerful and unique features,
you must understand exactly how the DATA step processes data behind the scenes.
The DATA step follows a very logical process that is easy to customize to your data processing
needs. When you run a DATA step, it goes through two phases: compilation and execution. In the
compilation phase, SAS prepares the code and establishes data attributes and the rules for
execution. In the execution phase, SAS follows those rules to read, manipulate, and write data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-13

DATA Step Processing: Compilation

Compilation PDV
Season Name StartDate Ocean
1) Check for syntax errors. N8 $ 25 N8 $8
2) Create the program data
vector (PDV), which includes
all columns and attributes.
3) Establish the specifications The PDV is the
for processing data in the magic behind the
PDV during execution. DATA step's
4) Create the descriptor processing power!
portion of the output table.

25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In the compilation phase, SAS runs through the program to check for syntax errors. If there are no
errors, SAS builds a critical area of memory called the program data vector, or PDV for short. The
PDV includes each column referenced in the DATA step and its attributes, including the column
name, type, and length. The PDV is used in the execution phase to hold and manipulate one row of
data at a time. Also in the compilation phase, SAS establishes rules for the PDV based on the code,
such as which columns will be dropped, or which rows from the input table will be read into the PDV.
Finally, SAS creates the descriptor portion, or the table metadata.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14 Lesson 1 Controlling DATA Step Processing

DATA Step Processing: Compilation


data storm_complete;
set pg2.storm_summary_small; Define the library and a
length Ocean $ 8; name for the output table.
drop EndDate;
where Name is not missing;
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
if substr(Basin,2,1)="I" then Ocean="Indian";
else if substr(Basin,2,1)="A" then Ocean="Atlantic";
else Ocean="Pacific";
run;

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p201d01

The DATA statement creates the output table Storm_Complete in the Work library.

DATA Step Processing: Compilation


data storm_complete;
set pg2.storm_summary_small; Columns are added to the PDV
length Ocean $ 8;
in the order in which they
drop EndDate;
appear in the input table.
where Name is not missing;
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
if substr(Basin,2,1)="I" then Ocean="Indian";
else if substr(Basin,2,1)="A" then Ocean="Atlantic";
else Ocean="Pacific";
run;

PDV
Name Basin MaxWind StartDate EndDate Attributes are inherited
$ 15 $2 N8 N8 N8 from the input table.

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

To build the PDV, SAS passes through the DATA step sequentially, adding columns and their
attributes. The SET statement in this program is listed first, so all of the columns from the
storm_summary_small table are added to the PDV along with the required column attributes
name, type, and length. Optional attributes such as formats or labels might also be included for
columns that have them.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-15

DATA Step Processing: Compilation


data storm_complete;
set pg2.storm_summary_small; The remaining columns are
length Ocean $ 8; added to the PDV in the
drop EndDate; order in which they appear in
where Name is not missing; the DATA step.
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
if substr(Basin,2,1)="I" then Ocean="Indian";
else if substr(Basin,2,1)="A" then Ocean="Atlantic";
else Ocean="Pacific";
run;
Each column must have at least
PDV a name, type, and length.
Name Basin MaxWind StartDate EndDate Ocean StormLength
$ 15 $2 N8 N8 N8 $8 N8

28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Any other statements that define new columns will add to the PDV. The LENGTH statement is next
after SET, and it explicitly defines the character column Ocean with a length of 8. StormLength is
the last new column, and, based on the arithmetic expression, it is defined as a numeric column with
a default length of 8.
Ocean appears in assignment statements later in the step. However, after a column and its
attributes are established in the PDV, they cannot be changed. That is why the LENGTH statement
must occur before the IF-THEN statements. Otherwise, the assignment statement OCEAN="Indian"
would be the first statement that SAS would use to define Ocean with a length of 6. Remember, SAS
is not processing at this point, so the IF expression is not evaluated. SAS is simply looking for the
first definition of any new column that it must add to the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-16 Lesson 1 Controlling DATA Step Processing

DATA Step Processing: Compilation


data storm_complete;
set pg2.storm_summary_small; DROP or KEEP statements flag
length Ocean $ 8; columns that will be excluded
drop EndDate;
from the output table.
where Name is not missing;
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
if substr(Basin,2,1)="I" then Ocean="Indian";
else if substr(Basin,2,1)="A" then Ocean="Atlantic";
else Ocean="Pacific";
run;

PDV
Name Basin MaxWind StartDate EndDate Ocean StormLength
$ 15 $2 N8 N8 N8 $8 N8
D
29
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

There are certain statements that are specific to the compilation phase and establish the behavior of
the PDV. The DROP statement does not remove a column from the PDV. Instead, SAS marks the
column with a drop flag so that it is dropped later in execution. In this program, EndDate will
eventually be dropped from the output data, but it is still be available to use in the PDV for
calculating the column StormLength.

DATA Step Processing: Compilation


data storm_complete;
set pg2.storm_summary_small; The WHERE statement
length Ocean $ 8; establishes conditions for
drop EndDate;
which rows will be read from
where Name is not missing;
the input table into the PDV.
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
if substr(Basin,2,1)="I" then Ocean="Indian";
else if substr(Basin,2,1)="A" then Ocean="Atlantic";
else Ocean="Pacific";
run;

PDV
Name Basin MaxWind StartDate EndDate Ocean StormLength
$ 15 $2 N8 N8 N8 $8 N8
D
30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The WHERE statement defines which rows are read from the input table into the PDV during
execution.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-17

DATA Step Processing: Compilation


data storm_complete;
set pg2.storm_summary_small;
length Ocean $ 8; The descriptor portion is
drop EndDate; created for the output table.
where Name is not missing;
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
if substr(Basin,2,1)="I" then Ocean="Indian";
else if substr(Basin,2,1)="A" then Ocean="Atlantic";
else Ocean="Pacific";
run;

work.storm_complete
Name Basin MaxWind StartDate Ocean StormLength
$ 15 $2 N8 N8 $8 N8
31
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Finally, the descriptor portion of the output table is complete. Notice that the EndDate column is not
included in the descriptor portion of the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18 Lesson 1 Controlling DATA Step Processing

DATA Step Processing: Execution

Execution data output-table;


set input-table;
1) Initialize the PDV. ...other statements...
2) Read a row from the input run;
table into the PDV. Implicit OUTPUT;
3) Sequentially process Implicit RETURN;
statements and update values Automatic
in the PDV. looping makes
4) At the end of the step, write processing data
the contents of the PDV to the easy!
output table.
5) Return to the top of the DATA
step.
32
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When the compilation phase is complete, the program is ready for action in the execution phase. In
this phase, SAS reads data, processes it in the PDV, and outputs it to a new table.
DATA step execution acts like an automatic loop. The first time through the DATA step, the SET
statement reads the first row from the input table, and then processes any other statements in
sequence, manipulating the values in the PDV. When SAS reaches the end of the DATA step, there
is an implied OUPUT action so that the contents of the PDV, minus any columns flagged for
dropping, are written as the first row in the output table. The DATA step then automatically loops
back to the top and executes the statements in order again, this time reading, manipulating and
outputting the next row. That implicit loop continues until all of the rows are read from the input table.
Compile-time statements such as DROP, LENGTH, and WHERE are not executed for each row.
However, because of the rules that they established in the compilation phase, their impact will be
observed in the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-19

DATA Step Processing in Action

You can watch


execution happen one
statement at a time in
the Enterprise Guide
DATA step debugger.

33
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

A great way to learn how data is processed in the execution phase is to watch it happen statement
by statement, and row by row. This is possible using an interactive debugging tool unique to SAS
Enterprise Guide. We will use the DATA step debugger to peek behind the scenes and watch the
impact of each statement on the values in the PDV as the step executes.
For more information about using the Enterprise Guide DATA step debugger, see
https://support.sas.com/resources/papers/proceedings17/SAS0447-2017.pdf.
The SAS windowing environment also provides an interactive DATA step debugger. It can be
accessed by adding the DEBUG option in the DATA statement:

DATA table / DEBUG;

Visit the Using the DATA Step Debugger page in SAS Help for more details.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20 Lesson 1 Controlling DATA Step Processing

DATA Step Processing

Scenario
Use the DATA step debugger in SAS Enterprise Guide to observe the process of execution.

Files
• p201d01.sas
• storm_summary_small – a SAS table that has one row per storm for the 1980 through 2016
storm seasons

Notes
• The DATA step is processed in two phases: compilation and execution.
• During compilation, SAS creates the program data vector (PDV) and establishes data attributes
and rules for execution.
• The PDV is an area of memory established in the compilation phase. It includes all columns that
will be read or created, along with their assigned attributes. The PDV is used in the execution
phase to hold and manipulate one row of data at a time.
• During execution, SAS reads, manipulates, and writes data. All data manipulation is performed in
the PDV.

Demo
Note: This demo must be performed in Enterprise Guide.
1. Open the p201d01.sas program in the demos folder.

2. The DATA step markers for debugging toolbar button enables debugging in the
program. If this option is enabled, you see the same icon and a green bar next to each DATA
step in your program.
3. Click the Debugger icon next to the DATA statement. The DATA Step Debugger window
appears.
4. At this point, the compilation phase is complete and the PDV is displayed on the right side of the
window. Notice that all columns read from the storm_summary_small table start with a missing
value.
5. Two additional columns are included in the PDV during execution. _ERROR_ is 0 by default but
is set to 1 whenever a data error is encountered, such as a value that cannot be read or
calculated. _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the
variable _N_ increments by 1. The value of _N_ represents the number of times that the DATA
step has iterated.

6. Click Step execution to next line to execute the highlighted SET statement and step to the
next executable statement. Recall that during the compilation phase, the WHERE statement
established a rule to read rows into the PDV only where Name is not missing. The first two rows
of the input table have missing values for Name, so the third row is read. However, because this
is the first iteration of the DATA step, _N_ is still equal to 1. Values for the Name, Basin,
MaxWind, StartDate, and EndDate columns are assigned in the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-21

Note: Red text in the Value column represents data values that were updated with the
execution of the previously highlighted statement.
7. The assignment statement for Basin is the next executable statement. LENGTH, DROP, and
WHERE are compile-time statements. Click Step execution to next line twice to execute
the Basin and StormLength assignment statements. Notice that Basin was already in
uppercase and did not change, but a value of 6 was assigned to StormLength.

8. Click Step execution to next line to execute the IF, ELSE IF, and ELSE statements. After
line 10, Pacific is assigned to Ocean.

9. With the RUN statement highlighted, click Step execution to next line . As the concluding
step boundary for the DATA step, the RUN statement triggers an implicit output. The values in
the PDV are written as the first row in storm_complete. After the implicit output, the process
returns to the top of the DATA step.
Note: While debugging a program, the output table is not created. When the program runs
outside of the debugger, the implicit output writes rows to the output table.
10. Notice that _N_ is now 2, representing the second iteration of the DATA step. Columns read from
the SET table retain their values. However, the new computed columns, Ocean and
StormLength, are reset to missing. This action is called reinitializing the PDV.

11. Click Step execution to next line to step through the program until line 8. Notice that the
value of Basin is SI, so the IF condition is true. Execute the IF statement, and SAS assigns
Indian to the Ocean column. The remaining ELSE statements are skipped and RUN is
highlighted.
12. Execute the RUN statement. _N_ is increased to 3, and the PDV is reinitialized.

13. Click Start/continue debugger execution to proceed through the rest of execution. Close
the DATA step debugger.
Note: The DATA step debugger is available by default in other programs. To suppress the
debugger icon in the editor, click the DATA step markers for debugging toolbar button
.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22 Lesson 1 Controlling DATA Step Processing

Viewing Execution in the Log


writes all columns
and values in the
PDV to the log
PUTLOG _ALL_;
writes selected If you don’t have the
PUTLOG column=; columns and values interactive debugger,
in the PDV to the log use the PUTLOG
statement to write
PUTLOG "message";
writes a text string information about
to the log execution to the log.

35
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

If you do not have access to the interactive DATA step debugger in Enterprise Guide, you can add
PUTLOG statements to your code so that you can examine the contents of the PDV at any time
during execution.
The _ALL_ keyword writes all columns in the PDV and their values to the log, and column= writes
one or more specific columns and their values to the log. Message writes a text string that you
specify to the log.

Viewing Execution in the Log

The OBS= data set option


data storm_complete; limits the observations
set pg2.storm_summary_small(obs=2); that are read.
putlog "PDV after SET Statement";
putlog _all_;
...

PDV after SET Statement


Name=AGATHA Basin=EP MaxWind=115 StartDate=09JUN1980
EndDate=15JUN1980 Ocean= StormLength=. _ERROR_=0 _N_=1
PDV after SET Statement
Name=ALBINE Basin=SI MaxWind=. StartDate=27NOV1979
EndDate=06DEC1979 Ocean= StormLength=. _ERROR_=0 _N_=2
36
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-23

1.04 Activity
Open p201a04.sas from the activities folder and perform the following tasks:
1. Examine the PUTLOG statements that are in the DATA step.
2. Add two PUTLOG statements before the RUN statement to print "PDV
before RUN statement" and write all columns in the PDV to the log. Run
the program.
3. View the log. What is the value of StormLength at the end of the second
iteration of the DATA step?
4. Type NOTE: (use uppercase and include the colon) inside the quotation
marks of the following PUTLOG statement. Run the program. What
changes in the log?
putlog "NOTE: PDV before RUN statement";
37
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24 Lesson 1 Controlling DATA Step Processing

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Using the DATA Step Debugger to Examine Execution Steps
Examine the National Park data that is used in most practices. Use the DATA step debugger to
follow the steps of execution in a DATA step that reads the np_final table.
Note: This practice must be performed in SAS Enterprise Guide to use the interactive DATA
step debugger. If you did not do the first activities in Enterprise Guide, first open and run
the libname.sas program.
a. In Enterprise Guide, use the Servers list to expand Servers  Local  Libraries  PG2.
Double-click np_final to open the table. The table includes one row per US national park.
Note that the first row in the table is Cape Krusenstern National Monument.
b. Become familiar with the following columns in the np_final table:
• Region (Alaska, Intermountain, Midwest, National Capital, Northeast, Pacific West, and
Southeast)
• Type (Monument, Park, Preserve, River, Seashore)
• ParkName (full name of national park)
• DayVisits (number of daily visitors in 2017)
• Campers (number of campers in 2017)
• OtherLodging (number of people in other lodging, including cabins and hotels, in 2017)
• Acres (total park size in acres)
c. Open p201p01.sas in the practices folder of the course files. Click DATA step markers for
debugging to enable debugging in the program. Click the Debugger icon next to
the DATA statement. The DATA Step Debugger window appears.
d. How many variables are in the PDV? What are the initial values?

e. Click Step execution to next line to execute the highlighted SET statement. Recall that
the first row of the np_final table is Cape Krusenstern National Monument. Why was the first
row not read into the PDV in the first iteration of the DATA step?

f. Click Step execution to next line to step through the remaining statements in the DATA
step. Which statements are executable? Which statements are compile-time only?
g. Exit the debugger and run the program to view the output table.
Note: The DATA step debugger is available by default in other programs. To suppress the
debugger icon in the editor, click DATA step markers for debugging .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Understanding DATA Step Processing 1-25

Level 2
2. Using PUTLOG Statements to Examine Execution Steps
a. Open p201p02.sas in the practices folder of the course files. Examine the program and
answer the following questions:
1) Which statements are compile-time only?
2) What will be assigned for the length of Size?
b. Run the program and examine the results.
c. Modify the program to resolve the truncation of Size. Read the first five rows from the input
table.
d. Add PUTLOG statements to provide the following information in the log:
1) Immediately after the SET statement, write START DATA STEP ITERATION to the log
as a color-coded note.
2) After the Type= assignment statement, write the value of Type to the log.
3) At the end of the DATA step, write the contents of the PDV to the log.
e. Run the program and read the log to examine the messages written during execution.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26 Lesson 1 Controlling DATA Step Processing

1.3 Directing DATA Step Output

Controlling DATA Step Processing

data output-table;
set input-table; You can alter the
...other statements... default DATA step
run;
processing rules to
Implicit OUTPUT; control how the
Implicit RETURN; steps of execution
proceed.

41
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Now that we have a better idea of what happens with the PDV during the compilation and execution
phases of DATA step processing, we can use the knowledge to our advantage. There are times
when the default processing rules of execution are perfectly fine for your needs. But there are other
times when you need to alter those rules to process your data in a different way. The DATA step
provides syntax that enables you to control exactly how the steps of execution proceed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-27

Controlling Output
sashelp.shoes

Use the DATA step to create a


table that includes a sales
forecast for each of the next
three years.

forecast

42
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Let’s start by focusing on the implicit output that occurs at the RUN statement. By default, SAS reads
one row from the input table, manipulates the values, and writes that updated row to the output
table. But what if you want to control exactly when and where each row is written?
To illustrate this, let’s look at the sashelp.shoes table. Each row includes the annual sales for
Region, Product, and Subsidiary. Suppose we want to create a table that includes a sales forecast
for each of the next three years, assuming that sales increase by 5% annually.

Controlling Output
sashelp.shoes forecast

data forecast;
set sashelp.shoes;
keep Region Product Subsidiary Year ProjectedSales;
format ProjectedSales dollar10.;
Year=1; Will this
ProjectedSales=Sales*1.05; program write
Year=2; three rows for
ProjectedSales=ProjectedSales*1.05;
Year=3; every one row
ProjectedSales=ProjectedSales*1.05; that it reads?
run;
43
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-28 Lesson 1 Controlling DATA Step Processing

1.05 Activity
Enterprise Guide: Open p201a05a.sas from the activities folder.
1. Use the DATA step debugger to step through one iteration of the DATA
step. Observe the values of Year and ProjectedSales as they are updated.
2. Close the debugger and run the program. Examine the log and output
data. How many rows are in the input and output tables?

SAS Studio: Open p201a05b.sas from the activities folder.


1. Run the program. Observe the values of Year and ProjectedSales written
in the log.
2. How many rows are in the input and output tables?
Keep the program open for the next activity.
44
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Note: If you are using the windowing environment, follow the steps for SAS Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-29

Implicit Output

data forecast;
set sashelp.shoes;
keep Region Product Subsidiary Year ProjectedSales;
format ProjectedSales dollar10.;
Year=1;
ProjectedSales=Sales*1.05;
Year=2;
ProjectedSales=ProjectedSales*1.05;
Year=3;
ProjectedSales=ProjectedSales*1.05;
run;
Implicit OUTPUT;
Implicit RETURN;

46
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

By default, SAS sequentially executes all appropriate statements in the DATA step, and when it
reaches the end of the DATA step, it implicitly outputs the data in the PDV as a row in the output
table. Then SAS automatically loops back to the top of the DATA step and goes through the same
process for the next row.
In this program, although the value of ProjectedSales is calculated for years 1, 2, and 3, the implicit
output occurs only once at the bottom of the loop. When SAS reaches the end of the DATA step, the
values in the PDV are for year 3, and those values are written to the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-30 Lesson 1 Controlling DATA Step Processing

Explicit Output

OUTPUT;

data forecast; data forecast;


set sashelp.shoes; set sashelp.shoes;
... ...
run; output;
run;
Implicit OUTPUT;
Implicit RETURN; Implicit OUTPUT;
Implicit RETURN;

47
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use an explicit OUTPUT statement in the DATA step to force SAS to write the contents of
the PDV to the output table at specific points in the program.
If you use an explicit OUTPUT statement anywhere in a DATA step, there is no implicit output at the
end of the DATA step. The implicit return still returns processing to the top of the DATA step.

1.06 Activity
Modify the p201a05 program that you have open from the previous activity.
1. Add an explicit OUTPUT statement after each ProjectedSales assignment
statement. Run the program. How many rows are in the output table?
2. Comment the final OUTPUT statement and run the program again. Are
rows where Year=3 written to the new table?

48
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-31

Sending Output to Multiple Tables

DATA table1 <table2...>;

OUTPUT table1 <table2...>;

data sales_high sales_low;


set sashelp.shoes;
if Sales>100000 then output sales_high;
else output sales_low;
run;

51
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The OUTPUT statement controls when to output, but one of the great features is that it also controls
where to output. The DATA step can create multiple tables simultaneously simply by listing more
than one table in the DATA statement. You can use the OUTPUT statement followed by the name of
the table to indicate where SAS should write the contents of the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-32 Lesson 1 Controlling DATA Step Processing

Controlling Row Output

Scenario
Create multiple output tables in a single DATA step and use IF-THEN/ELSE logic to designate which
rows are written to each table.

Files
• p201d02.sas
• storm_summary – a SAS table that has one row per storm for the 1980 through 2016 storm
seasons

Notes
• By default, the end of a DATA step causes an implicit output, which writes the contents of the PDV
to the output table.
• The explicit OUTPUT statement can be used in the DATA step to control when and where each
row is written.
• If an explicit OUTPUT statement is used in the DATA step, it disables the implicit output at the end
of the DATA step.
• One DATA step can create multiple tables by listing each table name in the DATA statement.
• The OUTPUT statement followed by a table name writes the contents of the PDV to the specified
table.

Demo
1. Open the p201d02.sas program in the demos folder and find the Demo section. Modify the
DATA statement to create three tables named indian, atlantic, and pacific.
data indian atlantic pacific;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-33

2. Modify the IF-THEN/ELSE conditional statements to write output to the appropriate table.
data indian atlantic pacific;
set pg2.storm_summary;
length Ocean $ 8;
Basin=upcase(Basin);
StormLength=EndDate-StartDate;
MaxWindKM=MaxWindMPH*1.60934;
if substr(Basin,2,1)="I" then do;
Ocean="Indian";
output indian;
end;
else if substr(Basin,2,1)="A" then do;
Ocean="Atlantic";
output atlantic;
end;
else do;
Ocean="Pacific";
output pacific;
end;
run;
3. Add a DROP statement to remove MaxWindMPH. Highlight the DATA step, run the selected
code, and examine the output tables. Notice that MaxWindMPH has been dropped from all three
tables.
drop MaxWindMPH;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-34 Lesson 1 Controlling DATA Step Processing

Controlling Column Output


data sales_high
sales_low; A DROP or KEEP statement
set sashelp.shoes; applies to all output tables
... listed in the DATA statement.
drop Inventory Returns;
run;

PDV
Region Product Subsidiary Stores Sales Inventory Returns
$ 25 $ 14 $ 12 N8 N8 $8 N8
D D

53
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When you use a DROP or KEEP statement in the DATA step, the column is flagged for dropping or
keeping in the PDV, so the action applies to all of the tables listed in the DATA statement.

Controlling Column Output


data sales_high(drop=Returns) table(DROP=col1 col2...)
sales_low(drop=Inventory);
set sashelp.shoes; table(KEEP=col1 col2...)
...
drop Inventory Returns;
run; A DROP= or KEEP= data set
option applies to the output
table that it follows.

PDV
Region Product Subsidiary Stores Sales Inventory Returns
$ 25 $ 14 $ 12 N8 N8 $8 N8
D D

54
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use the DROP= or KEEP= data set options to specify a unique list of columns for each table
listed in the DATA statement. The PDV keeps track of columns to drop from the specific output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-35

Controlling Column Output

Scenario
Control which columns are read in and out of the PDV with DROP= or KEEP= data set options.

Files
• p201d03.sas
• storm_summary – a SAS table that has one row per storm for the 1980 through 2016 storm
seasons

Notes
• DROP= or KEEP= data set options can be added on any table in the DATA statement.
• Columns that will be dropped are flagged in the PDV and are not dropped until the row is output to
the designated table. Therefore, dropped columns are still available for processing in the DATA
step.
• DROP= or KEEP= data set options can be added in the SET statement to control the columns that
are read into the PDV. If a column is not read into the PDV, it is not available for processing in the
DATA step.

Demo
Note: This demo must be performed in Enterprise Guide.
1. Open the p201d03.sas program in the demos folder and find the Demo section. Use the
DROP= data set option to drop MaxWindMPH from the indian table and MaxWindKM from the
atlantic table. Do not drop any columns from the pacific table.
data indian(drop=MaxWindMPH) atlantic(drop=MaxWindKM) pacific;
2. Start the DATA step debugger. Note that MaxWindMPH and MaxWindKM are included in the
PDV.
3. Close the debugger, run the program, and examine the three output tables. MaxWindMPH has
been dropped from the indian table, MaxWindKM has been dropped from the atlantic table,
and the pacific table has all columns.
4. Add a DROP= data set option in the SET statement to drop MinPressure. Start the debugger.
Notice that MinPressure is not included in the PDV.
set pg2.storm_summary(drop=MinPressure);
5. Close the debugger, run the program, and examine the three output tables. Confirm that
MinPressure has been dropped from each table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-36 Lesson 1 Controlling DATA Step Processing

Controlling Column Input

SET table(DROP=col1 col2...)


SET table(KEEP=col1 col2...)
input table

If you use DROP= or


PDV KEEP= in the SET
statement, the
columns are not
added to the PDV.

56
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

It is important to think about whether you need columns for processing when you are deciding where
to drop or keep them. When you use a DROP= or KEEP= data set option on a table in the SET
statement, the excluded columns are not read into the PDV and are not available for processing. It
does not delete columns from the original data.

Controlling Column Output

DATA table(DROP=col1 col2...) DROP col1 col2...;


DATA table(KEEP=col1 col2...) KEEP col1 col2...;
input table

If you use DROP= or PDV


KEEP= in the DATA
statement, the
columns are not
added to output. output table

57
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When you use a DROP or KEEP statement or a DROP= or KEEP= data set option in the DATA
statement, the columns are included in the PDV and can be used for processing. They are flagged to
be dropped when an implicit or explicit output is reached.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-37

1.07 Question

data indian(drop=MaxWindMPH)
atlantic(drop=MaxWindKM)
pacific;
set pg2.storm_summary;
StormLength=EndDate-StartDate;
...
run; What would be the
easiest way to drop
EndDate from all
three tables?

58
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Beyond SAS Programming 2


What if you want to ...

. . . access SAS DATA . . . learn about writing . . . use alternate


step programming multi-threaded DATA syntax to
documentation? step code in SAS Viya? IF-THEN/ELSE that
is similar to SQL?
• Visit the SAS Help • Watch free videos about
Center and the DATA programming in SAS Viya. • Learn about the
Step Programming • Take the Programming for SAS SELECT-WHEN-
section. Viya course. OTHERWISE statement
in SAS Help.
• Complete the Challenge
practice.

60
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links:
• Visit the SAS Help Center and the DATA Step Programming section.
• Watch free videos about programming in SAS Viya.
• Take the Programming for SAS Viya course.
• Learn about the SELECT-WHEN-OTHERWISE statement in SAS Help.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-38 Lesson 1 Controlling DATA Step Processing

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
3. Conditionally Creating Multiple Output Tables
The pg2.np_yearlytraffic table contains annual traffic counts at locations in national parks.
Parks are classified as one of five types: National Monument, National Park, National Preserve,
National River, and National Seashore.
a. Open the p201p03.sas program from the practices folder. Modify the DATA step to create
three tables: monument, park, and other. Use the value of ParkType as indicated above to
determine which table the row is output to.
b. Drop ParkType from the monument and park tables. Drop Region from all three tables.
c. Submit the program and verify the output.
The notes in the SAS log indicate how many rows are in each table.
NOTE: There were 478 observations read from the data set PG2.NP_YEARLYTRAFFIC.
NOTE: The data set WORK.MONUMENT has 84 observations and 3 variables.
NOTE: The data set WORK.PARK has 246 observations and 3 variables.
NOTE: The data set WORK.OTHER has 148 observations and 4 variables.

Level 2
4. Conditionally Creating Columns and Output Tables
The pg2.np_2017 table contains monthly public use figures for national parks, including these
columns:
a. Create a new program. Write a DATA step that creates temporary SAS tables named
camping and lodging and reads the pg2.np_2017 table.
b. Compute a new column, CampTotal, that is the sum of CampingOther, CampingTent,
CampingRV, and CampingBackcountry. Format CampTotal so that values are displayed
with commas.
c. The camping table has the following specifications:
1) includes rows if CampTotal is greater than zero
2) contains the ParkName, Month, DayVisits, and CampTotal columns
d. The lodging table has the following specifications:
1) includes rows where LodgingOther is greater than zero
2) contains only the ParkName, Month, DayVisits, and LodgingOther columns
e. Submit the program and verify the output. The notes in the SAS log indicate how many rows
are in each table.
NOTE: The data set WORK.CAMPING has 1374 observations and 4 variables.
NOTE: The data set WORK.LODGING has 383 observations and 4 variables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Directing DATA Step Output 1-39

Challenge
5. Processing Statements Conditionally with SELECT-WHEN Groups
SELECT and WHEN statements can be used in a DATA step as an alternative to IF-THEN
statements to process code conditionally.
a. Open the p201p05.sas program in the practices folder. The program contains the solution
programs for Practices 3 and 4.
b. Use SAS Help or online documentation to read about using SELECT and WHEN statements
in the DATA step.
c. Modify the Practice 3 program to use SELECT groups and WHEN statements.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-40 Lesson 1 Controlling DATA Step Processing

1.4 Solutions
Solutions to Practices
1. Using the DATA Step Debugger to Examine Execution Steps
Note: Detailed steps are included in the practice. The answers to questions asked in the
practice are provided below.
How many variables are in the PDV? What are the initial values?
Ten variables are included in the PDV. Character variables are blank, and numeric
variables are periods. _ERROR_ is 0, and _N_ is 1.
Why was the first row (Cape Krusenstern National Monument) not read into the PDV in the first
iteration of the DATA step?
Cape Krusenstern National Monument is not read into the PDV because it does not meet
the WHERE statement condition. Kenai Fjords National Park is the first row where
Type="PARK".
Which statements are executable?
SET, Type= assignment statement, AvgMonthlyVistiors= assignment statement
Which statements are compile-time only?
WHERE, FORMAT, and KEEP
2. Using PUTLOG Statements to Examine Execution Steps
Which statements are compile-time only?
KEEP, WHERE, and FORMAT
What will be assigned for the length of Size?
Size will have a length of 5. The first time Size occurs in the DATA step, it is assigned a
value of Small, which is five characters.
data np_parks;
set pg2.np_final(obs=5);
putlog "NOTE: START DATA STEP ITERATION";
keep Region ParkName AvgMonthlyVisitors Acres Size;
length Size $ 6;
where Type="PARK";
format AvgMonthlyVisitors Acres comma10.;
Type=propcase(Type);
putlog Type=;
AvgMonthlyVisitors=sum(DayVisits,Campers,OtherLodging)/12;
if Acres<1000 then Size="Small";
else if Acres<100000 then Size="Medium";
else Size="Large";
putlog _all_;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Solutions 1-41

3. Conditionally Creating Multiple Output Tables


data monument(drop=ParkType) park(drop=ParkType) other;
set pg2.np_yearlytraffic;
if ParkType = 'National Monument' then output monument;
else if ParkType = 'National Park' then output park;
else output other;
drop Region;
run;
4. Conditionally Creating Columns and Output Tables
data camping(keep=ParkName Month DayVisits CampTotal)
lodging(keep=ParkName Month DayVisits LodgingOther);
set pg2.np_2017;
CampTotal=sum(of Camping:);
if CampTotal > 0 then output camping;
if LodgingOther > 0 then output lodging;
format CampTotal comma15.;
run;
5. Processing Statements Conditionally with SELECT-WHEN Groups
data monument(drop=ParkType) park(drop=ParkType) other;
set pg2.np_yearlytraffic;
select (ParkType);
when ('National Monument') output monument;
when ('National Park') output park;
otherwise output other;
end;
drop Region;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-42 Lesson 1 Controlling DATA Step Processing

Solutions to Activities and Questions

1.01 Activity – Correct Answer

Confirm that 56
SAS tables were
created.

13
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

continued...
1.02 Activity – Correct Answer

Notice that Name


is missing for the
first two rows.
The storm_summary
table contains one
row per storm for the
1980 through 2016
storm seasons.

15
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Solutions 1-43

continued...
1.02 Activity – Correct Answer

Values for Basin:


NA – North Atlantic
SP – South Pacific
EP – East Pacific
WP – West Pacific
NI – North Indian
SI – South India

16
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

1.02 Activity – Correct Answer

StartDate and
EndDate are numeric
SAS date values.

17
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-44 Lesson 1 Controlling DATA Step Processing

continued...
1.03 Activity – Correct Answer
2. Move the DROP statement to the end of the DATA step, just before the
RUN statement. Run the program and examine the log, PROC CONTENTS
report, and output table. Did the results change?

No, the results are exactly the same, and there are no warnings or errors
in the log.

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

1.03 Activity – Correct Answer


3. Move the LENGTH statement between the DROP and RUN statements.
Run the program and examine the log, PROC CONTENTS report, and
output table. Did the results change?

Yes, the length of Ocean is 6 instead of 8, some values are truncated, and
there is a warning in the log.

WARNING: Length of character variable Ocean has


already been set. Use the LENGTH statement as
the very first statement in the DATA STEP to
declare the length of a character variable.

23
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Solutions 1-45

1.04 Activity – Correct Answer


3. What is the value of StormLength at the end of the second iteration of
the DATA step?
NOTE: PDV before RUN statement
Name=ALBINE Basin=SI MaxWind=. StartDate=27NOV1979
EndDate=06DEC1979 Ocean=Indian StormLength=9
_ERROR_=0 _N_=2

4. What changes in the log?


Adding the NOTE: prefix color-codes the text as a note in the log.

38
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

1.05 Activity – Correct Answer


2. How many rows are in the input and output tables?
NOTE: There were 395 observations read from the data set SASHELP.SHOES.
NOTE: The data set WORK.FORECAST has 395 observations and 5 variables.

The final values in the PDV are Year=3 and projected sales for year 3.

The implicit output at


the end of the DATA
step writes the
contents of the PDV to
the table.

Keep the program open for the next activity.


45
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-46 Lesson 1 Controlling DATA Step Processing

continued...
1.06 Activity – Correct Answer
1. Add the explicit OUTPUT statement after each ProjectedSales assignment
statement. Run the program. How many rows are in the output table?
data forecast;
set sashelp.shoes;
keep Region Product Subsidiary Year ProjectedSales;
format ProjectedSales dollar10.;
Year=1;
ProjectedSales=Sales*1.05;
output;
Year=2; 3*395 =
ProjectedSales=ProjectedSales*1.05;
output; 1,185 rows
Year=3;
ProjectedSales=ProjectedSales*1.05;
output;
run;
49
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

1.06 Activity – Correct Answer


2. Comment the final OUTPUT statement and run the program again. Are
rows where Year=3 written to the new table?
No

When an explicit
OUTPUT statement is
in a DATA step, the
implicit output is
disabled.

50
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Solutions 1-47

1.07 Question – Correct Answer

data indian(drop=MaxWindMPH) The DROP statement


atlantic(drop=MaxWindKM) can still be used to
pacific; flag columns in the
set pg2.storm_summary; PDV to drop from all
drop EndDate;
StormLength=EndDate-StartDate; tables.
...
run;

PDV
Season Name Basin MaxWindMPH MaxWindKM MinPressure StartDate EndDate
$8 $ 15 $2 N8 N8 $8 N8 N8
D D D

59
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-48 Lesson 1 Controlling DATA Step Processing

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 2 Summarizing Data
2.1 Creating an Accumulating Column .............................................................................. 2-3
Demonstration: Creating an Accumulating Column ...................................................... 2-5
Practice............................................................................................................... 2-10

2.2 Processing Data in Groups ........................................................................................ 2-12


Demonstration: Identifying the First and Last Row in Each Group ................................ 2-15
Demonstration: Creating an Accumulating Column within Groups ................................ 2-22
Practice............................................................................................................... 2-28

2.3 Solutions ................................................................................................................... 2-30


Solutions to Practices ............................................................................................ 2-30
Solutions to Activities and Questions........................................................................ 2-33
2-2 Lesson 2 Summarizing Data

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Creating an Accumulating Column 2-3

2.1 Creating an Accumulating Column

Lesson Overview

Understanding DATA
step processing is
create an critical as you perform
process data in
accumulating more complex data
groups
total manipulations.

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Understanding DATA step processing is critical as you perform more complex data manipulations. In
this lesson, you learn new syntax that enables you to alter the default behavior of the DATA step loop
to solve a problem. First you learn to create an accumulating column—or, in other words, generate a
running total. Then you learn to process data in groups so that you can add conditions in your
program to perform an action when each group begins or ends.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4 Lesson 2 Summarizing Data

Creating an Accumulating Column


houston_rain
Create a new
column that stores
an accumulating
total.

Will this code produce


the desired results?
How will SAS process
data houston_rain; this assignment
set pg2.weather_houston;
keep Date DailyRain YTDRain; statement?
YTDRain=YTDRain+DailyRain;
run;

4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose we want to look at rainfall data as part of our storm research. We start with a small
geographic area and analyze daily rainfall totals in Houston, TX , during 2017. We want to create a
new column that stores a year-to-date rainfall total. Notice the assignment statement for YTDRain.
We want to take the previous value of YTDRain and add to it the value of DailyRain for each row.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Creating an Accumulating Column 2-5

Creating an Accumulating Column

Scenario
Use the DATA step debugger in SAS Enterprise Guide to observe how the default behavior of the
PDV must be modified to create an accumulating column.

Files
• p202d01.sas
• weather_houston – a SAS table that has daily rain and temperature measurements in Houston,
TX, during 2017

Notes
• At the beginning of the first iteration of the DATA step, all column values are set to missing.
• By default, all computed columns are reset to missing at the beginning of each subsequent
iteration of the DATA step. This is called reinitializing the PDV. Columns read from the SET
statement automatically retain their value in the PDV.
• To create an accumulating column, this default behavior must be modified.

Demo
Note: This demo must be performed in Enterprise Guide.
1. Open the p202d01.sas program in the demos folder and find the Demo section. Run the
program and notice that the values for YTDRain are all missing.
2. To determine why YTDRain is missing, open the DATA step debugger. Click Step execution to
next line to execute the highlighted SET statement. The first row from weather_houston is
loaded into the PDV.
3. Notice that the YTDRain assignment statement adds a missing value and .01. If an arithmetic
expression includes a missing value as input, the answer will be missing. This prevents you from
creating the accumulating column.
4. In the PDV area of the DATA step debugger, double-click the missing value for YTDRain and
change it to 0. Click Step execution to next line to execute the YTDRain assignment
statement, and notice that the new value is 0.01.

5. Click Step execution to next line to advance past the RUN statement. This results in the
implicit output of the contents of the PDV to the output table and implicit return to the top of the
DATA step. Notice that when SAS returns to the top of the DATA step to begin the second
iteration, YTDRain is reset to missing.
Note: By default, all computed columns are reset to missing each time that the PDV is
reinitialized.
6. Double-click the missing value for YTDRain and enter .01, which is the value from the previous
row. Step through execution in the second iteration and notice that YTDRain is 1.3, the
accumulation of day 1 and day 2.
7. Exit the debugger.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6 Lesson 2 Summarizing Data

Retaining Values in the PDV

To successfully create an
accumulating column:
1) Set the initial value to 0.
2) Retain the value each time
that the PDV reinitializes.

6
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In order to create an accumulating column, you must override the default behavior of the PDV. First,
you must set the initial value of YTDRain to zero instead of missing. Second, when SAS reinitializes
the PDV at the beginning of each iteration of the DATA step, you need to retain the value of
YTDRain in the PDV instead of resetting it to missing each time. If these actions can be performed
in the PDV, you can create an accumulating column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Creating an Accumulating Column 2-7

Retaining Values in the PDV

RETAIN column <initial-value>; data houston2017;


set pg2.weather_houston;
retain YTDRain 0;
YTDRain=YTDRain+DailyRain;
run;
1) retains the value each time
that the PDV reinitializes
2) assigns an initial value

7
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

RETAIN is a compile-time statement that sets a rule for one or more columns to keep their value
each time that the PDV is reinitialized instead of resetting the value to missing. It also provides the
option of establishing an initial value in the PDV before the first iteration of the DATA step. By adding
a RETAIN statement to a program, you can change the default behavior of the PDV in this scenario
to create an accumulating column.

2.01 Activity
Open p202a01.sas from the activities folder and perform the following tasks:
1. Modify the program to retain TotalRain and set the initial value to 0.
2. Run the program and examine the results. Why are all values for
TotalRain missing after row 4?
3. Change the assignment statement to use the SUM function instead of
the plus symbol. Run the program again. Why are the results different?

8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8 Lesson 2 Summarizing Data

Using the Sum Statement

column+expression;

• creates TotalRain and sets


data zurich2017;
set pg2.weather_zurich; the initial value to 0
TotalRain+Rain_mm; • retains TotalRain
run; • adds Rain_mm to TotalRain
for each row
• ignores missing values

11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use the sum statement as a simple way to create an accumulating column. The name of the
new accumulating column is on the left, and the column to add for each row is on the right of the
expression.
The sum statement does the following four things automatically:
• creates the accumulating column on the left and sets the initial value to 0
• automatically retains the value of the accumulating column in the PDV
• adds the value of the column on the right to the accumulating column for each row
• ignores missing values

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Creating an Accumulating Column 2-9

2.02 Question
What sum statement would you add to this program to create the column
named DayNum?

data zurich2017;
set pg2.weather_zurich;
YTDRain_mm+Rain_mm;
???
run;

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-10 Lesson 2 Summarizing Data

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Producing a Running Total
The pg2.np_yearlyTraffic table contains annual traffic counts at locations in national parks.
a. Open the p202p01.sas program in the practices folder. Open the pg2.np_yearlyTraffic
table. Notice that the Count column records the number of cars that have passed through a
particular location.
b. Modify the DATA step to create a column, totTraffic, that is the running total of Count.
c. Keep the ParkName, Location, Count, and totTraffic columns in the output table.
d. Format totTraffic so that values are displayed with commas.

Level 2
2. Producing Multiple Totals
The pg2.np_yearlyTraffic table contains annual traffic counts at locations in national parks.
Parks are classified as one of five types: National Monument, National Park, National Preserve,
National River, and National Seashore.
a. Create a table, parkTypeTraffic, from the pg2.np_yearlyTraffic table. Use the following
specifications.
1) Read only the rows from the input table where ParkType is National Monument or
National Park.
2) Create two new columns named MonumentTraffic and ParkTraffic. The value of each
column should be increased by the value of Count for that park type.
3) Format the new columns so that values are displayed with commas.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Creating an Accumulating Column 2-11

b. Create a listing report of parkTypeTraffic. Use Accumulating Traffic Totals for Park Types
as the report title. Display the columns in this order: ParkType, ParkName, Location,
Count, MonumentTraffic, and ParkTraffic.

Challenge
3. Determining Maximum Amounts
The RETAIN statement can be used for other purposes besides accumulating columns. Use the
pg2.np_monthlyTraffic table, which contains monthly traffic counts at locations in national
parks. Create new columns that sequentially store the maximum value to date for Count, as well
as the corresponding values for Month and Location.
a. Create a table, cuyahoga_maxtraffic, from the pg2.np_monthlyTraffic table. Use the
following specifications.
1) Include only rows where ParkName is equal to Cuyahoga Valley NP.
2) Create three columns: TrafficMax, MonthMax, and LocationMax. Initialize TrafficMax
to 0.
3) If the current traffic count is greater than the value in TrafficMax, then set the value of
TrafficMax equal to Count, set the value of MonthMax equal to Month, and set the
value of LocationMax equal to Location.
4) Format the Count and TrafficMax columns so that values are displayed with commas.
5) Keep only the Location, Month, Count, TrafficMax, MonthMax, and LocationMax
columns in the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12 Lesson 2 Summarizing Data

2.2 Processing Data in Groups

Processing Data in Groups

If your data is sorted


into groups, the DATA
step can identify when
each group begins and
ends.

16
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Often, questions that you have about your data are related to examining values within groups.
Remember that the DATA step processes rows one at time, in the sequence that they occur in the
input table. If the input data is sorted in groups, the DATA step can process the data within groups.

Processing Data in Groups

Which storm
What is the
names are used
maximum wind
more than once
measurement for
within a season?
each storm?

When did the


first storm occur
in each basin?

17
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-13

Processing Data in Groups

PROC SORT DATA=input-table sorts the table


<OUT=sorted-output-table>; into groups
BY <DESCENDING> col-name(s);
RUN;

DATA output-table; processes the data


SET sorted-output-table; in the sorted table
BY <DESCENDING> col-name(s); by groups
RUN;

18
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

First you create an output table in the desired sort sequence based on groups. Then you use the BY
statement in the DATA step to tell SAS that you want to process the data in groups.

Processing Data in Groups

data storm2017_max; First.BY-column Last.BY-column


set storm2017_sort;
by Basin;
run;
The BY statement creates
First./Last. variables in the PDV
that can be used to identify when
each BY group begins and ends.

PDV
…other columns… Basin First.Basin Last.Basin
D D

19
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When a DATA step includes a BY statement followed by a column name, two special variables are
added to the PDV: First.BY-column and Last.BY-column to indicate the first and last rows within
each group. These variables are temporary and are dropped when the row is written to the output
table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-14 Lesson 2 Summarizing Data

Processing Data in Groups


PDV
…other columns… Basin First.Basin Last.Basin first row where
D D
NA 1 0 Basin is NA

PDV
…other columns… Basin First.Basin Last.Basin subsequent
D D rows where
NA 0 0
Basin is NA
PDV
…other columns… Basin First.Basin Last.Basin last row where
D D
NA 0 1 Basin is NA

20
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

During the execution phase, the First. and Last. variables are assigned a value of 0 or 1. The First.
variable is 1 for the first row within a group and 0 for all other rows. Similarly, the Last. variable is 1
for the last row within a group and 0 for all other rows.
These temporary variables contain important information that we can use before the variables are
dropped when the row is written to the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-15

Identifying the First and Last Row in Each Group

Scenario
Use the DATA step debugger in SAS Enterprise Guide to observe how First./Last. variables are
assigned values in the PDV during execution.

Files
• p202d02.sas
• storm_2017 – a SAS table containing one row for each storm that occurred in 2017

Syntax

PROC SORT DATA=input-table


<OUT=output-table>;
BY <DESCENDING> col-name(s);
RUN;

DATA output-table;
SET input-table;
BY <DESCENDING> col-name(s);
RUN;

First.bycol
Last.bycol

Notes
• To process data in groups, the data first must be sorted by the grouping column or columns. This
can be accomplished with PROC SORT.
• The BY statement in the DATA step indicates how the data has been grouped. Each unique value
of the BY column will be identified as a separate group.
• The BY statement creates two temporary variables in the PDV for each column listed as a BY
column: First.bycol and Last.bycol.
• First.bycol is 1 for the first row within a group and 0 otherwise. Last.bycol is 1 for the last row
within a group and 0 otherwise.
• Conditional IF-THEN logic can be used based on the values of the First./Last. variable to execute
statements in the DATA step.

Demo
Note: This demo must be performed in Enterprise Guide.
1. Open the p202d02.sas program in the demos folder and find the Demo section. Run the PROC
SORT step to create a temporary table named storm2017_sort that groups the rows by Basin.
Note that in the first 20 rows the value of Basin is EP.
2. Start the DATA step debugger and examine the contents of the PDV. Because there is a BY
statement included, the variables First.Basin and Last.Basin are included in the PDV. These
variables are temporary, so they are automatically dropped before writing each row to the output
table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-16 Lesson 2 Summarizing Data

3. Click Step execution to next line to execute the SET statement. The first row from the
input table is loaded into the PDV. Because it is the first row in the Basin=EP group, First.Basin
is 1. SAS is able to look ahead to the next sequential row in the input table and determine it is
not the last occurrence of EP, so Last.Basin is 0.
4. After the SET statement executes, SAS skips to the RUN statement. The BY statement is a
compile-time statement that adds the First./Last. variables in the PDV. Click Step execution to
next line again to advance past the RUN statement.
5. Execute the SET statement for the second iteration. The group value is still EP, so First.Basin
and Last.Basin are both 0 because it is not the first or last occurrence of EP in the group.
6. Click the Watch check box next to Last.Basin and click Start/continue debugger execution
to execute the program until the value of Last.Basin changes. Note that the debugger
stops when _N_ is 20. This is the last row in the EP group, so First.Basin is 0 and Last.Basin
is 1.

7. Click Start/continue debugger execution to proceed through execution until Last.Basin


changes again. Notice that when _N_ is 21, the group value for Basin changes to NA, and
First.Basin is 1.
8. Exit the debugger. Because the First.Basin and Last.Basin variables are temporary, they are
not included in the output table. Uncomment the two assignment statements to assign the values
to permanent columns to view their values for each row.
9. Run the program and examine the values for the First_Basin and Last_Basin columns.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-17

2.03 Activity
Open p202a03.sas from the activities folder and perform the following tasks:
1. Modify the PROC SORT step to sort the rows within each value of Basin
by MaxWindMPH. Highlight the PROC SORT step and run the selected
code. Which row within each value of Basin represents the storm with
the highest wind?
2. Add the following WHERE statement immediately after the BY statement
in the DATA step. The intent is to include only the last row within each
value of Basin. Does the program run successfully?

where last.Basin=1; Keep this program open


for the next activity.

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Subsetting Rows in Execution

WHERE expression;
input table

The WHERE
expression must be
PDV
based on columns in
the input table.

24
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The WHERE statement is a compile-time statement that establishes rules about which rows are read
into the PDV. The WHERE expression must be based on columns that exist in the input table
referenced in the SET statement.
The values for the First. and Last. variables are not assigned until after a row is read into the PDV.
Therefore, the WHERE statement does not work. Instead, you need to subset the data during the
execution phase, based on values that can be assigned or changed after a row is read into the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18 Lesson 2 Summarizing Data

Subsetting Rows in Execution

PDV
IF expression;

The IF expression
can be based on any
values in the PDV.

output table

25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The subsetting IF statement is an executable statement, so it processes during the execution phase
in the order in which it occurs in the DATA step code. When the expression is true, the DATA step
continues to execute the remaining statements in that iteration, including any explicit OUTPUT
statements or the implicit output that occurs with the RUN statement. If the expression is not true,
the DATA step immediately stops processing statements for that particular iteration, likely skipping
the output trigger, and the row is not written to the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-19

Subsetting Rows in Execution

When the IF condition is true,


SAS continues processing
data storm2017_max;
set storm2017_sort; statements for that row.
true
by Basin;
if last.Basin=1;
StormLength=EndDate-StartDate;
MaxWindKM=MaxWindMPH*1.60934;
run;
Implicit OUTPUT;
Implicit RETURN;

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p202d03

This program uses the subsetting IF statement to delay evaluating the expression until the execution
phase, when the First. and Last. variables are assigned values in the PDV. When the expression is
true, meaning it is the last row for a particular Basin value, then SAS processes the StormLength
and MaxWindKM assignment statements. Then SAS comes to the RUN statement, which includes
the implicit output and implicit return, and the contents of the PDV are written to the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20 Lesson 2 Summarizing Data

Subsetting in Execution

data storm2017_max;
set storm2017_sort;
by Basin; false
Implicit RETURN; if last.Basin=1;
StormLength=EndDate-StartDate;
MaxWindKM=MaxWindMPH*1.60934;
run;
When the IF condition is false,
Implicit OUTPUT; SAS stops processing statements
for that row and returns to the
top of the DATA step.

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When the subsetting IF condition is false, meaning it is not the last row for a particular Basin value,
SAS skips the remaining statements, including the implicit output, and moves on to the next iteration
of the DATA step. The subsetting IF statement not only filters the rows that are written to the output
table but also prevents unnecessary statements from executing.

2.04 Activity
Use the program from the previous activity to perform the following tasks:
1. Change the WHERE statement to a subsetting IF statement and submit
the program. How many rows are included in the output table?
2. Move the subsetting IF statement just before the RUN statement and
submit the program. How many rows are included in the output table?
3. Consider the sequence of the statements in the execution phase. Where
is the optimal placement of the subsetting IF statement?

28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-21

Accumulating Column within Groups

How can we reset an


accumulating column
at the beginning of
each group?

31
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You have seen how to use First. or Last. variables to keep only the f irst or last row within groups.
This is only one of many practical benef its of this syntax. You can also use First. and Last. variables
to prompt f or some action to occur at the beginning or end of each group.

Let’s look at our weather data f or Houston. We created an accumulating total f or the DailyRain
column during the entire year using a sum statement, but what if we would like to create a
month-to-date total column. In other words, we would like to create an accumulating column
named MTDRain and reset it every time a new month begins.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22 Lesson 2 Summarizing Data

Creating an Accumulating Column within Groups

Scenario
Use the First. variable to reset an accumulating column for rain totals at the beginning of each
month.

Files
• p202d03.sas
• weather_houston – a SAS table with daily weather measurements in the Houston, TX, area
during 2017

Syntax

Subsetting IF statement:

IF expression;

First.bycol
Last.bycol

Notes
• First./Last. variables can be used in combination with IF-THEN logic to execute one or more
statements at the beginning or end of a group.
• The subsetting IF statement affects which rows are written from the PDV to the output table. The
expression can be based on values in the PDV.
• When the subsetting IF expression is true, the remaining statements are executed for that
iteration, including any explicit OUTPUT statements or the implicit output that occurs with the RUN
statement.
• If the subsetting IF expression is not true, the DATA step immediately stops processing statements
for that particular iteration, likely skipping the output trigger, and the row is not written to the output
table.

Demo
1. Open the p202d03.sas program in the demos folder and find the Demo section. Highlight the
DATA step and run the selected code. Notice that YTDRain is an accumulating column that
creates a running total of DailyRain. Also notice that the data is sorted by Month and Date.
2. Add a BY statement to process the rows by groups based on the values of Month.
3. Change the new accumulating column to MTDRain in the KEEP and sum statements.
4. Reset MTDRain to 0 each time that SAS reaches the first row within a new Month group.
Highlight the DATA step and run the selected code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-23

data houston_monthly;
set pg2.weather_houston;
keep Date Month DailyRain MTDRain;
by Month;
if First.Month=1 then MTDRain=0;
MTDRain+DailyRain;
run;
Partial Results (Rows 177-183)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24 Lesson 2 Summarizing Data

2.05 Activity
Open p202a05.sas from the activities folder. Add a subsetting IF statement to
output only the final day of each month.

data houston_monthly;
set pg2.weather_houston;
keep Date Month DailyRain MTDRain;
by Month;
if First.Month=1 then MTDRain=0;
MTDRain+DailyRain;
run;

33
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Multiple BY Columns

Can we group the


data by multiple
columns?

35
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

If you sort data by more than one column, SAS arranges the data based on the first BY column, and
it then sorts the second BY column within each unique value of the first BY column. In this table,
rows are sorted by Year first, and then within Year by Qtr.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-25

Multiple BY Columns

data sydney_summary;
set pg2.weather_sydney;
by Year Qtr;
run;

PDV
other
Year Qtr First.Year Last.Year First.Qtr Last.Qtr
columns
D D D D

First./Last. variables are created for


each column in the BY statement.
36
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

If multiple columns are listed in the BY statement in the DATA step, then each column has its own
First. and Last. variables in the PDV.

Multiple BY Columns

First./Last. values for Qtr are


assigned within each value of Year.
37
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The First. and Last. variables for Qtr indicate when a quarter begins and ends within a particular
value of Year.
Note: The First. and Last. variables are dropped from the output data, but to create this example
table, assignment statements were used to create permanent variables to display the values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26 Lesson 2 Summarizing Data

Multiple BY Columns

First.Qtr =1 because it is the first


occurrence of Qtr=1 in 2018.

38
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Notice that for this highlighted row First.Qtr is equal to 1. It is not the first time that Qtr is equal to 1
in the table, but it is the first occurrence of Qtr 1 within Year 2018.

Discussion
Summarizing data within groups can be
performed in the DATA step or in
procedures such as PROC MEANS. What
are some examples of situations when
you might choose to use either the DATA
step or PROC MEANS?

C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-27

Beyond SAS Programming 2


What if you want to ...
. . . learn more about the . . . view examples of
. . . ask other SAS
Enterprise Guide DATA processing data in
programmers
step debugger? groups using
questions about the
DATA step? First./Last. variables?
• Read the paper Step through
• Search or post to the Your DATA Step: Introducing • Read the blog post How
Base SAS Programming the DATA Step Debugger in to use FIRST.variable
community. SAS Enterprise Guide. and LAST.variable in a
• Watch the video DATA step BY-group analysis in
debugger in SAS Enterprise SAS.
Guide.

40
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links
• Search or post to the Base SAS Programming community.
• Read the paper Step through Your DATA Step: Introducing the DATA Step Debugger in SAS
Enterprise Guide.
• Watch the video DATA step debugger in SAS Enterprise Guide.
• Read the blog post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28 Lesson 2 Summarizing Data

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
4. Generating an Accumulating Column within Groups
The pg2.np_yearlyTraffic table contains annual traffic counts at locations in national parks.
Park names are grouped into park types.
a. Open the p202p04.sas program in the practices folder. Complete the PROC SORT step to
sort the pg2.np_yearlyTraffic table by ParkType and ParkName.
b. Modify the DATA step as follows:
1) Read the sorted table created in PROC SORT.
2) Add a BY statement to group the data by ParkType.
3) Create a column, TypeCount, that is the running total of Count within each value of
ParkType.
4) Format TypeCount so that values are displayed with commas.
5) Keep only the ParkType and TypeCount columns.
c. Run the program and confirm that TypeCount is reset at the beginning of each ParkType
group.
d. Modify the program to write only the last row for each ParkType to the output table.

Level 2
5. Generating an Accumulating Column within Multiple Groups
The sashelp.shoes table contains sales information for various products in each region and
subsidiary. Numbers for sales and returns are recorded for each row. Create a summary table
that includes the sum of Profit for each region and product.
a. Create a sorted copy of sashelp.shoes that is ordered by Region and Product.
b. Use the DATA step to read the sorted table and create a new table named profitsummary.
Create a column named Profit that is the difference between Sales and Returns.
c. Create an accumulating column named TotalProfit that is a running total of Profit within
each value of Region and Product. Reset TotalProfit for each new combination of Region
and Product. Run the program and verify that TotalProfit is accurate.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Processing Data in Groups 2-29

d. Modify the DATA step to include only the last row for each Region and Product combination.
Keep Region, Product, and TotalProfit, and format TotalProfit as a currency value.

Challenge
6. Creating Multiple Output Tables Based on Group Values
The pg2.np_acres table contains acreage amounts for national parks. The park state is also
provided. However, some parks span multiple states and therefore have multiple rows of data.
a. Create two tables from the pg2.np_acres table:
• singlestate, which contains the rows with unique park names
• multistate, which contains the rows with park names that appear in multiple states .
The parks should be grouped within their associated regions. When sorting the data, you
need to keep only the Region, ParkName, State, and GrossAcres columns.
singlestate (5 of 367 rows)

multistate (5 of 89 rows)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30 Lesson 2 Summarizing Data

2.3 Solutions
Solutions to Practices
1. Producing a Running Total
data totalTraffic;
set pg2.np_yearlyTraffic;
retain totTraffic 0;
totTraffic=totTraffic+Count;
keep ParkName Location Count totTraffic;
format totTraffic comma12.;
run;

/*OR*/

data totalTraffic;
set pg2.np_yearlyTraffic;
totTraffic+Count;
keep ParkName Location Count totTraffic;
format totTraffic comma12.;
run;
2. Producing Multiple Totals
data work.parktypetraffic;
set pg2.np_yearlyTraffic;
where ParkType in ("National Monument", "National Park");
if ParkType = 'National Monument' then MonumentTraffic+Count;
else ParkTraffic+Count;
format MonumentTraffic ParkTraffic comma15.;
run;

title 'Accumulating Traffic Totals for Park Types';


proc print data=work.parktypetraffic;
var ParkType ParkName Location Count MonumentTraffic
ParkTraffic;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Solutions 2-31

3. Determining Maximum Amounts


data cuyahoga_maxtraffic;
set pg2.np_monthlyTraffic;
where ParkName = 'Cuyahoga Valley NP';
retain TrafficMax 0 MonthMax LocationMax;
if Count>TrafficMax then do;
TrafficMax=Count;
MonthMax=Month;
LocationMax=Location;
end;
format Count TrafficMax comma15.;
keep Location Month Count TrafficMax MonthMax LocationMax;
run;
4. Generating an Accumulating Column within Groups
proc sort data=pg2.np_yearlyTraffic
out=work.sortedTraffic(keep=ParkType ParkName
Location Count);
by ParkType ParkName;
run;

data TypeTraffic;
set work.sortedTraffic;
by ParkType;
if First.ParkType=1 then TypeCount=0;
TypeCount+Count;
if Last.ParkType=1;
format typeCount comma12.;
keep ParkType TypeCount;
run;

/*ALTERNATE SOLUTION*/

data TypeTraffic;
set work.sortedTraffic;
by ParkType;
retain TypeCount 0;
if First.ParkType=1 then TypeCount=0;
TypeCount=TypeCount+Count;
if Last.ParkType=1;
format TypeCount comma12.;
keep ParkType TypeCount;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32 Lesson 2 Summarizing Data

5. Generating an Accumulating Column within Multiple Groups


proc sort data=sashelp.shoes out=sort_shoes;
by Region Product;
run;

data profitsummary;
set sort_shoes;
by Region Product;
Profit=Sales-Returns;
if First.Product then Total=0;
TotalProfit+Profit;
if Last.Product=1;
keep Region Product TotalProfit;
format TotalProfit dollar12.;
run;
6. Creating Multiple Output Tables Based on Group Values
proc sort data=pg2.np_acres
out=sortedAcres(keep=Region ParkName State GrossAcres);
by Region ParkName;
run;

data multiState singleState;


set sortedAcres;
by Region ParkName;
if First.ParkName=1 and Last.ParkName=1
then output singleState;
else output multiState;
format GrossAcres comma15.;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Solutions 2-33

Solutions to Activities and Questions


continued...
2.01 Activity – Correct Answer
Open p202a01.sas from the activities folder and perform the following tasks:
1. Modify the program to retain TotalRain and set the initial value to 0.
retain TotalRain 0;

2. Run the program and examine the results. Why are all values for
TotalRain missing after row 4?
TotalRain=TotalRain+Rain_mm;

Rain_mm is missing in row 5,


so all subsequent values of
TotalRain are missing.

9
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

2.01 Activity – Correct Answer


3. Change the assignment statement to use the SUM function instead of
the plus symbol. Run the program again. Why are the results different?

data zurich2017;
set pg2.weather_zurich;
retain TotalRain 0;
TotalRain=sum(TotalRain,Rain_mm);
run;
The SUM
function ignores
missing values!

10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34 Lesson 2 Summarizing Data

2.02 Question – Correct Answer


What sum statement would you add to this program to create the column
named DayNum?

data zurich2017;
set pg2.weather_zurich;
YTDRain_mm+Rain_mm;
DayNum+1;
run;

13
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

2.03 Activity – Correct Answer


1. Which row within each value of Basin represents the storm with the
highest wind?
The last row within each value of Basin

2. Does the program run successfully?


No, an error is generated in the log.

32 where last.Basin=1;
__________
180
ERROR: Syntax error while parsing WHERE clause.
ERROR 180-322: Statement is not valid or it is used out of proper order.

23
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Solutions 2-35

continued...
2.04 Activity – Correct Answer
1. Change the WHERE statement to a subsetting IF statement and submit
the program. How many rows are included in the output table?
Five rows, one for each value of Basin
2. Move the subsetting IF statement just before the RUN statement and
submit the program. How many rows are included in the output data?
Five rows, same as the previous program

29
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

2.04 Activity – Correct Answer


3. Consider the sequence of the statements in the execution phase. Where
is the optimal placement of the subsetting IF statement?

data storm2017_max; Use the subsetting IF


set storm2017_sort; statement as early as
by Basin; possible so that SAS
if last.Basin=1;
processes additional
StormLength=EndDate-StartDate;
MaxWindKM=MaxWindMPH*1.60934; statements only for rows
run; that will be written to
the output table.

30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-36 Lesson 2 Summarizing Data

2.05 Activity – Correct Answer


Open p202a05.sas from the activities folder. Add a subsetting IF statement to
output only the final day of each month.

data houston_monthly;
set pg2.weather_houston;
keep Date Month DailyRain MTDRain;
by Month;
if First.Month=1 then MTDRain=0;
MTDRain+DailyRain;
if last.Month=1;
run;

equivalent if Last.Month;
statements if Last.Month then output;

34
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 3 Manipulating Data with
Functions
3.1 Understanding SAS Functions and CALL Routines ..................................................... 3-3

3.2 Using Numeric and Date Functions .............................................................................. 3-9


Demonstration: Using Numeric Functions................................................................. 3-12
Demonstration: Shifting Date Values ....................................................................... 3-23
Practice............................................................................................................... 3-25

3.3 Using Character Functions ........................................................................................ 3-27


Demonstration: Using Character Functions to Extract Words from a String .................... 3-32
Practice............................................................................................................... 3-39

3.4 Using Special Functions to Convert Column Type ..................................................... 3-42


Demonstration: Using the INPUT and PUT Functions to Convert Column Types............. 3-54

3.5 Solutions ................................................................................................................... 3-60


Solutions to Practices ............................................................................................ 3-60
Solutions to Activities and Questions........................................................................ 3-63
3-2 Lesson 3 Manipulating Data with Functions

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Understanding SAS Functions and CALL Routines 3-3

3.1 Understanding SAS Functions and


CALL Routines

SAS Functions and CALL Routines

Descriptive
Statistics Arithmetic

Date and External


Distance Trigonometric
Time Files
Functions
Financial Random
and CALL
Number
Routines

Truncation Special Character State and


ZIP Code

Mathematical Probability

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Many functions and CALL routines are available in SAS to manipulate your data. In SAS
documentation, functions and CALL routines are grouped by category. You should already be
familiar with some functions, but in this lesson, you learn about new functions from the Descriptive
Statistics, Date and Time, Truncation, Special, Random Number, and Character categories.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-4 Lesson 3 Manipulating Data with Functions

What Is a SAS Function?

function(argument1, argument2, ...);


A function performs a
specific computation
or manipulation and
returns a value.

4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

A SAS function is a named, predefined process that can be used in a SAS program to produce a
value. The function might accept none, one, or several arguments as input. Based on the arguments,
the function performs its specified computation or manipulation and returns a value.

Discussion
Where can functions be used in a DATA step?

C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Understanding SAS Functions and CALL Routines 3-5

3.01 Activity
Open p203a01.sas from the activities folder and perform the following tasks:
1. Run the program. Why does the DATA step fail? Correct the error by
overwriting the value of the column Name in uppercase.
2. Examine the expressions for Mean1, Mean2, and Mean3. Each one is a
method for specifying a list of columns as arguments in a function. Run
the program and verify that the values in these three columns are the
same.
3. In the expression for Mean2, delete the keyword OF and run the
program. What do the values in Mean2 represent?

6
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Specifying Column Lists


The double dash includes all
data quiz_summary; columns between and
set pg2.class_quiz;
Name=upcase(Name); including the two specified
AvgQuiz=mean(of Q:); columns as they are ordered
format Quiz1--AvgQuiz 3.1; in the PDV.
/*OR*/
format _numeric_ 3.1;
run;
Column lists can
be used in
The keyword statements as
_NUMERIC_ includes all well!
numeric columns.

9
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use a double dash to represent a physical range of columns as they are ordered in the data.
This FORMAT statement formats all columns from left to right from Quiz1 to AvgQuiz with the 3.1
format.
You could also use the keyword _NUMERIC_ to include all numeric columns with the 3.1 format. You
can also use the keywords _CHARACTER_ and _ALL_ to group columns.
Note: You do not need to use the OF keyword in the FORMAT statement. That is a special
requirement when you use column lists as arguments in a function or CALL routine.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-6 Lesson 3 Manipulating Data with Functions

SAS Column Lists

Specifies all columns from x1 to xn inclusive. You can begin with


Numbered any number and end with any number as long as you do not
x1-xn
range lists violate the rules for user-supplied column names and the
numbers are consecutive.

Specifies all columns ordered as they are in the program data


x- -a vector, from x to a inclusive.
Name
range lists x-numeric-a Specifies all numeric columns from x to a inclusive.

x-character-a Specifies all character columns from x to a inclusive.


Name Specif ies all the columns that begin with REV, such as REVJAN,
REV:
pref ix lists REVFEB, and REVMAR.

Specifies all columns that are already defined in the current


_ALL_
DATA step.
Special
SAS name Specifies all numeric columns that are already defined in the
_NUMERIC_
lists current DATA step.

Specifies all character columns that are already defined in the


_CHARACTER_ current DATA step.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Understanding SAS Functions and CALL Routines 3-7

What Is a SAS CALL Routine?

CALL routine(argument-1 <, ...argument-n>);

A CALL routine is used CALL routines alter column


in a CALL statement. values or perform system
actions. They cannot be
used in assignment
statements or expressions.

10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

A CALL routine also performs a computation or a system manipulation based on input that you
provide in arguments. However, a CALL routine does not return a value. Instead, it alters column
values or performs other system functions. In order for a CALL routine to be able to modify the value
of an argument, those arguments must be supplied as column names. Constants and expressions
are not valid.
All SAS CALL routines are invoked with CALL statements. In other words, the name of the routine
must appear after the keyword CALL in the CALL statement.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-8 Lesson 3 Manipulating Data with Functions

Using a CALL Routine to Modify Data

data quiz_report;
set pg2.class_quiz;
call sortn(of Quiz1-Quiz5);
QuizAvg=mean(of Quiz3-Quiz5);
run;

CALL SORTN sorts the


values of the columns in
ascending order.

11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose you have a class of students that have taken five quizzes, and you want to drop each
student’s lowest two quiz scores and base their grade on the average of the top three scores. The
CALL SORTN routine takes the columns provided as arguments and reorders the numeric values
from low to high. Notice that the data values are not assigned to new columns, but instead they are
reordered in the existing columns. The mean score is then calculated based on the values of Quiz3
through Quiz5.

3.02 Activity
Open p203a02.sas from the activities folder and perform the following tasks:
1. Examine the program and notice that all quiz scores for two students are
changed to missing values. Highlight the first DATA step and submit the
selected code.
2. In a web browser, access SAS Help at
http://support.sas.com/documentation. Click the Programming: SAS 9.4
and Viya link.
3. In the Syntax – Quick Links section, click CALL Routines. Use the
documentation to read about the CALL MISSING routine.
4. Simplify the second DATA step by using CALL MISSING to assign missing
values for the two students’ quiz scores. Run the step.

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-9

3.2 Using Numeric and Date Functions

Numeric Functions

SUM(num1, num2, …) YEAR(SAS-date) TODAY(SAS-date)

MEAN(num1, num2, …) MONTH(SAS-date) MDY(month,day,year)

calculate extract
create SAS date
summary information from
values
statistics SAS date values

15
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You should be familiar with some basic numeric functions in the descriptive statistics and date/time
categories, such as SUM or MEAN to calculate summary statistics, YEAR or MONTH to extract
information from a SAS date value, and TODAY or MDY to create a SAS date value. These functions
just scratch the surface of what SAS offers in the numeric and date/time function categories.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-10 Lesson 3 Manipulating Data with Functions

Using Numeric Functions

Suppose you want a


random number for
each student, the top
three quiz scores, and
a top three average.

16
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose you have a table that includes student names and five quiz grades, and you want to create
an output table that includes only the top three quiz scores, an average of the top three scores, and
a randomly generated number for each student. You can use several numeric functions to do this.

Using Numeric Functions


RAND('distribution', parameter1, ...parameterk)

The RAND function can be


used to assign a random
number to each student.

17
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-11

Using Numeric Functions


LARGEST(k, value-1 <, value-2 ...>)

The LARGEST function


can be used to identify the
top three quiz scores.
18
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

There is also a SMALLEST function that returns the kth smallest nonmissing value.

Using Numeric Functions


ROUND(number <, rounding-unit>)

The MEAN and ROUND


functions can be used to
calculate the average quiz
score with one decimal place.
19
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-12 Lesson 3 Manipulating Data with Functions

Using Numeric Functions

Scenario
Use numeric functions to assign a random four-digit integer to each student, identify each student’s
three highest quiz scores, and calculate the mean of those three quizzes. Round the mean to the
nearest tenth.

Files
• p203d01.sas
• class_quiz – a SAS table containing scores for five quizzes for the 19 students in sashelp.class

Syntax

RAND('distribution', parameter1, ...parameterk)

LARGEST(k, value-1 <, value-2 ...>)

ROUND(number <, rounding-unit>)

Notes
RAND function
• The RAND function generates random numbers from a selected distribution.
• The first argument specifies the distribution, and the remaining arguments differ depending on the
distribution.
• To generate a random, uniformly distributed integer, us e 'INTEGER' as the first argument. The
second and third arguments are the lower and upper limits.
LARGEST function
• The LARGEST function returns the kth largest nonmissing value.
• The first argument is the value to return, and the remaining arguments are the numbers to
evaluate.
• There is also a SMALLEST function that returns the kth smallest nonmissing value.
ROUND function
• The ROUND function rounds the first argument to the nearest integer.
• The optional second argument can be provided to indicate the rounding unit.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-13

Demo
1. Open the p203d01.sas program in the demos folder and find the Demo section. Copy and
paste the Quiz1st assignment statement twice and modify the statements to create columns
named Quiz2nd and Quiz3rd.
data quiz_analysis;
set pg2.class_quiz;
drop Quiz1-Quiz5;
Quiz1st=largest(1, of Quiz1-Quiz5);
Quiz2nd=largest(2, of Quiz1-Quiz5);
Quiz3rd=largest(3, of Quiz1-Quiz5);
run;
2. Create a new column named Top3Avg that uses the MEAN function with the top three quiz
scores as the arguments.
Top3Avg=mean(Quiz1st, Quiz2nd, Quiz3rd);
3. Add Name in the DROP statement.
4. Before the SET statement, create a new column named StudentID. Use the RAND function with
'INTEGER' as the first argument. This generates random integers between the values specified
in the second and third arguments. To create a four-digit number, use 1000 as the lower limit and
9999 as the upper limit. Highlight the DATA step and run the selected code.
Note: Because you placed the assignment statement before the SET statement, StudentID is
the first column added to the PDV and the leftmost column in the output table.
data quiz_analysis;
StudentID=rand('integer',1000,9999);
set pg2.class_quiz;
drop Quiz1-Quiz5 Name;
Quiz1st=largest(1, of Quiz1-Quiz5);
Quiz2nd=largest(2, of Quiz1-Quiz5);
Quiz3rd=largest(3, of Quiz1-Quiz5);
Top3Avg=mean(Quiz1st, Quiz2nd, Quiz3rd);
run;
5. Modify the Top3Avg assignment statement to use the ROUND function to round the values
returned by the MEAN function to the nearest integer. Highlight the DATA step and run the
selected code.
Top3Avg=round(mean(Quiz1st, Quiz2nd, Quiz3rd));

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-14 Lesson 3 Manipulating Data with Functions

6. Add a second argument in the ROUND function to round values to the nearest .1. Highlight the
DATA step and run the selected code.
Top3Avg=round(mean(Quiz1st, Quiz2nd, Quiz3rd) , .1);
Note: Because the numbers for StudentID are randomly assigned, your output for the
StudentID column might differ.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-15

Changing Numeric Precision


Function What it does
CEIL(number) Returns the smallest integer that is greater than or equal to the
argument
FLOOR(number) Returns the largest integer that is less than or equal to the
argument
INT(number) Returns the integer value These functions
can be used to
truncate
decimal values.

21
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

There are other functions that eliminate the decimal places and convert a numeric value to an
integer. The CEIL function rounds each number up to the nearest integer, and the FLOOR function
rounds each number down to the nearest integer. There is also the INT function, which simply
truncates the number and returns the integer portion only.

3.03 Activity
Open p203a03.sas from the activities folder and perform the following tasks:
1. Notice that the expressions for WindAvg1 and WindAvg2 are the same.
Run the program and examine the output table.
2. Modify the WindAvg1 expression to use the ROUND function to round
values to the nearest tenth (.1).
3. Add a FORMAT statement to format WindAvg2 with the 5.1 format. Run
the program. What is the difference between using a function and a
format?

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-16 Lesson 3 Manipulating Data with Functions

Date, Datetime, and Time Values

date 01Jan1960 date


SAS date number of days
-n 0 n

00:00 time
number of
SAS time seconds
0 n

00:00
datetime 01Jan1960 datetime
number of
SAS datetime seconds
-n 0 n

24
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Remember that dates in SAS are stored as a number that represents the number of days since
January 1, 1960. It is possible for data to also include times, or a combination of dates and times. A
time value in SAS is stored as the number of seconds from midnight. A datetime value in SAS is
stored as the number of seconds from midnight on January 1, 1960. Just like SAS date values, this
numeric storage method enables you to calculate time between two events, or sort by time or
datetime.
SAS offers a wide variety of date functions that enable you to extract information from date, time or
datetime values, perform interval calculations, or shift values based on a specified time period.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-17

Extracting Data from a Datetime Value


DATEPART(datetime-value) TIMEPART(datetime-value)

data storm_detail2;
set pg2.storm_detail;
WindDate=datepart(ISO_Time);
WindTime=timepart(ISO_Time);
format WindDate date9. WindTime time.;
run;

PDV
ISO_Time WindDate WindTime
628192800 7270 21600
25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

There are many formats that enable SAS to display both the date and time component, but what if
you want to separate the date or time value and create new columns? This can be accomplished
easily with either the DATEPART or TIMEPART function. The only required argument is a SAS
datetime value. After the date or time component is isolated, you can use any relevant date or time
formats or functions to further enhance or manipulate the values.
In this example, the DATEPART and TIMEPART functions are used to create two new columns,
WindDate and WindTime. The functions return raw SAS date and SAS time numeric values, which
can then be formatted to improve the display.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-18 Lesson 3 Manipulating Data with Functions

Calculating Date Intervals

INTCK('interval', start-date, end-date <, 'method'>)

interval that method for


SAS date The INTCK function
you want to calculating
columns counts the number
count intervals
of date or time
intervals between
two events.
Possible intervals include week,
month, year, weekday, or hour.

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose we need to count the number of time intervals, such as weeks, weekdays or months, that
have occurred between a start date and an end date? The INTCK function returns the number of
interval boundaries of a given time period that occur between two dates.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-19

Calculating Date Intervals


Method

'discrete' Each interval has a fixed boundary. For example, a week ends after
'd' Saturday, or a year ends on December 31.

begin This storm passes two


weekly boundaries
using the default
end discrete method.

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Let’s look at the start date and end date for a storm that began on July 21 and ended on July 31.
Consider these dates on a calendar. How many weeks passed between these two dates? It depends
on how you count when one week ends and the next week begins.
One option is to count the number of standard interval boundaries that occur between the dates. The
standard interval boundary for a week begins on Sunday and ends on Saturday. For this storm, there
are two end-of-week boundaries that occur. The first is between July 22 and 23, and the second is
between July 29 and 31. This method for counting interval boundaries is called discrete and is the
default for the INTCK function.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-20 Lesson 3 Manipulating Data with Functions

Calculating Date Intervals


Method

'continuous' Each interval is measured relative to the start date or time.


'c'

begin

end
This storm passes one
weekly boundary using
the continuous method.

28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

A different way to consider the number of weeks between two dates is to count weeks based on a
continuous count from the start date. The conclusion of the first and only complete week for this
storm is between July 27 and 28. This method for counting intervals is called continuous. If you
would like to use the continuous method for interval calculations, it must be specified with the letter
C (in quotation marks) as the fourth argument.

3.04 Activity
Open p203a04.sas from the activities folder and perform the following tasks:
1. Notice that the INTCK function does not include the optional method
argument, so the default discrete method is used to calculate the
number of weekly boundaries (ending each Saturday) between StartDate
and EndDate.
2. Run the program and examine rows 8 and 9. Both storms were two days,
but why are the values assigned to Weeks different?
3. Add 'c' as the fourth argument in the INTCK function to use the
continuous method. Run the program. Are the values for Weeks in rows
8 and 9 different?

29
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-21

3.05 Question
What value would be assigned to Months2Pay for each expression?

ServiceDate PayDate Months2Pay


10JUL2018 05SEP2018 ?

Months2Pay=intck('month', ServiceDate, PayDate);

Months2Pay=intck('month', ServiceDate, PayDate, 'c');

31
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Shifting Date Values

Customer ID SalesDate BillingDate


12808 10JUL2018 01AUG2018
59601 17JUL2018 01AUG2018
42616 02AUG2018 01SEP2018

Suppose you want


to shift dates to
the first day of the
following month.

33
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-22 Lesson 3 Manipulating Data with Functions

Shifting Date Values

INTNX(interval, start, increment <, 'alignment'>)

interval that you SAS date number of position of SAS dates


want to shift column intervals to shift in the interval

The INTNX function


shifts dates or
times based on an
interval.

34
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use the INTNX function for adjusting or shifting date values. This function enables you to
select an interval (such as week, month, year, or many others) as the first argument, and name a
SAS date column as the second argument. The third argument is the increment number, which
represents the number of intervals to shift the value of the start date. The optional fourth argument
controls the position of SAS dates within the interval.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-23

Shifting Date Values

Scenario
Use the INTNX function to create a new date value from an existing SAS date column.

Files
• p203d02.sas
• storm_damage – a SAS table that contains a description and damage estimates (adjusted for
inflation) for storms in the US with damages greater than one billion dollars

Syntax

INTNX('interval', start, increment <, 'alignment'>)

Notes
• The INTNX function shifts a date, time, or datetime value by a given time interval, and returns a
date, time, or datetime value.
• The first argument defines an interval, such as week, month, year, or weekday. See SAS Help for
other possible intervals.
• The second argument is the starting date, time, or datetime value.
• The third argument is the increment number, which represents the number of intervals to shift the
value of the start date. The increment number can be zero or a positive or negative integer.
• The optional fourth argument controls the position of SAS dates within the interval. Possible
values for alignment include BEGINNING (B), MIDDLE (M), END (E), or SAME (S).

Demo
1. Open the p203d02.sas program in the demos folder and find the Demo section. Notice that the
AssessmentDate column is created by using the INTNX function to shift each Date value.
Highlight the DATA step and run the selected code. Notice that each Date value has been shifted
to the first day of the same month.
2. To see the impact of the various arguments in the INTNX function, modify the arguments as
directed. Highlight the DATA step, run the selected code, and examine the results after each
modification.
a. Change the increment value to 2.
AssessmentDate=intnx('month', Date, 2);
b. Change the increment value to -1. Add 'end' as the optional fourth argument to specify
alignment.
AssessmentDate=intnx('month', Date, -1, 'end');
c. Change the alignment argument to 'middle'.
AssessmentDate=intnx('month', Date, -1, 'middle');

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-24 Lesson 3 Manipulating Data with Functions

3. Write an assignment statement to create a new column named Anniversary that is the date of
the 10-year anniversary for each storm. Add 'same' as the optional fourth argument to specify
alignment. Keep the new column in the output table and use the DATE9. format to display the
values.
data storm_damage2;
set pg2.storm_damage;
keep Event Date AssessmentDate Anniversary;
AssessmentDate=intnx('month', Date, -1, 'middle');
Anniversary=intnx('year', Date, 10, 'same');
format Date AssessmentDate Anniversary date9.;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Using Numeric and Date Functions 3-25

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Using the LARGEST and ROUND Functions
The pg2.np_lodging table contains statistics for lodging from 2010 through 2017. Each column
name starts with CL followed by the year. (For example, CL2010 contains the number of nights
stayed in 2010 for that park.)
a. Open the p203p01.sas program from the practices folder. Highlight the PROC PRINT step
and run the selected code. Examine the column names and the 10 rows printed from the
np_lodging table.
b. Use the LARGEST function to create three new columns (Stay1, Stay2, and Stay3) whose
values are the first, second, and third highest number of nights stayed from 2010 through
2017.
Note: Use column list abbreviations to avoid typing each column name.
c. Use the MEAN function to create a column named StayAvg that is the average number of
nights stayed for the years 2010 through 2017. Use the ROUND function to round values to
the nearest integer.
d. Add a subsetting IF statement to output only rows with StayAvg greater than zero. Highlight
the DATA step and run the selected code.

Level 2
2. Working with Date/Time Values
The pg2.np_hourlyrain table contains hourly rain amounts for the Panther Junction, TX, station
located in Big Bend National Park. The DateTime column contains date/time values.
a. Open the p203p02.sas program from the practices folder. Run the program and notice that
each row includes a datetime value and rain amount. The MonthlyRainTotal column
represents a cumulative total of Rain for each value of Month.
b. Uncomment the subsetting IF statement to continue processing a row only if it is the last row
within each month. After the subsetting IF statement, create the following new columns:
1) Date – the date portion of the DateTime column
2) MonthEnd – the last day of the month

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-26 Lesson 3 Manipulating Data with Functions

c. Format Date and MonthEnd as a date value and keep only the StationName,
MonthlyRainTotal, Date, and MonthEnd columns.

Challenge
3. Creating Projected Date Values
The pg2.np_weather table contains weather-related statistics for locations in four national
parks. Determine the number of weeks between the first and last snowfall in each park for the
2015-2016 winter season.
a. Open the p203p03.sas program from the practices folder. The program contains a PROC
SORT step that creates the winter2015_2016 table. This table contains rows with dates with
some snowfall between October 1, 2015, and June 1, 2016, sorted by Code and Date. Only
the Name, Code, Date, and Snow columns are kept.
b. Modify the DATA step to create the snowforecast table based on the following
specifications:
1) Process the data in groups by Code.
2) For the first row within each Code group, create a new column named FirstSnow that is
the date of the first snowfall for that code.
3) For the last row within each Code group, do the following:
a) Create a new column named LastSnow that is the date of the last snowfall for that
code.
b) Create a new column named WinterLengthWeeks that counts the number of full
weeks between the FirstSnow and LastSnow dates.
c) Create a new column named ProjectedFirstSnow that is the same day of the first
snowfall for the next year.
d) Output the row to the new table.
Note: Be sure to retain the values of FirstSnow in the PDV so that they will be included
with the rows that are in the output table.
4) Apply the DATE7. format to the FirstSnow, LastSnow, and ProjectedFirstSnow
columns and drop the Date and Snow columns.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-27

3.3 Using Character Functions

Character Functions

UPCASE(char)
SUBSTR(char, position <,length>)
PROPCASE(char, <delimiters)

correct extract characters


inconsistent case from a string

38
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When character strings need to be validated, cleaned, or modified, functions are a necessity. You
might already be familiar with several common character functions, such as UPCASE or
PROPCASE to correct inconsistent character case, or SUBSTR to extract characters from a
character string. There are many more character functions available in SAS, and in this section, you
learn about functions that remove, replace, find, or concatenate character strings.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-28 Lesson 3 Manipulating Data with Functions

Removing Characters from a String

Use character functions to clean


inconsistencies in the Station and
Location columns.

39
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Let’s look at precipitation data from Japan as an example. Station codes have been recorded
inconsistently. Some have hyphens and others have spaces. Station codes should be standardized
with no spaces or symbols.

Removing Characters from a String


Function What it does
COMPBL(string) Returns a character value with all
multiple blanks in the string
converted to single blanks
COMPRESS (string Returns a character value with These functions
<, characters>) specified characters removed remove unnecessary
from the string characters from a
STRIP(string) Returns a character value with string.
leading and trailing blanks
removed from the string

40
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-29

3.06 Activity
Open p203a06.sas from the activities folder and perform the following tasks:
1. Complete the NewLocation assignment statement to use the COMPBL
function to read Location and convert each occurrence of two or more
consecutive blanks into a single blank.
2. Complete the NewStation assignment to use the COMPRESS function
with Station as the only argument. Run the program. Which characters
are removed in the NewStation column?
3. Add a second argument in the COMPRESS function to remove both the
space and hyphen. Both characters should be enclosed in a single set of
quotation marks. Run the program.

41
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Extracting Words from a String


Use functions to create
the City and Prefecture
columns from Location.

44
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-30 Lesson 3 Manipulating Data with Functions

Extracting Words from a String

SCAN(string, n <, 'delimiters'>)

character word to characters that The SCAN function is


column extract separate words an easy way to
extract words from a
string.

45
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The SCAN function extracts a particular word in sequence from a string. You provide a c haracter
column as the first argument, and the word number that you want to extract as the second argument.
The optional third argument enables you to control the characters that are treated as delimiters,
indicating when words begin and end.
Note: The SCAN function includes an additional optional argument to provide character modifiers.
See the SCAN function documentation for details about available modifiers.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-31

Extracting Words from a String

City=scan(Location,1);

The new column


City is assigned the
same length as
Location.

The SCAN function treats the


following characters as delimiters:
blank ! $ % & ( ) * + , - . / ; < ^ |

46
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

To extract City from Location, use the SCAN function to read Location and extract the first word.
By default, SAS uses the default delimiter list, so each of these characters indicates the end of one
word and the beginning of the next word.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-32 Lesson 3 Manipulating Data with Functions

Using Character Functions to Extract Words from a String

Scenario
Use the SCAN function to extract words from a character column. Use the PROPCASE function to
standardize casing.

Files
• p203d03.sas
• weather_japan – a SAS table with total precipitation amounts (in millimeters) for 2017 for several
Japanese cities

Syntax

SCAN(string, n <, 'delimiters'>)

PROPCASE(string <, 'delimiters'>)

Notes
SCAN function
• The SCAN function returns the nth word in a string.
• If n is negative, the SCAN function begins reading from the right side of the string.
• The default delimiters are as follows: blank ! $ % & ( ) * + , - . / ; < ^ |
• The optional third argument enables you to specify a delimiter list. All delimiter characters are
enclosed in a single set of quotation marks.
PROPCASE function
• The PROPCASE function converts all uppercase letters to lowercase letters. It then converts to
uppercase the first character of each word.
• The default delimiters are as follows: blank / - ( . tab
• The optional second argument enables you to specify a delimiter list. All delimiter characters are
enclosed in a single set of quotation marks.

Demo
1. Open the p203d03.sas program in the demos folder and find the Demo section. Notice that the
DATA step creates the City and Prefecture columns by extracting the first or second word from
Location. Highlight the step and run the selected code.
2. Examine row 8 in the output data. Notice that the city name should be MIYAKE-JIMA. However,
the hyphen is a default delimiter, so MIYAKE is assigned to City and JIMA is assigned to
Prefecture.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-33

3. In both SCAN functions, add a third argument to specify that the only delimiter is a comma.
Highlight the step and run the selected code.
data weather_japan_clean;
set pg2.weather_japan;
Location=compbl(Location);
City=scan(Location, 1, ',');
Prefecture=scan(Location, 2, ',');
run;
4. Add an additional assignment statement to create a column named Country that reads the last
word in Location.
5. Use the PROPCASE function in the City assignment statement to capitalize the first letter of
each word and convert the remaining letters to lowercase. Highlight the step and run the
selected code.
data weather_japan_clean;
set pg2.weather_japan;
Location=compbl(Location);
City=propcase(scan(Location, 1, ','));
Prefecture=scan(Location, 2, ',');
Country=scan(Location, -1);
run;
6. Examine row 8 again in the output data. Because the hyphen is a delimiter, both Miyake and
Jima are capitalized. The proper casing for this city name should be Miyake-jima. Use the
optional second argument to specify that the only delimiter should be a space. Highlight the step
and run the selected code.
data weather_japan_clean;
set pg2.weather_japan;
Location=compbl(Location);
City=propcase(scan(Location, 1, ','), ' ');
Prefecture=scan(Location, 2, ',');
Country=scan(Location, -1);
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-34 Lesson 3 Manipulating Data with Functions

3.07 Activity
Open p203a07.sas from the activities folder and perform the following tasks:
1. Notice the subsetting IF statement that writes rows to output only if
Prefecture is Tokyo. Run the program and notice that the output table
does not include any rows.
2. Either use the DATA step debugger in Enterprise Guide or uncomment
the PUTLOG statement to view the values of Prefecture as the step
executes. Why is the subsetting IF condition always false?
3. Modify the program to correct the logic error. Run the program and
confirm that four rows are returned.

48
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Searching for Character Strings

FIND(string, substring <, 'modifiers'>)

character substring I = case insensitive The FIND function


column to find search returns a number that
T = trim leading and represents the first
trailing blanks from character position
string and substring where substring is
found in string.

51
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The FIND function searches for a particular substring within character values. The first argument
typically is the character column, and the second argument specifies the substring to find. The
optional third argument can be used to provide character modifiers. I makes the search case
insensitive, and T trims leading and trailing blanks. The FIND function returns a number that
indicates the start position of the substring within the string. If the substring is not found, the FIND
function returns a zero.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-35

Here are two similar functions:

FINDC Searches a string for any character in a list of characters .

FINDW Returns the character position of a word in a string, or returns the number of the
word in a string.

Searching for Character Strings

AirportLoc=find(Station,'Airport');

Station AirportLoc
Raleigh Durham Airport 16
Airport Road 1
Cary Parkway 0

52
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The FIND function does a case-sensitive search in the values of Station for the substring Airport.
The numeric value returned is assigned to a new column, AirportLoc.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-36 Lesson 3 Manipulating Data with Functions

3.08 Activity
Open p203a08.sas from the activities folder and perform the following tasks:
1. Notice that the assignment statement for CategoryLoc uses the FIND
function to search for category within each value of the Summary
column. Run the program.
2. Examine the PROC PRINT report. Why is CategoryLoc equal to 0 in row 1?
Why is CategoryLoc equal to 0 in row 15?
3. Modify the FIND function to make the search case insensitive.
Uncomment the IF-THEN statement to create a new column named
Category. Run the program and examine the results.

53
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Identifying Character Positions


Function What it does

LENGTH(string) Returns the length of a non-blank


character string, excluding trailing
blanks; returns 1 for a completely
blank string There are many
ANYDIGIT(string) Returns the first position at which a other functions to
digit is found in the string identify the
location of selected
ANYALPHA(string) Returns the first position at which an
characters.
alpha character is found in the string
ANYPUNCT(string) Returns the first position at which
punctuation character is found in the
string
56
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Similar to the FIND function, there are many other functions that return a numeric value that
identifies the location of selected characters. The LENGTH function returns a number that is the
position of the last non-blank character in a string. There are several ANY functions that return the
first position of a particular type of character, such as a digit, alpha or punctuation character.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-37

Here are some similar functions:

LENGTHC Returns the length of a character string, including trailing blanks.

LENGTHN Returns the length of a character string, excluding trailing blanks.

ANYLOWER Searches a character string for a lowercase (or uppercase) letter, and
returns the first position at which the letter is found.
ANYUPPER

ANYSPACE Searches a character string for a whitespace character (blank, horizontal


and vertical tab, carriage return, line feed, and form feed), and returns the
first position at which that character is found.

Replacing Character Strings

TRANWRD(source, target, replacement)

character string to replacement


column find string

Summary2=tranwrd(Summary, 'hurricane', 'storm');

Summary Summary2
Category 3 hurricane initially... Category 3 storm initially...
The largest (in size) Atlantic hurricane on The largest (in size) Atlantic storm on
record... record...
57
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The TRANWRD function basically does find and replace for you. The first argument is usually a
character column, the second argument is the target, or the string that you want to find. The third
argument is the string that replaces the target. This assignment statement uses the TRANWRD
function to replace all instances of lowercase hurricane with storm in the Summary column.
Here are two similar functions:

TRANSLATE Searches a string for any character in a list of characters.

TRANSTRN Replaces or removes all occurrences of a substring in a character string.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-38 Lesson 3 Manipulating Data with Functions

Building Character Strings


Function What it does

CAT(string1, ... stringn) Concatenates strings together, does


not remove leading or trailing blanks
CATS(string1, ... stringn) Concatenates strings together,
removes leading or trailing blanks You can use
from each string concatenation
functions to
CATX('delimiter', string1, Concatenates strings together,
combine text strings
... stringn) removes leading or trailing blanks
or numbers into a
from each string, and inserts the
single string.
delimiter between each string

58
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use concatenation functions to combine strings into a single character value. The
arguments can be either character or numeric values. The CAT function combines the strings as is
without removing leading or training blanks. CATS combines strings and removes blanks. CATX
combines strings, removes blanks, and inserts a delimiter that you specify between each string.
Character strings can also be concatenated using the concatenation operator (||). The concatenation
operator does not trim leading or trailing blanks. Use the STRIP or TRIM function to remove trailing
blanks from values before concatenating them.

3.09 Activity
Open p203a09.sas from the activities folder and perform the following tasks:
1. Examine the assignment statements that use the CAT and CATS functions
to create StormID1 and StormID2. Run the program. How do the two
columns differ?
2. Add an assignment statement to create StormID3 that uses the CATX
function to concatenate Name, Season, and Day with a hyphen inserted
between each value. Run the program.
3. Modify the StormID2 assignment statement to insert a hyphen only
between Name and Season.

59
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-39

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
4. Using the SCAN and PROPCASE Functions
The pg2.np_monthlytraffic table contains monthly traffic statistics for national parks. However,
the data has some inconsistencies. There is no column containing park type, and the gate
location does not use proper case.
a. Open the p203p04.sas program from the practices folder. Run the program and examine
the data. Notice that ParkName includes a code at the end of each value that represents the
park type. Also notice that some of the values for Location are in uppercase.
b. Add a LENGTH statement to create a new five-character column named Type.
c. Add an assignment statement that uses the SCAN function to extract the last word from the
ParkName column and assigns the resulting value to Type.
d. Add an assignment statement to use the UPCASE and COMPRESS functions to change the
case of Region and remove any blanks.
e. Add an assignment statement to use the PROPCASE function to change the case of
Location.

Level 2
5. Searching for Character Strings
a. Open the p203p05.sas program from the practices folder. Notice that the DATA step creates
a table named parks and reads only those rows where ParkName ends with NP.
b. Modify the DATA step to create or modify the following columns:
1) Use the SUBSTR function to create a new column named Park that reads each
ParkName value and excludes the NP code at the end of the string.
Note: Use the FIND function to identify the position number of the NP string. That value
can be used as the third argument of the SUBSTR function to specify how many
characters to read.
2) Convert the Location column to proper case. Use the COMPBL function to remove any
extra blanks between words.
3) Use the TRANWRD function to create a new column named Gate that reads Location
and converts the string Traffic Count At to a blank.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-40 Lesson 3 Manipulating Data with Functions

4) Create a new column names GateCode that concatenates ParkCode and Gate together
with a single hyphen between the strings.

Challenge
6. Determining the Maximum Length of a Column
The pg2.np_unstructured_codes table contains a single column whose contents inc lude
location codes and names. Create a table that efficiently stores the location code and location
name.
a. Open the p203p06.sas program from the practices folder. Run the program and examine
the output report. Notice that the Column1 column contains raw data with values separated
by various symbols. The SCAN function is used to extract the ParkCode and ParkName
values.
b. Examine the PROC CONTENTS report. Notice that ParkCode and ParkName have a length
of 200, which is the same as Column1.
Note: When the SCAN function creates a new column, the new column has the same
length as the column listed as the first argument.
c. The ParkCode column should include only the first four characters in the string. Add a
LENGTH statement to define the length of ParkCode as 4.
d. The length for the ParkName column can be optimized by determining the longest string and
setting an appropriate length. Modify the DATA step to create a new column named
NameLength that uses the LENGTH function to return the position of the last non-blank
character for each value of ParkName.
e. Use a RETAIN statement to create a new column named MaxLength that has an initial value
of zero.
f. Use an assignment statement and the MAX function to set the value of MaxLength to either
the current value of NameLength or MaxLength, whichever is larger.
g. Use the END= option in the SET statement to create a temporary variable in the PDV named
LastRow. LastRow is zero for all rows until the last row of the table, when it will be 1. Add
an IF-THEN statement to write the value of MaxLength to the log if the value of LastRow
is 1.
data parklookup;
set pg2.np_unstructured_codes end=LastRow;
...
if LastRow=1 then putlog MaxLength=;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Using Character Functions 3-41

h. Highlight the DATA step and run the selected code. Examine the data to confirm that the
MaxLength column sequentially stores the maximum value for NameLength. View the log
to determine the last value of MaxLength.
i. Modify the LENGTH statement to set the length of ParkName to the maximum length. Run
the program and confirm in the PROC CONTENTS report that the lengths of the new
columns are optimized.
Note: The statements added to determine the maximum length can be deleted or
commented.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-42 Lesson 3 Manipulating Data with Functions

3.4 Using Special Functions to Convert


Column Type

Handling Column Type

character
function
What happens
when you try to
arithmetic perform an action
calculation on a column that
isn't the proper
type?

64
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

SAS offers some functions that have specialized purposes. Consider a very common issue: dealing
with column type. Each column in SAS is either character or numeric, and there are certain actions
that depend on data either being specifically character or numeric. For example, arithmetic
calculations require numeric values. And character functions such as SUBSTR or SCAN require
character values. When you try to perform an action on a column that is not the proper type, it can
cause syntax errors or incorrect results.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-43

3.10 Activity
Open p203a10.sas from the activities folder and perform the following tasks:
1. Highlight the PROC CONTENTS step and run the selected code. What is
the type of High, Low, and Volume?
2. Highlight the DATA and PROC PRINT steps and run the selected code.
Notice that although High is a character column, the Range column is
accurately calculated.
3. Open the log. Read the note printed immediately after the DATA step.
4. Uncomment the DailyVol assignment statement and run the program. Is
DailyVol created successfully?

65
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Automatic Conversion of Column Type

Range = High-Low; DailyVol = Volume/30;

Automatic conversion is Automatic conversion


successful because High fails because Volume
contains standard contains nonstandard
numeric values. numeric values.
67
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

If an arithmetic calculation is performed on a character value, SAS automatically attempts to convert


the values for the purpose of evaluating the expression. Sometimes the automatic conversion is
successful, as it is with the High column. Because High includes only standard numeric values (just
digits and decimal points), SAS is able to interpret the character strings as numbers. Sometimes the
automatic conversion does not work, like with the Volume column. The values of Volume include
commas, resulting in missing values when automatic conversion is attempted. Similarly, if you use a
numeric column in a character expression, SAS attempts to convert the values. Sometimes it works
just fine, and other times you can get unexpected results.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-44 Lesson 3 Manipulating Data with Functions

Conversion Functions
Function What it does

INPUT(source, informat) Converts character values to numeric values using a


specified informat
PUT(source, format) Converts numeric or character values to character
values using a specified format

68
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

To explicitly and accurately convert data from one type to another, you can use special functions.
• The INPUT function can be used to convert a character value to a numeric value. An informat is
used to indicate how the character string should be read.
• The PUT function can be used for converting numeric values to character values using a format to
indicate how the values should be written.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-45

Converting Character Values to Numeric Values

character to numeric

Date2=input(Date,date9.);

source informat

69
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p203d04

The Date column has been defined as character. Dates are best stored as numeric values that
reference the number of days since January 1, 1960. The INPUT function is used to read the
character value from Date and convert it to a numeric value. The INPUT function has two
arguments. The first argument is the source or the column that needs to be converted. The second
argument is the informat. The informat tells SAS how the data in the input table is displayed.

Informats for Converting Character to Numeric


Character Informat Numeric
15OCT2018 DATE9. 21472
10/15/2018 MMDDYY10. 21472
The informat specifies
15/10/2018 DDMMYY10. 21472
how the character
123,456.78 COMMA12. value looks so that it
123456.78
$123,456.78 DOLLAR12. can be converted to a
123456 6. 123456 numeric value.

70
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-46 Lesson 3 Manipulating Data with Functions

Informats for Converting Character to Numeric


Character SAS Date
15OCT2018 ANYDTDTEw. 21472
10/15/2018 21472
10152018 21472
20181015 The multipurpose 21472
Oct 15, 2018 ANYDTDTE 21472
informat can read
October 15, 2018 dates written in 21472
many ways.

71
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Informats for Converting Character to Numeric

OPTIONS DATESTYLE=MDY;
Character
06JAN2018 June 01, 2018
06/01/2018 ANYDTDTEw.
Jan 6, 2018
06 January 2018 OPTIONS DATESTYLE=DMY;

January 06, 2018

72
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

If a date is ambiguous, such as 6/1/2018 (which can be interpreted as either June 1 or January 6),
SAS uses the DATESTYLE= system option to determine the sequence. The default value for the
DATESTYLE= option is determined by the value of the LOCALE= system option. For example, if
LOCALE= is English, then the DATESTYLE= order is MDY, so 6/1 would be read as June 1.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-47

Informats for Converting Character to Numeric


data work.stocks2;
set pg2.stocks2;
NewVolume1=input(Volume,comma12.);
NewVolume2=input(Volume,comma12.2);
keep volume newvolume:;
run;

Be careful not to specify


a decimal value with
the informat unless you
want to insert a new
decimal point.

73
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.11 Activity
Open p203a11.sas from the activities folder and perform the following tasks:
1. Examine and run the program. In the output table, verify that Date2 is
created as numeric. Notice that the table contains a character column
named Volume.
2. Add an assignment statement to create a column named Volume2. Use
the INPUT function to read Volume using the COMMA12. informat. Run
the program and verify that Volume2 is created as a numeric column.
3. In the assignment statement, change Volume2 to Volume so that you
update the value of the existing column.
4. Run the program and notice that Volume is still character. Why is the
assignment statement not changing the column type?

74
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-48 Lesson 3 Manipulating Data with Functions

Converting the Type of an Existing Column


The SET statement
data work.stocks2;
reads Volume into
set pg2.stocks2;
Date2=input(Date,date9.); the PDV as a
Volume=input(Volume,comma12.); character column.
run;
pg2.stocks2

PDV
Stock Date Open Close High Low Volume Date2
$ 12 $9 N8 N8 $6 N8 $ 12 N8

77
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When the SET statement reads the pg2.stocks data, the PDV is established and Volume is a
character column. The assignment statement cannot change the character column Volume to a
numeric column. This is similar to how a LENGTH statement has no effect after the length of a
column is set in the PDV. If you want to change Volume to a numeric column, there are three steps
that you must follow to change the column type.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-49

Converting the Type of an Existing Column

data work.stocks2;
set pg2.stocks2(rename=(Volume=CharVolume));
Date2=input(Date,date9.);
Volume=input(CharVolume,comma12.);
drop CharVolume;
run; Rename the input column
that you want to change.

table (RENAME=(current-col-name=new-col-name))

78
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p203d04

First, use the RENAME= data set option in the SET statement to rename the input column that you
want to change. The RENAME= data set option follows the table name in the SET statement. Then
you list the current column name on the left of the equal sign and the new column name on the right.
Note: The outer set of parentheses surrounds all data set options applied to the table, and the
inner set of parentheses surrounds the columns being renamed.

Converting the Type of an Existing Column

data work.stocks2;
set pg2.stocks2(rename=(Volume=CharVolume));
Date2=input(Date,date9.);
Volume=input(CharVolume,comma12.);
drop CharVolume;
run; Volume will be read into the
PDV as CharVolume.

PDV
Stock Date Open Close High Low CharVolume Date2 Volume
$ 12 $9 N8 N8 $6 N8 $ 12 N8 N8
D

79
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p203d04

In this example, Volume is renamed as CharVolume. CharVolume is added to the PDV as a


character column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-50 Lesson 3 Manipulating Data with Functions

Converting the Type of an Existing Column

data work.stocks2;
set pg2.stocks2(rename=(Volume=CharVolume));
Date2=input(Date,date9.);
Volume=input(CharVolume,comma12.);
drop CharVolume;
run; Use the INPUT function on the
renamed column to create a
column with the original name.
PDV
Stock Date Open Close High Low CharVolume Date2 Volume
$ 12 $9 N8 N8 $6 N8 $ 12 N8 N8

80
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p203d04

Next, in an assignment statement, use the INPUT function with the renamed column as the source.
To the left of the equal sign, use the original column name that you want to be numeric. This adds
Volume to the PDV as a numeric column.

Converting the Type of an Existing Column

data work.stocks2;
set pg2.stocks2(rename=(Volume=CharVolume));
Date2=input(Date,date9.);
Volume=input(CharVolume,comma12.);
drop CharVolume;
run;
Drop the renamed column.

PDV
Stock Date Open Close High Low CharVolume Date2 Volume
$ 12 $9 N8 N8 $6 N8 $ 12 N8 N8
D

81
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p203d04

Finally, use a DROP statement to eliminate the renamed column from the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-51

3.12 Multiple Choice Question


Which statement renames the existing column Product in sashelp.shoes
as Type?

a. set sashelp.shoes rename=(Type=Product);

b. set sashelp.shoes(rename=(Type=Product));

c. set sashelp.shoes(rename(Product=Type));

d. set sashelp.shoes(rename=(Product=Type));

82
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-52 Lesson 3 Manipulating Data with Functions

Converting Numeric Values to Character Values

numeric to character

Day=put(Date,downame3.);

source format

84
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p203d04

The INPUT function converts columns from character to numeric. The PUT function can be used to
do the opposite: to convert a numeric value to a character value. The PUT function has two
arguments. The first argument is the source, which is the column that you want to convert from
numeric to character. The second argument is the format. The format tells SAS how to display the
new character value.
Suppose you have a numeric Date column and want to create a Day column that has the first three
letters of the day of the week for the particular date. The PUT function reads the numeric value in
Date and converts it to character using the format DOWNAME3.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-53

Formats for Converting Numeric to Character


Numeric Format Character
21472 DATE9. 15OCT2018
21472 DOWNAME3. Mon
21472 YEAR4. 2018
123456.78 COMMA10.2 123,456.78
The format specifies
123456.78 DOLLAR11.2 $123,456.78 how the numeric
123.456 6.2 123.46 value should look as a
character value.

85
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-54 Lesson 3 Manipulating Data with Functions

Using the INPUT and PUT Functions to Convert Column Types

Scenario
Convert character columns to numeric columns using the INPUT function and convert numeric
columns to character columns using the PUT function.

Files
• p203d04.sas
• weather_atlanta – a SAS table that contains the daily precipitation for the three airports in Atlanta,
GA, for January 2018

Syntax

DATA output-table;
SET input-table (RENAME=(current-column=new-column));
...
column1 = INPUT(source, informat);
column2 = PUT(source, format);
...
RUN;

Notes
• The INPUT function converts a character value to a numeric value using a specified informat.
• The PUT function converts a numeric or character value to a character value using a specified
format.
• SAS automatically tries to convert character values to numeric values using the w. informat.
• SAS automatically tries to convert numeric values to character values using the BEST12. format.
• If SAS automatically converts the data, a note is displayed in the SAS log. If you explicitly tell SAS
to convert the data with a function, a note is not displayed in the SAS log.
• Some functions such as the CAT functions automatically convert data from numeric to character
and also remove leading blanks on the converted data. No note is displayed in the SAS log.
Demo
1. Open the pg2.weather_atlanta table and notice the following:
• ZipCode is a numeric column.
• Date and Precip are character columns. A Precip value of T means that a trace value was
recorded, which means a very small amount of precipitation that results in no measurable
accumulation.

2. Open p203d04.sas from the demos folder and find the Demo section of the program. Run the
first DATA step.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-55

3. View the SAS log. SAS attempts to convert the character Precip value to a numeric value using
the w. informat. SAS is successful when the character value is a legitimate numeric val ue such
as .27. SAS is unsuccessful when the value is equal to a non-numeric value such as T. A value
of T is converted to a missing numeric value.
NOTE: Character values have been converted to numeric
values at the places given by: (Line):(Column).
30:16
NOTE: Invalid numeric data, Precip='T' , at line 30 column 16.
Station=ATLANTA HARTSFIELD INTERNATIONAL AIRPORT, GA US AirportCode=ATL City=Atlanta
ZipCode=30320 Date=01/01/2018 Precip=T TempMax=29 TempMin=18 TotalPrecip=0 _ERROR_=1 _N_=1

4. View the output table. Notice that TotalPrecip was accurately created for each row. The sum
statement ignores the missing values for the Precip values of T.

5. Add IF-THEN/ELSE statements to the DATA step to create a new column named PrecipNum. If
Precip is not equal to T, then use the INPUT function to assign the numeric equivalent of Precip
to the PrecipNum column. Otherwise, assign 0 to PrecipNum. Use PrecipNum in the sum
statement instead of Precip. Drop the Precip column.
data atl_precip;
set pg2.weather_atlanta;
where AirportCode='ATL';
drop AirportCode City Temp: ZipCode Precip;
if Precip ne 'T' then PrecipNum=input(Precip,6.);
else PrecipNum=0;
TotalPrecip+PrecipNum;
run;
6. Run the DATA step. Notice that the SAS log no longer contains a note about character values
being converted to numeric values and no longer contains notes about invalid numeric data for
Precip='T'.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-56 Lesson 3 Manipulating Data with Functions

7. Add to the DATA step to create a numeric column Date from the character column Date. Also,
format the numeric Date and drop the character Date.
data atl_precip;
set pg2.weather_atlanta(rename=(Date=CharDate));
where AirportCode='ATL';
drop AirportCode City Temp: ZipCode Precip CharDate;
if Precip ne 'T' then PrecipNum=input(Precip,6.);
else PrecipNum=0;
TotalPrecip+PrecipNum;
Date=input(CharDate,mmddyy10.);
format Date date9.;
run;
8. Run the DATA step. Confirm that you have a numeric precipitation column and a numeric date
column.

9. Run the second DATA step and notice that CityStateZip was accurately created for each row.
The CAT functions automatically convert numeric values to character values and remove leading
blanks in the converted value. SAS does not write a note to the log when values are converted
with the CAT functions.

10. Add to the DATA step to create a character column ZipCodeLast2 that contains the last two
digits of the numeric column ZipCode.
data atl_precip;
set pg2.weather_atlanta;
CityStateZip=catx(' ',City,'GA',ZipCode);
ZipCodeLast2=substr(ZipCode, 4, 2);
run;
11. View the SAS log. SAS converts the numeric ZipCode value to a character value.
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column). 27:24

12. View the output table. Notice that ZipCodeLast2 is not displaying the last two digits of the ZIP
code. When SAS automatically converts a numeric value to a character value, the BEST12.
format is used, and the resulting character value is right-aligned. The numeric value of 30320
becomes the character value of seven leading spaces followed by 30320.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-57

13. Modify the first argument of the SUBSTR function to explicitly convert the numeric ZipCode
value to a character value using the PUT function.
data work.weather_atlanta;
set pg2.weather_atlanta;
CityStateZip=catx(' ',City,'GA',ZipCode);
ZipCodeLast2=substr(put(ZipCode,z5.), 4, 2);
run;
14. View the output table. Notice that ZipCodeLast2 now displays the last two digits of the ZIP code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-58 Lesson 3 Manipulating Data with Functions

3.13 Activity
Open p203a13.sas from the activities folder and perform the following tasks:
1. Add to the RENAME= option to rename the input column Date as
CharDate.
2. Add an assignment statement to create a numeric column Date from the
character column CharDate. The values of CharDate are stored as
01JAN2018.
3. Modify the DROP statement to eliminate all columns that begin with Char
from the output table.
4. Run the program and verify that Volume and Date are numeric columns.

87
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Using Special Functions to Convert Column Type 3-59

Beyond SAS Programming 2


What if you want to ...
. . . learn about using Perl
. . . learn more about . . . see examples of
regular expressions for
data cleaning functions?
more complex data
techniques?
manipulation.
• Read Cody’s Data • Take the SAS Programming 3: • Access SAS Help for
Cleaning Techniques Advanced Techniques course. Functions and CALL
Using SAS. • Read An Introduction to Perl Routines by Category
• Take the Data Cleaning Regular Expressions in SAS 9. and view examples.
Techniques course. • Read SAS Functions by
Example.

89
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links
• Read Cody’s Data Cleaning Techniques Using SAS.
• Take the Data Cleaning Techniques course.
• Take the Taking Your SAS Programming Skills to the Next Level course.
• Read An Introduction to Perl Regular Expressions in SAS 9.
• Access SAS Help for Functions and CALL Routines by Category and view examples.
• Read SAS Functions by Example.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-60 Lesson 3 Manipulating Data with Functions

3.5 Solutions
Solutions to Practices
1. Using the LARGEST and ROUND Functions
proc print data=pg2.np_lodging(obs=10);
where CL2010>0;
run;

data stays;
set pg2.np_lodging;
Stay1=largest(1, of CL:);
Stay2=largest(2, of CL:);
Stay3=largest(3, of CL:);
StayAvg=round(mean(of CL:));
if StayAvg > 0;
keep Park Stay:;
format Stay: comma11.;
run;
2. Working with Date/Time Values
data rainsummary;
set pg2.np_hourlyrain;
by Month;
if first.Month=1 then MonthlyRainTotal=0;
MonthlyRainTotal+Rain;
if last.Month=1;
Date=datepart(DateTime);
MonthEnd=intnx('month',Date,0,'end');
format Date MonthEnd date9.;
keep StationName MonthlyRainTotal Date MonthEnd;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Solutions 3-61

3. Creating Projected Date Values


proc sort data=pg2.np_weather(keep=Name Code Date Snow)
out=winter2015_2016;
where date between '01Oct15'd and '01Jun16'd and Snow > 0;
by Code Date;
run;

data snowforecast;
set winter2015_2016;
retain FirstSnow;
by Code;
if first.Code then FirstSnow=Date;
if last.Code then do;
LastSnow=Date;
WinterLengthWeeks=intck('week',FirstSnow, LastSnow, 'c');
ProjectedFirstSnow=intnx('year', FirstSnow, 1, 'same');
output;
end;
format FirstSnow LastSnow ProjectedFirstSnow date7.;
drop Snow Date;
run;
4. Using the SCAN and PROPCASE Functions
data clean_traffic;
set pg2.np_monthlytraffic;
drop Year;
length Type $ 5;
Type=scan(ParkName, -1);
Region=upcase(compress(Region));
Location=propcase(Location);
run;
5. Searching for Character Strings
data parks;
set pg2.np_monthlytraffic;
where ParkName like '%NP';
Park=substr(ParkName, 1, find(ParkName,'NP') -2);
Location=compbl(propcase(Location));
Gate=tranwrd(Location, 'Traffic Count At ', ' ');
GateCode=catx('-', ParkCode, Gate);
run;

proc print data=parks;


var Park GateCode Month Count;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-62 Lesson 3 Manipulating Data with Functions

6. Determining the Maximum Length of a Column


data parklookup;
set pg2.np_unstructured_codes end=lastrow;
length ParkCode $ 4 ParkName $ 83;
ParkCode=scan(Column1, 2, '{}:,"()-');
ParkName=scan(Column1, 4, '{}:,"()');
retain MaxLength 0;
NameLength=length(ParkName);
MaxLength=max(NameLength,MaxLength);
if lastrow=1 then putlog MaxLength=;
run;

proc print data=parklookup(obs=10);


run;

proc contents data=parklookup;


run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Solutions 3-63

Solutions to Activities and Questions


continued...
3.01 Activity – Correct Answer
1. Run the program. Why does the DATA step fail?
A function returns a value that must be used in an assignment statement
or expression.

Correct the error by overwriting the value of the column Name in


uppercase.

data quiz_summary;
set pg2.class_quiz;
Name=upcase(Name);
...
run;

7
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.01 Activity – Correct Answer


3. In the expression for Mean2, delete the keyword OF and run the
program. What do the values in Mean2 represent?

Without the OF
Mean2=mean(Quiz1-Quiz5); keyword, Mean2 is
the mean of the
difference between
Quiz1 and Quiz5.

8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-64 Lesson 3 Manipulating Data with Functions

3.02 Activity – Correct Answer


Simplify the program by using CALL MISSING to assign missing values for the
two students. Run the program.

data quiz_report;
set pg2.class_quiz;
if Name in("Barbara", "James")
then call missing(of Q:);
run;
Don’t forget to
take advantage of
column lists!

13
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.03 Activity – Correct Answer


2. Modify the WindAvg1 expression to use the ROUND function to round
values to the nearest tenth (.1).
WindAvg1=round(mean(of Wind1-Wind4), .1);

3. Add a FORMAT statement to format WindAvg2 with the 5.1 format. Run
the program. What is the difference between using a function and a
format?
format WindAvg2 5.1;

The values appear the same, but the


function changes the stored values, whereas
the format affects only the displayed values.
23
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Solutions 3-65

3.04 Activity – Correct Answer


2. Run the program and examine rows 8 and 9. Both storms were two days,
but why are the values assigned to Weeks different?
MAARUTHA spans a Saturday/Sunday boundary, and ARLENE was in the
middle of a week.

3. Add 'c' as the fourth argument in the INTCK function to use the
continuous method. Run the program. Are the values for Weeks in rows
8 and 9 different?
Both storms were shorter than seven days, so Weeks is zero.

30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.05 Question – Correct Answer


What value would be assigned to Months2Pay for each expression?

ServiceDate PayDate Months2Pay


10JUL2018 05SEP2018 ?

Months2Pay=intck('month', ServiceDate, PayDate);


Two end-of-month boundaries were crossed at the end of July and August.

Months2Pay=intck('month', ServiceDate, PayDate, 'c');


One month boundary was crossed at August 10. The next boundary
will not occur until September 10.
32
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-66 Lesson 3 Manipulating Data with Functions

continued...
3.06 Activity – Correct Answer
1. Complete the NewLocation assignment statement to use the COMPBL
function to read Location and convert each occurrence of two or more
consecutive blanks into a single blank.
data weather_japan_clean;
set pg2.weather_japan;
NewLocation=compbl(Location);
run;

42
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.06 Activity – Correct Answer


2. Complete the NewStation assignment to use the COMPRESS function
with Station as the only argument. Run the program. Which characters
are removed in the NewStation column?
Blanks are removed.
3. Add a second argument in the COMPRESS function to specify the
characters to remove. All characters should be enclosed in a single set of
quotation marks. Run the program.

data weather_japan_clean;
set pg2.weather_japan;
NewLocation=compbl(Location);
NewStation=compress(Station,"- ");
run;

43
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Solutions 3-67

continued...
3.07 Activity – Correct Answer
2. Why is the subsetting IF condition always false?

When the SCAN function extracts Prefecture from Location, only a comma is
specified as a delimiter. The leading space is included in the returned value.

Prefecture=scan(Location, 2, ',');

PDV
Location Prefecture
TAKADA, Tokyo, JA Tokyo

49
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.07 Activity – Correct Answer


3. Modify the program to correct the logic error. Run the program and
confirm that four rows are returned.
Possible solutions: Adding a space as a delimiter
works if there are no spaces
Prefecture=scan(Location, 2, ', '); embedded in City or Prefecture.

Prefecture=strip(scan(Location, 2, ',')); The STRIP function


removes leading and
training blanks.

50
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-68 Lesson 3 Manipulating Data with Functions

continued...
3.08 Activity – Correct Answer
2. Examine the PROC PRINT report. Why is CategoryLoc equal to 0 in row 1?
Lowercase category is not found in this row.

Lowercase
Why is CategoryLoc equal to 0 in row 15? category is not
The word category is not in this row at all. found in either row.

54
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.08 Activity – Correct Answer


3. Modify the FIND function to make the search case insensitive.
Uncomment the IF-THEN statement to create a new column named
Category. Run the program and examine the results.
data storm_damage2;
set pg2.storm_damage;
drop Date Cost;
CategoryLoc=find(Summary, 'category', 'i');
if CategoryLoc > 0 then
Category=substr(Summary, CategoryLoc, 10);
run;
Start at the number
stored in CategoryLoc and
read 10 characters.

55
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Solutions 3-69

continued...
3.09 Activity – Correct Answer
1. Examine the assignment statements that use the CAT and CATS functions
to create StormID1 and StormID2. Run the program. How do the two
columns differ?

StormID1 is created using StormID2 is created using


the CAT function, so trailing the CATS function, so
blanks after Name are trailing blanks after Name
included in the are removed in the
concatenated string. concatenated string.
60
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.09 Activity – Correct Answer


2. Add an assignment statement to create StormID3 that uses the CATX
function to concatenate Name, Season, and Day with a hyphen inserted
between each value. Run the program.
StormID3=catx("-", Name, Season, Day);

3. Modify the StormID2 assignment statement to insert a hyphen only


between Name and Season.
StormID2=cats(Name, '-', Season, Day);

61
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-70 Lesson 3 Manipulating Data with Functions

3.10 Activity – Correct Answer


1. What is the type of High, Low, and Volume?
High and Volume are character, and Low is numeric
3. Open the log. Read the note printed immediately after the DATA step.

NOTE: Character values have been converted to numeric


values at the places given by: (Line):(Column).
31:10 32:11

4. Uncomment the DailyVol assignment statement and run the program. Is


DailyVol created successfully?
No, DailyVol is missing for all rows.
NOTE: Invalid numeric data, Volume='5,976,252', at
line 32 column 11.
66
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

continued...
3.11 Activity – Correct Answer
2. Add an assignment statement to create a column named Volume2. Use
the INPUT function to read Volume using the COMMA12. informat. Run
the program and verify that Volume2 is created as a numeric column.

data work.stocks2;
set pg2.stocks2;
Date2=input(Date,date9.);
Volume2=input(Volume,comma12.);
run;

75
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Solutions 3-71

3.11 Activity – Correct Answer


3. In the assignment statement, change Volume2 to Volume so that you
update the value of the existing column.
data work.stocks2;
set pg2.stocks2;
Date2=input(Date,date9.);
Volume=input(Volume,comma12.);
run;

4. Run the program and notice that Volume is still character. Why is the
assignment statement not changing the column type?
Volume cannot be in the PDV as both numeric and character. The
character attribute of Volume from the SET statement is controlling the
column type.
76
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

3.12 Multiple Choice Question – Correct Answer


Which statement renames the existing column Product in sashelp.shoes
as Type?

a. set sashelp.shoes rename=(Type=Product);

b. set sashelp.shoes(rename=(Type=Product));

c. set sashelp.shoes(rename(Product=Type));

d. set sashelp.shoes(rename=(Product=Type));

83
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-72 Lesson 3 Manipulating Data with Functions

3.13 Activity – Correct Answer


Rename input columns with
the undesired column type.

data work.stocks2;
set pg2.stocks2(rename=(Volume=CharVolume Date=CharDate));
Volume=input(CharVolume,comma12.);
Date=input(CharDate,date9.);
drop Char:;
run;

Use the INPUT function on the renamed


Drop the renamed columns.
columns to create columns with original names.

88
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 4 Creating Custom
Formats
4.1 Creating and Using Custom Formats ........................................................................... 4-3
Demonstration: Creating and Using Custom Formats ................................................. 4-10
Practice............................................................................................................... 4-13

4.2 Creating Custom Formats from Tables ....................................................................... 4-16


Demonstration: Creating Custom Formats from Tables ............................................... 4-19
Practice............................................................................................................... 4-26

4.3 Solutions ................................................................................................................... 4-30


Solutions to Practices ............................................................................................ 4-30
Solutions to Activities and Questions........................................................................ 4-34
4-2 Lesson 4 Creating Custom Formats

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-3

4.1 Creating and Using Custom Formats

Formatting Data Values

proc print data=pg2.class_birthdate noobs;


format Height Weight 3.0 Birthdate date9.;
run;

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In SAS, formats are used to specify how data values are to be displayed. In this example, the
FORMAT statement is used in the PRINT procedure to display the numeric values of Height and
Weight with a width of 3 and no decimal places and the numeric values of Birthdate as a two-digit
day, three-letter month, and four-digit year. Remember that when you use a format, the format name
contains one period.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Lesson 4 Creating Custom Formats

4.01 Activity
Open p204a01.sas from the activities folder and perform the following tasks:
1. Add a FORMAT statement in the DATA step to format the following values:
Date Use MONYY7. to display three-letter month and four-digit year values.
Volume Use COMMA12. to add commas.
CloseOpenDiff
Use DOLLAR8.2 to add dollar signs and include two decimal places.
HighLowDiff
2. Run the program and verify the formatted values in the PROC PRINT output.
3. Add a FORMAT statement in the PROC MEANS step to format the values of
Date to show only a four-digit year. Run the PROC MEANS step again.
4. What is the advantage of adding a FORMAT statement to the DATA step
versus the PROC step?
4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Formatting Data Values

SAS doesn’t always


have a predefined
format that meets
format Registration ? Height ?;
your needs.

6
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

SAS supplies over a hundred formats for you to use. However, in that list of formats, you might not
find one that meets your needs. In this example, we want a format to apply to the Registration
column to display a value of C as Complete and a value of I as Incomplete. We also want a format
for the Height column to display low values as Below Average, mid-range values as Average, and
high values as Above Average.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-5

FORMAT Procedure

PROC FORMAT;
VALUE format-name value-or-range-1 = 'formatted-value'
value-or-range-2 = 'formatted-value'
...;
RUN;
You can use the
• The name can be up to 32 characters in length. FORMAT procedure
• Character formats must begin with a $ followed to create your own
by a letter or underscore. format.
• Numeric formats must begin with a letter or
underscore.
• The name cannot end in a number or match an
existing SAS format.
7
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The FORMAT procedure enables you to create your own custom formats. Each VALUE statement
specifies the criteria for creating one format.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Lesson 4 Creating Custom Formats

FORMAT Procedure

PROC FORMAT;
VALUE format-name value-or-range-1 = 'formatted-value'
value-or-range-2 = 'formatted-value'
...;
RUN;

individual value or format that you want to


range of values that apply to the individual
you want to format value or range of values

8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

On the left side of the equal sign, you specify individual values or a range of values that you want to
convert to formatted values. Character values need to be in quotation marks. Numeric values do not.
On the right side of the equal sign, you specify the formatted values that you want the values on the
left side to become. Formatted values are enclosed in quotation marks.
Note: You can define more than one custom format at a time by using multiple VALUE statements
in a PROC FORMAT step.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-7

Creating and Using Custom Formats

proc format;
create
value $regfmt 'C'='Complete'
format
'I'='Incomplete';
run;
no period in format name

proc print data=pg2.class_birthdate;


apply
format Registration $regfmt.;
format
run;
period in format name

9
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

This PROC FORMAT step creates a character format called $REGFMT. There is no period after the
name of the format name in the VALUE statement. The format definition specifies that a character
value of a capital C should be formatted as Complete and a character value of a capital I should be
formatted as Incomplete. Notice that there is no reference to a data table or a specific column in the
PROC FORMAT step. This format can be applied to any column that contains those values in any
table.
This PROC PRINT step applies $REGFMT to the Registration column in the class_birthdate table.
There is a period after the name of the format when you use it.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Lesson 4 Creating Custom Formats

4.02 Activity
Open p204a02.sas from the activities folder and perform the following tasks:
1. In the PROC FORMAT step, modify the second VALUE statement to create
a format named HRANGE that has the following criteria:
• A range of 50 – 57 has a formatted value of Below Average.
• A range of 58 – 60 has a formatted value of Average.
• A range of 61 – 70 has a formatted value of Above Average.
2. In the PROC PRINT step, modify the FORMAT statement to format Height
with the HRANGE format.
3. Run the program and verify the formatted values in the PRINT output.
4. Why is the Height value for the first row not formatted?

10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Defining a Continuous Range

Put < before the ending value in


a range to exclude the value.

value hrange 50-<58 = 'Below Average'


58-60 = 'Average'
60<-70 = 'Above Average';

Put < after the starting value in


a range to exclude the value.

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-9

Here are the four ways that a range can be specified:

Range Starting Value Ending Value

58 – 60 Includes 58 Includes 60

58 – < 60 Includes 58 Excludes 60

58 < – 60 Excludes 58 Includes 60

58 < – < 60 Excludes 58 Excludes 60

Using Keywords
lowest possible value

value hrange low-<58 = 'Below Average'


58-60 = 'Average'
60<-high = 'Above Average';

highest possible value

value $regfmt 'C' = 'Complete'


'I' = 'Incomplete'
other = 'Miscoded';

all values that do not match any other value


13
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p204d01

Keywords can be specified in the VALUE statement to enhance your selection.


• You can use LOW or HIGH as one value in a range. The LOW keyword includes missing values
for character variables and does not include missing values for numeric variables.
• You can use the keyword OTHER as a single value. OTHER matches all values that do not match
any other value or range.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Lesson 4 Creating Custom Formats

Creating and Using Custom Formats

Scenario
Use the FORMAT procedure to create custom numeric and character formats based on single
values and a range of values.

Files
• p204d01.sas
• storm_summary – a SAS table that contains one row per storm for the 1980 through 2016 storm
seasons

Syntax

PROC FORMAT;
VALUE format-name value-or-range-1 = 'formatted-value'
value-or-range-2 = 'formatted-value'
...;
RUN;

Notes
• The FORMAT procedure is used to create custom formats.
• A VALUE statement specifies the criteria for creating one custom format.
• Multiple VALUE statements can be used within the PROC FORMAT step.
• The format name can be up to 32 characters in length, must begin with a $ followed by a letter or
underscore for character formats, and must begin with a letter or underscore for numeric formats.
• On the left side of the equal sign, you specify individual values or a range of values that you want
to convert to formatted values. Character values must be in quotation marks; numeric values are
not quoted.
• On the right side of the equal sign, you specify the formatted values that you want the values on
the left side to become. Formatted values need to be in quotation marks.
• The keywords LOW, HIGH, and OTHER can be used in the VALUE statement.
• You do not include a period in the format name when you create the format, but you do include the
period in the name when you use the format.
• Custom formats can be used in the FORMAT statement and the PUT function.

Demo
1. Open p204d01.sas from the demos folder and find the Demo section of the program. Notice the
syntax for creating the STDATE format in the PROC FORMAT step.
proc format;
value stdate low - '31DEC1999'd = '1999 and before'
'01JAN2000'd - '31DEC2009'd = '2000 to 2009'
'01JAN2010'd - high = '2010 and later'
. = ' Not Supplied';
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-11

2. Add a VALUE statement to the PROC FORMAT step to create the $REGION format with the
following labels:

NA Atlantic

WP, EP, SP Pacific

NI, SI Indian

blank Missing

other Unknown

proc format;
value stdate low - '31DEC1999'd = '1999 and before'
'01JAN2000'd - '31DEC2009'd = '2000 to 2009'
'01JAN2010'd - high = '2010 and later'
. = 'Not Supplied';
value $region 'NA'='Atlantic'
'WP','EP','SP'='Pacific'
'NI','SI'='Indian'
' '='Missing'
other='Unknown';
run;
3. Highlight the PROC FORMAT step and run the selected code. Verify in the SAS log that the
formats have been output.
4. Add a FORMAT statement in the PROC FREQ step to format Basin with the $REGION format
and StartDate with the STDATE format. Highlight the PROC FREQ step and run the selected
code.
proc freq data=pg2.storm_summary;
tables Basin*StartDate;
format StartDate stdate. Basin $region.;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Lesson 4 Creating Custom Formats

4.03 Activity
Open p204a03.sas from the activities folder and perform the following tasks:
1. Review the PROC FORMAT step that creates the $REGION format that
assigns basin codes to groups. Highlight the step and run the selected
code.
2. Notice that the DATA step includes IF-THEN/ELSE statements to create a
new column named BasinGroup.
3. Delete the IF-THEN/ELSE statements and replace them with an
assignment statement to create the BasinGroup column. Use the PUT
function with Basin as the first argument and $REGION. as the second
argument.
4. Highlight the DATA and PROC MEANS steps and run the selected code.
How many BasinGroup values are in the summary report?
15
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-13

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Creating Custom Formats Based on Single Values
The pg2.np_summary table contains public use statistics from the National Park Service. The
values of the Reg column represent park region as a code. Create a format that, when applied,
displays full descriptive values for the regions with high frequency.
a. Open p204p01.sas from the practices folder. Highlight the PROC FREQ step and run the
selected code. Review the output. Notice that regional codes are used, not descriptive
values.

b. Add a VALUE statement to the PROC FORMAT step to create a format named $HIGHREG
that defines the descriptive values shown below.

Code Value

IM Intermountain

PW Pacific West

SE Southeast

other codes All Other Regions

c. Add a FORMAT statement to the PROC FREQ step so that the $HIGHREG format is applied
to the Reg column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Lesson 4 Creating Custom Formats

d. Run the program and review the output. Verify that the descriptive values for the Reg column
are displayed.

Level 2
2. Creating Custom Formats Based on a Range of Values
The pg2.np_acres table contains acreage amounts for national parks. Create a format that,
when applied, groups acreage amounts into identified categories.
a. Open p204p02.sas from the practices folder. Before the DATA step, add a PROC FORMAT
step to create a format named PSIZE that categorizes parks based on the gross acres. Use
the ranges and values as identified below.

Range Value

Less than 10,000 acres Small

10,000 through less than 500,000 acres Average

500,000 and more acres Large

b. In the DATA step, add an assignment statement to create a new column named ParkSize.
Use the PUT function to create the new column based on the formatted values of
GrossAcres.
c. Run the program and view the output table. Verify the values of the ParkSize column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Creating and Using Custom Formats 4-15

Challenge
3. Creating Custom Formats Based on Nesting Formats
The pg2.np_weather table contains weather-related statistics for four national park locations.
Create a format that, when applied, groups dates into identified categories.
a. Access the Base SAS ® 9.4 Procedures Guide. Find the PROC FORMAT section and the
VALUE statement page. Scroll to the bottom of the page to look at examples where existing
SAS formats are used for labels in a custom format.
b. Open p204p03.sas from the practices folder.
c. Add a PROC FORMAT step to create a format named DECADE that categorizes dates as
identified below.
• Dates from January 1, 2000 – December 31, 2009 are displayed with the value
2000-2009.
• Dates from January 1, 2010 – December 31, 2017 are displayed with the value
2010-2017.
• Dates from January 1, 2018 – March 31, 2018 are displayed with the value
1st Quarter 2018.
• Dates from April 1, 2018, and beyond display the actual date value using the MMDDYY10.
format.
d. Modify the PROC MEANS step so that the DECADE format is applied to the Date column.
e. Run the program and review the output. Verify that the descriptive values for the Date
column are displayed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Lesson 4 Creating Custom Formats

4.2 Creating Custom Formats from Tables

Creating Custom Formats from Tables

proc format;
value $sbfmt
'AS'='Arabian Sea'
'BB'='Bay of Bengal'
'EA'='Eastern Australia'
'WA'='Western Australia'
'CP'='Central Pacific'
'CS'='Caribbean Sea'
'GM'='Gulf of Mexico'
'MM'='Missing';
run;

19
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

It can be a bit tedious to type all the values of a custom format in the VALUE statement. If the data is
stored in a table, you can read the values from the table to create the format. In this example, the
Sub_Basin column represents the values on the left side of the equal sign, and the
SubBasin_Name column represents the formatted values on the right side of the equal sign.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-17

Rules for the Input Table

FmtName Start Label

name of the label that you want


value to format
custom format to apply to the value
The input table
must have at
least these
three columns.

20
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When you use an input table to create a format, the table must have a particular structure with at
least three specific character columns.
• FmtName contains the name of the format that you are creating. Remember that character
formats begin with a dollar sign.
• Start contains the values to the left of the equal sign in the VALUE statement (the values that you
want to format).
• Label contains the values to the right of the equal sign in the VALUE statement (the labels that
you want to apply to the values).
You might need additional columns depending on the requirements of the format. For example, if you
are specifying ranges, you need an End column in addition to a Start column.
More than likely, your input table will not have the appropriately named columns. Using a DATA step,
you can create a new version of the data that you can use for creating a format.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-18 Lesson 4 Creating Custom Formats

Rules for the Input Table

data work.sbdata;
retain FmtName '$sbfmt';
set pg2.storm_subbasincodes(rename=(Sub_Basin=Start
SubBasin_Name=Label));
keep Start Label FmtName;
run;

21
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p204d02

To build the table to generate the format, the RETAIN statement creates the FmtName column and
retains the value $sbfmt for each row. The SET statement reads the input table and renames the
Sub_Basin and SubBasin_Name columns. In the end, our new table, work.sbdata, will have the
three columns (FmtName, Start, and Label) with a row for each SubBasin value.

CNTLIN= Option

proc format cntlin=work.sbdata;


run;

Use the CNTLIN=


option to specify a
table for building a
format.

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

It is very simple to read a table to create a custom format when the table has the correct layout.
Simply use the CNTLIN= option in the PROC FORMAT statement to name the input table. The
VALUE statement is no longer needed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-19

Creating Custom Formats from Tables

Scenario
Use the CNTLIN= option to specify a table from which the FORMAT procedure will build formats.

Files
• p204d02.sas
• storm_subbasincodes – a SAS table that contains seven rows showing the two-letter code
values and descriptive values for storm sub-basins.
• storm_detail – a SAS table that contains detailed data for the 1980 through 2016 storm seasons.
Each row represents one measurement for each six hours of a storm.
• storm_categories – a SAS table that contains the range values for the five storm categories.

Syntax

PROC FORMAT CNTLIN=input-table FMTLIB;


SELECT format-names;
RUN;

Notes
• The CNTLIN= option specifies a table from which PROC FORMAT builds formats.
• The input table must contain at a minimum three character columns:
– Start, which represents the raw data values to be formatted.
– Label, which represents the formatted labels.
– FmtName, which contains the name of the format that you are creating. Character formats
start with a dollar sign.
• If you specify ranges, the input table must also contain an End column in addition to the Start
column.
• The FMTLIB option creates a report containing information about your custom formats.
• The SELECT statement selects formats for processing by the FMTLIB option.

Demo
1. Open p204d02.sas from the demos folder and find the Demo section of the program. Examine
the DATA step that creates the sbdata table from the pg2.storm_subbasincodes table and the
PROC FORMAT step that imports the sbdata table. Highlight the demo program and run the
selected code. Verify that the new table contains three required columns to build a format. View
the log and confirm that the $SBFMT format was created.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-20 Lesson 4 Creating Custom Formats

2. Open the pg2.storm_categories table. This table defines a range of maximum wind speeds
(Low and High) and assigns a storm category (Category).
3. Modify the second DATA and PROC FORMAT steps to create a table named catdata that will
include the following columns. Highlight the DATA and PROC FORMAT steps and run the
selected code. View the log and confirm that the CATFMT format was created.

Column in Column in catdata


pg2.storm_categories

<none> FmtName (assign the value catfmt for each row)

Low Start

High End

Category Label

data catdata;
retain FmtName "catfmt";
set pg2.storm_categories(rename=(Low=Start
High=End
Category=Label));
keep FmtName Start End Label;
run;

proc format cntlin=catdata;


run;
4. Add a FORMAT statement in the PROC FREQ step to format Sub_basin with the $SBFMT
format and Wind with the CATFMT format. Highlight the TITLE statements and PROC FREQ
step and run the selected code.
title "Frequency of Wind Measurements for Storm Categories by
SubBasin";
title2 "2016 Storms";
proc freq data=pg2.storm_detail;
/*include only Category 1-5 2016 storms with known subbasin*/
where Wind>=64 and Season=2016 and
Sub_basin not in('MM', 'NA');
tables Sub_basin*Wind / nocol norow nopercent;
format Sub_basin $sbfmt. Wind catfmt.;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-21

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-22 Lesson 4 Creating Custom Formats

4.04 Activity
Open p204a04.sas from the activities folder and perform the following tasks:
1. Run the program to create the $SBFMT and CATFMT formats. View the
log to confirm that both were output.
2. Uncomment the PROC FORMAT step at the end of the program.
Highlight the step and run the selected code. A report for all formats in
the Work library is generated.
3. Add the following statement in the last PROC FORMAT step to limit the
report to selected formats. Run the step.
select $sbfmt catfmt;

4. What are the default lengths for the $SBFMT and CATFMT formats?

24
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Location of Custom Formats


proc format;
value $sttype By default, custom
proc format cntlin=work.sbdata;
'TS'='Tropical Storm'
run; formats are stored in
'SS'='Subtropical Storm'
'ET'='Extratropical Storm' the temporary Work
'DS'='Disturbance' library in a catalog
'NR'='Not Reported'; named formats.
run;

NOTE: Format $STTYPE has been output.

NOTE: Format $STTYPE is already on the library WORK.FORMATS.


NOTE: Format $STTYPE has been output.

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-23

Creating Permanent Custom Formats

proc format library=pg2.myfmts;

proc format library=pg2;

proc format lib=pg2; Use the LIBRARY=


option to save
formats in a
permanent location.

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

If you want to store your formats in a different location other than WORK.FORMATS, you can use
the LIBRARY= option. This option can be shortened to LIB=. You can specify the library and a
catalog name, such as PG2.MYFMTS. If you specify only the library, the default catalog formats is
used.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-24 Lesson 4 Creating Custom Formats

Searching for Custom Formats

options fmtsearch=(pg2.myfmts sashelp);

work.formats library.formats pg2.myfmts sashelp.formats

28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When SAS encounters a format, it has to look up the format definition. By default, SAS searches for
formats in the formats catalog of the Work library and, if there is a library named Library, it also
searches in the formats catalog there. If you save your custom formats to another location, you
must tell SAS where to find them.
In the global OPTIONS statement, you use the FMTSEARCH= option to specify additional locations
to search. This OPTIONS statement directs SAS to search in the myfmts catalog of the Pg2 library
and the formats catalog of the Sashelp library after it searches the default locations
WORK.FORMATS and LIBRARY.FORMATS.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-25

4.05 Activity
Open p204a05.sas from the activities folder and perform the following tasks:
1. In the PROC FORMAT statement, add the LIBRARY= option to save the
formats to the pg2.formats catalog.
2. Run the PROC FORMAT step and verify in the log that the two formats
were created in a permanent location.
3. Before the PROC PRINT step, add an OPTIONS statement so that SAS can
find the two permanent formats.
options fmtsearch=(pg2.formats);

4. Run the OPTIONS statement and the PROC PRINT step. Are the
Registration and Height values formatted?
29
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Beyond SAS Programming 2


What if you want to ...

. . . create a template for displaying numeric values with PROC FORMAT?


picture million low-high = '09.9M'
(prefix='$' mult=.00001);

picture mydates (default=14)


low-high = '%Y %B' (datatype=date);

• Learn about PROC FORMAT in SAS Help.


• Browse or ask questions in the SAS Procedures community
and see responses from other SAS programmers.

31
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links
• Learn about PROC FORMAT in SAS Help.
• Browse or ask questions in the SAS Procedures community and see responses from other SAS
programmers.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-26 Lesson 4 Creating Custom Formats

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
4. Creating a Custom Format from a Table
The pg2.np_monthlyTraffic table contains monthly traffic counts at locations in national parks.
Create a format that categorizes park codes into their type (for example, National Park, National
Seashore, and so on). The pg2.np_codeLookup table contains park codes and the associated
park types.
a. Open p204p04.sas from the practices folder. Highlight the PROC MEANS step and run the
selected code. Review the output. Notice that the traffic statistics are listed by a four-letter
park code.
b. Open the pg2.np_codeLookup table. Notice that ParkCode contains the four-letter park
code and Type contains the type of park.

c. Modify the DATA step.


1) Add a RENAME= data set option to the SET statement to rename the ParkCode column
to Start and the Type column to Label.
2) Add a RETAIN statement before the SET statement to create the FmtName column with
a value of $TypeFmt (without a period at the end).
d. In the PROC FORMAT statement, add a CNTLIN= option to build a format from the
type_lookup table.
e. In the PROC MEANS step, add a FORMAT statement so that the $TypeFmt format is applied
to the ParkCode column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-27

f. Run the program and review the results. Verify that the data is grouped by park types.

Level 2
5. Creating a Custom Format from a Table
The pg2.np_species table provides a detailed species list for selected national parks. Create a
format that categorizes park codes into regions (for example, Northeast or Intermountain). Use
the pg2.np_codeLookup table to create a custom format.
a. Open p204p05.sas from the practices folder. Modify the first DATA step to create the
np_lookup table that will be used to build a custom format.
1) Add a RETAIN statement to create the FmtName column with a value of $RegLbl.
2) Add a RENAME= data set option to the SET statement to rename the ParkCode column
to Start.
3) Add conditional statements to create the Label column. The Label column is equal to the
Region column unless the region is missing. In that case, the Label column is equal to a
value of Unknown.
4) Add a KEEP statement to include the Start, Label, and FmtName columns.
b. Highlight the first DATA step and run the selected code. Verify the output table.

c. Modify the PROC FORMAT step to read in the np_lookup table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-28 Lesson 4 Creating Custom Formats

d. In the second DATA step, create a new column named Region. Use the PUT function to
create the new column based on using the $RegLbl format on the ParkCode column. Run
the program and confirm the results in the PROC FREQ output.

Challenge
6. Updating a Custom Format by Using the CNTLOUT= Option
The pg2.np_summary table contains public use statistics from the National Park Service. The
values of the Type column represent the park type as a code. A format is applied to display
descriptive values for the park types.
a. Open p204p06.sas from the practices folder. Run the program and review the results.
Notice that some of the park types are still displayed as codes because the custom format
does not include a label for those values.
b. Write a PROC FORMAT step that uses the CNTLOUT= option to create a table named
typfmtout from the existing $TypCode format. Run the step and view the output table. The
typfmtout table contains several extra columns, but the critical columns for this practice are
FmtName, Start, and Label. Notice that the values for FmtName do not include the $ as a
prefix.
c. Open the pg2.np_newcodes table. Notice that it contains the format name, the Type values,
and the labels in the FmtName, Start, and Label columns.
d. Write a DATA step that creates a table named typfmt_update by concatenating the output
table from PROC FORMAT and the pg2.np_newcodes table. Change the values of
FmtName to $TypCode and keep only the FmtName, Start, and Label columns.
e. Write a PROC FORMAT that re-creates the $TypCode format using the CNTLIN= option to
read the new table that contains the updated format values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Creating Custom F ormats from Tables 4-29

f. Run the PROC FREQ step again and verify that all Type codes are displayed with labels.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-30 Lesson 4 Creating Custom Formats

4.3 Solutions
Solutions to Practices
1. Creating Custom Formats Based on Single Values
proc format;
value $highreg 'IM'='Intermountain'
'PW'='Pacific West'
'SE'='Southeast'
other='All Other Regions';
run;

title 'High Frequency Regions';


proc freq data=pg2.np_summary order=freq;
tables Reg;
label Reg='Region';
format Reg $highreg.;
run;
title;
2. Creating Custom Formats Based on a Range of Values
proc format;
value psize low-<10000='Small'
10000-<500000='Average'
500000-high='Large';
run;

data np_parksize;
set pg2.np_acres;
ParkSize=put(GrossAcres,psize.);
format GrossAcres comma16.;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-31

3. Creating Custom Formats Based on Nesting Formats


proc format;
value decade
'01Jan2000'd-'31Dec2009'd = '2000-2009'
'01Jan2010'd-'31Dec2017'd = '2010-2017'
'01Jan2018'd-'31Mar2018'd = '1st Quarter 2018'
'01Apr2018'd-high = [mmddyy10.];
run;

title1 'Precipitation and Snowfall';


title2 'Note: Amounts shown in inches';
proc means data=pg2.np_weather maxdec=2 sum mean nonobs;
where Prcp > 0 or Snow > 0;
var Prcp Snow;
class Date Name;
format Date decade.;
run;
title;
4. Creating a Custom Format from a Table
data type_lookup;
retain FmtName '$TypeFmt';
set pg2.np_codeLookup(rename=(ParkCode=Start Type=Label));
keep Start Label FmtName;
run;

proc format cntlin=type_lookup;


run;

title 'Traffic Statistics';


proc means data=pg2.np_monthlyTraffic maxdec=0 mean sum nonobs;
var Count;
class ParkCode Month;
label ParkCode='Name';
format ParkCode $TypeFmt.;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-32 Lesson 4 Creating Custom Formats

5. Creating a Custom Format from a Table


data np_lookup;
retain FmtName '$RegLbl';
set pg2.np_codeLookup(rename=(ParkCode=Start));
if Region ne ' ' then Label=Region;
else Label='Unknown';
keep Start Label FmtName;
run;

proc format cntlin=np_lookup;


run;

data np_endanger;
set pg2.np_species;
where Conservation_Status='Endangered';
Region=put(ParkCode,$RegLbl.);
run;

title 'Number of Endangered Species by Region';


proc freq data=np_endanger;
tables Region / nocum;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-33

6. Updating a Custom Format by Using the CNTLOUT= Option


/*step a*/
proc format cntlin=pg2.np_types_regions;
run;

title1 'Park Frequencies by Type';


proc freq data=pg2.np_summary;
table Type / nocum;
format Type $TypCode.;
run;
title;

/*step b*/
proc format cntlout=typfmtout;
select $TypCode;
run;

/*step d*/
data typfmt_update;
set typfmtout pg2.np_newcodes;
keep FmtName Start Label;
FmtName='$TypCode';
run;

/*step e*/
proc format cntlin=typfmt_update;
run;

/*step f*/
title1 'Park Frequencies by Type';
proc freq data=pg2.np_summary;
table Type / nocum;
format Type $TypCode.;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-34 Lesson 4 Creating Custom Formats

Solutions to Activities and Questions

4.01 Activity – Correct Answer


1. Add a FORMAT statement in the DATA step to format the values of the
four numeric columns.
format Date monyy7. Volume comma12.
CloseOpenDiff HighLowDiff dollar8.2;

3. Add a FORMAT statement in the PROC MEANS step to format the values
of Date to show only a four-digit year. format Date year4.;

4. What is the advantage of adding a FORMAT statement to the DATA step


versus the PROC step? Formats that you use in the DATA step are
permanent attributes that are stored in the descriptor portion of the
table. Formats that you use in a PROC step are temporary attributes.
5
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

4.02 Activity – Correct Answer


1. Modify the second VALUE statement to create a format named HRANGE
that specifies the following criteria:
value hrange 50-57='Below Average'
58-60='Average'
61-70='Above Average';

2. Modify the FORMAT statement to format Height with the HRANGE format.
format Registration $regfmt. Height hrange.;

4. Why is the Height value for the first row not formatted?
A value of 57.3 does not fit into any of the ranges. Therefore, the actual
value is displayed.
11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-35

4.03 Activity – Correct Answer

data storm_summary;
set pg2.storm_summary;
Basin=upcase(Basin);
BasinGroup=put(Basin, $region.);
run;

How many BasinGroup values are in the summary report?


three

16
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

4.04 Activity – Correct Answer


What are the default lengths for the $SBFMT and CATFMT formats?

The length of the


format matches
the length of the
longest label.

25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-36 Lesson 4 Creating Custom Formats

4.05 Activity – Correct Answer


1. In the PROC FORMAT statement, add the LIBRARY= option to save the
formats to the pg2.formats catalog.
proc format library=pg2; proc format library=pg2.formats;

3. Before the PROC PRINT step, add an OPTIONS statement so that SAS can
find the two permanent formats.
options fmtsearch=(pg2); options fmtsearch=(pg2.formats);

4. Are the Registration and Height values formatted? Yes

30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 5 Combining Tables
5.1 Concatenating Tables .................................................................................................. 5-3
Demonstration: Concatenating Tables ....................................................................... 5-6
Practice............................................................................................................... 5-10

5.2 Merging Tables .......................................................................................................... 5-12


Demonstration: Merging Tables .............................................................................. 5-22

5.3 Identifying Matching and Nonmatching Rows ............................................................ 5-25


Demonstration: Merging Tables with Nonmatching Rows ........................................... 5-31
Practice............................................................................................................... 5-41

5.4 Solutions ................................................................................................................... 5-44


Solutions to Practices ............................................................................................ 5-44
Solutions to Activities and Questions........................................................................ 5-47
5-2 Lesson 5 Combining Tables

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Concatenating Tables 5-3

5.1 Concatenating Tables

Concatenating Tables

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose you have two or more tables that have the same columns, and you need to combine all the
rows into a single table. We call this concatenating tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-4 Lesson 5 Combining Tables

Concatenating Tables with Matching Columns

DATA output-table;
SET input-table1 input-table2 ...;
RUN;

any number
Use the SET statement of input tables
when you want to
combine tables with
similar data in one
output table.

4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When the columns have the same names, lengths, and types, then concatenating tables is simple.
You use the DATA step to create a new table, and list the tables that you want to combine in t he SET
statement. First, SAS reads all of the rows from the first table listed in the SET statement and writes
them to the new table, then it reads and writes the rows from the second table, and so on. In the
same DATA step, you can include any other statements that you need to manipulate the rows read
from all tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Concatenating Tables 5-5

Concatenating Tables with Matching Columns

data class_current;
set sashelp.class pg2.class_new;
run;
sashelp.class
class_current

pg2.class_new rows from second table added


after rows from the first table

5
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p205d01

In this example, the sashelp.class table and the class_new table have the same columns and
attributes. This program concatenates the two tables to create a new table named class_current.
The class_current table has 22 rows, which includes all rows from both tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-6 Lesson 5 Combining Tables

Concatenating Tables

Scenario
Use the DATA step and the RENAME= data set option to concat enate tables.

Files
• p205d01.sas
• storm_summary – a SAS table that contains one row per storm for the 1980 through 2016 storm
seasons
• storm_2017 – a SAS table that contains one row per storm for the 2017 storm season

Syntax

DATA output-table;
SET input-table1(rename=(current-colname=new-colname))
input-table2 ...;
RUN;

Notes
• Multiple tables listed in the SET statement are concatenated.
• SAS first reads all the rows from the first table listed in the SET statement and writes them to the
new table. Then it reads and writes the rows from the second table, and so on.
• Columns with the same name are automatically aligned. The column properties in the new table
are inherited from the first table that is listed in the SET statement.
• Columns that are not in all tables are also included in the output table.
• The RENAME= data set option can be used to rename columns in one or both tables so that they
align in the new table.
• Additional DATA step statements can be used after the SET statement t o manipulate the data.

Demo
1. Open p205d01.sas from the demos folder and find the Demo section of the program. Modify the
SET statement to concatenate pg2.storm_summary and pg2.storm_2017. Highlight the DATA
and PROC SORT steps and run the selected code.
2. Notice that for the 2017 storms Year is populated with 2017, Location has values, and Season
is missing. Rows from the storm_summary table (starting with row 55) have Season populated,
and Year and Location are missing.
3. After pg2.storm_2017, use the RENAME= data set option to rename Year as Season. Use the
DROP= data set option to drop Location. Highlight the demo program and run the selected
code.
data storm_complete;
set pg2.storm_summary
pg2.storm_2017(rename=(Year=Season) drop=Location);
...

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Concatenating Tables 5-7

NOTE: There were 3118 observations read from the data set PG2.STORM_SUMMARY.
NOTE: There were 54 observations read from the data set PG2.STORM_2017.
NOTE: The data set WORK.STORM_COMPLETE has 3172 observations and 7 variables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-8 Lesson 5 Combining Tables

5.01 Activity
Open p205a01.sas from the activities folder and perform the following tasks:
1. Notice that the SET statement concatenates the sashelp.class and
pg2.class_new2 tables. Highlight the DATA step and run the selected
code. What differences do you observe between the first 19 rows and
the last 3 rows?
2. Use the RENAME= data set option to change Student to Name in the
pg2.class_new2 table. Highlight the DATA step and run the selected code.
What warning is issued in the log?
3. Highlight the two PROC CONTENTS steps and run the selected code.
What is the length of Name in sashelp.class and Student in
pg2.class_new2?

7
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Setting Column Attributes


data class_current;
set sashelp.class
pg2.class_new2(rename=(Student=Name));
run;

PDV
Name
...other columns...
$8

Column lengths are


determined by the first table
listed in the SET statement.

11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When multiple tables are listed in a SET statement, as the program is compiled, columns from the
first table are added to the PDV with their corresponding attributes. When SAS reads the second
table in the SET statement, any columns already in the PDV are not changed. In other words, the
LENGTH is already set and cannot be modified.
Note: If columns with the same name in multiple tables have a different type (character or
numeric), the DATA step fails.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Concatenating Tables 5-9

Merging Column Attributes


data class_current;
length Name $ 9;
set sashelp.class
pg2.class_new2(rename=(Student=Name));
run;

Use a LENGTH
PDV statement before
Name the SET statement
...other columns... to establish an
$9
appropriate length.

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Conflicting column lengths can be solved by using the LENGTH statement. Here we are explicitly
defining the column Name to be character with a length of 9. Notice that the LENGTH statement
comes before the SET statement so that it will establish the attributes of Name in the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-10 Lesson 5 Combining Tables

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Concatenating Like-Structured Tables
Create a table that contains monthly public use statistics for 2015 and 2016 from the National
Park Service.
a. Open the p205p01.sas program in the practices folder. Complete the SET statement to
concatenate the pg2.np_2015 and pg2.np_2016 tables to create a new table, np_combine.
b. Use a WHERE statement to include only rows where Month is 6, 7, or 8.
c. Create a new column named CampTotal that is the sum of CampingOther, CampingTent,
CampingRV, and CampingBackcountry. Format the new column with commas.
Note: Use a column list to specify that all columns beginning with Camping be included as
arguments in the SUM function.

Level 2
2. Concatenating Unlike-Structured Tables
Create a table that contains monthly public use statistics for 2014, 2015, and 2016 from the
National Park Service.
a. Complete the Level 1 practice or open and run the p205p01_s.sas program in the
practices/solutions folder.
b. Open the pg2.np_2014 table and compare the column names with the np_combine table.
Which column or columns in np_2014 must be renamed to match columns in np_combine?
c. Modify the DATA step to concatenate the pg2.np_2014, pg2.np_2015, and pg2.np_2016
tables. Rename the columns as necessary to align the columns with similar values.
d. In addition to filtering rows by Month, also include only rows where ParkType is National
Park.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Concatenating Tables 5-11

e. Arrange the newly created table in ascending order by ParkType, ParkCode, Year, and
Month.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-12 Lesson 5 Combining Tables

5.2 Merging Tables

Merging Tables

15
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Merging (or joining) tables is a required step when you have columns in multiple tables that need to
be combined into a single table. In the SAS Programming 1 class, we introduced the PROC SQL
method for joining tables. But that is just one option available to you. The DATA step also enables
you to merge tables together in an efficient and logical process. Because the DATA step and PROC
SQL use a very different method for combining data behind the scenes, each has unique strengths.
It is helpful to know both methods to take advantage of the most efficient syntax depending on your
situation. In this course, we focus on the DATA step merge.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Merging Tables 5-13

Merging Tables
one-to-one nonmatching rows

A B C C D E A B C C D E
1 1 1 2
2 2 2 3
3 3 4 4

one-to-many

A B C C D E
1 1
2 1
2
16
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

A merge, or join, is made by matching values in a common column in the tables. There are several
scenarios that could arise. You could have a one-to-one merge in which each row in one table
matches with a single row in the other table. You could have a one-to-many merge in which each
row from one table matches with one or more rows from the other table. And you could also have
non-matches in which rows from one table do not have a match in the other table. Each of these
scenarios can be handled with the DATA step merge.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-14 Lesson 5 Combining Tables

continued...

Discussion

Suppose you have two lists of people: those who are invited to a party and
those who are attending. You have to manually match equivalent names.
• How would you do this if the names are in random order?
• How would the process change if the names are in alphabetical order?

C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Discussion

Invited Attending Invited Attending


Drew Caroline Caroline Caroline
Lani Drew Drew Drew
Mansfield Michael George Kristin
Caroline Lani Kristin Lani
Kristin Kristin Lani Michael
Michael Mansfield
George Michael
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Merging Tables 5-15

Merging Tables

class class_teachers

With the input tables


ordered by Name, SAS
can compare rows
sequentially to
efficiently match rows.

19
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose you want to combine class and class_teachers into a single table. Notice that the Name
column is in both tables, which is how matching rows are determined. This is a one-to-one merge,
because each Name has a corresponding match in both tables. Also notice that both tables are in
sorted order by Name.
The DATA step merge process is very similar to how most people envision matching two lists by
hand if the values are in sorted order. SAS simply compares rows sequentially as it reads from the
multiple tables, matching rows based on the value of the common column, such as Name or ID.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-16 Lesson 5 Combining Tables

Merging Tables

DATA output-table;
MERGE input-table1 input-table2 ...;
BY BY-column(s);
RUN;
The input tables
must be sorted by
list any number of list the common the column (or
input tables with column or columns columns) listed in
one or more
the BY statement.
common columns

20
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In the DATA step, you use a MERGE statement instead of a SET statement. You can list multiple
tables in the MERGE statement, as long as each table has the common matching column. That
matching column is then listed in the BY statement. Anytime that a BY statement is used in a DATA
step, the data must be in sorted order. Typically, the DATA step is preceded by PROC SORT steps to
arrange the rows of the input tables by the matching column.

Merging Tables
data class2; Columns are
merge sashelp.class pg2.class_teachers; combined in the new
by Name; table by matching
run; values of Name.

sashelp.class pg2.class_teachers

class2

21
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p205d02

The DATA step includes the MERGE statement, listing the sorted tables. The BY statement identifies
Name as the common column in the two tables that will be used to determine matching rows.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Merging Tables 5-17

Merging Tables: Compilation


data class2; All columns from the
merge sashelp.class pg2.class_teachers; first table are added
by Name; to the PDV.
run;

PDV
Name Sex Age Height Weight

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In the compilation phase, all the columns from the first table listed in the MERGE statement and their
attributes are added to the PDV.

Merging Tables: Compilation


data class2; Additional columns
merge sashelp.class pg2.class_teachers; from the second table
by Name; are added to the PDV.
run;

PDV
Name Sex Age Height Weight Grade Teacher

The BY column is
already in the PDV.
23
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

SAS then examines the second table in the MERGE statement. The BY column is already in the
PDV, but any additional columns and their attributes are added to the PDV. If there are any additional
statements in the DATA step that create new variables, they are also added to the PDV, and any
other compile-time statements are processed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-18 Lesson 5 Combining Tables

Merging Tables: Execution


Rows are read
data class2; sequentially from both
merge sashelp.class pg2.class_teachers;
by Name; tables. When the BY
run; values match, they are
both read into the PDV.
sashelp.class pg2.class_teachers

PDV
Name Sex Age Height Weight Grade Teacher
Alfred M 14 69 112.5 8 Thomas
24
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In the execution phase, SAS begins by examining the BY column value for the first row in each table.
If they match, then both rows are read into the PDV, additional statements are executed, and at the
RUN statement, the row is written to the output table. SAS returns to the top of the DATA step for the
next iteration, and advances to row 2 in both tables. Again, Name matches, so both rows are read
into the PDV, and so on. That sequential comparison process continues until all rows are read from
each table listed in the MERGE statement.

5.02 Activity
Open p205a02.sas from the activities folder and perform the following tasks:
1. Highlight the two PROC SORT steps and run the selected code. How
many rows per Name are in the teachers_sort and test2_sort tables?
2. Complete the DATA step to merge the sorted tables by Name. Run the
DATA step and examine the log and results. How many rows are in the
output table?

25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Merging Tables 5-19

One-to-Many Merge
data class2; The BY values match,
merge teachers_sort test2_sort; and both rows are read
by name; into the PDV.
run;
teachers_sort test2_sort

PDV
Name Grade Teacher Subject TestScore _N_
Alfred 8 Thomas Math 82 1

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

How does SAS process this one-to-many merge in the execution phase? SAS reads the first rows in
each table, finds a By-value match, and writes values from both tables to the PDV. The RUN
statement triggers an implicit output and implicit return to the top of the DATA step. Because all
columns in the PDV are read via the MERGE statement, all values are automatically retained.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-20 Lesson 5 Combining Tables

One-to-Many Merge
The BY values do not
match, but one value
data class2;
merge teachers_sort test2_sort; matches the PDV. That
by name; row is read into the
run; PDV and overwrites
previous values.
teachers_sort test2_sort

PDV
Name Grade Teacher Subject TestScore _N_
Alfred 8 Thomas Reading 79 2

28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

For the next row in each table, the BY values do not match, so SAS checks to see whether either BY
value matches the current contents of the PDV. In this case, Alfred in the test2_sort table matches
the current value for Name in the PDV. SAS reads the row from test2_sort and overwrites the
previous values for Name, Subject, and TestScore. The values for Grade and Teacher from the
first table are retained. Again, the RUN statement triggers an implicit output and an implicit return.

One-to-Many Merge
data class2; Neither BY value
merge teachers_sort test2_sort; matches the PDV.
by name;
run;
teachers_sort test2_sort

PDV
Name Grade Teacher Subject TestScore _N_
Alfred 8 Thomas Math 82 3

29
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

On the next iteration, neither value for Name matches what is currently in the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Merging Tables 5-21

One-to-Many Merge
data class2; The PDV is reset to
merge teachers_sort test2_sort; missing values when a
by name; new BY group begins.
run;
teachers_sort test2_sort

PDV
Name Grade Teacher Subject TestScore _N_
. . 3

30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When SAS recognizes that it has completed reading all values for a BY group, the PDV is
reinitialized and all columns are set to missing.

One-to-Many Merge
data class2; The BY values match,
merge teachers_sort test2_sort; and both rows are read
by name; into the PDV.
run;
teachers_sort test2_sort

PDV
Name Grade Teacher Subject TestScore _N_
Alice 7 Evans Math 71 3

31
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Next, the BY values are compared again, and because we have another match, the values from both
tables are written to the PDV.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-22 Lesson 5 Combining Tables

Merging Tables

Scenario
Use the DATA step MERGE statement to combine two tables with matching rows.

Files
• p205d02.sas
• storm_summary – a SAS table that contains one row per storm for the 1980 through 2016 storm
seasons
• storm_basincodes – a SAS table that includes each two-letter basin code and the corresponding
full basin name

Syntax
If data needs to be sorted prior to the merge:

PROC SORT DATA=input-table OUT=output-table;


BY BY-column;
RUN;

DATA output-table;
MERGE input-table1 input-table2 ...;
BY BY-column(s);
RUN;

Notes
• Any tables listed in the MERGE statement must be sorted by the same column (or columns) listed
in the BY statement.
• The MERGE statement combines rows where the BY-column values match.
• This syntax merges multiple tables in both one-to-one and one-to-many situations.

Demo
1. Open the p205d02.sas program in the demos folder and find the Demo section. Highlight the
two PROC SORT steps and run the selected code. Examine the sorted tables, including the
number of rows in each. Notice that both tables include a column representing basin codes .
However, the column is named Basin in the storm_sort table and BasinCode in the
basincodes_sort table.
Note: The storm_basincodes table serves as a lookup table that includes the two-letter basin
codes and the corresponding basin name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Merging Tables 5-23

2. To combine the BasinName column with the columns in the storm_summary table, the tables
need to be merged. Complete the MERGE statement. Use the RENAME= data set option to
rename the BasinCode column as Basin in the basincodes_sort table. Add a BY statement to
combine the sorted tables by Basin.
Note: This will be a one-to-many merge.
data storm_summary2;
merge storm_sort basincodes_sort(rename=(BasinCode=Basin));
by Basin;
run;
3. Run the program and examine the storm_summary2 table. Notice that the BasinName values
have been matched with each of the Basin code values.

4. Scroll to the end of the storm_summary2 table. Notice that when the value of Basin is
lowercase na, the values for BasinName are missing. This is because lowercase na occurs only
in the storm_sort table and not in basincodes_sort.
Note: To view the end of the table in SAS Studio, click the Last page toolbar button .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-24 Lesson 5 Combining Tables

Discussion
A lookup table can be used to build a
custom format, or it can be merged with
another table to include labels. What are
the advantages of each technique?

C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-25

5.3 Identifying Matching and Nonmatching


Rows

Merging Tables with Nonmatching Rows


data class2;
merge pg2.class_update pg2.class_teachers;
by name;
run;

class_update class_teachers

class2

The new table includes matches


and nonmatches.
35
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When you merge tables together, you might have some rows that do not have a corresponding
match in the other table. What happens to those nonmatching rows in the DATA step merge?
Suppose we have had some changes in our class: Carol moved out and David mov ed in, and those
names have been changed in the class_update table. Let’s go back to our simple one-to-one merge
and combine class_update with class_teachers. Class_teachers has not been updated to reflect
the student change yet, so David is in class_update but not in class_teachers, and Carol is in
class_teachers but not class_update. What will the combined table look like? Although both of the
input tables have 19 rows, the output table class2 has 20 rows because it includes both Carol and
David. The rows for Carol and David include missing values for the columns in the tables where they
were not included.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-26 Lesson 5 Combining Tables

Merging Tables: Execution


The BY values do
data class2; not match. SAS
merge pg2.class_update pg2.class_teachers; reinitializes the PDV
by name;
run; and reads the next
row in sequence.
class_update class_teachers

PDV
Name Sex Age Height Weight Grade Teacher
Carol . . . 8 Thomas
36
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

What happens during execution? Assume that SAS has just finished reading the matching rows
where Name is Barbara and output a row to the new table. The next BY values are examined, and
because neither value matches Barbara, the entire PDV is set to missing values. SAS then
compares BY values in the two tables and finds that they do not match, so it reads the row from the
table with the BY value that comes first in sorted sequence (Carol before David) and writes it to the
PDV. The rest of the columns remain as missing values when the row is written to the new table.

Merging Tables: Execution


The BY values do
data class2; not match. SAS
merge pg2.class_update pg2.class_teachers; reinitializes the PDV
by name;
run; and reads the next
row in sequence.
class_update class_teachers

PDV
Name Sex Age Height Weight Grade Teacher
David M 11 55.3 73 .
37
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-27

In the next iteration of the DATA step, SAS is still on the row for David in class_update because it
has not been read yet. SAS compares David to the next unread row in class_teachers, which is
Henry. Either Dave or Henry would represent a new BY group, so the PDV is reset to missing
values. Between the two values, David comes first, so the row from class_update is read into the
PDV, and the values for Grade and Teacher remain as missing in the PDV.

Merging Tables: Execution


Sequential matching
data class2; continues until all
merge pg2.class_update pg2.class_teachers;
by name; rows are read from
run; each input table.

class_update class_teachers

PDV
Name Sex Age Height Weight Grade Teacher
Henry M 14 63.5 102.5 8 Thomas
38
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In class_update, the next unread row is Henry, and there is a match to Henry in class_teachers.
This pattern of sequential reading of the rows in the input tables based on the BY column values
continues until all rows are read from each table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-28 Lesson 5 Combining Tables

Merging Tables with Nonmatching Rows

DATA output-table;
MERGE input-table1(IN=variable)
input-table2(IN=variable) ...;
BY BY-column(s);
RUN; The IN= data set
option can be used
to identify matching
and nonmatching
rows.

39
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The output table that you create when you merge tables with nonmatching rows might not be quite
what you want. Suppose you want to include only students who are in both tables. Or, suppose you
want to identify students that are missing in one of the tables. The IN= data set opt ion creates
temporary variables in the PDV that you can use to flag matching or nonmatching rows.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-29

Merging Tables with Nonmatching Rows


data class2;
merge pg2.class_update(in=inUpdate)
pg2.class_teachers(in=inTeachers);
by name;
run;

PDV
Name Sex Age Height Weight Grade Teacher inUpdate inTeachers
D D

The IN= variables are 0 if the BY value


is not in the corresponding input
table and 1 if the BY value is in the
corresponding input table.
40
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p205d03

The IN= data set option follows one or more tables in the MERGE statement, and it names a
temporary variable that will be added to the PDV. The IN= variables are included in the PDV during
execution, but they are not written to the output table. Each IN= variable is associated with the
particular table that the option follows. During execution, the IN= variables are assigned a value of 0
or 1. A value of 0 means that table did not include the BY-column value for that row, and 1 means
that it did include the BY-column value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-30 Lesson 5 Combining Tables

Merging Tables with Nonmatching Rows


data class2;
merge pg2.class_update(in=inUpdate)
pg2.class_teachers(in=inTeachers);
by name;
run;
class2 How can we
include only
matching rows in
the output
table?
Values are assigned in the
PDV during execution but not
written to the output table.
41
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Let’s see what these IN= variable values would look like for each row in our class2 table. Notice that
when Name was read from both input tables, inU and inT are both 1. For Carol, she is only in the
class_teachers table, so inU is 0 and inT is 1. And the opposite is true for David. Notice that
missing values are assigned for the columns where there was no data read from one of the input
tables. The values of the IN= variables can be used to subset the output table.

5.03 Multiple Choice Question


Which statement writes only matching rows to the output table?
data class2;
merge pg2.class_update(in=inUpdate)
pg2.class_teachers(in=inTeachers);
by name;
???
run;

a. where inUpdate=1 and inTeachers=1;


b. where inUpdate=1 or inTeachers=1;
c. if inUpdate=1 and inTeachers=1;
d. if inUpdate=1 or inTeachers=1;
42
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-31

Merging Tables with Nonmatching Rows

Scenario
Use the DATA step MERGE statement to combine two tables and identify nonmatching rows.

Files
• p205d03.sas
• storm_final – a SAS table that contains one row per storm for the 1980 through 2017 storm
seasons. The data has been cleaned and prepared previously using the DATA step.
• storm_damage – a SAS table that includes a description and damage estimates for storms in the
US with damages greater than one billion dollars.

Syntax

DATA output-table;
MERGE input-table1(IN=var1) input-table2(IN=var2) ...;
BY BY-column(s);
RUN;

Notes
• By default, both matches and nonmatches are written to the output table in a DATA step merge.
• The IN= data set option follows a table in the MERGE statement and names a variable that will be
added to the PDV. The IN= variables are included in the PDV during execution, but they are not
written to the output table. Each IN= variable relates to the table that the option follows.
• During execution, the IN= variable is assigned a value of 0 or 1. 0 means that the corresponding
table did not include the BY column value for that row, and 1 means that it did include the
BY-column value.
• The subsetting IF or IF-THEN logic can be used to subset rows based on matching or
nonmatching rows.

Demo
1. Open the p205d03.sas program in the demos folder and find the Demo section. Highlight the
first PROC SORT step and run the selected code. A table named storm_final_sort is created,
arranged by Season and Name. Because some storm names have been used more than once,
unique storms are identified by both Season and Name.
Note: Storm names are in uppercase.
2. Open pg2.storm_damage. Notice that it does not include the columns Season and Name,
which are in storm_final_sort. Season and Name must be derived from the Date and Event
columns.
3. Examine the DATA step that creates a temporary table named storm_damage. SAS functions
are used to create Season and Name with values that match the values in the storm_final_sort
table. Highlight the DATA step and the PROC SORT step that follows it, and run the selection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-32 Lesson 5 Combining Tables

4. Complete the final DATA step to merge the sorted tables by Season and Name. Highlight the
DATA step and run the selection. Notice that in the output table that row 4 is storm Allen, which is
included in the storm_damage table. Therefore, each of the columns has values read from both
input tables. Most of the values in the Cost column are missing because those storms are not
found in the storm_damage table.
data damage_detail;
merge storm_final_sort storm_damage;
by Season Name;
keep Season Name BasinName MaxWindMPH MinPressure Cost;
run;
5. Use the IN= data set option after the storm_damage table to create a temporary variable named
inDamage that flags rows where Season and Name were read from the storm_damage table.
Add a subsetting IF statement to write the 38 rows from storm_damage and the corresponding
data from storm_final_sort to the output table. Highlight the DATA step and run the selection.
data damage_detail;
merge storm_final_sort storm_damage(in=inDamage);
by Season Name;
if inDamage=1;
keep Season Name BasinName MaxWindMPH MinPressure Cost;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-33

5.04 Activity
Open p205a04.sas from the activities folder and perform the following tasks:
1. Modify the final DATA step to create an additional table named
storm_other that includes all nonmatching rows.
2. Drop the Cost column from the storm_other table only.
3. How many rows are in the storm_other table?

45
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Columns in Both Tables

weather_sanfran2016 weather_sanfran2017

What happens if
we merge these
tables by Month?

47
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-34 Lesson 5 Combining Tables

Columns in Both Tables

data weather_sanfran;
merge pg2.weather_sanfran2016 pg2.weather_sanfran2017;
by month;
run;

weather_sanfran2016 weather_sanfran2017

PDV The 2016 value for


Month AvgTemp AvgTemp is written to
01 - January 52.7 the PDV.
48
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When columns with the same name are in the input tables, SAS reads the values from the tables
into the PDV according to the order in the MERGE statement. In this example, the BY value Month
matches for the first row in both tables. The weather_sanfran2016 table is listed first in the MERGE
statement, so the 2016 value for average temperature is read first into the PDV.

Columns in Both Tables

data weather_sanfran;
merge pg2.weather_sanfran2016 pg2.weather_sanfran2017;
by month;
run;

weather_sanfran2016 weather_sanfran2017

PDV AvgTemp is overwritten


Month AvgTemp with the value from 2017
before the row is written
01 - January 50.6 to output.
49
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Then the 2017 value for average temperature overwrites the 2016 value before the implicit output at
RUN.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-35

Columns in Both Tables


data weather_sanfran;
merge pg2.weather_sanfran2016(rename=(AvgTemp=AvgTemp2016))
pg2.weather_sanfran2017(rename=(AvgTemp=AvgTemp2017));
by month;
run;

Use the RENAME=


data set option to
give each column a
unique name.

50
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In order to ensure that both average temperature columns from the 2016 and 2017 tables are
included in the result, we must use the RENAME=data set option to give each column a unique
name.

Merging Tables without a Common Column

class_update class_teachers class_rooms

What if you don’t have


a common column in
all the tables that you
want to merge?

51
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

What if you want to merge more than two tables together, but not all of the tables have the same
matching column? In this example, class_update and class_teachers can be merged by Name.
The class_rooms table does not have a Name column, but it does have a common column
(Teacher) with the class_teachers table. How can we merge all three of these tables?

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-36 Lesson 5 Combining Tables

Merging Tables without a Common Column

class_update (sorted by Name) class_teachers (sorted by Name)

merge by Name

data update_teachers;
merge pg2.class_update
pg2.class_teachers; update_teachers (intermediate output)
by Name;
run;

52
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

To use a DATA step merge, we must break the process into multiple steps. First, we sort two of the
tables by their common column and merge them to create an intermediate table. In our example,
class_update and class_teachers are already sorted by Name, so we can merge them by Name to
create the intermediate table, update_teachers.

Merging Tables without a Common Column

update_teachers (intermediate output)

sort by Teacher

proc sort data=update_teachers;


by Teacher; update_teachers (sorted intermediate output)
run;

53
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Next we need to sort the intermediate table and the third table by the common column in the third
table. In our example, the class_rooms table is already sorted by the common column Teacher, so
we just need to sort the intermediate table, update_teachers, by Teacher.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-37

Merging Tables without a Common Column

update_teachers (sorted intermediate output) class_rooms

merge by
Teacher

data class_combine;
merge update_teachers
pg2.class_rooms; class_combine (final output)
by Teacher;
run;

54
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Finally, we can merge the sorted intermediate table, update_teachers, and third table,
class_rooms, to get the final output table.

Merging Multiple Tables with PROC SQL


proc sql;
create table class_combine as
select u.*, t.Teacher, r.Room
from pg2.class_update as u
inner join pg2.class_teachers as t
on u.Name=t.Name
inner join pg2.class_rooms as r PROC SQL can join
on t.Teacher=r.Teacher;
quit; multiple tables
without a common
column in one
query.

55
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-38 Lesson 5 Combining Tables

Creating Multiple Tables with the DATA Step


data damage_detail
storm_other(drop=Cost);
merge storm_final_sort(in=inFinal)
storm_damage(in=inDamage);
by Season Name;
if inDamage=1 and inFinal=1
then output damage_detail; PROC SQL can’t
else output storm_other;
run; write output to
multiple tables in a
single query.

56
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-39

DATA Step Merge and PROC SQL Join


DATA step merge PROC SQL join

• requires sorted input data • does not require sorted data


• efficient, sequential • matching columns do not need the
processing same name
• can create multiple tables for • easy to define complex matching
matches and nonmatches in criteria between multiple tables in
one step a single query
• provides additional complex • can be used to create a Cartesian
data processing syntax product for many-to-many joins
57
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In certain scenarios, there are alternative methods for combining tables that might be more efficient
or have simpler code. In the SAS Programming 1 course, we introduced joining tables in SQL.
Because the DATA step merge and PROC SQL use different methods for processing data, they each
have unique strengths.
For example, if you want to create multiple tables including matches and nonmatches, this can be
accomplished in a single DATA step. SQL would require multiple queries. However, if you are joining
three tables without a single common column, this can be done in just one query using PROC SQL.
It is extremely valuable as a SAS programmer to know both.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-40 Lesson 5 Combining Tables

Beyond SAS Programming 2


What if you want to ...

. . . use PROC SQL to join . . . compare the DATA


tables? step merge and the . . . view examples of
PROC SQL join? different methods for
combining data?
• Take the SAS SQL 1: • Read the blog post Life saver tip • Read the book Combining and
Essentials course. for comparing PROC SQL join Modifying SAS Data Sets:
• Read the book PROC SQL by with SAS data step merge. Examples.
Example. • Read the paper MERGING vs. • Take the SAS Programming 3:
JOINING: Comparing the DATA Advanced Techniques and
Step with SQL. Efficiencies course.

58
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links
• Take the SAS SQL 1: Essentials course.
• Read the book PROC SQL by Example.
• Read the blog post Life saver tip for comparing PROC SQL join with SAS data step merge.
• Read the paper MERGING vs. JOINING: Comparing the DATA Step with SQL.
• Read the book Combining and Modifying SAS Data Sets: Examples .
• Take the SAS Programming 3: Advanced Techniques and Efficiencies course.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-41

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
3. Performing a One-to-Many Merge
The pg2.np_2016traffic table contains monthly traffic statistics from the National Park Service
for parks. Create a table that contains the monthly traffic statistics from the pg2.np_2016traffic
table and adds a column for the park name. Park name values can be found in the matching
pg2.np_codelookup table.
a. Open the p205p03.sas program in the practices folder. Submit the two PROC SORT steps.
Determine the name of the common column in the sorted tables.
b. Modify the second PROC SORT step to use the RENAME= option after the
pg2.np_2016traffic table to rename Code to ParkCode. Modify the BY statement to sort by
the new column name.
Note: You could also rename the column in the DATA step after the table in the MERGE
statement.
c. Write a DATA step to merge the sorted tables by the common column to create a new table,
work.trafficStats. Drop the Name_Code column from the output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-42 Lesson 5 Combining Tables

Level 2
4. Writing Matches and Nonmatches to Separate Tables
The pg2.np_2016 table contains monthly public use statistics from the National Park Service for
parks by ParkCode. The pg2.np_codelookup table contains the full name for each ParkCode
value. Create a table, parkStats, that contains all park codes found in the np_2016 table. Create
a second table, parkOther, that contains ParkCode values in the np_codelookup table, but not
in the np_2016 table.
a. Determine the name of the common column in the pg2.np_codelookup and pg2.np_2016
tables.
b. Create a new program. Ensure that the data in both tables is sorted by the matching column.
c. Using a DATA step, merge the pg2.np_codelookup and pg2.np_2016 tables to create two
new tables:
1) The work.parkStats table should contain only ParkCode values that are in the np_2016
table. Keep only the ParkCode, ParkName, Year, Month, and DayVisits columns.
2) The work.parkOther table should contain all other rows. Keep only the ParkCode and
ParkName columns.
work.parkStats

work.parkOther

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Identifying Matching and Nonmatching Rows 5-43

Challenge
5. Combining Multiple Tables with Different Matching Columns
Merge the pg2.np_codelookup, pg2.np_final, and pg2.np_species tables to create a table
that contains information about the common birds found at locations that have more than
5,000,000 visitors a year.
a. Open the p205p05.sas program in the practices folder. The first three steps sort and merge
the pg2.np_codelookup and pg2.np_final tables. Highlight the first two PROC SORT steps
and the DATA step and run the selected code. Examine the highuse table.
b. Add a subsetting IF statement in the DATA step to output only the rows in which DayVisits is
greater than or equal to 5,000,000. Highlight the DATA step and run the selected code. Why
must you use IF instead of a WHERE statement?
c. Run the final PROC SORT step to sort and subset the pg2.np_species table. Compare the
columns in the output birds table with the highuse table to determine the matching column.
d. Add a PROC SORT step to sort the highuse table by the matching column in the birds
table.
e. Add a DATA step to merge the highuse and birds tables and create a table named
birds_largepark. Include in the output table only ParkCode values that are in the highuse
table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-44 Lesson 5 Combining Tables

5.4 Solutions
Solutions to Practices
1. Concatenating Like-Structured Tables
data work.np_combine;
set pg2.np_2015 pg2.np_2016;
CampTotal=sum(of Camping:);
where Month in(6, 7, 8);
format CampTotal comma15.;
drop Camping:;
run;
2. Concatenating Unlike-Structured Tables
data work.np_combine;
set pg2.np_2014(rename=(Park=ParkCode Type=ParkType))
pg2.np_2015
pg2.np_2016;
CampTotal=sum(of Camping:);
where Month in(6, 7, 8) and ParkType="National Park";
format CampTotal comma15.;
drop Camping:;
run;

proc sort data=np_combine;


by ParkType ParkCode Year Month;
run;
3. Performing a One-to-Many Merge
proc sort data=pg2.np_codelookup out=work.codesort;
by ParkCode;
run;

proc sort data=pg2.np_2016traffic(rename=(Code=ParkCode))


out=work.traf2016Sort;
by ParkCode month;
run;

data work.trafficStats;
merge work.traf2016Sort
work.codesort;
by ParkCode;
drop Name_Code;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-45

4. Writing Matches and Nonmatches to Separate Tables


proc sort data=pg2.np_CodeLookup
out=work.sortedCodes;
by ParkCode;
run;

proc sort data=pg2.np_2016


out=work.sorted_code_2016;
by ParkCode;
run;

data work.parkStats(keep=ParkCode ParkName Year Month DayVisits)


work.parkOther(keep=ParkCode ParkName);
merge work.sorted_code_2016(in=inStats) work.sortedCodes;
by ParkCode;
if inStats=1 then output work.parkStats;
else output work.parkOther;
run;
5. Combining Multiple Tables with Different Matching Columns
Why must you use IF instead of a WHERE statement?
You must use a subsetting IF statement because the DayVisits column is in only one of
the tables in the MERGE statement.
proc sort data=pg2.np_codelookup
out=sortnames(keep=ParkName ParkCode);
by ParkName;
run;

proc sort data=pg2.np_final out=sortfinal;


by ParkName;
run;

data highuse(keep=ParkCode ParkName);


merge sortfinal sortnames;
by ParkName;
if DayVisits ge 5000000;
run;

proc sort data=pg2.np_species


out=birds(keep=ParkCode Species_ID Scientific_Name
Common_Names);
by ParkCode Species_ID;
where Category='Bird' and Abundance='Common';
run;

proc sort data=highuse;


by ParkCode;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-46 Lesson 5 Combining Tables

data work.birds_largepark;
merge birds highuse(in=inPark);
by ParkCode;
if inPark=1;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-47

Solutions to Activities and Questions


continued...
5.01 Activity – Correct Answer
1. What differences do you observe between the first 19 rows and the last
3 rows?
Student names are in
separate columns.

Height and Weight


are missing.
8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

continued...
5.01 Activity – Correct Answer
2. Use the RENAME= data set option to change Student to Name in the
pg2.class_new2 table. Highlight the DATA step and run the selected code.
What warning is issued in the log?
data class_current;
set sashelp.class
pg2.class_new2(rename=(Student=Name));
run;

WARNING: Multiple lengths were


specified for the variable Name
by input data set(s). This can
cause truncation of data.

9
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-48 Lesson 5 Combining Tables

5.01 Activity – Correct Answer


3. Highlight the two PROC CONTENTS steps and run the selected code.
What is the length of Name in sashelp.class and Student in
pg2.class_new2?

Name has a
data class_current; length of 8.
set sashelp.class
pg2.class_new2(rename=(Student=Name));
run;
Student has a
length of 9.

10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

5.02 Activity – Correct Answer


1. Highlight the two PROC SORT steps and run the selected code. How
many rows per Name are in the teachers_sort and test2_sort tables?
The teachers_sort table has one row for each value of Name, and
test2_sort has two rows for each value of Name.
2. Complete the DATA step to merge the sorted tables by Name. Run the
DATA step and examine the log and results. How many rows are in the
output table?
data class2; Thirty-eight rows
merge teachers_sort test2_sort; are in the output
by Name; table, two rows
run; per Name.

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.4 Solutions 5-49

5.03 Multiple Choice Question – Correct Answer


Which statement writes only matching rows to the output table?
data class2;
merge pg2.class_update(in=inUpdate)
pg2.class_teachers(in=inTeachers);
by name;
if inUpdate=1 and inTeachers=1; The subsetting IF
run; statement must be
used because values
a. where inUpdate=1 and inTeachers=1; for the IN= variables
b. where inUpdate=1 or inTeachers=1; are assigned during
execution.
c. if inUpdate=1 and inTeachers=1;
d. if inUpdate=1 or inTeachers=1;
43
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

5.04 Activity – Correct Answer


1. Modify the final DATA step to create an additional table named
storm_other that includes all nonmatching rows.
2. Drop the Cost column from the storm_other table only.
3. How many rows are in the storm_other table? 3,054 rows
data damage_detail storm_other(drop=Cost);
merge storm_final_sort(in=inFinal)
storm_damage(in=inDamage);
keep Season Name BasinName MaxWindMPH MinPressure Cost;
by Season Name;
if inDamage=1 and inFinal=1 then output damage_detail;
else output storm_other;
run;

46
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-50 Lesson 5 Combining Tables

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 6 Processing Repetitive
Code
6.1 Using Iterative DO Loops ............................................................................................. 6-3
Demonstration: Executing an Iterative DO Loop .......................................................... 6-7
Demonstration: Using Iterative DO Loops................................................................. 6-16
Practice............................................................................................................... 6-19

6.2 Using Conditional DO Loops ..................................................................................... 6-24


Demonstration: Using Conditional DO Loops ............................................................ 6-28
Demonstration: Combining Iterative and Conditional DO Loops ................................... 6-32
Practice............................................................................................................... 6-36

6.3 Solutions ................................................................................................................... 6-40


Solutions to Practices ............................................................................................ 6-40
Solutions to Activities and Questions........................................................................ 6-48
6-2 Lesson 6 Processing Repetitive Code

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-3

6.1 Using Iterative DO Loops

Processing Repetitive Code


data forecast;
set sashelp.shoes(rename=(Sales=ProjectedSales));
Year=1;
ProjectedSales=ProjectedSales*1.05;
output;
Year=2;
DATA step ProjectedSales=ProjectedSales*1.05;
loop output;
Year=3;
ProjectedSales=ProjectedSales*1.05;
output;
keep Region Product Subsidiary Year ProjectedSales;
format ProjectedSales dollar10.;
run;

NOTE: There were 395 observations read from the data set SASHELP.SHOES.
NOTE: The data set WORK.FORECAST has 1185 observations and 5 variables

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Iterative DO Loop

DO index-column = start TO stop <BY increment> ;

... repetitive code ...


Iterative DO loops
END; are a good way to
eliminate repetitive
code.

4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

To eliminate having to type repetitive code, you can use the iterative DO loop. The DO loop starts
with a DO statement and ends with an END statement. During DATA step execution, the iterative DO
loop processes statements between the DO and END statements repetitively, based on the value of
an index column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-4 Lesson 6 Processing Repetitive Code

Iterative DO Statement

DO index-column = start TO stop <BY increment> ;

column whose value


controls DO loop execution

do Year =

5
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The DO statement in an iterative DO loop has logic to specify when to stop looping. T he index
column goes immediately after the word DO and names a column whose value controls execution of
the DO loop. The value of this column is incremented at the end of each pass through the DO loop.
The index column is included in the output table unless you drop it. In our example, Year is the index
column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-5

Iterative DO Statement

DO index-column = start TO stop <BY increment> ;

the initial value of the the value that the index column must
index column exceed to stop execution of the DO loop

do Year = 1 to 3;

6
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Execution of the DO loop is based on the start value and the stop value. Start is a number or
numeric expression that specifies the initial value of the index column. Stop is a number or numeric
expression that specifies the value that the index column must exceed to stop execution of the DO
loop. The TO keyword is specified between the start and stop values. The DO loop continues to
execute as long as the index column is within the start and stop range.
In this example, we specify a Year start value of 1 and a Year stop value of 3.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-6 Lesson 6 Processing Repetitive Code

Iterative DO Statement

DO index-column = start TO stop <BY increment> ;

a positive or negative number for


incrementing the value of the index column
If you don't
specify an
do Year = 1 to 3; increment, the
default is 1.

do Year = 1 to 3 by 1;

7
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The increment is based on the value specified after the BY keyword. Each time t hat the DO loop
executes, the value of the increment is added to the value in the index column, and this continues
until the value exceeds the stop value. If you do not specify the BY keyword and increment, the
default increment is 1. Both of these DO statements produce the same result.

Executing an Iterative DO Loop


data forecast;
set sashelp.shoes(rename=(Sales=ProjectedSales));
do Year = 1 to 3;
ProjectedSales=ProjectedSales*1.05;
output;
end;
keep Region Product Subsidiary Year ProjectedSales;
format ProjectedSales dollar10.;
run;

8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p206d01a

The repetitive code of the previous example is now located in a DO loop. For each iteration of the
DATA step, the DO loop executes three times. Each iteration updates the value of Year and
ProjectedSales and outputs a row.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-7

Executing an Iterative DO Loop

Scenario
Use the DATA step debugger in SAS Enterprise Guide to see the execution of an iterative DO loop.

Files
• p206d01a.sas
• p206d01b.sas
• sashelp.shoes – a table supplied by SAS that contains the annual shoe sales for each
combination of Region, Product, and Subsidiary

Syntax

DATA output-table;
...
DO index-column = start TO stop <BY increment>;

. . . repetitive code . . .

END;
...
RUN;

Notes
• The iterative DO loop executes statements between the DO and END statements repetitively,
based on the value of an index column.
• The index-column parameter names a column whose value controls execution of the DO loop.
This column is included in the table that is being created unless you drop it.
• The start value is a number or numeric expression that specifies the initial value of the index
column.
• The stop value is a number or numeric expression that specifies the ending value that the index
column must exceed to stop execution of the DO loop.
• The increment value specifies a positive or negative number to control the incrementing of the
index column. The BY keyword and the increment are optional. If they are omitted, the index
column is increased by 1.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-8 Lesson 6 Processing Repetitive Code

Demo
Note: This demo must be performed in SAS Enterprise Guide.
1. Open p206d01a.sas from the demos folder and find the Demo section of the program. Run the
program and view the Forecast output table. Notice that there are three rows (Year 1, 2, and 3)
for each combination of Region, Product, and Subsidiary.

2. Return to the Program tab and click the DATA step markers for debugging button
to enable debugging in the program if it is not already enabled. Click the Debugger icon next to
the DATA statement. The DATA Step Debugger window appears.

3. Click the Step execution to next line button to execute the highlighted SET statement.
4. Click the button again to execute the highlighted DO statement. Notice that the Year value is 1.
5. Click the button three times to execute the statements inside the DO loop and the END
statement. Notice that the Year value has been incremented to 2 and that processing returns to
the inside of the DO loop.
6. Continue to click the button to execute the highlighted statements inside the DO loop. Observe
the changing of values in the PDV.
7. At the end of third iteration of the DO loop, notice that the Year value is incremented to 4 and
that processing does not return to the inside of the DO loop.

8. Close the DATA Step Debugger.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-9

Alternative Demo
Note: This demo can be performed in any of the SAS programming interfaces.
1. Open p206d01b.sas from the demos folder and find the Demo section of the program. Notice
the three PUTLOG statements in the DATA step.
data forecast;
putlog 'Top of DATA Step ' Year= _N_=;
set sashelp.shoes(obs=2 rename=(Sales=ProjectedSales));
do Year = 1 to 3;
ProjectedSales=ProjectedSales*1.05;
output;
putlog 'Value of Year written to table' Year=;
end;
putlog 'Outside of DO Loop ' Year=;
keep Region Product Subsidiary Year ProjectedSales;
format ProjectedSales dollar10.;
run;
2. Run the program and view the Forecast output table. Notice that there are three rows (Year 1,
2, and 3) for the first two input rows.
3. View the PUTLOG text in the SAS log.
Top of DATA Step Year=. _N_=1
Value of Year written to table: Year=1
Value of Year written to table: Year=2
Value of Year written to table: Year=3
Outside of DO Loop: Year=4
Top of DATA Step Year=. _N_=2
Value of Year written to table: Year=1
Value of Year written to table: Year=2
Value of Year written to table: Year=3
Outside of DO Loop: Year=4
Top of DATA Step Year=. _N_=3

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-10 Lesson 6 Processing Repetitive Code

Executing an Iterative DO Loop


Year is incremented by 1 at the
bottom of the DO loop.

index = index + increment

The DO loop terminates when Year


exceeds the stop value of 3.
10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p206d01a

The index column Year is incremented by 1 at the bottom of the DO loop. Then the new value is
checked to see whether execution continues. When the value exceeds the stop value, DO-loop
processing is over for that DATA step iteration. The final value of the index column will be one
increment beyond the stop value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-11

Scenario
data YearlySavings;
Amount=200;
do Month=1 to 12;
Savings+Amount;
output;
end;
format Savings 12.2;
run;
How much money
is in savings each
month if we save
$200 per month
for a year?

11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p206a01

This DATA step calculates the amount in savings when 200 dollars is added each month. Notice that
we are not reading any data here, so there is only one iteration of the DATA step. However, the DO
loop creates 12 rows, corresponding to 12 months of savings. At month 12, the savings is 2,400
dollars.

6.01 Activity
Open p206a01.sas from the activities folder and perform the following tasks:
1. In the DATA step, add the following sum statement after the Savings sum
statement to add 2% interest compounded monthly:
Savings+(Savings*0.02/12);

2. Run the program. How much is in savings at month 12?


3. Delete the OUTPUT statement and run the program again.
4. How many rows are created?
5. What is the value of Month?
6. What is the value of Savings?

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-12 Lesson 6 Processing Repetitive Code

Output inside the DO Loop

data YearlySavings;

do Month=1 to 12;

output;
end;

run;

14
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

An OUTPUT statement between the DO and END statements writes a row for each iteration of the
DO loop. In this example, there are 12 iterations. Therefore, the output happens 12 times. The value
of the index column is incremented at the bottom of the DO loop, after each row is written to the
output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-13

Output outside the DO Loop

data YearlySavings;

do Month=1 to 12;

end;

run;
Implicit OUTPUT;

15
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In this example, there is no explicit OUTPUT statement, so implicit output at the end of the DATA
step is active. Because output occurs after the 12 iterations of the DO loop, the value of Month is
13, which is one increment beyond the stop value. Notice that the value of Savings is 2,426 at the
end of the DO loop for both examples. When Month is incremented to 13, the DO loop does not
execute to update the amount in Savings.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-14 Lesson 6 Processing Repetitive Code

DO Loop with an Input Table

pg2.savings
data YearSavings;
set pg2.savings;

do Month=1 to 12;
4 12
end; work.YearSavings

run;
Implicit OUTPUT;

16
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Now let’s look at what happens in a DATA step that is reading input data. In this example, the DATA
step loops four times because there are four rows to read from the Savings table. For each of these
4 iterations of the DATA step, there will be 12 iterations of the DO loop. Because there is no explicit
OUTPUT statement, implicit output writes four rows to the YearSavings table. The value of Savings
for each row is the value at the end of 12 months.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-15

Nested DO Loops
pg2.savings
data FiveYearSavings;
set pg2.savings;

do Year=1 to 5;
do Month=1 to 12;
4 5 12
end; work.FiveYearSavings
end;

run;
Implicit OUTPUT;

17
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Suppose you want to calculate the savings after five years for each row in the input data. To do this,
you can use nested DO loops. The outer DO loop for Year iterates 5 times, once for each year. The
inner DO loop for Month iterates 12 times within each iteration of the DO loop for Year, so a total of
60 times.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-16 Lesson 6 Processing Repetitive Code

Using Iterative DO Loops

Scenario
Modify an existing DATA step with variations of iterative DO loops and variations in the placement of
the OUTPUT statement.

Files
• p206d02.sas
• savings – a SAS table that contains the names of individuals and the amount that they are
planning to save monthly

Syntax

DATA output-table;
SET input-table;
...
DO index-column = start TO stop <BY increment>;

. . . repetitive code . . .
<OUTPUT;>

END;
...
<OUTPUT;>
RUN;

Notes
• The DO loop iterates for each iteration of the DATA step.
• An OUTPUT statement between the DO and END statements outputs one row for each iteration of
the DO loop.
• An OUTPUT statement after the DO loop outputs a row based on the final iteration of the DO loop.
The index column will be an increment beyond the stop value.
• DO loops can be nested.

Demo
1. Open the pg2.savings table. Notice that there are four rows representing different people. The
Amount value is a monthly savings value.
2. Open p206d02.sas from the demos folder and find the Demo section of the program. Run the
program and notice that four rows are created due to four rows being read from the input table.
Also, notice how the Savings value keeps increasing for each row.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-17

3. Fix the issue by adding an assignment statement before the DO loop to set the value of Savings
to 0. Run the program and notice the corrected values for Savings.
data YearSavings;
set pg2.savings;
Savings=0;
...
run;

4. Add an outer DO loop to iterate through five years for each of the 12 months. Run the program
and notice that you have one row for each person. Each row represents the savings after five
years, assuming that savings are added each month. The value of Year is 6 and the value of
Month is 13, an increment beyond each stop value.
data YearSavings;
set pg2.savings;
Savings=0;
do Year=1 to 5;
do Month=1 to 12;
Savings+Amount;
Savings+(Savings*0.02/12);
end;
end;
format Savings comma12.2;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-18 Lesson 6 Processing Repetitive Code

5. Add an OUTPUT statement to the bottom of the outer DO loop. Run the program and notice that
you now have 5 rows for each person (a total of 20 rows). Each row represents the savings at
each of the five years.
...
do Year=1 to 5;
do Month=1 to 12;
Savings+Amount;
Savings+(Savings*0.02/12);
end;
output;
end;
...

6. Move the OUPUT statement to the bottom of the inner DO loop. Run the program and notice that
you now have 60 rows for each person (a total of 240 rows). Each row represents the savings at
each year and month combination.
...
do Year=1 to 5;
do Month=1 to 12;
Savings+Amount;
Savings+(Savings*0.02/12);
output;
end;
end;
...

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-19

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Using Nested Iterative DO Loops (DATA Step with No SET Statement)
Determine the value of a retirement account after six years based on an annual investment of
$10,000 and a constant annual interest rate of 7.5%.
a. Open p206p01.sas from the practices folder. Add an iterative DO loop around the sum
statement for Invest.
1) Add a DO statement that creates the column Year with values ranging from 1 to 6.
2) Add an OUTPUT statement to show the value of the retirement account for each year.
3) Add an END statement.
b. Run the program and review the results.

c. Add an inner iterative DO loop between the sum statement and the OUTPUT statement to
include the accrued quarterly compounded interest based on an annual interest rate of 7.5%.
1) Add a DO statement that creates the column Quarter with values ranging from 1 to 4.
2) Add a sum statement to add the accrued interest to the Invest value.
Invest+(Invest*(.075/4));
3) Add an END statement.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-20 Lesson 6 Processing Repetitive Code

d. Run the program and review the results.

e. Drop the Quarter column. Run the program and review the results.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-21

Level 2
2. Using an Iterative DO Loop (DATA Step with a SET Statement)
The pg2.np_summary table contains public use statistics from the National Park Service. The
Pacific West region is anticipating the number of recreational day visitors to increase yearly by
5% for national monuments and 8% for national parks. Show the forecast ed number of
recreational day visitors for each park for the next five years.
a. Open p206p02.sas from the practices folder. Run the program and review the results.
Notice that the initial program is showing the forecasted value for the next year. The next
year is based on adding one year to the year value of today’s date. Depending on the current
date, your NextYear value might be bigger than the NextYear value in the following results.

b. Add an iterative DO loop around the conditional IF-THEN statements.


1) The DO loop needs to iterate five times.
2) In the DO statement, a new column named Year needs to be created that starts at the
value of NextYear and stops at the value of NextYear plus 4.
3) A row needs to be created for each year.
c. Modify the KEEP statement to keep the column Year instead of NextYear.
d. Run the program and review the results.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-22 Lesson 6 Processing Repetitive Code

e. (Optional) Modify the OUTPUT statement to be a conditional statement that outputs only on
the fifth iteration. Run the program and review the results.

Challenge
3. Using an Iterative DO Loop with a List of Values
The sashelp.cars table contains information about cars, including Make, Model, MPG_City,
and MPG_Highway. Forecast each car’s projected fuel efficiency over the next five years,
assuming a 3% increase per year.
a. Open p206p03.sas from the practices folder.
b. Add a DO loop to the DATA step to produce the following results. The MPG value is
increasing by 3% per year.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Using Iterative DO Loops 6-23

c. Modify the DO statement to produce the following results. The DO statement is now based
on a list of values instead of a value that is incremented.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-24 Lesson 6 Processing Repetitive Code

6.2 Using Conditional DO Loops

Iterative DO Loop
data YearSavings;
set pg2.savings;
Savings=0;
do Month=1 to 12;
12
Savings+Amount;
Savings+(Savings*0.02/12);
end; An iterative DO loop
drop Month; works great when you
format Savings comma12.2; know exactly how
run; many iterations to
perform.

21
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Conditional DO Loop
data Savings3K;
set pg2.savings;
Month=0;
Savings=0;
do until (Savings>3000);
Month+1;
? Savings+Amount; The condition
Savings+(Savings*0.02/12); determines how
end;
many times the
format Savings comma12.2;
run; loop is executed.

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p206d03

There are times when you do not know how many times the DO loop needs to iterate. For example,
suppose you want the DO loop to stop when each person has saved beyond 3000 dollars. When a
row is read from the input table, the DO loop iterates as many times as necessary to meet the
condition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-25

Conditional DO Loops

executes repetitively executes repetitively


until a condition is true while a condition is true

DO UNTIL (expression) ; DO WHILE (expression) ;

... repetitive code ... ... repetitive code ...

END; END;

23
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

There are two variations of the conditional DO loop. You can use either the keyword UNTIL, which
executes until a condition is true, or the keyword WHILE, which executes while a condition is true.
For both methods, the expression must be enclosed in parentheses.

Conditional DO Loops

executes repetitively executes repetitively


until a condition is true while a condition is true

do until (Savings>3000); do while (Savings<=3000);


Month+1; Month+1;
Savings+Amount; Savings+Amount;
Savings+(Savings*0.02/12); Savings+(Savings*0.02/12);
end; end;

24
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The DO UNTIL loop executes until savings is greater than 3,000. The DO WHILE loop uses the
opposite of the expression. That is, the DO WHILE loop executes while the savings is less than or
equal to 3,000.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-26 Lesson 6 Processing Repetitive Code

6.02 Activity
Open p206a02.sas from the activities folder and perform the following tasks:
1. Run the program and view the Savings3K table.
2. How many months until James exceeds 3000 in savings?
3. How much savings does James have at that month?
4. Change the DO UNTIL statement to a DO WHILE statement and modify
the expression to produce the same results.
5. Run the program and view the Savings3K table.
6. Are the results for James identical with the DO WHILE as compared to
the DO UNTIL?

25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Checking the Condition

checks the condition at


the top of the loop

DO UNTIL (expression) ; DO WHILE (expression) ;

... repetitive code ... ... repetitive code ...

END; END;
checks the condition at
the bottom of the loop

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Many times, a DO UNTIL and a DO WHILE produce the same results. A subtle but important
difference between the two is when the condition is checked. For a DO UNTIL loop, the condition is
always checked at the bottom of the DO loop. For a DO WHILE loop, the condition is always
checked at the top of the DO loop. This difference can cause different results to be produced.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-27

Checking the Condition


pg2.savings2

do until (Savings>3000); do while (Savings<=3000);

DO UNTIL always DO WHILE executes only if


executes once. the condition is true.
28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Let’s use our saving example in a slightly dif ferent scenario. This time, the input table Savings2
includes the amount of savings already earned f or each individual, so we are no longer starting with
a savings of zero. Notice that Linda has already reached the desired savings of 3,000. Even though
her savings has exceeded the target, the DO UNTIL loo p executes once because the condition is
not checked until the bottom of the DO loop. The DO WHILE does not execute at all because the
condition is checked at the top of the DO loop.

The important point to remember is that the statements in a DO UNTIL loo p always executes at least
one time, whereas the statements in a DO WHILE loop do not execute even once if the condition is
f alse.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-28 Lesson 6 Processing Repetitive Code

Using Conditional DO Loops

Scenario
Modify an existing DATA step with variations of the conditional DO loop.

Files
• p206d03.sas
• savings2 – a SAS table that contains the names of individuals, the amount that they are planning
to save monthly, and the current value of their savings

Syntax

DATA output-table;
SET input-table;
...
DO UNTIL | WHILE (expression);
. . . repetitive code . . .
<OUTPUT;>
END;
RUN;

Notes
• A conditional DO loop executes based on a condition, whereas an iterative DO loop executes a
set number of times.
• A DO UNTIL executes until a condition is true, and the condition is checked at the bottom of the
DO loop. A DO UNTIL loop always executes at least one time.
• A DO WHILE executes while a condition is true, and the condition is checked at the top of the DO
loop. A DO WHILE loop does not iterate even once if the condition is initially false.
• The expression needs to be in a set of parentheses for the DO UNTIL or DO WHILE.

Demo
1. Open the pg2.savings2 table. This table contains a column named Savings that is the current
value of each person’s savings account. Notice that Linda’s value is already greater than 3000.
2. Open p206d03.sas from the demos folder and find the Demo section of the program. Notice
that the DO UNTIL expression is Savings equal to 3000. Run the program. Because Savings is
never equal to 3000, the program is in an infinite loop. Stop the infinite DO loop from running.
• In SAS Enterprise Guide, click the Stop toolbar button on the Program tab.
• In SAS Studio, click Cancel in the Running pop-up window.
3. Make the following modifications to the DATA step:
a. Replace the equal sign with a greater than symbol.
b. Add a sum statement inside the DO loop to create a column named Month that increments
by 1 for each loop.
c. Before the DO loop, add an assignment statement to reset Month to 0 each time that a new
row is read from the input table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-29

4. Run the program. Notice that even though Linda began with 3600 for Savings, the DO LOOP
executed once.
data MonthSavings;
set pg2.savings2;
Month=0;
do until (Savings>3000);
Month+1;
Savings+Amount;
Savings+(Savings*0.02/12);
end;
format Savings comma12.2;
run;

5. Change the DO UNTIL expression to DO WHILE so that the condition is checked at the top of
the loop. Run the program and verify that Linda’s Savings amount is 3600.
data MonthSavings;
set pg2.savings2;
Month=0;
do while (Savings<3000);
Month+1;
Savings+Amount;
Savings+(Savings*0.02/12);
end;
format Savings comma12.2;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-30 Lesson 6 Processing Repetitive Code

Combining Iterative and Conditional DO Loops

DO index-column = start TO stop <BY increment> UNTIL | WHILE (expression) ;

iterative conditional

The DO loop stops


executing when the
stop value is exceeded
or the condition is met,
whichever is first.

30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The conditional DO loop can be combined with the iterative DO loop. The iterative syntax comes
before the conditional syntax. The DO loop stops executing when the stop value is exceeded or the
condition is met, whichever is first.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-31

Iterative and Conditional DO Loop


do Month=1 to 12 until (Savings>5000);
Savings+Amount;
Savings+(Savings*0.02/12);
end;
At the bottom of loop, the condition is checked
before the index column is incremented.

At the top of loop, the


do Month=1 to 12 while (Savings<=5000); condition is checked.
Savings+Amount;
Savings+(Savings*0.02/12);
end;
At the bottom of loop, the
index column is incremented.
31
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d . p206d03

Suppose you need to reach a savings value of 5,000, but you do not want to go beyond 12 months.
For the DO UNTIL loop, the condition is checked before the index column is incremented at the
bottom of the loop. So here, if the value of Savings is greater than 5000, the DO loop processing
stops and Month is not incremented. If the value of Savings is less than 5000, Month is
incremented and the value of Month is checked against the stop value.
For the DO WHILE loop, the condition is checked at the top of the loop, and the index column is
incremented at the bottom of the loop. So here, if Savings is less than or equal to 5000, the DO loop
executes and the index column is incremented at the bottom of the DO loop and checked against the
stop value. Assuming that the stop value has not been exceeded, the condition is checked again at
the top of the loop.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-32 Lesson 6 Processing Repetitive Code

Combining Iterative and Conditional DO Loops

Scenario
Combine iterative and conditional DO loops to determine the number of times a loop iterates.
Compare results for DO WHILE and DO UNTIL.

Files
• p206d04.sas
• savings2 – a SAS table that contains the names of individuals, the amount that they are planning
to save monthly, and the current value of their savings

Syntax

DATA output-table;
SET input-table;
...
DO index-column = start TO stop <BY increment> UNTIL | WHILE (expression);
. . . repetitive code . . .
END;
...
RUN;

Notes
• An iterative DO loop can be combined with a conditional DO loop. The index column is listed in the
DO statement before the DO UNTIL or DO WHILE condition.
• For an iterative loop combined with a DO UNTIL condition, the condition is checked before the
index column is incremented at the bottom of the loop.
• For an iterative loop combined with a DO WHILE condition, the condition is checked at the top of
the loop and the index column is incremented at the bottom of the loop.

Demo
1. Open p206d04.sas from the demos folder and find the Demo section of the program. The intent
of both DATA steps is to process the DO loop for each row in the pg2.savings2 table. One DATA
step uses DO WHILE and the other uses DO UNTIL. Each loop represents one month of
savings. The loop should stop iterating when Savings exceeds 3000 or when 12 months pass,
whichever comes first.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-33

2. Run the demo program and view the two reports that are created. Notice that the values of
Savings in the DO WHILE and DO UNTIL reports match, indicating that the DO loops executed
the same number of times for each person.

3. Observe that for the first row in both the DO WHILE and DO UNTIL reports has Month equal to
13. Savings did not exceed $5,000 after 12 iterations of the DO loop. The Month index variable
was incremented to 13 at the end of the 12th iteration of the loop, which triggered the end of the
loop in both DATA steps and an implicit output action to the output table.
4. Observe that in rows 2, 3, and 4, the value of Month in the DO WHILE results is 1 greater
compared to the DO UNTIL results. This is because in the DO WHILE loop, the index variable
Month increments before the condition is checked. Therefore, the Month column in the output
data does not accurately represent the number of times that the DO loop iterated in either DATA
step.
5. To create an accurate counter for the number of iterations of a DO loop, make the following
modifications to both DATA steps:
a. Add a sum statement inside the loop to create a column named Month and add 1 for each
iteration.
b. Before the DO loop add, an assignment statement to reset Month to 0 each time that a new
row is read from the input table.
c. Change the name of the index variable to an arbitrary name, such as i.
d. Add a DROP statement to drop i from the output table.
Note: Make the same additions in both DATA steps.
data MonthSavingsW;
set pg2.savings2;
Month=0;
do i=1 to 12 while (savings<=5000);
Month+1;
Savings+Amount;
Savings+(Savings*0.02/12);
end;
format Savings comma12.2;
drop i;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-34 Lesson 6 Processing Repetitive Code

6. Run the program and examine the results. Notice that the values of Savings and Month match
for the DO WHILE and DO UNTIL reports. Month represents the number of times that the DO
loop executed for each row.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-35

Beyond SAS Programming 2


What if you want to ...

. . . access SAS documentation . . . use array processing in


and examples for the DO combination with DO loops?
statement?

• Go to the SAS Help Center for the DO • View examples of array processing.
statement. • Take the SAS Programming 3: Advanced
Techniques and Efficiencies course.

33
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links
• Go to the SAS Help Center for the DO statement.
• View examples of array processing.
• Take the SAS Programming 3: Advanced Techniques and Efficiencies course.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-36 Lesson 6 Processing Repetitive Code

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
4. Using a Conditional DO Loop
The pg2.np_summary table contains public use statistics from the National Park Service. The
Northeast region has seen an increase in visitors at its national monuments that previously
experienced low visitation. Determine the number of years it will take for the number of visitors to
exceed 100,000, assuming an annual 6% increase.
a. Open p206p04.sas from the practices folder. Run the program and review the results.
Notice that the first two monuments are not near 100,000 visitors, but the third monument is
near 100,000 after one year with a 6% increase.

b. Add a conditional DO loop around the assignment statement where IncrDayVisits is being
increased by 6%
1) Add a DO UNTIL statement that executes until the value of IncrDayVisits exceeds
100,000.
2) Add an OUTPUT statement to show the increased values per each iteration.
3) Add an END statement.
c. Run the program and review the results.

d. Within the DO loop, add a sum statement to add 1 to the value of Year.
Year+1;
e. Before the DO loop, add an assignment to set the Year to 0. Add Year to the KEEP
statement.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-37

f. Run the program and review the results.

g. How many years did it take until the number of visitors exceeded 100,000 for each national
monument?

ParkName Number of Years

African Burial Ground National Monument

Booker T. Washington National Monument

Fort Stanwix National Monument

h. Remove the OUTPUT statement. Run the program and view the results. The number for
Year should match the numbers that you specified above.

i. (Optional) Modify the DO UNTIL statement to be a DO WHILE statement that produces the
same results.

Level 2
5. Using an Iterative and Conditional DO Loop
The pg2.eu_sports table contains European Union trade amounts for sport products. Belgium
wants to see their exports exceed their imports for golf and racket products. They expect to
annually increase exports by 7% and want to achieve their goal within 10 years.
a. Open p206p05.sas from the practices folder. Run the program and review the results.
Notice that the golf export number is farther from the golf import number as compared to the
racket export and import numbers.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-38 Lesson 6 Processing Repetitive Code

b. Add a conditional DO loop around the assignment statement for Amt_Export.


1) Use a DO WHILE statement that executes while the export value is less than or equal to
the import value.
2) Create a Year column that increments by a value of 1.
3) Create a row of output for each year.
c. Run the program and review the results.
Partial Results (5 of 18 Rows)

d. How many years did it take until the exports exceeded the imports, and what is the final Year
value for each sport product?

Sport_Product Number of Years Final Year

GOLF

RACKET

e. Modify the DO statement to include an iterative portion before the conditional portion. The
iterative portion needs to be based on Year values of 2016 to 2025 (10 years).
f. Within the DO loop, delete any statements related to the incrementing of Year.
g. Run the program and review the results. The results show 14 data rows.
h. Complete this table based on your last modification:

Sport_Product Number of Years Final Year Do Exports exceed


Imports?

GOLF

RACKET

i. Delete the OUTPUT statement.


j. Run the program and review the results.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Using Conditional DO Loops 6-39

k. Do these Year values equal the final Year values before deleting the OUTPUT statement?
Why or why not?
l. (Optional) Include a conditional OUTPUT statement within the DO loop that will show the two
rows of output with the Year values equal to the final Year values before deleting the
OUTPUT statement.

Challenge
6. Controlling Execution of DO Loop Statements with CONTINUE and LEAVE
The pg2.storm_summary table contains information about storms, including storm name, basin,
maximum wind speed, and the start and end dates. You want to calculate the duration of each
storm in days and count the number of working days lost in 2015.
a. Open p206p06.sas from the practices folder. Run the program and review the results. Note
that the values for Duration and LostWork2015 are incorrect.

b. Modify the DATA step program to correctly calculate duration and the number of lost
workdays in 2015 for each storm.
1) When calculating Duration, include both the start and end dates in the number of days.
2) Use a DO loop and accumulating variable LostWork2015 to calculate the number of
workdays lost. Within the DO loop, do the following:
• Test to see whether ThisDay is in the year 2015. If not, exit the DO loop because
there will be no further workdays that occur in 2015 for the given storm. Review the
SAS documentation for the LEAVE statement.
• If the current day of the week is Sunday or Saturday, skip the remaining statements in
the DO loop and go to the next iteration. Review the SAS documentation for the
CONTINUE statement.
• Otherwise, increment LostWork2015 by 1.
c. Run the program and review the results. The table work.storm_workdays should have 95
rows and seven columns, and the PROC PRINT step should produce these results:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-40 Lesson 6 Processing Repetitive Code

6.3 Solutions
Solutions to Practices
1. Using Nested Iterative DO Loops (DATA Step with No SET Statement)
/* b. */
data retirement;
do Year = 1 to 6;
Invest+10000;
output;
end;
run;

title1 'Retirement Account Balance per Year';


proc print data=retirement noobs;
format Invest dollar12.2;
run;
title;

/* d. */
data retirement;
do Year = 1 to 6;
Invest+10000;
do Quarter = 1 to 4;
Invest+(Invest*(.075/4));
end;
output;
end;
run;

title1 'Retirement Account Balance per Year';


proc print data=retirement noobs;
format Invest dollar12.2;
run;
title;

/* f. */
data retirement;
do Year = 1 to 6;
Invest+10000;
do Quarter = 1 to 4;
Invest+(Invest*(.075/4));
end;
output;
end;
drop Quarter;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-41

title1 'Retirement Account Balance per Year';


proc print data=retirement noobs;
format Invest dollar12.2;
run;
title;
2. Using an Iterative DO Loop (DATA Step with a SET Statement)
/* d. */
data ForecastDayVisits;
set pg2.np_summary;
where Reg='PW' and Type in ('NM','NP');
ForecastDV=DayVisits;
NextYear=year(today())+1;
do Year = NextYear to NextYear+4;
if Type='NM' then ForecastDV=ForecastDV*1.05;
if Type='NP' then ForecastDV=ForecastDV*1.08;
output;
end;
format ForecastDV comma12.;
label ForecastDV='Forecasted Recreational Day Visitors';
keep ParkName DayVisits ForecastDV Year;
run;

proc sort data=ForecastDayVisits;


by ParkName;
run;

title 'Forecast of Recreational Day Visitors for Pacific West';


proc print data=ForecastDayVisits label;
run;
title;

/* e. */
data ForecastDayVisits;
set pg2.np_summary;
where Reg='PW' and Type in ('NM','NP');
ForecastDV=DayVisits;
NextYear=year(today())+1;
do Year = NextYear to NextYear+4;
if Type='NM' then ForecastDV=ForecastDV*1.05;
if Type='NP' then ForecastDV=ForecastDV*1.08;
if Year=NextYear+4 then output;
end;
format ForecastDV comma12.;
label ForecastDV='Forecasted Recreational Day Visitors';
keep ParkName DayVisits ForecastDV Year;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-42 Lesson 6 Processing Repetitive Code

proc sort data=ForecastDayVisits;


by ParkName;
run;

title 'Forecast of Recreational Day Visitors for Pacific West';


proc print data=ForecastDayVisits label;
run;
title;
3. Using an Iterative DO Loop with a List of Values
/* b. */
data IncMPG;
set sashelp.cars;
MPG=mean(MPG_City, MPG_Highway);
do Year=1 to 5;
MPG=MPG*1.03;
output;
end;
run;

title 'Projected Fuel Efficiency with 3% Annual Increase';


proc print data=IncMPG;
var Make Model Year MPG;
format MPG 4.1;
run;
title;

/* c. */
data IncMPG;
set sashelp.cars;
MPG=mean(MPG_City, MPG_Highway);
do Year='Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5';
MPG=MPG*1.03;
output;
end;
run;

title 'Projected Fuel Efficiency with 3% Annual Increase';


proc print data=IncMPG;
var Make Model Year MPG;
format MPG 4.1;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-43

4. Using a Conditional DO Loop


/* f. */
data IncreaseDayVisits;
set pg2.np_summary;
where Reg='NE' and DayVisits<100000;
IncrDayVisits=DayVisits;
Year=0;
do until (IncrDayVisits>100000);
Year+1;
IncrDayVisits=IncrDayVisits*1.06;
output;
end;
format IncrDayVisits comma12.;
keep ParkName DayVisits IncrDayVisits Year;
run;

proc sort data=IncreaseDayVisits;


by ParkName;
run;

title1 'Years Until Northeast National Monuments Exceed 100,000


Visitors';
title2 'Based on Annual Increase of 6%';
proc print data=IncreaseDayVisits label;
label DayVisits='Current Day Visitors'
IncrDayVisits='Increased Day Visitors';
run;
title;

ParkName Number of Years

African Burial Ground National Monument 14

Booker T. Washington National Monument 25

Fort Stanwix National Monument 2

/* h. */
data IncreaseDayVisits;
set pg2.np_summary;
where Reg='NE' and DayVisits<100000;
IncrDayVisits=DayVisits;
Year=0;
do until (IncrDayVisits>100000);
Year+1;
IncrDayVisits=IncrDayVisits*1.06;
end;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-44 Lesson 6 Processing Repetitive Code

format IncrDayVisits comma12.;


keep ParkName DayVisits IncrDayVisits Year;
run;

proc sort data=IncreaseDayVisits;


by ParkName;
run;

title1 'Years Until Northeast National Monuments Exceed 100,000


Visitors';
title2 'Based on Annual Increase of 6%';
proc print data=IncreaseDayVisits label;
label DayVisits='Current Day Visitors'
IncrDayVisits='Increased Day Visitors';
run;
title;

/* i. */
data IncreaseDayVisits;
set pg2.np_summary;
where Reg='NE' and DayVisits<100000;
IncrDayVisits=DayVisits;
Year=0;
do while (IncrDayVisits<=100000);
Year+1;
IncrDayVisits=IncrDayVisits*1.06;
end;
format IncrDayVisits comma12.;
keep ParkName DayVisits IncrDayVisits Year;
run;

proc sort data=IncreaseDayVisits;


by ParkName;
run;

title1 'Years Until Northeast National Monuments Exceed 100,000


Visitors';
title2 'Based on Annual Increase of 6%';
proc print data=IncreaseDayVisits label;
label DayVisits='Current Day Visitors'
IncrDayVisits='Increased Day Visitors';
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-45

5. Using an Iterative and Conditional DO Loop


/* c. */
data IncrExports;
set pg2.eu_sports;
where Year=2015 and Country='Belgium'
and Sport_Product in ('GOLF','RACKET');
do while (Amt_Export<=Amt_Import);
Year+1;
Amt_Export=Amt_Export*1.07;
output;
end;
format Amt_Import Amt_Export comma12.;
run;

title 'Belgium Golf and Racket Products - 7% Increase in


Exports';
proc print data=IncrExports;
var Sport_Product Year Amt_Import Amt_Export;
run;
title;

Sport_Product Number of Years Final Year

GOLF 14 2029

RACKET 4 2019

/* g. */
data IncrExports;
set pg2.eu_sports;
where Year=2015 and Country='Belgium'
and Sport_Product in ('GOLF','RACKET');
do Year=2016 to 2025 while (Amt_Export<=Amt_Import);
Amt_Export=Amt_Export*1.07;
output;
end;
format Amt_Import Amt_Export comma12.;
run;

title 'Belgium Golf and Racket Products - 7% Increase in


Exports';
proc print data=IncrExports;
var Sport_Product Year Amt_Import Amt_Export;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-46 Lesson 6 Processing Repetitive Code

Sport_Product Number of Years Final Year Do Exports exceed


Imports?

GOLF 10 2025 No

RACKET 4 2019 Yes

/* j. */
data IncrExports;
set pg2.eu_sports;
where Year=2015 and Country='Belgium'
and Sport_Product in ('GOLF','RACKET');
do Year=2016 to 2025 while (Amt_Export<=Amt_Import);
Amt_Export=Amt_Export*1.07;
end;
format Amt_Import Amt_Export comma12.;
run;

title 'Belgium Golf and Racket Products - 7% Increase in


Exports';
proc print data=IncrExports;
var Sport_Product Year Amt_Import Amt_Export;
run;
title;
No, the Year values do not equal the final Year values before deleting the OUTPUT
statement. Output happens after the DO loop due to the implicit OUTPUT. The Year
column is incremented at the bottom of the DO loop before checking the DO WHILE
condition at the top of the loop.
/* l. */
data IncrExports;
set pg2.eu_sports;
where Year=2015 and Country='Belgium'
and Sport_Product in ('GOLF','RACKET');
do Year=2016 to 2025 while (Amt_Export<=Amt_Import);
Amt_Export=Amt_Export*1.07;
if Year=2025 or Amt_Export>Amt_Import then output;
end;
format Amt_Import Amt_Export comma12.;
run;

title 'Belgium Golf and Racket Products - 7% Increase in


Exports';
proc print data=IncrExports;
var Sport_Product Year Amt_Import Amt_Export;
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Solutions 6-47

6. Controlling Execution of DO Loop Statements with CONTINUE and LEAVE


data storm_workdays;
set pg2.storm_summary;
where year(StartDate)=2015 and Name is not missing;
Duration=EndDate-StartDate+1;
LostWork2015=0;
do ThisDay = StartDate to EndDate;
/* if the current day is not in 2015, exit the DO loop */
if year(ThisDay) ne 2015 then leave;
/* if the current day is not a work day, skip the rest
of the statements in the loop, and loop again*/
if weekday(ThisDay) in (1,7) then continue;
LostWork2015+1;
end;
keep Name Basin MaxWindMPH StartDate EndDate
Duration LostWork2015;
run;

title1 'Workdays Lost in 2015 due to Storms';


title2 '(where started in 2015 and ended in 2016)';
proc print data=storm_workdays;
where year(StartDate) ne year(EndDate);
run;
title;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-48 Lesson 6 Processing Repetitive Code

Solutions to Activities and Questions

6.01 Activity – Correct Answer


2. How much is in savings at month 12? 2426.16

4. How many rows are created? one


5. What is the value of Month? 13
6. What is the value of Savings? 2426.16

13
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

6.02 Activity – Correct Answer


1. Run the program and view the Savings3K table.
2. How many months until James exceeds 3000 in savings? 12
3. How much savings does James have at that month? 3,032.70
4. Change the DO UNTIL statement to a DO WHILE statement and modify
the expression to produce the same results.
do while (Savings<=3000);

5. Run the program and view the Savings3K table.


6. Are the results for James identical with the DO WHILE as compared to
the DO UNTIL? Yes

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 7 Restructuring Tables
7.1 Restructuring Data with the DATA Step ........................................................................ 7-3
Demonstration: Creating a Narrow Table with the DATA Step ........................................ 7-6
Practice............................................................................................................... 7-10

7.2 Restructuring Data with the TRANSPOSE Procedure ................................................. 7-13


Demonstration: Creating a Wide Table with PROC TRANSPOSE................................. 7-16
Practice............................................................................................................... 7-23

7.3 Solutions ................................................................................................................... 7-25


Solutions to Practices ............................................................................................ 7-25
Solutions to Activities and Questions........................................................................ 7-27
7-2 Lesson 7 Restructuring Tables

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Restructuring Data with the DATA Step 7-3

7.1 Restructuring Data with the DATA


Step

Restructuring Tables

3
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Sometimes you have the right information in a table, but the way that the rows and columns are
organized does not match the structure that you need. The syntax for reporting or analytic
procedures might require a particular arrangement for the groups or measures within the table. This
lesson looks at methods available in SAS to restructure your data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-4 Lesson 7 Restructuring Tables

Table Structure
class_test_wide class_test_narrow

Both tables include


the same information,
but they are
structured differently.

4
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

The class_test_wide table includes one row for each student and a separate column for the math
and reading test scores. This is referred to as a wide table because the measures (test scores) are
split into multiple columns.
The class_test_narrow table includes one row per test score, so there is a column that indicates
the subject and another column for the score. This is called a narrow table because the scores are
all stacked in a single column. Both tables include the same information, but they are structured
differently.

7.01 Multiple Choice Question


Which table and column (or columns) could you use with PROC MEANS to
calculate an average for all test scores combined?

proc means data=???;


a. class_test_wide, Math and Reading var ???;
b. class_test_narrow, TestScore run;

class_test_wide class_test_narrow
5
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Restructuring Data with the DATA Step 7-5

Discussion
When would the wide table structure be
preferred?

class_test_wide
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Restructuring Data with the DATA Step

wide

You can use the


DATA step to read narrow
one row and write
multiple rows.

8
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When your data is not in the structure needed for analytics, you can restructure it. There are multiple
techniques that can be used to restructure data, but let’s start with familiar syntax: the DATA step.
You can use statements in the DATA step to control the behavior of the PDV. You can read one row
from an input table and write several rows to an output table. Or you can read multiple rows from an
input table and write one row to an output table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-6 Lesson 7 Restructuring Tables

Creating a Narrow Table with the DATA Step

Scenario
Use the DATA step debugger in SAS Enterprise Guide to examine a DATA step that creates a narrow
table from a wide table.

Files
• p207d01.sas
• class_test_wide – a SAS table with one row per student and with test scores in separate columns
for reading and math tests

Notes
• The DATA step can be used to restructure tables.
• Assignment statements are used to create new columns for stacked values.
• The explicit OUTPUT statement is used to create multiple rows for each input row.

Demo
Note: This demo must be performed in Enterprise Guide.
1. Open the p207d01.sas program from the demos folder and find the Demo section.
data class_test_narrow;
set pg2.class_test_wide;
keep Name Subject Score;
length Subject $ 7;
Subject="Math";
Score=Math;
output;
Subject="Reading";
Score=Reading;
output;
run;
2. Start the DATA step debugger. Notice that the three columns from class_test_wide (Name,
Math, and Reading) are included in the PDV. Two additional columns, Subject and Score, are
added to the PDV because of assignment statements. The LENGTH statement establishes the
attributes of the Subject column that will store either Reading or Math.

3. Click Step execution to next line to execute the highlighted SET statement. The first row
from class_test_wide is loaded into the PDV.
4. Execute the two assignment statements. Values are assigned to Subject and Score for the math
test. Execute the OUTPUT statement to write Name, Subject, and Score to the output table.
5. Execute the next two assignment statements. Values are assigned to Subject and Score for the
reading test. Execute the OUTPUT statement to write Name, Subject, and Score to the output
table.
Note: At the end of this first iteration of the DATA step, two rows have been written to the
class_test_narrow table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Restructuring Data with the DATA Step 7-7

6. Proceed through execution of the second iteration of the DATA step. Two additional rows are
written to the output table for the test scores for Alice.
7. Close the DATA step debugger and run the program. Examine the output table and confirm that
each student has two rows corresponding to the math and reading test scores.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-8 Lesson 7 Restructuring Tables

Restructuring Data with the DATA Step

wide

You can use the DATA


narrow
step to read multiple
rows before writing one
row to the output table.

10
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

In this data, there are multiple rows for each student, each representing a test score. We want to
create a table with one row per student with each test score recorded in a separate column.

Restructuring Data with the DATA Step

wide

narrow

if TestSubject="Math" then Math=TestScore;


else if TestSubject="Reading" then Reading=TestScore;

11
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use conditional processing in the DATA step to examine the value of TestSubject. If
TestSubject is Math, then the column Math in the wide table will have the value of TestScore
(82 for the first row). And if TestSubject is Reading, then TestScore will be assigned to the
Reading column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Restructuring Data with the DATA Step 7-9

7.02 Activity
Open p207a02.sas from the activities folder and perform the following tasks:
1. Examine the DATA step code and run the program. Uncomment the
RETAIN statement and run the program again. Why is the RETAIN
statement necessary?
2. Add a subsetting IF statement to include only the last row per student in
the output table. Run the program.
3. What must be true of the input table for the DATA step to work?

12
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-10 Lesson 7 Restructuring Tables

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
1. Restructuring a Table Using the DATA Step: Wide to Narrow
The pg2.np_2017Camping table contains public use statistics for camping in 2017 from the
National Park Service. To enable statistics to be calculated for all camping locations, restructure
the table as a narrow table.
a. Open the p207p01.sas program in the practices folder. Highlight the PROC PRINT step and
run the selected code. Note that the Tent, RV, and Backcountry columns contain visitor
counts.

b. To convert this wide table to a narrow table, the DATA step must create a new column named
CampType with the values Tent, RV, and Backcountry, and another new column named
CampCount with the numeric counts. The DATA step includes statements to output a row for
CampType='Tent'. Modify the DATA step to output additional rows for RV and Backcountry.
c. Add a LENGTH statement to ensure that the values of the CampType column are not
truncated.
d. Run the DATA step. Confirm that each ParkName value has three rows corresponding to the
Tent, RV, and Backcountry visitor counts.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Restructuring Data with the DATA Step 7-11

Level 2
2. Restructuring a Table Using the DATA Step: Narrow to Wide
The pg2.np_2016Camping table contains public use statistics for camping in 2016 from the
National Park Service. To enable statistics to be calculated for individual camping locations,
restructure the table as a wide table.
a. Examine the pg2.np_2016Camping table to determine the three unique values of the
CampType column.
b. Write a DATA step to read pg2.np_2016camping and create camping_wide. Use
IF-THEN/ELSE statements to assign CampCount to the Tent, RV, and Backcountry
columns based on the value of CampType.
c. Use the RETAIN statement to hold the values of ParkName, Tent, RV, and Backcountry in
the PDV each time that the PDV reinitializes.
d. Use the BY statement to group the data by ParkName. Add a subsetting IF statement to
output the last row for each value of ParkName.
e. Keep the ParkName, Tent, RV, and Backcountry columns. Format Tent, RV, and
Backcountry with commas.
f. Run the program and confirm that a column exists for each unique camping location (Tent,
RV, and Backcountry).

Challenge
3. Using Arrays to Restructure a Table
The pg2.np_lodging_array table contains statistics for stays at lodging facilities in 2015, 2016,
and 2017. Create a table that contains two rows for each year (2015, 2016, and 2017),
corresponding to the lodge counts for individual parks.
Note: An array enables you to perform the same action on a group of similar columns. In this
example, Lodge2015, Lodge2016, and Lodge2017 are all numeric columns that
represent the same measure for different years. Using an array with a DO loop can
simplify repetitive code. Access SAS Help for more information about arrays.
a. Examine the np_lodging_array table. In addition to ParkName, notice that there are three
columns containing visitor lodging counts. Lodge2015, Lodge2016, and Lodge2017 contain
counts for visitors staying at lodges.
b. Open the p207p03.sas program from the practices folder. Run the program and confirm
that the output table stacks the values of the Lodge columns.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-12 Lesson 7 Restructuring Tables

c. Create a copy of the DATA step and paste it at the end of the program. Modify the second
DATA step to use an array to simplify the repetitive processing.
1) Delete all statements between the FORMAT and RUN statements.
2) Add the following ARRAY statement after the FORMAT statement to define an array
named lodge that includes the columns Lodge2015, Lodge2016, and Lodge2017.
array Lodge[2015:2017] Lodge2015-Lodge2017;
3) Add a DO loop with an index variable, year, that will loop three times for the values 2015
to 2017.
4) Inside the DO loop, perform the following actions:
a) Create a column named Stays that will be equal to the value of each column in the
lodge array.
Note: The array name can be used in combination with the DO loop index variable
to represent each column in the array. For example, lodge[year] will be
replaced by lodge[2015] the first time through the DO loop. Lodge[2015]
represents the first column in the lodge array, which is Lodge2015.
b) Output the row to the new table.
d. Run the second DATA step and verify that the table includes three rows for each value of
ParkName.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Restructuring Data with the TRANSPOSE Procedure 7-13

7.2 Restructuring Data with the


TRANSPOSE Procedure

Restructuring Data with PROC TRANSPOSE

PROC TRANSPOSE DATA=input-table <OUT=output-table>;


<ID col-name;>
<VAR col-name(s);>
RUN;
PROC TRANSPOSE
can restructure
data with simple
statements.

17
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Another way to restructure data is to use the TRANSPOSE procedure. PROC TRANSPOSE creates
an output table by restructuring the values in a SAS table, transposing selected columns into rows.
Although the DATA step is an effective way to restructure tables, the TRANSPOSE procedure can
sometimes achieve the same result with less code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-14 Lesson 7 Restructuring Tables

7.03 Activity
Open p207a03.sas from the activities folder and perform the following tasks:
1. Highlight the PROC PRINT step and run the selection. Note how many
rows are in the sashelp.class table.
2. Highlight the PROC TRANSPOSE step and run the selection. Answer the
following questions:

Which columns from the input table are transposed into rows?
What does each column in the output table represent?
What is the name of the output table?
Keep this
program open for
the next activity.
18
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

7.04 Activity
Use the program from the previous activity to perform the following tasks:
1. Add the OUT= option in the PROC TRANSPOSE statement to create an
output table named class_t.
2. Add the following ID statement and run the step. What changes in the
results?
id Name;

3. Add the following VAR statement and run the step. What changes in the
results?
var Height Weight;

20
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Restructuring Data with the TRANSPOSE Procedure 7-15

Transposing Values within Groups

PROC TRANSPOSE DATA=input-table <OUT=output-table>;


<VAR col-name(s);>
<ID col-name;>
<BY col-name(s);>
RUN; The input table
must be sorted by
the same columns
Use the BY statement that you specify in
to transpose data the BY statement.
within groups.

22
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Transposing Values within Groups

wide

Each unique
combination of BY
values creates one
narrow
row in the output
table.
by Season Basin Name;

23
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-16 Lesson 7 Restructuring Tables

Creating a Wide Table with PROC TRANSPOSE

Scenario
Use PROC TRANSPOSE to transpose data values within groups into rows. Use options to
customize the output table.

Files
• p207d02.sas
• storm_top4_narrow – a SAS table with four rows for each storm representing the highest wind
measurements

Syntax

PROC TRANSPOSE DATA=input-table OUT=output-table


<PREFIX=column> <NAME=column>;
<VAR columns(s);>
<ID column;>
<BY column(s);>
RUN;

Notes
• PROC TRANSPOSE can be used to restructure tables.
• The OUT= option creates or replaces an output table based on the syntax used in the step.
• By default, all numeric columns in the input table are transposed into rows in the output table.
• The VAR statement lists the column or columns to be transposed.
• The output table will include a separate column for each value of the ID column. There can be only
one ID column. The ID column values must be unique in the column or BY group.
• The BY statement transposes data within groups. Each unique combination of BY values creates
one row in the output table.
• The PREFIX= option provides a prefix for each value of the ID column in the output table.
• The NAME= option names the column that identifies the source column containing the transposed
values.

Demo
1. Open the p207d02.sas program in the demos folder and find the Demo section. Run the PROC
TRANSPOSE step and examine the error in the log. The step fails because the values of ID are
not unique.
2. Add a BY statement to transpose the values within the groups of Season, Basin, and Name.
Run the program.
proc transpose data=pg2.storm_top4_narrow out=wind_rotate;
var WindMPH;
id WindRank;
by Season Basin Name;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Restructuring Data with the TRANSPOSE Procedure 7-17

3. Notice that the unique values of WindRank (1, 2, 3, and 4) are assigned as the column names
for the transposed values of WindMPH.
Note: Enterprise Guide and SAS Studio set the VALIDVARNAME= system option to ANY,
which permits column names that do not follow standard SAS naming rules. If
VALIDVARNAME= is set to V7, underscores are added in front of leading numbers or in
place of spaces or special symbols in column names.
4. To give the transposed columns standard names, add the PREFIX=Wind option in the PROC
TRANSPOSE statement. To rename the _name_ column that identifies the source column for
the transposed values, add the NAME=WindSource option as well. Run the step.
proc transpose data=pg2.storm_top4_narrow out=wind_rotate
prefix=Wind name=WindSource;

5. Delete the NAME= option and add the DROP= data set option on the output table to drop the
_name_ column. Run the step.
proc transpose data=pg2.storm_top4_narrow
out=wind_rotate(drop=_name_) prefix=Wind;
var WindMPH;
id WindRank;
by Season Basin Name;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-18 Lesson 7 Restructuring Tables

Transposing Values into Groups

wide

You can also use


PROC TRANSPOSE
to convert one row narrow
into multiple rows.

25
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Imagine you are starting with a wide table, where each storm has four separate columns for Wind1
through Wind4. You want to transpose the values of Wind1 through Wind4 for each unique
combination of Season, Basin, and Name. This will create four rows per storm, where each row is
one of the wind measures. Minor adjustments to the PROC TRANSPOSE code enable you to
convert a wide table to a narrow table.

7.05 Activity
Open p207a05.sas from the activities folder and perform the following tasks:
1. Run the program. Notice that, by default, PROC TRANSPOSE transposes
all the numeric columns, Wind1-Wind4.
2. Add a VAR statement in PROC TRANSPOSE to transpose only the Wind1
and Wind2 columns. Run the program.
3. What are the names of the columns that contain the column names and
values that have been transposed?

26
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Restructuring Data with the TRANSPOSE Procedure 7-19

Changing Column Names

PROC TRANSPOSE DATA=input-table <OUT=output-table>


<NAME=column> <PREFIX=column>;

proc transpose data=pg2.storm_top4_wide name=WindRank


prefix=WindMPH;
by Season Basin Name;
var wind1-wind4;
run;

28
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

When PROC TRANSPOSE stacks values into new columns, the new columns are assigned generic
names by default: _NAME_ and COL1. You can use the NAME= option to name the column that
contains the column names that were transposed from the input table. The PREFIX= option spec ifies
the prefix for the column or columns that contain the transposed data values.

Changing Column Names

proc transpose data=pg2.storm_top4_wide name=WindRank


prefix=WindMPH;
by Season Basin Name;
var wind1-wind4;
run;
How could you
change the name of
the column in the
output table to
exclude the number
1?
?
29
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-20 Lesson 7 Restructuring Tables

Changing Column Names

proc transpose data=pg2.storm_top4_wide name=WindRank


out=storm_rotate(rename=(col1=WindMPH));
by Season Basin Name;
var wind1-wind4;
run;

Create an output
table and use the
RENAME= data set
option.

30
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

You can use the OUT= option to name the output table that you want to create, and then use the
RENAME= data set option to change COL1 to any name that you choose.

Discussion
When might you prefer to use the DATA
step instead of PROC TRANSPOSE to
restructure data and vice versa?

C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Restructuring Data with the TRANSPOSE Procedure 7-21

Beyond SAS Programming 2


What if you want to ...

. . . access SAS documentation . . . use array processing to


and examples for PROC make your DATA step
TRANSPOSE? restructuring code more
simple?
• Go to the SAS Help Center for the • View examples of array processing.
TRANSPOSE procedure. • Take the SAS Programming 3: Advanced
Techniques and Efficiencies course.

32
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Links
• Go to the SAS Help Center for the TRANSPOSE procedure.
• View examples of array processing.
• Take the SAS Programming 3: Advanced Techniques and Efficiencies course.

Put It All Together!

Practice all your new SAS skills


by completing a
comprehensive case study.
Visit the Extended Learning
page to access the case study
materials.

33
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-22 Lesson 7 Restructuring Tables

Join the Discussion

Discuss SAS Courses and Test Your SAS Skills

https://communities.sas.com/sas-training

Visit the SAS Training


Community to ask
questions of our experts
and exchange ideas with
other SAS users.

34
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 Restructuring Data with the TRANSPOSE Procedure 7-23

Practice

If you restarted your SAS session, open and submit the libname.sas program in the course files.

Level 1
4. Restructuring a Table Using PROC TRANSPOSE: Wide to Narrow
The pg2.np_2017Camping table contains public use statistics for camping in 2017 from the
National Park Service. Convert the data from a wide table to a narrow table.
a. Open the p207p04.sas program in the practices folder. Highlight the PROC PRINT step and
run the selected code. Notice that the table contains three columns (Tent, RV, and
Backcountry) with visitor counts for each value of ParkName. In addition, notice that the
table is sorted by ParkName.

b. Add the OUT= option in the PROC TRANSPOSE statement to create a table named
work.camping2017_t.
c. Add the BY statement to group the data by ParkName. This creates one row in the output
table for each unique value of ParkName.
d. Add the VAR statement to transpose the Tent and RV columns. Highlight the PROC
TRANSPOSE step and run the selected code.
e. Use the NAME= option to specify Location as the name for the column that contains the
names of the columns from the input table.
f. Use the RENAME= data set option after the output table to rename COL1 as Count.
Highlight the PROC TRANSPOSE step and run the selected code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-24 Lesson 7 Restructuring Tables

Level 2
5. Restructuring a Table Using PROC TRANSPOSE: Narrow to Wide
The pg2.np_2016Camping table contains public use statistics for camping in 2016 from the
National Park Service. Convert the data from a narrow to a wide table.
a. Examine the np_2016Camping table. Notice that the table contains one row for each
location type (Tent, RV, and Backcountry) by ParkName. In addition, notice that the table is
sorted alphabetically by ParkName.
b. Write a PROC TRANSPOSE step to create a wide table named work.camping2016_t.
Include only the ParkName column and individual columns for the values of CampType.

Challenge
6. Naming Transposed Columns when the ID Column Has Duplicate Values
The pg2.weather_highlow table contains weather data for four locations. The high and low
temperatures are recorded for the months of June, July, and August.
a. Open p207p06.sas from the practices folder. Run the program and examine the output
table. Notice that table contains two rows for each value of Location and Month. The first
row represents the high temperature and the second row is the low temperature.
b. Write a PROC TRANSPOSE step to create a table, work.lows, that contains the low
temperatures for each reporting location. Use the LET option to transpose only the last row
for each BY group. Use the values of Month as the names for the transposed columns.
Note: The LET option transposes only the last row for each BY group. Be sure your data is
sorted in the order that you require. For more information about the LET option, view
SAS Help.
c. Examine the output table and confirm that three columns (Jun, Jul, and Aug) exist for each
value of Location. The values of the month columns should be the low temperatures.
Note: Warning messages will still appear in the log indicating that the Month values are
duplicated within each value of Location.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Solutions 7-25

7.3 Solutions
Solutions to Practices
1. Restructuring a Table Using the DATA Step: Wide to Narrow
data work.camping_narrow(drop=Tent RV Backcountry);
set pg2.np_2017Camping;
length CampType $11;
format CampCount comma12.;
CampType='Tent';
CampCount=Tent;
output;
CampType='RV';
CampCount=RV;
output;
CampType='Backcountry';
CampCount=Backcountry;
output;
run;
2. Restructuring a Table Using the DATA Step: Narrow to Wide
data work.camping_wide;
set pg2.np_2016Camping;
by ParkName;
keep ParkName Tent RV Backcountry;
format Tent RV Backcountry comma12.;
retain ParkName Tent RV Backcountry;
if CampType='Tent' then Tent=CampCount;
else if CampType='RV' then RV=CampCount;
else if CampType='Backcountry' then Backcountry=CampCount;
if last.ParkName;
run;
3. Using Arrays to Restructure a Table
data np_lodge_stack;
set pg2.np_lodging_array;
keep ParkName Year Stays;
format Stays comma12.;
array Lodge[2015:2017] Lodge2015-Lodge2017;
do Year=2015 to 2017;
Stays=Lodge[Year];
output;
end;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-26 Lesson 7 Restructuring Tables

4. Restructuring a Table Using PROC TRANSPOSE: Wide to Narrow


proc transpose data=pg2.np_2017camping
out=work.camping2017_
transposed(rename=(COL1=Count)) name=Location;
by ParkName;
var Tent RV;
run;
5. Restructuring a Table Using PROC TRANSPOSE: Narrow to Wide
proc transpose data=pg2.np_2016camping
out=work.camping2016_transposed(drop=_name_);
by ParkName;
id CampType;
var CampCount;
run;
6. Naming Transposed Columns when the ID Column Has Duplicate Values
proc sort data=pg2.weather_highlow out=sort_highlow;
by Location;
run;

proc transpose data=sort_highlow out=lows let;


by location;
id Month;
run;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Solutions 7-27

Solutions to Activities and Questions

7.01 Multiple Choice Question – Correct Answer


Which table and column (or columns) could you use with PROC MEANS to
calculate an average for all test scores combined?

a. class_test_wide, Math and Reading


b. class_test_narrow, TestScore

proc means data=pg2.class_test_narrow


maxdec=1;
var TestScore;
run;

6
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

continued...
7.02 Activity – Correct Answer
1. Examine the DATA step code and run the program. Uncomment the
RETAIN statement and run the program again. Why is the RETAIN
statement necessary?

The RETAIN statement hold values in the PDV across multiple iterations of
the DATA step. The last row for each student includes both test scores.

without RETAIN with RETAIN


13
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-28 Lesson 7 Restructuring Tables

7.02 Activity – Correct Answer


2. Add a subsetting IF statement to include only the last row per student in
the output table.
data class_wide;
set pg2.class_teststack;
by name;
retain Name Math Reading;
keep Name Math Reading;
if TestSubject="Reading" then Reading=TestScore;
else if TestSubject="Math" then Math=TestScore;
if last.name=1 then output;
run;

3. What must be true of the input table for the DATA step to work?
The data must be sorted by Name.
14
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

7.03 Activity – Correct Answer


Which columns from the input table are transposed into rows?
Only the numeric columns are transposed (Age, Height, and Weight).

What does each column in the output table represent?


Each column corresponds to a student (row) from the input table.

What is the name of the output table?


work.data1 Keep this
NOTE: There were 19 observations read from the data set SASHELP.CLASS. program open for
NOTE: The data set WORK.DATA1 has 3 observations and 20 variables. the next activity.
19
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.3 Solutions 7-29

7.04 Activity – Correct Answer


proc transpose data=sashelp.class out=class_t;
id Name;
var Height Weight;
run;

The values of the ID column are


assigned as column names.

The VAR statement limits the


columns that are transposed to rows.
21
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

7.05 Activity – Correct Answer


2. Add a VAR statement in PROC TRANSPOSE to transpose only the Wind1
and Wind2 columns. Run the program.
var Wind1 Wind2;

3. What are the names of the columns that contain the column names and
values that have been transposed? _NAME_ and COL1

27
C o p y r i g h t © S A S In s t i tu t e In c. A l l r i g h ts r e s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-30 Lesson 7 Restructuring Tables

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.

Вам также может понравиться