Вы находитесь на странице: 1из 783

1

© Copyright 2015, Simplilearn. All rights reserved.


2
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 0— Introduction

3
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi! Welcome to the “Data Science with Statistical Analysis System, or SAS,” course offered by
Simplilearn. In this video you’ll see some interesting highlights of this course.

4
© Copyright 2015, Simplilearn. All rights reserved.
Why SAS

 Have you faced challenges during data processing because of the size of the data?
 Have you felt the need to combine, separate, compare, and extract data based on a specific
requirement?
 Has interpreting data been difficult because you couldn’t manipulate it?
 Have you ever wanted to learn the most in-demand Analytics technology?

5
© Copyright 2015, Simplilearn. All rights reserved.
Why SAS

SAS can help you achieve all this and more. It offers a variety of data analysis tools that can deal with
large data. SAS provides an end-to-end solution for the entire Analytics cycle. It’s the undisputed leader
in the commercial analytics space.

6
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS

SAS is an integrated system of software solutions, which enables you to perform the following tasks:

 Data entry, retrieval, and management

 Report writing and graphics design

 Statistical and mathematical analysis

 Business forecasting and decision support

 Operations research and project management

 Applications development

7
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS

Data science is concerned with organizing, packing, and delivering data. SAS can help in all three stages.
With the tools at their disposal in SAS, Data Scientists can organize, analyze, and provide interpretations
or results.

SAS has an edge over other tools with its huge array of statistical functions, user-friendly graphical user
interface, and technical support.

8
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS

Industries that use SAS include Automotive, Banking, Capital Markets, Consumer Goods, Defense, Health
Care, Higher Education, Manufacturing, Media, Retail, Sports, Entertainment, and so on.

9
© Copyright 2015, Simplilearn. All rights reserved.
Market Trends

Demand for SAS professionals has increased dramatically compared to other data analysis software
professionals.

10
© Copyright 2015, Simplilearn. All rights reserved.
Objectives

Simplilearn’s Data Science with SAS course will enable you to:

 Understand the role of Data Scientist


 Use the SAS tool
 Apply Data Manipulation and Optimization techniques
 Work on Advanced Statistical concepts like Clustering, Linear Regression, and Decision Trees
 Apply data analysis methods to real world business problems
 Understand and apply predictive modeling techniques

11
© Copyright 2015, Simplilearn. All rights reserved.
Objectives

Attention people!

Simplilearn provides an exciting range of learning modules for our eager learners.

SAS, being a statistical course, Simplilearn:

 Provides content with visualization to enhance learning,

12
© Copyright 2015, Simplilearn. All rights reserved.
Objectives

Takes you to the syntax classroom to teach you the coding,

13
© Copyright 2015, Simplilearn. All rights reserved.
Objectives

Wants to hold your attention by providing logical breaks in the form of knowledge checks.

14
© Copyright 2015, Simplilearn. All rights reserved.
Objectives

Provides hands-on experience with demos, assignments, and practice sessions.

15
© Copyright 2015, Simplilearn. All rights reserved.
Objectives

Going one step further, Simplilearn introduces gaming to add an element of challenge to your learning.

Learn while you play “Organize to Analyze”. Let’s begin this course!

16
© Copyright 2015, Simplilearn. All rights reserved.
Simplilearn’s Data Science with SAS Course

This course enables you to learn the key concepts of SAS, which are important for Data Analytics, using
practical examples. The course comprises 32 hours of Instructor Led Training, 24 hours of eLearning, and
hands-on experience with industry projects. You will receive full support from the Simplilearn Faculty
throughout the course and mentoring for project work.

You will also be able to access three sets of assessment papers comprising 100 questions each, case
studies, and four live industry projects on the SAS tool. On successful completion, you will receive an
experience certificate.

To complete Data Science with SAS certification, you’ll need to:

 Complete any two projects and get them evaluated by the lead trainer.

 Score at least 80% on the online exam.

Submit your queries by writing to Help and Support on www.simplilearn.com or talking directly to our
support staff with the Simplitalk and Live Chat options.

17
© Copyright 2015, Simplilearn. All rights reserved.
Simplilearn’s Data Science with SAS Course

Go ahead and begin “Data Science with SAS” course. The first lesson is, “Analytics Overview.”

Happy Learning!

18
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 01 — Analytics Overview

19
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hello and welcome to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

In this lesson “Analytics Overview,” you will learn what data analytics is and the ways to perform data
analysis, using SAS.

20
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will understand the concept of data analytics, its types, and techniques. You will be
able to list the various types of analytical problems industries face, and describe ways to solve those
using SAS. You will also learn the various widely used analytical tools to perform data analysis.

21
© Copyright 2015, Simplilearn. All rights reserved.
What is Analytics

Analytics plays a vital role not only in businesses but also in various fields such as sports, healthcare,
finance, and government. It is hard to think of any aspect of life that is not affected by analytics.

Now, you must be thinking what really analytics is?

22
© Copyright 2015, Simplilearn. All rights reserved.
What is Analytics

Analytics is a scientific process that examines raw data to draw meaningful conclusions from the data. It
gives insights into the information to help organizations make better decisions.

23
© Copyright 2015, Simplilearn. All rights reserved.
What is Analytics

The study of analytics often involves analyzing historical data to look for potential trends, to understand
the effects of certain decisions, or to evaluate the performance of the business basis of the decisions
made. This comprehensive knowledge of past trends and decisions can form the basis on which
corrective actions can be taken.

24
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example

Suppose you are working with an Ecommerce company and you want to run a marketing campaign to
increase your sales.

25
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example

To do so, you need to analyze your existing campaigns and how much they help in increasing the current
business and collect some more statistical information from all the areas. This will help you examine the
key areas that can give drive your business.

These tasks that you perform to increase the sales through marketing campaign is called marketing
Analysis.

26
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example

Analytics even helps companies optimize their Supply Chain performance. By analyzing their historical
data on daily, weekly, and monthly basis, they evaluate and forecast the future demand of their
products.

27
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example

Suppose you are working in a multinational tire company and you want to analyze the demand of tires at
two different depots across the globe. If proper analysis and evaluation is performed, you can supply the
products per the demand and maintain the required stock in the stores.

This types of analysis is called supply-demand analytics.

From these examples, it is clear that data analysis plays a vital role in every organization.

28
© Copyright 2015, Simplilearn. All rights reserved.
Types of Analytics

There are four distinct types of analytics:

1) Descriptive explains what has happened


2) Diagnostic suggests why it happened
3) Predictive indicates what could happen
4) Prescriptive recommends what should happen

We will learn about these types in the subsequent screens.

29
© Copyright 2015, Simplilearn. All rights reserved.
Descriptive Analytics

Descriptive analytics allows you to break a big chunk of data into smaller pieces, chunking out relevant
information from the data or providing a brief synopsis of what happened. This is also known as the
“simplest class of analytics.”

Let us take an example of using descriptive analytics for customer data. It includes finding answers to the
following questions:

 How many different segments of buyers are we dealing with?


 Where are these buyers located?
 How do high-value customers differ from normal customers?
 What are they interested in?
 What is the income, age, number of children, occupation, and regional breakdown of these
buyers?

30
© Copyright 2015, Simplilearn. All rights reserved.
Diagnostic Analytics

Diagnostic analytics is the best option to go for if you want to go deeper into the collected data.

In Diagnostic Analytics, we are not concerned about “What happened,” instead we focus on “why
happened.”

Descriptive Analytics doesn’t provide us with answers to questions like “How do we fix this?” or “How
can we improve this?”

31
© Copyright 2015, Simplilearn. All rights reserved.
Predictive Analytics

Predictive analytics is another option to help us condense data. It uses different statistical, data
modeling, and data mining techniques to study the latest and past trends, thereby allowing the business
analysts or data scientists to make predictions.

Here is an example of using Predictive Analytics for a marketing campaign. It will look for answers to the
following questions:

 Who will respond to this campaign, and for what product and through which channel?
 What are the potential values of each customer and prospect?
 Who will stop the subscription to your service, and when would that be?

32
© Copyright 2015, Simplilearn. All rights reserved.
Prescriptive Analytics

Prescriptive analytics is the last phase of business analytics and is related to both descriptive and
predictive analytics. While descriptive analytics provides information about what has happened and
predictive analytics helps forecast what might happen, which is probabilistic in nature, prescriptive
analytics optimizes decision making by determining the best solution available among various choices,
given the business constraints.

33
© Copyright 2015, Simplilearn. All rights reserved.
Areas of Analytics

Let’s look at a few types of analytics depending on the areas we us use them in:

 Customer Analytics
 Financial Analytics
 Performance Analytics
 Risk Analytics

34
© Copyright 2015, Simplilearn. All rights reserved.
Customer Analytics

Customer Analytics is a process that helps organizations make critical decisions and deliver offers that
are anticipated. This analytics offers organizations necessary customer insights to make better
decisions. Customer analytics uses techniques such as market segmentation, predictive analytics, data
modeling, and data visualization. It plays a pivotal role in the prediction of customer behavior.

Example:

Telecom companies these days use different marketing methods to retain their customers.

35
© Copyright 2015, Simplilearn. All rights reserved.
Financial Analytics

This type of analytics is the new way to drive competitive advantage. It helps financial executives explore
different ways to answer specific finance-related business questions and forecast future financial
situations. In today's dynamic business environment, financial analytics helps the finance function to
bring greater value to organizations.

Companies can leverage financial analytics to take multiple views of their data and derive insights that
will help them take necessary actions.

Example:

Reading Cash flow statement, balance sheets, and income statements comes under financial analytics.

36
© Copyright 2015, Simplilearn. All rights reserved.
Performance Analytics

Performance analytics is the practice of using data and technology to study how our business is
performing to continuously make it better. The basic functions involved in Performance Analytics are
Planning, Organizing, Staffing, Directing, and Controlling.

Example:

In Human Resource Management, the performance of the employees is monitored on a regular basis,
keeping in mind the parameters dependent on the expected outcomes.

37
© Copyright 2015, Simplilearn. All rights reserved.
Risk Analytics

Risk analysis tries to foresee the uncertainties of the predicted future that helps evaluate a project’s
success or failure.

Risk analytics can be categorized as quantitative or qualitative.

Quantitative risk analysis quantifies the possible project results specific to a project. This analysis tries to
numerically evaluate the possibilities of various adverse events and predict the losses a company would
go through if any of these possibilities come true.

Qualitative risk analysis is performed on almost all risks and is not numerically defined. This method
involves defining various project-related threats and risks, determining the extent of these risks and
proposing corrective actions to avoid these risks.

Example:

In the Banking Industry, credit scores are built to predict an individual’s delinquency behavior and is
used to represent the credit worthiness of each individual.

38
© Copyright 2015, Simplilearn. All rights reserved.
39
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Tools

So far you have learned what analytics is, its types, and the areas of analytics.

Let’s now look at the popular analytical tools available for data analysis.

40
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Tools

There are various data analysis tools such as:

 Excel,
 SAS,
 Python,
 R,
 MATLAB, and
 Tableau Software.

41
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Tools(contd.)

Following reasons make SAS one of the best and most popular tools to visualize data:

 Helps users understand the nature of the customers and anticipate the future by forecasting and
modelling
 Processes and manages large and complex datasets
 Works with multiple variables
 Tracks all the operations of datasets and generates output
 Provides better Graphical User Interface, Graphs, Regression results, and Summary statistics

42
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques

With the help of analytical techniques, we can easily examine the complex relationships between
variables.

Following are a few analytical techniques we use in SAS to analyze data:

• Clustering
• Regression
• Decision Tree
• Time Series

Let’s acquire a basic understanding of these techniques.

43
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques(contd.)

Clustering is the process of grouping abstract objects into classes of similar objects. It is a common
technique used for statistical data analysis and is mainly involved in the process of data mining.

It is used in various applications such as market research, pattern recognition, data analysis, and image
processing.

44
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques (contd.)

Regression is a statistical measure to determine the strength of the relationship between one dependent
variable (usually denoted by Y) and a series of other changing variables (known as independent
variables).

Example:

Consider sales data where we have quantity sold, amount, and marketing expenses of various products
in the company. Using regression, we can determine the relationship between quantity sold, amount,
and marketing expenses.

45
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques (contd.)

Decision tree is a form of multiple variable analysis. It allow us to predict, explain, describe, or classify an
outcome.

Example:

An example of a multiple variable analysis is a probability of sale or the likelihood of responding to a


marketing campaign as a result of the combined effects of multiple input variables, factors, or
dimensions.

46
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques (contd.)

This helps in forecasting and predicting the future values based on previously observed values.

47
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let us now quickly recap what we have learned in the lesson:

 Analytics is a scientific process to examine raw data to draw meaningful conclusions from the
data.
 Descriptive analytics allows you to break a big chunk of data into smaller pieces.
 Diagnostic analytics is used go deeper into the collected data.
 Predictive analytics helps condense data.
 Prescriptive analytics optimizes decision making by determining the best solution from the
available options.
 Customer Analytics is a process that helps organizations make critical decisions and deliver
offers that are anticipated.
 A few analytical techniques of SAS are Clustering, Regression, Decision Tree, and Time Series.

48
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes ‘”Analytics Overview.” The next lesson is “Introduction to SAS.”

49
© Copyright 2015, Simplilearn. All rights reserved.
50
© Copyright 2015, Simplilearn. All rights reserved.
51
© Copyright 2015, Simplilearn. All rights reserved.
52
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which one of the following is NOT a type d.
of Business Analytics? There are four distinct types of Data Analytics:
Descriptive, Diagnostic, Predictive, and
Prescriptive.

2 Which one of the following is correct a.


statement about Descriptive Analytics? Descriptive Analytics allows you to break data
into smaller pieces, extracting relevant
information to get a brief synopsis of what
happened.

3 Predictive Analytics helps forecast what a.


might happen; it is probabilistic in nature. Predictive Analytics helps to forecast what
might happen; it is probabilistic in nature.

4 Which of the following areas of analytics a.


refers to the practice of using data and Performance analytics is the practice of using
technology to study how your business is data and technology to study how your
performing to continuously make it business is performing to continuously make it
better? better.

5 Which of the following is an example of c.


Customer Analytics? Customer Acquisition and Customer Retention
is an example of Customer Analytics.

6 Which of the following is the process of a.


grouping abstract objects into classes of Clustering is the process of grouping abstract
similar objects? objects into classes of similar objects.

53
© Copyright 2015, Simplilearn. All rights reserved.
54
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 02 — Introduction to SAS

55
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

In this lesson “Introduction to SAS,” you will get introduced to the essential concepts of Statistical
Analysis System.

56
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will understand what SAS is and its components. You will also get acquainted with the
SAS console.

In addition, you will learn to import/export data and list SAS’s different temporary and permanent
libraries.

57
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS

Let’s start this lesson by defining what Statistical Analysis System is.

Statistical Analysis System, or SAS, is a software suite developed by the SAS Institute for advanced
analytics, multivariate analyses, Business Intelligence, data management, and predictive analytics.

58
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS

SAS is a set of solutions for enterprise-wide business users, and it provides a powerful fourth-generation
programming language for performing tasks such as:

• data entry, retrieval, and management,

• statistical and mathematical analysis,

• business planning, forecasting, and decision support,

• operations research and project management, and

• quality improvement.

Before we begin with the concepts of SAS, let us install the SAS University Edition in your system.

59
© Copyright 2015, Simplilearn. All rights reserved.
SAS University Edition

SAS University Edition is a free version to practice SAS programing language.

You can download the free SAS University Edition by visiting the website shown on the screen.

http://www.sas.com/en_us/software/university-edition/download-software.html

Ensure you have the following system configuration to install the software:

1. 64-bit hardware
2. 1GB RAM
3. Microsoft Windows 7, 8, 8.1, or 10
4. Microsoft Internet Explorer 9, 10, or 11, Mozilla Firefox 21 or later, or Google Chrome 27 or later
version

Click Installation Steps button to download the installation steps of SAS software.

This installation steps is also available in the link shown on the screen.

http://support.sas.com/software/products/university-
edition/docs/en/SASUniversityEditionQuickStartVirtualBox.pdf

Follow the installation steps carefully and enjoy working on the SAS software.

60
© Copyright 2015, Simplilearn. All rights reserved.
61
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition

Now that you have installed the SAS University Edition in your system, let’s see how to open the SAS
software.

Follow these steps to open the SAS University Edition:

Open Virtual box by double-clicking its icon on the desktop to access the SAS University Edition.

62
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition

Click Start button. Virtual box opens “Oracle VMware Virtual box.”

63
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition

Type the link shown on the screen, http://localhost:10080/ in your Internet Explorer, Mozilla Firefox, or
Google Chrome. Now, you can access the SAS University Addition Information Center.

64
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition

Click Start SAS studio. The SAS studio opens in new window.

There you have it! You now have access to SAS and can start practicing this new programming language.

65
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

SAS provides a graphical user interface that makes SAS easy to use.

66
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

The SAS studio has the navigation pane on the left side and work area on the right side.

The navigation pane helps you to access files from your system, server, or shared folder. It also has
saved tasks, snippets, libraries, and file shortcuts for easy access.

67
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

The work area has three windows, namely CODE, LOG, and RESULTS.

68
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

The CODE window is used to write a program.

The LOG window is used to view messages about your SAS session and debug SAS programs.

The RESULT window contains the record of the output.

69
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

To start a new program, either right-click “My folder” under “Server Files and Folder” on the Navigation
pane and Click “New” and select “SAS program,” or just press the shortcut key “F4.”

To navigate between the programs, use the tab key.

70
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

This icon is used to execute the program.

This icon is used to save the program.

71
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

This icon is used to save the program in the desired name and location.

This icon is used to cut the program to paste it in the desired place.

This icon is used to paste the copied program in the desired place.

72
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

This icon is used to print the program code.

This icon is used to undo the task.

73
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console

-This icon is used to redo the task.

-This icon is used to find the desired code and replace it with another code.

This icon is used to extract the code.

74
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Input Files

Well, let’s now start with the essential concepts of SAS.

When you work with SAS, you use files that are created and maintained by SAS and files that are not
related to SAS.

SAS supports the following input files:

 SAS files
 External files
 Database Management System, or DBMS, files

SAS files:

Files with formats or structures known to SAS are called SAS files. All SAS files reside in a SAS library.

A SAS file can be a SAS dataset, a catalog, a stored program, a multidimensional database file, and a
financial database file.

75
© Copyright 2015, Simplilearn. All rights reserved.
76
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Input Files

External files:

The files with formats or structure unknown to SAS are called external files. The raw data that you want
to read into a SAS data file are referred to external files.

77
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Input Files

Database Management System files:

Files that are stored in the form of databases are called Database Management System files. SAS
software enables you to write and read data to and from many common Database Management
Systems.

78
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Elements

The elements of SAS language are:

 Statements,
 Expressions,
 Formats, and
 Functions similar to those of many other programming languages.

These elements are used within the DATA step or PROC step of a SAS statement.

79
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Elements

The DATA step statement enables you to write and read raw data to and from external files and SAS
files.

The PROC step statement is a group of procedure statements that enables you to analyze data to create
tables, reports, charts, and SQL queries.

In short, you can say that the DATA step is used to create and manipulate SAS data and the PROC step is
used to analyze the data and generate the output.

Let’s look at each statement in some detail.

80
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step

The DATA step is used to create SAS datasets, compute values, and select specific input records for
processing.

The DATA step creates the following types of output:

 SAS log
 SAS data file
 SAS view
 External data file

81
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step

SAS log is the default type and contains a list of processing messages.

82
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step

A SAS data file is a SAS dataset that contains a data portion and a data descriptor portion. The descriptor
portion consists of the information about the contents and attributes of SAS dataset.

83
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step

SAS view is a SAS dataset that uses descriptor information and data from other files.

84
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step

External data files contain text files.

85
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

Let’s step into the “Syntax Classroom.” In “Syntax Classroom,” you can learn all the essential syntaxes
required to work on a SAS software.

86
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

Let’s understand this with an example. Take a look at the example code written on the screen.

Data Electronic;

input Product_Name $ Salesman_Name $ Price;

Datalines;

LED Kara 500

LCD Harry 400

Mobile Lawrence 300

Iron Mary 125

Proc Print data=Electronic;

title Electronic Dataset of Online XYZ Store';

Run;

87
© Copyright 2015, Simplilearn. All rights reserved.
Here, the keyword “data” creates the dataset electronic.

88
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

The keyword “input” declares the input variables.

The variables declared here are product name, sales man name, and price.

89
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

The dollar symbol represents the product name and sales man name as characters.

90
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

The keyword “Datalines” indicates that the next lines contain input data.

91
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

The keyword “Run” is used to execute the program.

92
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

In this example, the product name, salesman name, and price are referred to as variables and their
values are called observations.

93
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

The keyword “Proc Print” is used to print the output in the electronic dataset.

94
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example

The keyword “title” names the dataset. Here, the dataset is named “Electronic Dataset of Online XYZ
Store”.

95
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Compilation Phase

When you submit a DATA step for execution, it is first compiled and then executed. Let’s learn about
each phase in detail.

The compile phase checks for any syntax errors. The SAS statements written in SAS software are
compiled in this phase.

The compile phase creates an input buffer, a program data vector, and descriptor information.

96
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Compilation Phase

An input buffer is the area of memory into which each record of raw data is read when an INPUT
statement is executed. The input buffer is created if it contains the raw data.

97
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Compilation Phase

A program data vector, or PDV, is the area of memory where the SAS System builds your dataset one
observation at a time. When the program executes, data values are read from the input buffer or
created by SAS language statements and assigned to the appropriate variables in the program data
vector. From here, the variables are written to the SAS dataset as a single observation.

Descriptor Information creates and maintains each SAS dataset, including dataset attributes and variable
attributes.

98
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Execution Phase

All executable statements in the DATA step are executed once for each iteration. If your input file
contains raw data, then SAS reads a record into the input buffer. SAS then reads the values in the input
buffer and assigns the values to the appropriate variables in the program data vector. SAS also calculates
values for variables created by program statements and writes these values to the program data vector.

99
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Execution Phase

When the program reaches the end of the DATA step, three actions occur by default, which make using
the SAS language different from using most other programming languages. They are:

• SAS writes the current observation from the program data vector to the dataset.

• The program loops back to the top of the DATA step.

• Variables in the program data vector are reset to missing values. However, the automatic
variables _N_ is not reset but incremented by one. SAS builds the second observation and
continues until there are no more records to read. The dataset is then closed, and SAS goes on
to the next DATA or PROC step.

100
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Execution Phase

Variables in the program data vector are reset to missing values. However, the automatic variables _N_
is not reset but incremented by one. SAS builds the second observation and continues until there are no
more records to read. The dataset is then closed, and SAS goes on to the next DATA or PROC step.

101
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Example

Let’s understand DATA step processing with the same example used earlier in the syntax classroom.

When you submit a DATA step for execution by clicking the “Run” button, SAS automatically compiles
the DATA step and then executes it. In the compilation Phase, SAS creates an input buffer for electronic
dataset to hold the data as it is not a SAS dataset.

The PDV contains all the variables—product name, salesman name, and price in the input dataset. In
addition, two variables, N and Error, are generated automatically. The “_N_” variable represents the
number of times the DATA step has iterated. The “_ERROR_” variable acts like a binary switch whose
value is 0, if no errors exist in the DATA step, or 1, if one or more errors exist.

Initially in the process, all variable values are set to missing values, except _N_ and _Error_ automatic
variables. Missing characters in SAS are represented by a Period.

102
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Example

The program starts to execute and do the following:

 SAS reads the first data line into the input buffer.
 The INPUT statement then reads the data values from the dataset in the input buffer and writes
them to the PDV where they become variable values.
 SAS increments the _N_ automatic variable by 1 and resets the _ERROR_ automatic variable to 0
at the end of each iteration.
 The data is printed as there is a PROC statement in the end.

103
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

So far you have learned the two major statements of SAS and their execution processes.

Let’s now learn about SAS libraries.

SAS libraries allow us to store datasets and user-defined formats so that they can be used in our
programs. In general, the SAS library is a folder located in our local machine or share drive that we use to
store raw data for SAS Programs.

You can create your own SAS library.

104
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

Let’s step into the syntax classroom to learn the syntax used for SAS libraries. Click go to enter into the
syntax classroom.

105
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

SAS allows you to create your own library and to access the existing library.

To create your own library, use the syntax shown on the screen.

LIBNAME libref 'File path here';

106
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

The Keyword “LIBNAME” creates a library.

107
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

“libref” represents the name of the library. The library name should always be less than or equal to 8
characters and should start with a character.

After using the keyword ”libref”, you should mention the desired file path.

108
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

Note that the “LIBNAME”statement is not used in the DATA step or PROC step.

109
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

To access the stored library, use the syntax shown on the screen.

Libref.dataset_name

110
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library

Here, libref is the stored library name. The dataset name represents the name of the stored dataset.

After performing tasks when you close the SAS sessions, any libraries that you have defined in your
program will be lost. This means that you need to reload the library when you start the SAS program
each time.

111
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries

There are two types of libraries present in the SAS:

 Permanent Library
 Temporary Library

A permanent SAS library exists on the external storage medium of your computer, and it is not deleted
when the SAS session terminates. Permanent SAS libraries are stored until you delete them.

A temporary SAS library exists only for the current SAS session.

112
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries

SAS files are held in a special work space and this work space is assigned to default libref WORK. Note
that files in the temporary WORK library can be used in any DATA step or SAS procedure during the SAS
session, but they are typically not available for subsequent SAS sessions.

113
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries

Let’s step into the classroom to understand how to use a temporary library.

114
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries

Work.Data_set_name;

You can refer to the temporary library by using keyword “Work.”

115
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries

For Example, let’s consider the same example used earlier.

Data work.Electronic;

input Product_Name $ Salesman_Name $ Price;

Datalines;

LED Kara 500

LCD Harry 400

Mobile Lawrence 300

Iron Mary 125

Run;

This example indicates that the dataset is created in a temporary library. However, the output remains
the same for both ways of coding.

116
© Copyright 2015, Simplilearn. All rights reserved.
117
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

118
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following keywords is used d.
to create a library in SAS? The keyword LIBNAME is used to create a
library in SAS.

119
© Copyright 2015, Simplilearn. All rights reserved.
Demo-Importing Data

Well, you have learned about the various types of libraries associated with the SAS software. Let’s now
learn about the most important concept—Importing and Exporting data with the help of a
demonstration.

Click “Server and share Folder” in the navigation pane and browse the file you want to import. Here, we
will import the Ecommerce Data.

Right-click the Ecommerce data and select import data.

You can find the dataset name and its location as shown on the top.

If you have data in a specific worksheet in your Excel workbook, you can pass the name of your
worksheet in the Worksheet Name box. By default, SAS imports data from the first worksheet.

You can change the storage location of the output by clicking the change button. By default, the output
dataset is saved to the Work library, which is a temporary location. The contents in this library are
deleted when you exit the SAS Studio.

The Results tab shows the attributes of the new SAS dataset.

The Output Data tab shows the contents of the new dataset.

120
© Copyright 2015, Simplilearn. All rights reserved.
Demo -Exporting Data

Click Snippets tab under “Server Files and Folders” Panel.

Click the drop-down of snippets and select data from it.

Double-click the “Generate CSV file” option from the “Data” drop-down list.

The “Generate CSV file window“ opens.

Note that in this example, the dataset car is exported. You can also change the dataset by typing the
required dataset name.

Click Run icon.

The dialog box appears on the screen. Click open.

The dataset is exported.

This concludes the demonstration on Exporting data.

121
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Let’s practice what you have learned in this lesson.

122
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Import the data of the North region from Ecommerce dataset. The Ecommerce data is available in the
Downloads.

123
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let us now quickly recap what we have learned in the lesson:


 The navigation pane helps you to access files from your system, server, or shared folder.
 The work area has three windows, namely CODE, LOG, and RESULTS.
 SAS supports inputs files such as SAS files, External files, and Database Management System, or
DBMS, files.
 The elements of SAS language are Statements, Expressions, Formats, and Functions.
 The DATA step is used to create SAS datasets, compute values, and select specific input records
for processing.
 The PROC step is a group of SAS statements that call and execute a SAS procedure.
 A temporary SAS library “Work” exists only for the current SAS session.

124
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “Introduction to SAS.” The next lesson is “Combining and Modifying Datasets.”

125
© Copyright 2015, Simplilearn. All rights reserved.
126
© Copyright 2015, Simplilearn. All rights reserved.
127
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which of the following statements is used c.
to declare variables in the DATA step? The Input statement is used to declare
variables in the DATA step.

2 In which of the following phases is the a.


syntax of a program checked? The syntax of a program is checked in the
compilation phase.

3 The DATA step begins with the keyword a.


_____. The DATA step begins with the keyword Data.

4 Which of the following variables is d.


generated automatically by a DATA step? The values of the _N_ and the _ERROR_
variables are automatically generated for every
DATA step.

128
© Copyright 2015, Simplilearn. All rights reserved.
129
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 03 — Combining & Modifying Datasets

130
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

I will take you through this lesson on Combining and Modifying Datasets.

131
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will learn the different methods used to combine datasets. You will also learn to
modify datasets, and use SAS functions and procedures to manipulate data.

132
© Copyright 2015, Simplilearn. All rights reserved.
Why Combine or Modify Data

A data analyst often has to combine or modify data to aid analysis. For example, a company that sells
products both online and through teleshopping keeps track of its sales in two databases. If it wants to
know the total sales for a period, it has to combine both datasets to know the total sales figure. SAS
offers many methods to combine datasets such as concatenating, interleaving, one-to-one reading, and
one-to-one merging. The method selection depends on the requirements and business scenarios.
analytics, multivariate analyses, Business Intelligence, data management, and predictive analytics.

133
© Copyright 2015, Simplilearn. All rights reserved.
Why Combine or Modify Data

Take another example where the company has sales information for the last one year and now wants to
analyze it quarterly, after sorting sales from the highest to the lowest in a particular region. This sort of
data modification can be done with SAS using the functions and procedures available in the tool.
Let’s begin this lesson by learning the combining datasets techniques.

134
© Copyright 2015, Simplilearn. All rights reserved.
Combining Datasets

We’ll learn four methods of combining datasets, such as Concatenating, Interleaving, one-to-one
reading, and one-to-one merging.

135
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Datasets

Concatenating datasets in SAS means stacking datasets one “on top” of the other into a single dataset.

The number of observations in the new dataset is the sum of the observations in the original datasets.

If a company maintains employee details department-wise and wants to have all the employee details in
one dataset for payroll processing, then by concatenating the individual department datasets it can have
the information in one dataset.

We can concatenate SAS datasets using two methods:

• SET statement and

• APPEND procedure

Let’s learn both these methods and their differences so that you will be able to choose a method based
on the combining requirements.

If the datasets that you concatenate contain the same variables, and each variable has the same
attributes in all the datasets, then the results of the SET statement and PROC APPEND are the same.

On the other hand, if the datasets contains different variables, the results will differ for both.

136
© Copyright 2015, Simplilearn. All rights reserved.
137
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Datasets - Set Statement

Let’s step into the “Syntax Classroom” to learn the syntax. The SET statement allows you to read and
modify datasets.

The syntax of Set statement is:

Set SAS_Data_sets;

Here, SAS_Data_Sets is two or more datasets to concatenate.

138
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Datasets - Set Statement

Here are the uses of SET statement.

 Set statement can contain multiple datasets


 Set statement can read observations and variables from datasets for further data processing
 Set statement is used in Concatenating, Interleaving, and One-to-One reading data combining
methods.

139
© Copyright 2015, Simplilearn. All rights reserved.
SAS SET Statement Demo

Let’s look at a demonstration of the SET statement in the SAS tool.

Look at this example:

An E-Commerce company maintains its data in two datasets “Electronic” and “Fashion” and each has the
following variables: ‘Order_ID’, ‘Products’, ‘Region’, and ‘Sales’. The company wants a consolidated
report of both datasets to understand the combined sales amount for the year. This can be done with
the concatenation method in SAS.

‘Electronic’ and ‘Fashion’ datasets are in “myfolders” of this machine under the “Lesson3” sub-folder.

Let’s import both these datasets using the PROC Import process. You can see the code has been entered
in the program editor for each dataset to import the data from the folder to the SAS application.

Select the program and click the Run icon.

In the “Output Data” tab you can see the ‘Electronic’ and ‘Fashion’ datasets that have been generated.

Now let’s write the code to concatenate the two datasets.

We’ve specified the output dataset name as ‘combinedataset.’

Use the keyword ‘SET’ to combine both the datasets.

140
© Copyright 2015, Simplilearn. All rights reserved.
Select the program and click the Run icon.

In the ‘Output tab’ you can see the name of the table ‘combinedataset’ which has the ‘Fashion’ and
‘Electronic’ datasets combined.

141
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating – PROC Append

The APPEND procedure adds the observations from one SAS dataset to the end of another SAS dataset.
PROC APPEND does not process the observations of the first dataset. It adds the data of the second
dataset directly to the end of the original dataset.

142
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating – PROC Append

The syntax of the APPEND procedure is:

PROC APPEND BASE=base-data-set <DATA=Data-set-to-append> <FORCE>;

base-data-set is the SAS dataset to which you want to append the data. If this dataset does not exist,
then SAS creates it. The value of base-data-set becomes the most recently created dataset.

Data-set-to-append is the SAS dataset that contains the observations to add to the end of the base
dataset. If you don’t pass this option, then PROC APPEND adds the data in the current SAS dataset to the
end of the base dataset.

FORCE option forces PROC APPEND to concatenate the files in some situations where the program
executes syntax error.

143
© Copyright 2015, Simplilearn. All rights reserved.
Demo – Concatenate Proc Append & FORCE option

Let’s now see a demonstration of the Proc Append procedure.

Let’s try to concatenate ‘Fashion’ and ‘Electronic’ datasets using the PROC append function.

Use the keywords ‘PROC APPEND’ and specify the names of the datasets to be combined.

Select the program and click the Run icon.

In the ‘Log’ tab, you can see an error message that has been generated and the two datasets have not
been combined. The message says that some variable lengths are different.

We will have to use the FORCE option to concatenate these datasets.

You can see the keyword “Force” being included in the program.

Select the program and click the Run icon.

In the output data tab, you can see the combined dataset that includes both Fashion and Electronic
datasets.

144
© Copyright 2015, Simplilearn. All rights reserved.
145
© Copyright 2015, Simplilearn. All rights reserved.
SET and Append–A Comparison

Having learned the two methods of concatenating, namely SET statement and Append procedure, let’s
now look at a comparison of these methods.

The SET statement can be used to combine any number of datasets while the Append procedure is used
for combining only two datasets.

The SET statement uses all the variables and assigns missing values where appropriate, while the append
procedure uses the force option to concatenate datasets with missing values.

The Set statement uses explicitly defined formats, informats and labels while in the append procedure
these are defined in the base dataset.

If variable names have different lengths, the SET statement will use the dataset named first while the
append procedure truncates the value of the variable to match the base dataset.

SET statement will not concatenate if there are different variable types in the datasets while the Append
procedure uses the force option to concatenate.

146
© Copyright 2015, Simplilearn. All rights reserved.
147
© Copyright 2015, Simplilearn. All rights reserved.
Interleaving Method

Interleaving method is a way of combining individual sorted datasets into one big sorted dataset.
However, before combining the datasets you have to ensure that they are sorted by the same variable
or variables. The SET statement along with the BY statement is used in this method.

148
© Copyright 2015, Simplilearn. All rights reserved.
Interleaving Method

For example, when dataset Electronic and dataset Fashion are interleaved by variable “Sales”, we get
dataset “OutputSales”. Let’s see how to write this code.

Data OutputSales;

set Electronic Fashion;

by Sales;

Run;

Note that the data should be sorted, here the data is sorted by the Sales field.

149
© Copyright 2015, Simplilearn. All rights reserved.
Interleaving - Demo

Let’s write the program in the Program Editor to Interleave ‘Electronic’ and ‘Fashion’ Datasets.

You can see the two datasets displayed here. To combine them, we will write the program:

Data OutputSales;

set Electronic Fashion;

by Sales;

Select the program and click the Run button.

You can see the combined dataset here through the interleaving method, and it is sorted by the ‘Sales’
field.

150
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

151
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 For which combining method should the c.
datasets be sorted? In the interleaving method of combining, the
data should be sorted by the same variable.

152
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Reading

One-to-one reading combines two or more SAS datasets, one "to the right" of the other into a single
"fat" dataset. In a one-to-one reading, a single observation in one data set is related to a single
observation from another dataset based on the values of one or more selected variables. A one-to-one
reading implies that each value of the selected variable occurs no more than once in each data set.

For Example:

A company maintains two records. The first record has the variables “Order ID”, “Sales_Amount”, and
“Product”. The second record has the variables “Order ID”, “Customer_Name”, and “Location”. Suppose
the company wants to know from an Order ID number all related information such as Sales, Product,
Customer Name, and Location to analyze it further for sales forecasts, one-to-one reading method of
combining datasets is used.

153
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Reading

The syntax for one-to-one reading is:

Data onetooneread;

Set <Dataset>;

Set <Dataset1>;

Run;

Set is a keyword and Dataset refers to the names of the datasets to be combined. Set will read the
observations from each dataset matching the first one with the first and so on. It will stop at the end of
the smaller dataset. Let’s see a demonstration of the one-to-one read method.

154
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Reading - Demo

Let’s write the program in the Program Editor to combine Sales and Customer_Info datasets using the
one-to-one read method.

You can see the data inputted here. Let’s write the code to generate these two datasets.

Select the data and program and click the Run icon.

You can see the ‘Sales’ dataset and the ‘Customer_Info’ dataset here.

Let’s now write the one-to-one read code to combine these datasets.

Data onetooneread;

set Sales;

set Customer_Info;

Run;

Now select the data and click the Run icon.

This will combine the first observation of Sales with the first observation of Customer_Info and then the
second observation of Sales with the second observation of Customer_Info and so on to create one-to-

155
© Copyright 2015, Simplilearn. All rights reserved.
one-read. The dataset stops after it reads the last observation from the smallest dataset. For example, if
you check the combined dataset, it has ignored Order ID “6” in the sales dataset.

156
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Merging

One-to-one Merging, like one-to-one reading, also combines two or more SAS datasets, one "to the
right" of the other into a single "fat" dataset. Use one-to-one merging when you want to combine one
observation from each data set, but it is not important to match observations. The precondition is that
the datasets have been sorted by the variable which is being used for merging.

For Example:

Suppose the dataset Sales contains three variables: Order_ID, Sales_Amount, and Product;

and the dataset Customer_Info contains three variables: Order_ID, Customer_Name, and Location;

the two datasets are sorted by ‘Order_ID’,

157
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Merging

The one-to one merge syntax to combine these datasets would be:

Data onetooneread;

Merge Sales Customer_Info;

Run;

Let’s see a demonstration of the one to one merge.

158
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

159
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 In which combining method in SAS does b.
the dataset stop reading data once it In one-to-one reading method the dataset
reads the last observation from smallest stops reading data once the last observation
data set. from the smallest data set is read.

160
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Merge - Demo

Let’s write the program in the Program Editor to combine Sales and Customer_Info Datasets using the
one-to-one-read method.

Let’s first generate the two datasets.

You can see the inputted data of the two datasets. Select the program and click the Run icon.

You can see the first dataset Sales here and the second dataset Customer_Info here.

Let’s now write the program for one-to-one read.

Select the data and click the Run icon.

This will combine the first observation of Sales with first observation of Customer_Info and the second
observation of Sales with the second observation of Customer_Info and so on to create the one-to-one-
read dataset. When SAS performs a one-to-one merge, the DATA step continues to read observations
until the last observation is read from the largest dataset.

161
© Copyright 2015, Simplilearn. All rights reserved.
162
© Copyright 2015, Simplilearn. All rights reserved.
Data Manipulation

We saw a few data combining techniques so far. Let’s now look at some data manipulation techniques.
But before we begin, what is data manipulation?

Data manipulation is the process of changing or rearranging data for further analysis. Data becomes
easier to read as it is organized in a systematic manner to facilitate study and analysis.

A popular use of data manipulation is allowing website owners to know their most popular pages and
traffic sources. Data manipulation helps in sorting and analyzing raw data to understand required
information.

163
© Copyright 2015, Simplilearn. All rights reserved.
Data Manipulation

Some of the data manipulation techniques are listed here.

• Delete and group observations from a dataset

• delete variables from a dataset

• create and modify variables

• change variable attributes

Let’s learn each technique in detail.

164
© Copyright 2015, Simplilearn. All rights reserved.
Delete and group observations

If-then-else statement is mainly used to group observations. It executes a SAS statement for
observations that meets a specific condition.

Syntax of IF – THEN – ELSE:

IF expression THEN statement;


<ELSE statement;>

expression is any SAS expression and is a required argument.

statement can be any executable SAS statement or DO group.

Let’s see a demonstration of delete observations.

165
© Copyright 2015, Simplilearn. All rights reserved.
Delete Observations - Demo

Suppose you want to delete observations based on a certain condition, IF and DELETE are the two
keywords that are to be used in the program.

For example, if you want to delete observations greater than $150 in the sales field in the ‘Electronic’
dataset, write the program:

Data Datset_Deleteobservations;

Set

Select the program and click the Run icon.

In the ‘Output Data’ tab, you can see the dataset ‘Dataset_Deleteobservations’ with sales figures of $150
and less here.

166
© Copyright 2015, Simplilearn. All rights reserved.
Delete and Keep variables – Demo

Sometimes, you might want to delete one or more variables from a dataset. To do this, you have to use
the DROP keyword.

Let’s understand the “delete variable” procedure through this example.

In the ‘Electronic’ dataset, if you want to delete the variables ‘Shipping Cost’ and ‘Order Priority’
variables, you have to write the program using the keyword DELETE.

First, specify the output dataset name, then use the keyword DROP, and specify the variable names that
you want to delete. Set ‘Electronic’ indicates the dataset to be used.

Now let’s write the program to keep some variables.

Specify the output dataset and use the keyword KEEP followed by the variables that you want to retain.

Mention the dataset to be used and then click the Run icon.

In the Output Tab, you can see the original dataset ‘Electronic’ with all the variables.

Here is ‘Dataset_Output1’ without the variables ‘Shipping_Cost’ and ‘Order_Priority’.

Here is ‘Dataset_Output2’ with only ‘Order_ID’ ‘Product’ and ‘Sales’ fields.

167
© Copyright 2015, Simplilearn. All rights reserved.
168
© Copyright 2015, Simplilearn. All rights reserved.
Modifying Variable Attributes

Variables in SAS contain a number of attributes such as Name, Type, Length, Format, Label, and so on.
If you want to modify the attributes of a variable, for example, change the name to a new one, or cut
down the length of a variable, you can use the code specified for each action in SAS..

169
© Copyright 2015, Simplilearn. All rights reserved.
Modifying Variable Attributes

Here is the syntax for changing the Label:


LABEL variable-1=label-1...<variable-n=label-n>;

“variable-1 …” specifies the variable we want to label.

Syntax of Format Statement is:

FORMAT variable-1 <. . . variable-n> <format> <DEFAULT=default-format>;

Where, Variable-1 is the variable we want to format.

Format is the keyword used to format the variable.

Default Format is the temporary format for displaying values of variables which are not in the FORMAT
statement. Default format is not permanently associated with variables in the output dataset.

Here is the syntax for Renaming variables.


The Syntax is:

RENAME old-name=new-name;

Let’s see a demonstration of modifying variable attributes.

170
© Copyright 2015, Simplilearn. All rights reserved.
Modifying Variable Attributes - Demo

In this demonstration, you will see how to modify variable attributes using ‘Rename’, ‘Label’ and
‘Format’ keywords.

We will use the same ‘Electronic’ dataset.

The program to import and generate the dataset is already written.

Now, let’s rename the variable “Product” to “Product_Names”.

“Set” Electronic specifies the dataset to be used and the keyword ‘Rename’ indicates that the variable
‘Product’ has to be renamed to ‘Product_Names’.

Let’s change the labels ‘Order_ID’ to ‘ID’, and ‘Sales’ to ‘Sales_Amount.’

The difference between ‘Rename’ and ‘Label’ is that rename permanently changes the name, whereas
label command retains the old name but displays the new name.

“Label” followed by the old name and the equal to sign and followed by the new name implies that the
name has to be changed.

We are labelling ‘Order_ID’ to ‘ID’ and ‘Sales’ to ‘Sales_Amount’.

171
© Copyright 2015, Simplilearn. All rights reserved.
We are also going to format Sales so that the amount is displayed with a dollar sign and a decimal point
followed by two zeroes.

Use the keyword ‘Format’, mention the variable name, and type ‘dollar10.2’, which can be found in the
SAS dictionary of formats. This format will display the amount with a dollar sign, a comma, and two
decimal places.

Click the Run icon.

Click the ‘Output Data’ tab to see the results.

You can see the output dataset under the ‘column names’ option displaying the new variable names and
formats.

‘Product’ has been renamed as ‘Product_Names’. The amount in the ‘Sales’ column is displayed with a
dollar sign and a decimal point followed by two zeroes.

In the ‘column labels’ option, you can see that ‘Order ID’ has be labelled as ‘ID’.

172
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s quickly recap what we’ve learned in the lesson:

 Combining and modifying datasets create data that serves the purpose of data analysis better.

 Four methods of combining data are concatenating, interleaving, one-to-one reading, and one-
to-one-merging.

 Data manipulation techniques allow you to modify variable or observation attributes, exclude or
include data based on a criteria, or rename variables and attributes for further analysis.

173
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes the lesson “Combining and Modifying Datasets”. The next lesson will discuss “PROC SQL”.

174
© Copyright 2015, Simplilearn. All rights reserved.
175
© Copyright 2015, Simplilearn. All rights reserved.
176
© Copyright 2015, Simplilearn. All rights reserved.
177
© Copyright 2015, Simplilearn. All rights reserved.
178
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which of the following Statement will d.
you use to FORMAT Salary variable as FORMAT salary dollar10.2;
comma separated with a dollar symbol ?

2 What is the syntax for Renaming a.


Variables? Rename New_name=Old_Name;

3 Which two methods allow us to a.


concatenate dataset in SAS? We can concatenate SAS datasets by using 2
methods:
• SET statement and
• APPEND procedure

4 Which of the following program b.


concatenate the data sets HR and The program to concatenate the data sets HR
Marketing? and Marketing :
Data output_Data;
Set HR Marketing;
Run;

5 Which of the following methods continue a.


to read observations until the last Interleaving continues to read observations
observation is read from the largest until the last observation is read from the
dataset to merge datasets? largest dataset to merge datasets.

6 Which keyword can we use to exclude b.


variables from the output dataset? The Drop statement excludes variables from
the output dataset

7 Which method is used to stack datasets? a.


The SET statement excludes variables from the
output dataset.

179
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 04 — PROC SQL

180
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the “Data Science with Statistical Analysis System or SAS” course offered by
Simplilearn.

In this lesson, “PROC SQL,” you will be introduced to the essential concepts of PROC SQL.

181
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will understand the concept of data analytics, its types, and techniques. You will be
able to list the various types of analytical problems industries face, and describe ways to solve those
using SAS. You will also learn the various widely used analytical tools to perform data analysis.

182
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQL

Let’s start this lesson by defining PROC SQL.

Structured Query Language, or SQL, is a generic database language that helps to communicate with
databases.

183
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQL

The PROC SQL is the base SAS implementation of SQL. It allows you to retrieve, summarize, sort, join,
and concatenate datasets or databases available in SAS.

184
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQ

The PROC SQL is the base SAS implementation of SQL. It allows you to retrieve, summarize, sort, join,
and concatenate datasets or databases available in SAS.

PROC SQL is used for the following:

Generate reports and summary statistics.

Retrieve and combine data from tables.

Create tables, views, and indexes.

Update and retrieve data from DBMS.

Modify a PROC SQL table by adding, modifying, or dropping columns.

185
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQL

The PROC SQL allows you to combine the functionality of the DATA step and PROC step into a single
step.

Before we begin with the concepts of PROC SQL, let’s understand some terminologies associated with
the PROC SQL.

186
© Copyright 2015, Simplilearn. All rights reserved.
Terminologies of SQL

The following table lists the equivalent terms that are used in SQL, SAS, and data processing.

The PROC SQL table is termed a SAS data file in SAS and file in data processing.

The row in SQL is termed an observation in SAS and record in data processing.

The column in SQL is termed a variable in SAS and field in data processing.

Well, let’s now learn the syntax of PROC SQL and its uses.

187
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax

Let’s step into the “Syntax Classroom” to learn the syntax of PROC SQL.

188
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax

PROC SQL <options>;

--------------

QUIT;

The PROC SQL command begins with the keyword “proc sql” and ends with the keyword “quit.” The
keyword “quit” is used to terminate the procedure.

189
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax

The PROC SQL command begins with the keyword “proc sql” and ends with the keyword “quit.” The
keyword “quit” is used to terminate the procedure.

190
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax

There are various clauses present in PROC SQL:

Select statement or SELECT and FROM clauses,

WHERE clause,

GROUP BY clause,

HAVING clause, and

ORDER BY clause.

Every PROC SQL statement must have at least one select statement. It displays the query's results
without the PRINT statement.

However, the other clauses such as where, group by, having, and order by are optional and can be
applied according to the requirement.

Let’s understand the syntax of each clause and learn how to retrieve data from a single table using these
clauses in PROC SQL.

191
© Copyright 2015, Simplilearn. All rights reserved.
192
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

Select Statement

select column_name

from sql.database_name;

The select statement contains two clauses, namely “select clause” and “from clause.” The “Select
clause” is used to select the specific row or column.

193
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

The “from clause” is used to select the dataset or table from which the data needs to be extracted.

194
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

Where Clause

select column_name

from sql.data_set_name

where <condition>;

The “where clause” is used to extract the data that fulfills the specific condition. Note that the keyword
used for this clause is “Where.”

195
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

Order by Clause:

select column_name

from sql.data_set_name

where <condition>

order by column_name <option>;

The “order by” clause sorts the output set by one or more columns. It also allows you to sort the output
data both in alphabetical and numerical order. Note that the column name is mentioned after the
keyword “order by.” The option is set after mentioning the column name.

196
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

Group by Clause:

The keyword “group by,” breaks the resultant data into subsets of rows. You should use an aggregate
function either in the “select” clause or a “having” clause to group the data. Some of the aggregate
functions are avg, mean, count, sum, and max.

select column_name, aggregate function(condition)

from sql.data_set_name

group by <condition>

order by column_name <option>;

197
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

Having Clause:

The keyword “having,” is used to set condition to the groups.

select column_name, aggregate function( condition)

from sql.data_set_name

group by <condition>

198
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table

having <group by condition> in (condition 1, condition 2…. , condition n)

order by column_name <option>;

The “having clause” is used after the “group by” clause.

199
© Copyright 2015, Simplilearn. All rights reserved.
Demo- Retrieve data from a table

In this demo, you will learn how to retrieve data from a table using the PROC SQL clauses.

In this demo, we will retrieve data of all the products from the Electronic dataset, which have sum of
sales greater than 450, in a descending order.

The dataset “Electronic” is imported to the SAS console using the code shown on the screen.

To retrieve data using PROC SQL clauses, use the keyword proc sql. PROC SQL executes the program
without using the RUN statement.

The columns product, sales, and order priority are selected from the table “Electronic” using the
keyword “Select.”

In this demo, the products that have sales greater than 200 are selected using the Where statement.

The Group By statement is used to group data by a specified column. Here, we will group the product
column. With the GROUP BY clause, we can also use an aggregate function in the SELECT clause or in a
HAVING clause.

In this demo, the products which have the sum of sales greater than 450 are grouped. Note that the
aggregate function SUM is used here.

200
© Copyright 2015, Simplilearn. All rights reserved.
The Group By statement is used to group data by a specified column. Here, we will group the product
column. With the GROUP BY clause, we can also use an aggregate function in the SELECT clause or in a
HAVING clause.

In this demo, the products which have the sum of sales greater than 450 are grouped. Note that the
aggregate function SUM is used here.

The data is ordered in descending order using the Order By statement.

The keyword “quit” is used to terminate the procedure.

This concludes the demo on how to retrieve from a table using the PROC SQL clauses.

201
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

202
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following PROC SQL clauses a.
uses aggregate functions? The Having clause uses aggregate functions.

203
© Copyright 2015, Simplilearn. All rights reserved.
Selecting columns in a Table

At times, you will need to select all columns or a specific column in a table.

To select all columns in a table, use an asterisk symbol in the “select” clause.

To select a specific column in a table, use the column name in the “select” clause.

Using PROC SQL, you can also eliminate the duplicate rows from the output data. To do so, use the
keyword “distinct” in the select clause.

204
© Copyright 2015, Simplilearn. All rights reserved.
Creating New Variable

PROC SQL allows you to create a new variable in the query result. These columns can either be text or
calculations. You can add text column to the query result by using a string or literal expression.

205
© Copyright 2015, Simplilearn. All rights reserved.
Creating New Variable

Take a look at this example program and its output dataset shown on the screen.

Proc SQL;

Select Product,Sales, (Sales*0.5) as Bonus from Electronic

Where Sales > 200

Order By Sales Desc;

Quit;

Here, a new column “bonus,” is created, where observations are derived from the sales column. The
generated output is shown on the screen.

206
© Copyright 2015, Simplilearn. All rights reserved.
Formetting the Variable in SAS

You can also change the format of the variable and assign a new label to the dataset.

Proc SQL;

Select Sales format = Dollar10.2 Label='Net Sales' from electronic;

Quit;

In this example, the attribute format is used to modify the format of the sales variable and a label is used
to name the output dataset. The generated output is shown on the screen.

207
© Copyright 2015, Simplilearn. All rights reserved.
Case Expression

PROC SQL also allows you how to process conditional data. Case expression is a valid SQL-expression
that resolves to a table column where the values are compared to all the when-conditions. Using “Case”
expression in the select clause, you can extract the data that fulfils the set condition.

The generated output dataset is shown on the screen.

Proc SQL;

Select Product, Discount,

Case

When Sales between 0 and 100 then 'Medium'

When Sales between 101 and 200 then 'High'

When Sales between 201 and 250 then 'Critical'

Else 'Very Critical'

End As Order_Priority1

from Electronic;

208
© Copyright 2015, Simplilearn. All rights reserved.
Quit;

In this example, from the electronic dataset, the product and discount column are selected and the
condition is set to the sales column. The end statement is required in the case expression. Also, set the
condition in descending order to increase the efficiency because SAS stops checking the case expression
as soon as it finds the first true value.

The output dataset is shown here. Note that the column “Order priority 1” is generated.

209
© Copyright 2015, Simplilearn. All rights reserved.
Referencing a CALCULATED Column

CALCULATED enables you to use the results of an expression in the same SELECT clause or in the WHERE
clause.

Let’s take the previous example and derive Net Profit.

To derive the Net Profit, create a Tax column, which is 5% of the sales amount, and subtract Tax from
the profit.

You must use the CALCULATED keyword with the alias to inform PROC SQL that the value is calculated
within the query.

Otherwise, the SQL code will fail with a message similar to “column Tax was not found.”

Proc Sql;

Select Product, Discount, Sales*0.05 as Tax,

(Profit-CALCULATED Tax) as Net_Profit Format=Dollar10.2,

Case

When Sales between 0 and 100 then ‘'“Medium’”

210
© Copyright 2015, Simplilearn. All rights reserved.
When Sales between 101 and 200 then “’High’”

When Sales between 201 and 250 then “’Critical’”

Else “’Very Critical”’

End as Order_Priority1

From Electronic;

Quit;

The generated output is shown on the screen.

211
© Copyright 2015, Simplilearn. All rights reserved.
Create Totals— Example

Using SAS, you can also obtain the totals by Order_Priority1. Look at the example shown on the screen.

The SUM function returns the sum of each row of the columns specified as arguments.

The COUNT(*) returns the total number of rows in a group or in a table.

Proc Sql;

Select

Case

When Sales between 0 and 100 then 'Medium'

When Sales between 101 and 200 then 'High'

When Sales between 201 and 250 then 'Critical'

Else 'Very Critical'

End as Order_Priority1,

sum(Discount) as Total_Discount Format=Dollar10.2,

212
© Copyright 2015, Simplilearn. All rights reserved.
sum(Sales) as Total_Sales Format=Dollar10.2,

count(*) as Number_Sales

From Electronic

group by Order_Priority1

Quit;

The generated output is shown on the screen.

213
© Copyright 2015, Simplilearn. All rights reserved.
SQL Pass-Through Facility

The SQL Procedure Pass-Through Facility communicates with the DBMS through the SAS/ACCESS engine.
The Pass-Through Facility allows you to do the following::
• Pass native DBMS SQL statements to a DBMS
• Display the query results formatted as a report
• Create SAS datafiles and views from query results
Since the database is typically optimized and indexed to handle queries, complex joins are handled much
faster with a SQL pass-through query.
Take a look at the example program shown on the screen.
Use keyword connect to link the DBMS.

214
© Copyright 2015, Simplilearn. All rights reserved.
Creating a New Table

Using the “Create Table” statement, you can create a new table to define the columns and their
attributes. You can also specify a column's name, type, length, format, and label.

Let’s understand with an example,

215
© Copyright 2015, Simplilearn. All rights reserved.
Creating a New Table

Proc SQL;

Create Table Electronic_Example as

Select Product,Sales, Order_Priority from Electronic

where Order_Priority = 'High';

Select * from Electronic;

Quit;

In this example, the “electronic_example” dataset is created. This dataset will have the data from the
electronic data set that has higher-order priority.

The second select statement is used to show the complete electronic dataset. Note that only one table is
created using the “Create Table” statement.

The generated outputs are shown on the screen.

216
© Copyright 2015, Simplilearn. All rights reserved.
217
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

218
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 While using the CASE expression, values c.
are compared to all the _____. While using the CASE expression, values are
compared to all the When conditions.

219
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables

So far you have learned how to retrieve data from a single table. Let’s now learn how to retrieve data
from multiple tables.

220
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables

If you want to combine multiple tables through SAS code, it requires several PROC SORT, DATA step, and
merge function. However, using PROC SQL, multiple datasets are combined easily.

To select data from multiple tables, simply join the tables in a query.

221
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables

Let’s step into the “Syntax classroom” to learn the syntax for selecting two tables using PROC SQL,

222
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables

proc sql;

select *

from table 1, table 2;

Quit;

Use the keyword “select” to select the table. The asterisk symbol selects all the columns from tables 1
and 2. To select the particular column from table, simply mention the column name after the keyword
select.

223
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

Let’s now learn how to select data from multiple tables.

The data that you may need for a research can come from different sources. To combine them, simply,
join the tables in a query.

There are two types of joins: Inner Join and Outer Join

• The Inner Join selects all rows from both tables as long as there is a match between the columns in
both tables.

• The Outer Join returns all matching records from both tables whether the other table matches or not.

Let’s learn about each type of join in detail.

224
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

The INNER JOIN selects all rows from both tables as long as there is a match between the columns in
both tables.

It can combine a maximum of 256 tables at a time.

Only rows that satisfy the join conditions are kept.

225
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

You can perform an inner join by using a list of table-names separated by commas with the WHERE
clause or by using the INNER JOIN and ON keywords.

226
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

Let’s take an example of how we can join Electronic and Electronic_CustInfo datasets to the attach
customer name and customer ID to each order.

You can select all columns from both tables with * and utilize the feedback option.

You can use the FEEDBACK option to see exactly how PROC SQL is implementing your query.

In the log session, you can see all column names with e and c table aliases. The output obtained is shown
on the screen.

227
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

You can also customize your query by selecting only required columns in the order you prefer. Observe
the changes made in the code to select preferred columns.

228
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

You can obtain the same results by performing an inner join with WHERE clause and INNER JOIN and On
keywords.

Both approaches are used interchangeably in practice.

229
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

In contrast with an inner join, an outer join keeps rows that match the condition as well as some or all of
the unmatched data from one or both tables.

There are three types of outer joins: left, right, and full.

230
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

The LEFT JOIN returns all rows from the left table (table1), with the matching rows in the right table
(table2). The electronic dataset and electronic customer information dataset is taken as an example.

The output for the example program is shown on the screen.

231
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

The RIGHT JOIN returns all rows from the right table (table2), with the matching rows in the left table
(table1). The electronic dataset and electronic customer information dataset are taken as an example.

The output for the example program is shown on the screen.

232
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables

The FULL OUTER JOIN returns all rows from the left table (table1) and from the right table (table2). The
electronic dataset and electronic customer information dataset are taken as an example.

The output for the example program is shown on the screen.

233
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Query Results

You can concatenate the two query results using the “Union” operator. Union Operator takes unique
observations from the dataset and generates a report.

Remember that “Union” does not return duplicate rows. If a row occurs more than once, then only one
occurrence is returned.

234
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Query Results

Sometimes, you need to return duplicate rows as well. In this case, you can use the keyword “Union All”
which requests that duplicate rows too remain in the output.

You can also concatenate two or more query results using the operator Except, Intersect, and Outer
union.

Use the operator “Except” to produce rows that are part of the first query only.

Use the operator “Intersect” to produce rows that are common to both the queries.

Use the operator “Outer union” to concatenate the query results.

235
© Copyright 2015, Simplilearn. All rights reserved.
Demo - Concatenating Query Results

This demo shows you how to concatenate the query results using the operator “Union.”

The two datasets, namely north and south are imported to the SAS console.

The table output is created using the keyword “Create.”

The variables “Order ID, region, and, sales amount” have been selected from the dataset “North” and
“South” using the keyword “Select.”

The keyword “Union” is used to concatenate the two datasets. The Union operator produces all unique
rows from both queries.

Note that the variables selected in both the datasets are the same.

The keyword “quit” is used to terminate the procedure.

This concludes the demo on how to concatenate the query results using the operator “Union.”

236
© Copyright 2015, Simplilearn. All rights reserved.
237
© Copyright 2015, Simplilearn. All rights reserved.
Activity

Let’s check your understanding. Play “Organize to Analyze.”

Read the problem carefully and analyze what needs to be done using SAS techniques.

Create a new table with a new variable which is 10% of Sales if Sales is greater than 100 and 5% of Sales
if sales is less than 100 from the Electronic Dataset.

Click each code in the correct sequence to write the program that will be the solution to the
problem. Click the dataset tab to view them.

Hint: Name the new table as “Electronic_Data1” and new variable as “Incentive.” Semicolon can be
clicked any number of times.

Let’s begin “Organize to Analyze.”

238
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Let’s practice what you have learned so far in this lesson. There are two Mini Projects in this lesson. Read
the question carefully and then answer them. The techniques and steps are provided to assist you under
the guide section.

239
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

ABC eCommerce company has to create a report in SAS from the master dataset.

The report should display the total sales and profits details for the watch, iron, LED, and LCD products in
descending order.

As a SAS programmer, write the code for the above requirement.

240
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Follow the below steps to solve the problem:

1. Display Product, Sales and Profit fields.

2. Create new Table name SalesReport

3. Look only for the watch, iron, LED and LCD products

4. Group the report by products’ s um of sales

5. Sort the report in descending order by Product.

241
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

Click Next to view the demo.

242
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 2

ABC eCommerce company has a requirement to create a new table with the variables Order_ID,
Order_Date, Product and Sales variables from “Electronic” dataSet and Customer_ID, Customer_Name
from “Electronic_Custinfo” dataset.

This table should be the extract of rows from Electronic and Electronic_Custinfo datasets that have as
sales value greater than 150 based on order ID in a descending order.

As a SAS programmer, write the code for the above requirement.

243
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 2

Follow the below steps to solve the problem:

1. Look for values from both tables based on Order_ID

2. Extract Order_ID, Order_Date, Product and Sales variables from Electronic Data Set and
ustomer_ID, Customer_Name for all records from Electronic_Custinfo dataset

3. Create new table named “Electronic_Data”

4. Sort by sales in descending order

5. Extract rows where sales is greater than 150

244
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 2

We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

Click Next to view the demo.

245
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s now quickly recap the concepts you have learned in the lesson:

Structured Query Language, or SQL, is a generic database language that helps you communicate with
databases.

PROC SQL allows you to retrieve, summarize, sort, join, and concatenate datasets or databases available
in SAS.

There are various clauses present in the PROC SQL:

Select statement or SELECT and FROM clauses,

WHERE clause,

GROUP BY clause,

HAVING clause, and

ORDER BY clause.

The asterisk symbol selects all the columns from the table.

246
© Copyright 2015, Simplilearn. All rights reserved.
The Inner Join selects all rows from both tables as long as there is a match between the columns in both
tables.

The Outer Join returns all matching records from both tables whether the other table matches or not.

247
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “PROC SQL.” The next lesson is “Basics of Statistics.”

248
© Copyright 2015, Simplilearn. All rights reserved.
249
© Copyright 2015, Simplilearn. All rights reserved.
250
© Copyright 2015, Simplilearn. All rights reserved.
251
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which keyword do we use to end Proc d.
SQL?? We end Proc SQL using the QUIT statement.

2 What do we use the Order By Statement b.


for? We use the Order By statement to sort data in
ascending and descending order.
3 We use the Select Statement in Proc SQL a.
to select columns form the datasets. We use the Select Statement to select columns
from the datasets.
4 Which keyword do we use to create b.
tables in Proc SQL? We use the create statement to create tables in
SAS.
5 We join more three or more tables in a.
Proc SQL. We can join three or more tables in Proc SQL.

252
© Copyright 2015, Simplilearn. All rights reserved.
253
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 05 — SAS Macros

254
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the “Data Science with Statistical Analysis System or SAS” course offered by
Simplilearn.

In this lesson “SAS Macros,” you will get introduce to the essential concepts of SAS macros.

255
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will learn how to minimize the amount of SAS code using SAS Macros.

You will learn how to use macro function to manipulate the character strings and text.

You will also identify the differences between automatic and user-defined macro variables.

256
© Copyright 2015, Simplilearn. All rights reserved.
Need for SAS Macros

You have a program and you need to run it over again and again. Writing the program every time is time
consuming and tiring.

SAS allows you to use macros in your program which reduces the time spent writing the same code
repeatedly.

257
© Copyright 2015, Simplilearn. All rights reserved.
Need for SAS Macros

You can use macros in SAS for the following reasons:

• Changes made in one location of your program cascades throughout your program.

• The programs are data driven, letting SAS decide what to do based on actual data values.

The purpose of the SAS macro language is to generate text which is used in SAS programs; this text can be in
any valid SAS code, namely statements, variables, text strings, and PROC steps.

258
© Copyright 2015, Simplilearn. All rights reserved.
Macro variables

Macro variables are tools that enable you to dynamically modify the text in a SAS program through symbolic
substitution. You can assign large or small amounts of text to macro variables, and after that, you can use
that text by simply referencing the variable that contains it. Macro variable values have a maximum length of
65,534 characters.

259
© Copyright 2015, Simplilearn. All rights reserved.
Macro variables

Let’s step into the syntax classroom to learn how to refer a macro variable in the code.

260
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables

Macro variables defined by the macro processor are called automatic macro variables. These variables are
also called global variables.

To invoke an automatic macro variable, use an ampersand followed by the macro variable name that starts
with a three-letter prefix “SYS.”

Following are the most used automatic macro variables:

&SYSLAST macro variable returns the name of the most recent SAS data set.

261
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables

&SYSNOBS macro variable returns the number of observations in the last data set.

262
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables

&SYSDATE and &SYSDATE9 values represent the date on which a SAS session began executing in the two- and
four-digit format of the year, respectively.

263
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables

&SYSDAY macro variable returns Day of week on which SAS job or session began executing.

264
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables

&SYSTIME macro variable returns the time at which a SAS job or session began executing

265
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables

Use the command “%PUT _AUTOMATIC_” to view all available automatic macro variables.

266
© Copyright 2015, Simplilearn. All rights reserved.
Automatic macro variables

Let’s understand the automatic macro variable with the help of an example.

proc print data = Electronic;

where Order_Priority = ‘High';

TITLE “Status of Product Orders as of &SYSDAY &SYSDATE";

run;

The “&SYSDAY” and “&SYSDATE” are automatic macro variables created when the SAS session starts.

When the above code is run, we get the output as shown on the screen.

Note that an ampersand symbol is used to refer those values in the title statement.

267
© Copyright 2015, Simplilearn. All rights reserved.
User-Defined Macro Variables

User-defined macro variables or local variables enable you to create a value once and replace that value
repeatedly within a program.

The %LET Statement creates a macro variable and assigns it a value.

268
© Copyright 2015, Simplilearn. All rights reserved.
User-Defined Macro Variables

Let’s step into the syntax classroom to learn the syntax of user-defined macro variable.

269
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement

To create a macro variable, after the keyword %LET, specify the name of the macro variable you want to
create, an equal sign, and then the value of the macro variable.

Use the command “%PUT _user_” to view all user-defined macro variables in the SAS log.

270
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement

Use the command “%PUT _ALL_”to view all user-defined and automatic macro variables in the SAS log.

271
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement

To delete user-defined macro variables mention the variable name next to the statement “%SYMDEL.”

272
© Copyright 2015, Simplilearn. All rights reserved.
User-defined Macro Variable

Look at the following example program that explains the use of a user-defined macro variable.

%LET Order = ‘High';

proc print data = Electronic;

where Order_Priority = &order;

TITLE "Sales as of &SYSDAY &SYSDATE";

run;

“High” is the value field, and it can take any numeric, text, or date value. “Order” is the name of the local
variable.

When the above code is run, we get the output as shown on the screen. Note that only the column “order
with high value” is generated.

273
© Copyright 2015, Simplilearn. All rights reserved.
274
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions

Similar to SAS base functions, SAS Macro functions are built-in programming routines that enable you to
process many types of data manipulation tasks.

The syntax of a macro function is similar to that of a SAS function and they yield similar results and are
executed by the macro processor.

The uses of Macro functions are as follows:

 It manipulates strings and text.


 It performs arithmetic and logical operations.
 It executes SAS functions.

275
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Strings and Text

Macro character functions help you to change lowercase words to uppercase, extract a substring of a
character string, get a word from a text, and so on.

Following is the list of the most popular string manipulation functions:

 %UPCASE translates letters from lowercase to uppercase.


 %SUBSTR extracts a substring from a character string.
 %SCAN extracts a word from a character string.
 %INDEX searches a character string for specified text.
 %LENGTH returns the length of a character string or text expression.

276
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Logical Operations and Execution

Macro functions perform arithmetic and logical operations. These include tasks such as performing simple
arithmetic tasks, computing dates, and evaluating logical expressions.

The %EVAL function evaluates integer arithmetic or logical expressions.

The %SYSEVALF function evaluates arithmetic and logical expressions using a floating-point arithmetic.

The macro function “%SYSFUNC” executes SAS functions.

277
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

Let’s track the E-commerce dataset variable “Sales and Profit” for previous years grouped by the ship mode
type.

Look at the program shown on the screen to understand how to assign a value to a macro variable and how
to manipulate it.

PROC IMPORT DATAFILE='/folders/myfolders/E Commerce Data.xlsx'

DBMS=XLSX

OUT=WORK.E_Commerce;

GETNAMES=YES;

RUN;

%let DSN=E_Commerce;

%let var=Sales Profit;

proc means data=&DSN;

278
© Copyright 2015, Simplilearn. All rights reserved.
title1 "%UPCASE(%SCAN(&VAR,1)) and %UPCASE(%SCAN(&VAR,2)) for %UPCASE(&DSN) channel";

title2 "prior to %sysfunc(year("&sysdate"d)) calendar year";

where year(order_date) <(%sysfunc(year("&sysdate"d)));

var &var;

class Ship_Mode;

run;

The %LET statement creates a macro variable and assigns a value to it. Here, DSN and Var are the macro
variables. The value ecommerce is assigned to the DSN macro variable and sales and profit are assigned to
the var macro variable.

279
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

To refer the macro variable, precede the name of the macro variable with an ampersand symbol.

A macro processor resolves the reference and substitutes the macro variable's value before the program
compiles and executes.

Thus, the variable “&DSN” is replaced with value “E_Commerce” and variable “var” with values “Sales Profit.”

280
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

The %SCAN function extracts the nth word from a macro variable, where the words are separated by
delimiters. The default delimiters are shown on the screen.

blank, !, $, %, &, ( ), *, +, ,, -, ., /, ;, <, and ^.

281
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

The first %SCAN function extracts the “Sales” value from the macro variable “&var.”

The second %SCAN function extracts the “Profit” value from the macro variable “&var.”

282
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

The %UPCASE function converts a character to the upper case before substituting that value in a SAS
program.

283
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

The function "&sysdate” is used to refer the current year and obtain data prior to it in the WHERE clause.
Here the WHERE clause extracts rows which have the year value lesser than that of the current year.

284
© Copyright 2015, Simplilearn. All rights reserved.
X`

Macro Functions – Example 1

The %SYSFUNC invokes the automatic macro function "&sysdate” and extracts the current year value with the
YEAR() function.

285
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1

When you run this code, you get the output as shown on the screen.

286
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 2

It’s easy to make changes and track statistics for the Aging and Discount values for prior years using SAS
macro functions.

Let’s consider the same program to track statistics for the Aging and Discount value in the previous years.

Simply change the variable from sales profit to Aging and discount and run the program.

287
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 2

In the output window, you can see the updated report for Aging and Discount statistics.

288
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions—Logical Operations and Execution

SAS also allows you to verify the values of macro variables and display them in the SAS log.

Consider the same example. To verify the values of macro variables, simply add the code “options
symbolgen;.”

289
© Copyright 2015, Simplilearn. All rights reserved.
SYMBOLGEN System Option

Run this code and view the log section.

You can find symbolgen messages that display the value of macro variables.

290
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros

You can use PROC SQL to analyze data, calculate values, and create macro variables in a single step.

Suppose you need to store a list of Regions from E-Commerce data in the macro variable.

Look at the program shown on the screen.

proc sql noprint;

select distinct Region

into : Regions separated by ','

from E_Commerce ;

quit;

%put Regions=&Regions;

The NOPRINT option suppresses the report.

The DISTINCT keyword ensures that no duplicate values are stored.

291
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros

The INTO clause of the PROC SQL procedure is a very convenient way to store all unique values in one macro
variable.

The SEPARATED BY clause specifies the character(s) that is used as a delimiter in the value of the macro
variable. The unique regions are to be separated by a comma.

292
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros (contd.)

The %PUT statement writes the value of macro variable to the SAS log. Here “Regions” is the macro variable.

293
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros (contd.)

You can mention the region details as shown on the screen.

294
© Copyright 2015, Simplilearn. All rights reserved.
Now let's do a Knowledge check of what you have learned so far.

295
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Global variables cannot be accessed by b.
any SAS program available in the SAS Global variables can be accessed by any SAS
environment. program available in the SAS environment.

296
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement

Let’s now look at some of the macro statements.

Sometimes, you need to interpret the sales results of various regions. Writing a program for each region is
repetitive and time consuming. Using %macro statement, you can pass the required parameter in a program.
A parameter list can contain any number of macro parameters separated by commas. Note that you cannot
use a text expression to generate a macro name in a %MACRO statement.

297
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the classroom to learn the syntax of the %macro statement.

298
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement

To create Macro, follow the syntax shown on the screen.

%MACRO (Param1, Param2,….Paramn);

Macro Statements;

%MEND;

The %Mend statement is used to end the %Macro statement.

299
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement (contd.)

To call Macro, follow the syntax shown on the screen.

%MacroName (Value1, Value2,…..Value n);

You can call the macro by mentioning the macro name and passing the required values into it.

300
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement (contd.)

Note that, semicolons are not required for macro calls, but it is a good programming practice to have it.

301
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement–Example

Look at the following example to understand how to create and call a %macro statement.

%Macro Output(Sales_Amount=);

Proc Print Data=Electronic;

where Sales > &Sales_Amount;

Run;

%Mend;

%Output(Sales_Amount=200);

Here, the macro name is output, the parameter is Sales_amount, the macro statement is where sales is
greater than sales amount, and value is 200.

When you run this code, you get the output as shown on the screen.

302
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the classroom to learn the syntax of conditional statement.

303
© Copyright 2015, Simplilearn. All rights reserved.
The Conditional Statement

In Macros, you can also set conditions.

%IF condition %THEN %DO;

action;

%END;

The mentioned action will be executed only if the condition set is fulfilled.

If the set condition is false, then else statement is executed.

304
© Copyright 2015, Simplilearn. All rights reserved.
The Conditional Statement–Example

Let’s understand this with the help of an example.

%Macro Output(Sales_Amount=);

%If &Sales_Amount >=200 %then %do;

Proc Print data=Electronic;

where Sales = &Sales_Amount;

Run;

%End;

%Else %do;

Proc Contents Data=Electronic;

Where Sales = &Sales_Amount;

Run;

%End;

305
© Copyright 2015, Simplilearn. All rights reserved.
%Mend;

%Output(Sales_amount=150);

This is the conditional statement used in the macro.

Here the macro name is output, the parameter is Sales_amount, and the value is 250.

According to the condition set, the Proc Print procedure is executed if the sales amount is greater than 200
and the PROC Contents procedure is executed if the sales amount is less than 200.

Note that the sales amount value passed here is 250. In this example, the passed sales amount value, 250, is
greater than 200 and the Proc print procedure is executed.

When you run this code, you get the output as shown on the screen.

306
© Copyright 2015, Simplilearn. All rights reserved.
The Conditional Statement–Example (contd.)

If the sales amount value is passed as 150, which is below 200, the set condition becomes false. Therefore,
the else part of the conditional statement gets executed. The PROC content procedure is present in the else
statement and the output obtained is shown on the screen.

307
© Copyright 2015, Simplilearn. All rights reserved.
Activity

You are a SAS developer in a leading organization and need to prepare a report from an ecommerce dataset.
The condition to extract the data varies based on the management requirements daily, say, if you need to
fetch the LED products or watches for instance. You felt that writing code for each product and varied
requirements daily was time consuming and tiring.

Which of the following concepts would you use to code for the above requirement?

Let the dataset name be electronic, the macro name be productwise, and the value for macro be watch.

308
© Copyright 2015, Simplilearn. All rights reserved.
Activity

309
© Copyright 2015, Simplilearn. All rights reserved.
Activity

310
© Copyright 2015, Simplilearn. All rights reserved.
Activity

311
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Let’s practice what you have learned so far in this lesson. Read the question carefully and answer them. The
techniques and steps are provided to assist you under the guide section.

312
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

A famous ecommerce company wants to create a macro to sort data from the Electronic Dataset. It wants to
pass different variables names, title in the macro parameters, and print the dataset with title.

As a SAS programmer, write the code for the above requirement.

313
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Follow these steps to solve the problem:

1. Import the required dataset

2. Create Macro with two parameters “Field” and “Title”.

3. Check if field name is sales and sort the report per the requirement.

4. Print the output.

314
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

315
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s now quickly recap the key concepts of this lesson:


 SAS allows you to use macros in your program, which reduces the time spent writing the same code
repeatedly.
 There are two types of macro variables: automatic macro variables, which SAS provides, and user-
defined macro variables, which the user creates and defines.
 SAS also allows you to verify the values of macro variables and display them in the SAS log using
symbolgen.
 The %Mend statement is used to end the %Macro statement.

316
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “SAS Macros” lesson. The next lesson is “Basics of Statistics.”

317
© Copyright 2015, Simplilearn. All rights reserved.
318
© Copyright 2015, Simplilearn. All rights reserved.
319
© Copyright 2015, Simplilearn. All rights reserved.
320
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
S.No. Question Answer & Explanation
1 Macros in SAS start with _____. c.
Macros in SAS start with %Macro .

2 Which of the following macro variable b.


returns the day of the week on which a &SYSDAY macro variable returns the day of the
SAS job or session began executing? week on which a SAS job or session began
executing.

3 Which of the following functions extracts a.


the nth word from a macro variable? The %SCAN function extracts the nth word
from a macro variable.

321
© Copyright 2015, Simplilearn. All rights reserved.
322
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 06 — Basics of Statistics

323
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the Data Science with Statistical Analysis System or SAS course offered by
Simplilearn.

In this lesson, “Basics of Statistics,” you will be introduced to the essential concepts of statistics used in
the Statistical Analysis System.

324
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will understand what Descriptive Statistics is, its uses, and how it helps to analyze
data. You will learn the various testing techniques used in an inferential statistics. You will also
understand the differences between parametric and non-parametric techniques.

325
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Statistics

Let’s begin this lesson by defining the term “Statistics.”

Statistics is a mathematical science pertaining to the collection, presentation, analysis, and


interpretation of data.

326
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Statistics (contd.)

It is widely used to understand the complex problems of the real world and simplify them to make well-
informed decisions.

327
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Statistics (contd.)

Several statistical principles, functions, and algorithms can be used to analyze primary data, build a
statistical model, and predict the outcomes.

328
© Copyright 2015, Simplilearn. All rights reserved.
Statistical and Non-statistical Analysis

An analysis of any situation can be done in two ways: Statistical analysis or a Non-Statistical analysis.

Statistical analysis is the science of collecting, exploring, and presenting large amounts of data to identify
the patterns and trends. Statistical analysis is also called Quantitative Analysis.

Non-statistical analysis provides generic information and includes, text, sound, still images, and moving
images. Non-statistical analysis is also called Qualitative Analysis.

Although both forms of analysis provide results, statistical analysis gives more insight and a clearer
picture, feature that makes it vital for businesses.

329
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics

There are two major categories of statistics: Descriptive Statistics and Inferential Statistics.

330
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)

Descriptive Statistics helps organize data and focuses on the main characteristics of the data. It provides
a summary of the data numerically or graphically. Numerical measures, such as average, mode, standard
deviation or SD, and correlation are used to describe the features of a dataset.

331
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)

Suppose you want to study the height of students in a class room. In the Descriptive Statistics, you would
record the height of every person in the class room and then find out the maximum height, minimum
height, and average height of the population.

332
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)

Inferential Statistics generalizes the larger dataset and applies probability theory to draw a conclusion. It
allows you to infer population parameters based on the sample statistics and to model relationships
within the data. Modeling allows you to develop mathematical equations which describe the
interrelationships between two or more variables.

333
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)

Consider the same example of calculating the height of students in the class room. In Inferential
Statistics, you would categorize height as “tall,” “medium,” and “small” and then take only a small
sample from the population to study the height of students in the class room.

334
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms

The field of statistics touches our lives in many ways. From the daily routines in our homes to the
business of making the greatest cities run, the effects of statistics are everywhere.

335
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

There are various statistical terms that one should be aware of while dealing with statistics:

 Population
 Sample
 Variable
 Quantitative variable
 Qualitative variable
 Discrete variable
 Continuous variable

A population is the group from which data is to be collected.

336
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

A sample is a subset of a population.

337
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

A variable is a feature that is characteristic of any member of the population differing in quality or
quantity from another member.

338
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

A variable differing in quantity is called a quantitative variable, for example, the weight of a person,
number of people in a car.

339
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

A variable differing in quality is called a qualitative variable or attribute, for example, color, the degree of
damage of a car in an accident.

340
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

A discrete variable is one in which no value can be assumed between the two given values. For example,
the number of children in a family.

341
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)

A continuous variable is one in which any value can be assumed between the two given values. For
example, the time taken for a 100-meter run.

342
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures

Typically, there are four types of statistical measures used to describe the data. They are:

 Measures of Frequency
 Measures of Central Tendency
 Measures of Spread
 Measures of Position

Let’s learn each in detail.

343
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)

Frequency of the data indicates the number of times a particular data value occurs in the given dataset.
The measures of frequency are number and percentage.

344
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)

Central tendency indicates whether the data values tend to accumulate in the middle of the distribution
or toward the end. The measures of central tendency are mean, median, and mode.

345
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)

Spread describes how similar or varied the set of observed values are for a particular variable. The
measures of spread are standard deviation, variance, and quartiles. The measures of spread are also
called measures of dispersion.

346
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)

Position identifies the exact location of a particular data value in the given dataset. The measures of
position are percentiles, quartiles, and standard scores.

347
© Copyright 2015, Simplilearn. All rights reserved.
Procedures in SAS for Descriptive Statistics

Statistical Analysis System, or SAS, provides a list of procedures to perform descriptive statistics. They
are as follows:

 Proc Print
 Proc Contents
 Proc Means
 Proc Freq
 Proc Univariate
 Proc GChart
 Proc Boxplot
 Proc Gplot

348
© Copyright 2015, Simplilearn. All rights reserved.
Procedures in SAS for Descriptive Statistics (contd.)

Proc Print – It prints all the variables in a SAS dataset.

Proc Contents – It describes the structure of a dataset.

Proc Means – It provides data summarization tools to compute Descriptive Statistics for variables across
all observations and within the groups of observations.

Proc Freq – It produces one-way to n-way frequency and cross-tabulation tables. Frequencies can also be
an output of a SAS dataset.

Proc Univariate - It goes beyond what PROC MEANS does and is useful in conducting some basic
statistical analyses and includes high resolution graphical features.

Proc GChart - The GCHART procedure produces six types of charts: block charts, horizontal - vertical bar
charts, pie - donut charts, and star charts. These charts graphically represent the value of a statistic
calculated for one or more variables in an input SAS dataset. The charted variables can be either numeric
or character.

Proc Boxplot - The BOXPLOT procedure creates side-by-side box-and-whisker plots of measurements
organized in groups. A box-and-whisker plot displays the mean, quartiles, and minimum and maximum
observations for a group.

349
© Copyright 2015, Simplilearn. All rights reserved.
Proc Gplot – Gplot procedure creates two-dimensional graphs including, simple scatter plots, overlay
plots in which multiple sets of data points are displayed on one set of axes, plots against a second
vertical axis, bubble plots, and logarithmic plots.

350
© Copyright 2015, Simplilearn. All rights reserved.
Demo- Descriptive Statistics

In this demo, you will learn how to use Descriptive Statistics to analyze the mean from the electronic
database.

Let’s import the electronic dataset into the SAS console.

In the left pane, right-click the electronic.xlsx dataset and click Import Data.

The code to import the data generates automatically. Copy the code and paste it in the new window.

The PROC Means procedure is used to analyze the mean of the imported dataset.

The keyword DATA identifies the input dataset. In this demo, the input dataset is “electronic.”

The output obtained is shown on the screen.

Note that the number of observations, mean, Standard deviation, and maximum and minimum values of
the electronic dataset are obtained.

This concludes the demo on how to use Descriptive Statistics to analyze the mean from the electronic
database.

351
© Copyright 2015, Simplilearn. All rights reserved.
352
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

353
© Copyright 2015, Simplilearn. All rights reserved.
354
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 XYZ plywood manufacturing company b.
wants to check the strength of its It is an example of Inferential Statistics. It
plywood. The company picks one out of allows you to infer population parameters
every 200 pieces of plywood as a sample based on sample statistics.
to test the quality. What is this scenario
an example of?

2 A report analyst creates a column chart a.


to compare the sales of the North and It is an example of Descriptive Statistics. It
South regions. What is this scenario an provides a summary of the data numerically or
example of? graphically.

355
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing

So far you have learned about descriptive statistics. Let’s now learn about inferential statistics.
Hypothesis testing is an inferential statistical technique to determine whether there is enough evidence
in a data sample to infer that a certain condition holds true for the entire population. To understand the
characteristics of the general population, we take a random sample and analyze the properties of the
sample. We then test whether or not the identified conclusions correctly represent the population as a
whole.

The purpose of hypothesis testing is to choose between two competing hypotheses about the value of a
population parameter.

356
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)

For example, one hypothesis might claim that the wages of men and women are equal, while the other
might claim that women make more than men.

357
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)

Hypothesis testing is formulated in terms of two hypotheses:

 Null Hypothesis which is referred to as H0


 Alternative Hypothesis which is referred to as H1

The null hypothesis is assumed to be true unless there is strong evidence to the contrary.

The alternative hypothesis is assumed to be true when the null hypothesis is proven false.

358
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)

Let’s understand the null hypothesis and alternative hypothesis using a general example.

Null hypothesis attempts to show that no variation exists between variables and alternative hypothesis
is any hypothesis other than the null. For example, say a pharmaceutical company has introduced a
medicine in the market for a particular disease and people have been using it for a considerable period
of time and it’s generally considered safe. If the medicine is proved to be safe, then it is referred to as
null hypothesis.

359
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)

To reject null hypothesis, we should prove that the medicine is unsafe. If the null hypothesis is rejected,
then the alternative hypothesis is used.

360
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types

Before you perform any statistical tests with variables, it is significant to recognise the nature of the
variables involved. Based on the nature of the variables, it is classified into four types.

361
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)

They are categorical or nominal variables, ordinal variables, interval variables, and ratio variables.

Nominal variables are ones which have two or more categories, and it is impossible to order the values.
Examples of nominal variables include gender and blood group.

362
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)

Ordinal variables have values ordered logically. However, the relative distance between two data values
is not clear. Examples of ordinal variables include considering the size of coffee cup—large, medium, and
small and considering the ratings of a product—bad, good, and best.

363
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)

Interval variables are similar to ordinal variables, except that the values are measured in a way where
their differences are meaningful. With an interval scale, equal differences between scale values do have
equal quantitative meaning. For this reason, an interval scale provides more quantitative information
than the ordinal scale. The interval scale does not have a true zero point. A true zero point means that a
value of zero on the scale represents zero quantity of the construct being assessed.

Examples of interval variables include the Fahrenheit scale used to measure temperature and distance
between two compartments in a train. The Fahrenheit scale does not have a true zero point.

364
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)

Ratio scales are similar to interval scales in that equal differences between scale values have equal
quantitative meaning. However, ratio scales also have a true zero point which give them an additional
property. For example, the system of inches used with a common ruler is an example of a ratio scale.
There is a true zero point because zero inches does in fact indicate a complete absence of length.

365
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process

Let’s understand the process of hypothesis testing. There are four steps to be performed to test the
hypothesis of any variables. Click each step to know more.
The first step is to make assumptions and state the null hypothesis and the alternative hypothesis.
Assume each sample is an independent random sample and the distribution of the response variable
follows normal distribution. The null hypothesis, or H0, states that a population parameter is equal to a
value. The alternative hypothesis, or H1, states that the population parameter is different than the value
of the population parameter in the null hypothesis. The alternative hypothesis is what is believed to be
true or is proven to be true.

366
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process (contd.)

The second step is to select the appropriate test statistic and the level of significance.
If the population standard deviation, σ, is known and either the data is normally distributed or the
sample size n is greater than 30, you can use the normal distribution or z-statistic.
If the population standard deviation, σ, is unknown and either the data is normally distributed or the
sample size is greater than 30, you can use the t-distribution or t-statistic.

367
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process (contd.)

The third step is to calculate the p-value. Compute the appropriate test statistic and make the decision.
Use the formulas shown on the screen to obtain the p-value depending on the statistic.

368
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process (contd.)

The fourth step is to compare the p-value to alpha to interpret the decision.
 If the p-value is less than or equal to alpha, the evidence is strong against the null hypothesis, so
you can reject the null hypothesis.

 If the p-value is greater than alpha, the evidence is weak against the null hypothesis, so you fail
to reject the null hypothesis.

If the p-value is equal to alpha, the evidence is neither strong nor weak against the null hypothesis. In
this case, you draw your own conclusions.

369
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

370
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following has values in a b.
logical order? Ordinal variables take on values that can be
logically ordered or ranked.

371
© Copyright 2015, Simplilearn. All rights reserved.
Demo-Hypothesis Testing

In this demo, you will learn how to perform hypothesis testing using SAS.

In this example, let’s check the aging length of certain observations from a random sample.

The keyword DATA identifies the input dataset.

The input statement is used to declare the aging variable and cards to read data into SAS.

Let’s perform a t-test to check the null hypothesis.

Let’s assume the null hypothesis to be that the mean days to deliver a product is 6 days.

So H0 equals 6. Alpha value is the probability of making an error, which is 5% standard and hence alpha
equals 0.05.

The var statement names the variable to be used in the analysis.

The output is shown on the screen.

Note that the p-value is greater than the alpha value, which is 0.05. Therefore, we fail to reject the null
hypothesis.

372
© Copyright 2015, Simplilearn. All rights reserved.
This concludes the demo on how to perform the hypothesis testing using SAS.

373
© Copyright 2015, Simplilearn. All rights reserved.
Parametric and Non-parametric Tests

Let’s now learn about hypothesis testing procedures. There are two types of hypothesis testing
procedures. They are parametric tests and non-parametric tests.

In statistical inference or hypothesis testing, the traditional tests, such as t- test and ANOVA, are called
parametric tests. They depend on the specification of a probability distribution except for a set of free
parameters.

In simple words, you can say that if the population information is known completely by its parameter,
then it is called a parametric test.

374
© Copyright 2015, Simplilearn. All rights reserved.
Parametric and Non-parametric Tests

If the population or parameter information is not known and you are still required to test the hypothesis
of the population, then it is called a non-parametric test. Non-parametric tests do not require any strict
distributional assumptions.

375
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests

There are various parametric tests. They are as follows:

 T-test
 ANOVA
 Chi-square
 Linear regression

Let’s understand them in detail.

T-Test:

A T-test determines if two sets of data are significantly different from each other.

The T-test is used in the following situations:

 To test if the mean is significantly different than a hypothesized value


 To test if the mean for two independent groups is significantly different
 To test if the mean for two dependent or paired groups is significantly different

376
© Copyright 2015, Simplilearn. All rights reserved.
377
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

For example:

Let’s say you have to find out which region spends the highest amount of money on shopping. It’s
impractical to ask everyone in the different regions about their shopping expenditure.

In this case, you can calculate the highest shopping expenditure by collecting sample observations from
each region.

With the help of the t-test, you can check if the difference between the regions are significant or a
statistical fluke.

378
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

ANOVA:

ANOVA is a generalized version of the T-test and used when the mean of the interval dependent variable
is different to the categorical independent variable. When we want to check variance between two or
more groups, we apply the ANOVA test.

379
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

For example:

Let’s look at the same example of the T-test example. Now, you want to check how much people in
various regions spend every month on shopping. In this case, there are four groups, namely East, West,
North, and South. With the help of the ANOVA-test, you can check if the difference between the regions
is significant or a statistical fluke.

380
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

Chi-Square

Chi-square is a statistical test used to compare observed data with data you would expect to obtain
according to a specific hypothesis.

381
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

Let’s understand the Chi-Square test through an example.


You have a dataset of male shoppers and female shoppers.
Let’s say you need to assess whether the probability of females purchasing items of 500 dollars or more
is significantly different from the probability of males purchasing items of 500 dollars or more.

382
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

Linear regression

There are two types of linear regression—simple linear regression and multiple linear regression.

Simple linear regression is used when one wants to test how well a variable predicts another variable.
Multiple linear regression allows one to test how well multiple variables (Independent Variables) predict
a variable of interest. When using multiple linear regression, we additionally assume the predictor
variables are independent.

383
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)

For Example, finding relationship between any two variables, say Sales and Profit, is called simple linear
regression.

Finding relationship between any three variables, say sales, cost, and Telemarketting and is called
Multiple linear regression.

Let’s Say a Ecommerce company noticed the hike in Sales because of two marketing campaigns. They
have three field one Sales, second cost spent on Direct marketing campaign and third cost spent on Tele
Marketing Campaign.

Here Sales we are denoting by S, Cost on Tele Marketing Campaign by TM and Direct Marketing by DM.

So checking the relationship between these three variables (Sales based on campaigns) is the example of
Multiple Regression.

Here Sales is dependent variable and TM and DM campaign Independent Variables.

384
© Copyright 2015, Simplilearn. All rights reserved.
Non-parametric Tests

Some of the non-parametric tests are Wilcoxon rank sum test and Kruskal-Wallis H-test.

Wilcoxon Rank Sum Test:

The Wilcoxon Signed-Rank Test is a non-parametric statistical hypothesis test used to compare two
related samples or matched samples to assess whether or not their population mean ranks differ.

In Wilcoxon Rank Sum test, you can test the null hypothesis on the basis of the ranks of the
observations.

385
© Copyright 2015, Simplilearn. All rights reserved.
Non-parametric Tests (contd.)

Kruskal-Wallis H-Test:

Kruskal-Wallis H-Test is a rank-based non-parametric test used to compare independent samples of


equal or different sample sizes.

In this test, you can test the null hypothesis on the basis of the ranks of the independent samples.

386
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests-Advantages and Disadvantages

There are various advantages and disadvantages of parametric tests.

The advantages of parametric tests are as follows:

 Provide information about the population in terms of parameters and confidence intervals
 Easier to use in modeling, analyzing, and for describing data with central tendencies and data
transformations
 Express the relationship between two or more variables
 Don’t need to convert data into rank order to test

The disadvantages of parametric tests are as follows:

 Only support normally distributed data


 Only applicable on variables, not attributes

387
© Copyright 2015, Simplilearn. All rights reserved.
Non-parametric Tests—Advantages and Disadvantages

Let’s now list the advantages and disadvantages of non-parametric tests.

The advantages of non-parametric tests are as follows:

 Simple and easy to understand


 Do not involve population parameters and sampling theory
 Make fewer assumptions
 Provide results similar to parametric procedures

The disadvantages of non-parametric tests are as follows:

 Not as efficient as parametric tests

Difficult to perform operations on large samples manually

388
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s now quickly recap the key concepts of this lesson:

 Descriptive Statistics helps organize data and focuses on the main characteristics of the data.
 Inferential Statistics generalizes the larger dataset and applies probability theory to draw a
conclusion.
 Hypothesis testing is an inferential statistical technique to determine whether there is enough
evidence in a data sample to infer that a certain condition holds true for the entire population.
 If the population information is known completely by its parameter, then it is called a parametric
test.
 If the population or parameter information is not known and you are still required to test the
hypothesis of the population, then it is called a non-parametric test.

389
© Copyright 2015, Simplilearn. All rights reserved.
390
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “Basics of Statistics.” The next lesson is “Basic Statistical Procedure.”

391
© Copyright 2015, Simplilearn. All rights reserved.
392
© Copyright 2015, Simplilearn. All rights reserved.
393
© Copyright 2015, Simplilearn. All rights reserved.
394
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation

1 If the p-value is less than the alpha value, a.


what significance does it have for If the p-value is less than alpha, we reject the
hypothesis testing? null hypothesis.

2 A linguistics professor determining the c.


average student scores is an example of It is an example of descriptive statistics.
_____.

3 What is a t-test an example of? a.


t-test is a parametric test.
4 Which of the following tests depends on a.
the Probability distribution? Parametric Tests depend on the probability
distribution.

395
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 07 — Statistical Procedure

396
© Copyright 2015, Simplilearn. All rights reserved.
397
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

In this lesson, “Statistical Procedures,” you will be introduced to the various procedures of statistics
available in Statistical Analysis System.

398
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will understand the various statistical procedures such as PROC Means, PROC FREQ,
PROC UNIVARIATE, PROC CORR, PROC REG, and PROC ANOVA that help perform statistical tests. You
will also learn how to create graphs and interpret the results.

399
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Procedures

Let’s begin this lesson by defining statistical procedures.

The statistical procedures are used to analyze, represent, and calculate statistical data.

400
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Procedures (contd.)

There are various statistical procedures that help perform statistical tests:

 PROC Means
 PROC UNIVARIATE
 PROC FREQ
 PROC CORR
 PROC REG
 PROC ANOVA

Let’s learn each statistical procedure in detail.

401
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means

Let’s start with the PROC means procedure.

402
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)

One of the most powerful and flexible procedures of SAS System is PROC MEANS. You can use it rapidly
and efficiently to analyze the values of numeric variables and place those analyses either in the output
window or in a SAS dataset or both. Mastering the basic syntax and features of this procedure will
enable you to analyze your datasets easily.

403
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)

PROC MEANS is used in a variety of analytic, business intelligence, reporting, and data management
situations.

404
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)

The PROC Mean is used to calculate descriptive statistics, estimate quartiles including the median,
calculate confidence limits for the mean, identify extreme values, and perform a “t-test.”

405
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)

Let’s step into the “Syntax Classroom” to learn the syntax of PROC Means.

406
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)

PROC MEANS <Options> <Dataset_name>;

RUN;

The syntax for the means procedure is shown on the screen. The keyword “PROC Means” calculates the
number of observations, Mean, Standard Deviation, and maximum and minimum values from the
dataset.

407
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means—Examples

Let’s now understand the use of SAS procedures using PROC Means as an example.

Example 1: Using a procedure with no options

You can use PROC Means without options. By default, SAS uses the last created dataset, and it generates
the means for all of the numeric variables in that dataset.

PROC MEANS;

RUN;

Look at the output shown on the screen. In this example, from the E-Commerce dataset, the number of
observations, mean, maximum and minimum values, and Standard Deviation are obtained.

408
© Copyright 2015, Simplilearn. All rights reserved.
Example 2: Using options on the PROC statement

SAS allows you to use various options to generate the desired output when you use PROC Means.

PROC MEANS DATA= E_Commerce;

RUN;

Note that the data= option is optional. However, it is strongly recommended you use it as it avoids
errors of omission when you revise your programs.

409
© Copyright 2015, Simplilearn. All rights reserved.
410
© Copyright 2015, Simplilearn. All rights reserved.
Example 2: Using options on the PROC statement (contd.)

You can also use the options such as n, mean, mode, and Standard Deviation after the keyword “PROC
Means.”

Look at the example shown on the screen.

PROC MEANS DATA= E_Commerce N MEAN STD;

RUN;

Look at the output shown on the screen. In this example, from the electronic dataset, the number of
observations, mean, and Standard Deviation alone are obtained.

411
© Copyright 2015, Simplilearn. All rights reserved.
Example 3: Using additional statements

In addition, you can use additional statements in PROC Means to get the desired output.

The additional statements used in the PROC Means are as follows:

 BY
 CLASS
 FREQ
 ID
 OUTPUT
 TYPES
 VAR
 WAYS
 WEIGHT

412
© Copyright 2015, Simplilearn. All rights reserved.
413
© Copyright 2015, Simplilearn. All rights reserved.
Example 3: Using additional statements (contd.)

The BY statement calculates the separate statistics for each BY group.

The CLASS statement identifies variables whose values define subgroups for the analysis.

The FREQ statement identifies a variable whose values represent the frequency of each observation.

The ID statement includes additional identification variables in the output dataset.

The OUTPUT statement creates an output dataset that contains specified statistics and identification
variables.

The TYPES statement identifies specific combinations of class variables to use to subdivide the data.

The VAR statement identifies the analysis variables and their order in the results.

The WAYS statement specifies the number of ways to make unique combinations of class variables.

The WEIGHT statement identifies a variable whose values weigh each observation in the statistical
calculations.

414
© Copyright 2015, Simplilearn. All rights reserved.
415
© Copyright 2015, Simplilearn. All rights reserved.
Example 3: Using additional statements (contd.)

Look at the example shown on the screen. In this example, the statements variable and class are used.

PROC Means Data= E_Commerce;

Var Sales Profit;

Class Ship_Mode;

Run;

Look at the output shown on the screen. In this example, SAS calculates the average Sale and Profit
within each Ship_Mode type.

The “Standard Class” Ship_Mode appears to have the highest average Sales and average Profit.

416
© Copyright 2015, Simplilearn. All rights reserved.
Example 4: Using additional statements

Going a step further, SAS helps you to compute median, mode, quartile, kurtosis, and skewness.

Look at the example program shown on the screen.

PROC Means Data=E_Commerce Mean Median Mode P25 P50 P75;

Var Sales;

Class Ship_Mode;

Run;

The keyword mean generates the average of Sales column for each shipmode type.

The keyword median generates the “middle” value or median of Sales column for each shipmode type.

417
© Copyright 2015, Simplilearn. All rights reserved.
418
© Copyright 2015, Simplilearn. All rights reserved.
Example 4: Using additional statements

The keyword mode generates the most repeated value or mode of Sales column for shipmode.

The keyword P25 generates the first quartile value of Sales column for shipmode.

419
© Copyright 2015, Simplilearn. All rights reserved.
Example 4: Using additional statements

The keyword P50 generates the second quartile value of Sales column for shipmode.

The keyword P75 generates the third quartile value of Sales column for shipmode.

The output generated is shown on the screen.

420
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

421
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following statements is used a.
to identify the analysis variables and their The VAR statement identifies the analysis
order in the results? variables and their order in the results.

422
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ

So far you have learned the use of SAS procedures using PROC Means. Let’s now learn the use of SAS
procedures using PROC FREQ.

423
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The PROC FREQ is used to obtain a frequency distributions and to analyse multi-dimension tables.
It invokes the procedure and identifies the input dataset optionally. By default, the PROC FREQ uses the
recently generated SAS dataset.

424
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

Let’s step into the “Syntax Classroom” to learn the syntax of PROC FREQ.

425
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

PROC FREQ <options>;

BY variable-list;

TABLES requests / options;

WEIGHT variable;

OUTPUT <OUT= SAS-data-set><output-statistic-list>;

FORMAT;

EXACT statistic-keywords < / computation-option >;

TEST options;

The syntax for PROC FREQ is shown on the screen.

The PROC FREQ statement invokes the FREQ procedure. By default, similar to PROC MEANS, the
procedure uses the most recently created SAS dataset.

426
© Copyright 2015, Simplilearn. All rights reserved.
427
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The BY statement obtains a separate analysis in groups defined by the BY variables (the prior sorting is
required).

428
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The TABLES statement requests cross-tabulation tables and statistics for those tables.

429
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The WEIGHT statement names a numeric variable that provides a weight for each observation in the
input data set.

430
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The OUTPUT statement creates an output data set that contains specified statistics and identification
variables.

431
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The EXACT statement requests exact tests or confidence limits for the specified statistics.

432
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The TEST statement requests asymptotic tests for measures of association and measures of agreement.

433
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)

The statements and options in PROC FREQ can be categorized into three primary ways. They are as
follows:

1. Controlling the frequency output as far as content and appearance is concerned


2. Requesting statistical tests
3. Writing tables and results to SAS datasets

434
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC FREQ

In this demo, you will learn how to perform the statistical procedure using PROC FREQ.

Let’s perform the statistical analysis using PROC FREQ on the electronic dataset.

In PROC FREQ, we use the “BY” statement instead of the “CLASS” statement. To use the BY statement in
PROC FREQ, data should be sorted by variables.

Let’s sort the “Electronic” dataset using the PROC SORT statement.

The data is sorted by product using the BY statement.

Let’s now calculate the frequency of sorted products from the electronic dataset using PROC FREQ
statement.

The table statement creates a new table for the product.

Look at the output shown on the screen. The product table is created with frequency, percent, and
cumulative frequency and percent columns.

This concludes the demo on how to perform the statistical procedure using PROC FREQ.

435
© Copyright 2015, Simplilearn. All rights reserved.
436
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE

Hope, you have understood the concept of PROC FREQ. Let’s now learn the next statistical procedure,
PROC UNIVARIATE.

437
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

PROC UNIVARIATE is a powerful base statistical procedure that combines other analytical procedures
such as FREQ, MEANS, SUMMARY, and TABULATE into a single PROC step.

438
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

The UNIVARIATE procedure provides data summarization tools, high-resolution graphics displays, and
information on the distribution of numeric variables.

439
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

PROC UNIVARIATE performs the following tasks:

 It calculates descriptive statistics, median, mode, range, quartiles, frequency tables, and
confidence limits.
 It tabulates extreme observations and extreme values and plots the data distribution.
 It performs tests for location and normality.
 It performs goodness-of-fit tests for fitted parametric and nonparametric distributions.
 It creates histograms—one-way and two-way comparative histograms, comparative quantile-
quantile plots, and comparative probability plots.
 It creates output data sets with requested statistics, histogram intervals, and parameters of the
fitted distributions.

440
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

Let’s step into the “Syntax Classroom” to learn the syntax of PROC UNIVARIATE.

441
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

The simple syntax of PROC UNIVARIATE is shown on the screen.

Proc UNIVARIATE <Options><Dataset_name>;

Run;

The keyword “PROC UNIVARIATE” examines the distribution of your data, including an assessment of
normality and discovery of outliers.

442
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

The PROC UNIVARIATE procedure allows you to include various options and statements.

Follow the syntax shown on the screen while using various options and statements.

PROC UNIVARIATE <option(s)>;

BY <DESCENDING> variable-1 <…<DESCENDING> variable-n> <NOTSORTED>;

CLASS variable-1<(variable-option(s))> <variable-2<(variable-option(s))>> </ KEYLEVEL=’value1’|(’value1’


’value2’)>;

FREQ variable;

HISTOGRAM <variable(s)> </ option(s)>;

ID variable(s);

INSET <keyword(s) DATA=SAS-data-set> </ option(s)>;

OUTPUT <OUT=SAS-data-set> statistic-keyword-1=name(s) <… statistic-keyword-n=name(s)>


<percentiles-specification>;

443
© Copyright 2015, Simplilearn. All rights reserved.
PROBPLOT <variable(s)> </ option(s)>;

QQPLOT <variable(s)> </ option(s)>;

VAR variable(s);

WEIGHT variable;

The HISTOGRAM statement creates a high-resolution graph of a histogram.

444
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

The INSET statement inserts a table of summary statistics in a high-resolution graph.

445
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

The PROBPLOT statement creates a high-resolution graph of a probability plot.

446
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

The QQPLOT statement creates a high-resolution graph of a quantile-quantile plot.

447
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)

Note that, like PROC Print, PROC UNIVARIATE also have “By” statement to produce separate analyses for
each value of the variable specified.

448
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC UNIVARIATE

In this demo, you will learn how to perform the statistical procedure using PROC UNIVARIATE.

Let’s create a histogram using PROC UNIVARIATE.

The statement PROC UNIVARIATE invokes the UNIVARIATE procedure. We have chosen the electronic
dataset.

The VAR statement selects the analysis variables and determines their order in the report. Here, aging is
an analysis variable.

The HISTOGRAM statement creates histograms and superimposes the estimated parametric and
nonparametric probability density curves.

In this example, we will plot a normal curve. To plot a normal histogram curve, use the statement
“Normal.”

Look at the output shown on the screen.

The moments, basic statistical measures, tests for location, quantiles levels, extreme observations,
histogram plot, and normal distribution are obtained for the aging variable.

This concludes the demo on how to perform the statistical procedure using PROC FREQ.

449
© Copyright 2015, Simplilearn. All rights reserved.
450
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

451
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following procedures b.
calculates the unique values of the PROC FREQ calculates the unique values of the
variable, the number of observations at variable, the number of observations at each
each value, a cumulative count, and a value, a cumulative count, and a cumulative
cumulative percent? percent.

452
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR

Let’s now learn the next statistical procedure, PROC CORR.

453
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

Let’s now learn the next statistical procedure, PROC CORR.

PROC CORR is a correlation procedure used to check the strength between two or more variables. It is
used to compute simple descriptive statistics, Pearson product-moment correlation coefficient between
variables, Spearman’s rank-order correlation, and Kendall correlation coefficient.

It also calculates Fisher's Z transformation for the Pearson product-moment and Spearman’s rank-order
correlation coefficients to get 95% confidence intervals.

454
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

 Descriptive statistics are used to describe the basic features of the data.
 The Pearson product-moment correlation, or Pearson correlation for short, is used to measure
the linear correlation between two variables.
 Spearman’s rank-order correlation is used to prove or disprove the hypothesis.
 Kendall rank correlation is used to measure the ordinal association between two variables.

455
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

Well, let’s now step into the syntax classroom to learn the syntax of PROC CORR.

456
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

PROC CORR <options> <Dataset_name> ;

Run;

The “PROC CORR” statement computes Pearson product-moment correlation for the recent dataset. It
also computes probabilities to test the null hypothesis.

457
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

There are various options available in PROC CORR statement under datasets, statistical analysis, Pearson
Correlation Statistics, ODS Output Graphics, and Printed Output category.

Click each category to know the various options available in PROC CORR.

458
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

Datasets

Option Description
DATA Specifies the input dataset
OUTH Specifies the output dataset with Hoeffding’s statistics
OUTK Specifies the output dataset with Kendall correlation statistics
OUTP Specifies the output dataset with Pearson correlation statistics
OUTS Specifies the output dataset with Spearman correlation statistics

459
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

Statistical Analysis

Option Description
EXCLNPWGT Excludes observations with nonpositive weight values from the analysis
FISHER Requests correlation statistics using Fisher’s Z transformation
HOEFFDING Requests Hoeffding’s measure of dependence
KENDALL Requests Kendall’s tau-b
NOMISS Excludes observations with missing analysis values from the analysis
PEARSON Requests Pearson product-moment correlation
SPEARMAN Requests Spearman rank-order correlation

460
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

Pearson Correlation Statistics

Option Description
ALPHA Computes Cronbach’s coefficient alpha
COV Computes covariances
CSSCP Computes corrected sums of squares and cross products
SINGULAR Specifies the singularity criterion
SSCP Computes sums of squares and cross products
VARDEF Specifies the divisor for variance calculations

461
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

ODS Output Graphics

Option Description
PLOTS=MATRIX Computes Cronbach’s coefficient alpha
PLOTS=SCATTER Computes covariance

462
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)

Printed Output

Option Description
BEST= Displays the specified number of ordered correlation coefficients
NOCORR Suppresses Pearson correlations
NOPRINT Suppresses all printed output
NOPROB Suppresses P-values
NOSIMPLE Suppresses descriptive statistics
RANK Displays ordered correlation coefficients

463
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC CORR

In this demo, you will learn how to perform the statistical procedure and obtain a scatter plot using
PROC CORR.

Let’s create a basic statistics and correlation matrix table for the electronic dataset.

The statement PROC CORR is used to check the strength between two or more variables.

The variables sales, products, and discounts are selected as the analysis variable using the VAR
statement.

The output is shown on the screen.

The first value is the correlation coefficient and second value is the p-value.

In correlation matrix table, the correlation coefficient is 1 because the diagonal elements are in
correlation between the same variables.

Let’s now create a matrix table with a Scatter Plot Chart.

For this, let’s add the plots statement ODS Graphic Option On and ODS Graphic Option Off. This
statement helps add graphics in the output window.

Use the Plot statement with the matrix option to create the matrix table for the selected variables.

The output is shown on the screen.

The scatter plot matrix is obtained for the variables–sales, profit, and discount.

This concludes the demo on how to perform the statistical procedure and obtain scatter plot using PROC
CORR.

464
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG

So far you have learned the statistical procedure such as PROC MEANS, PROC FREQ, PROC UNIVARIATE,
and PROC CORR.

Let’s now learn the next statistical procedure-PROC REG.

465
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

The PROC REG is used to estimate the linear regression models.

466
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

Let’s step into the syntax classroom to learn the syntax of PROC REG.

467
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

PROC REG <options>;

MODEL dependent Variable =Independent Variable;

VAR variables;

FREQ variable;

WEIGHT variable;

ID variable;

OUTPUT OUT=SASdataset keyword=names...;

PLOT yvariable*xvariable = symbol ...;

RESTRICT linear_equation,...;

TEST linear_equation,...;

MTEST linear_equation,...;

BY variables;

The MODEL statement specifies the dependent and independent variables in the regression model. The
MODEL statement provides the output with a covariance matrix and other summarized statistical values.

468
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

The OUTPUT statement requests an output dataset.

The ID statement names a variable to identify observations in the printout.

469
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

The WEIGHT and FREQ statements declare variables to weight observations.

The BY statement specifies variables to define subgroups for the analysis.

Note that if you need to fit a model to the data, you should use a model statement. If you need to use
only PROC REG, the VAR statement is necessary and the model statement becomes optional.

470
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

Let’s now learn the various options available in PROC REG statement under datasets, ODS Output
Graphics, Traditional graphics, Display options, and other options.

Click each category to know the various options available in PROC REG.

471
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

Datasets

Option Description
DATA Specifies the input dataset
OUTTEST Outputs a dataset that contains parameter estimates and other model fit summary
statistics
OUTSSCP Outputs a dataset that contains sums of squares and cross products
COVOUT Outputs the covariance matrix for parameter estimates to the OUTEST= dataset
OUTSEB Outputs standard errors of the parameter estimates to the OUTEST= dataset
OUTSTB Outputs standardized parameter estimates to the OUTEST= dataset; Use only with
the RIDGE= or PCOMIT= option
OUTVIF Outputs the variance inflation factors to the OUTEST= data set
Use only with the RIDGE= or PCOMIT= option
PCOMIT Performs incomplete principal component analysis and outputs estimates to the
OUTEST= dataset
RIDGE Performs ridge regression analysis and outputs estimates to the OUTEST= dataset
RSQUARE Outputs the number of regressors, the error degrees of freedom,
and the model R2 to the OUTEST= dataset
TABLEOUT Outputs standard errors, confidence limits, and associated test statistics of the
parameter estimates to the OUTEST= dataset

472
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

ODS Output Graphics

Option Description
PLOTS= Produces ODS graphical displays

473
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

Traditional Graphics

Option Description
ANNOTATE= Specifies an annotation dataset
GOUT= Specifies the graphics catalog in which graphics output is saved

474
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

Display Options

Option Description
LINEPRINTER Creates plots requested as line printer plot
ALL Displays all statistics including correlation matrix, simple statistics values, and
uncorrected sums of squares and cross products matrix

475
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)

Other Options

Option Description
ALPHA= Sets significance value for confidence and prediction intervals and tests
SINGULAR Sets criterion for checking for singularity

476
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC REG

In this demo, you will learn how to perform the statistical procedure and interpret regression results
using PROC REG.

Let’s check the variation of variables in the electronic dataset.

The MODEL statement specifies the dependent and independent variables in the regression model.

In this example, let’s check the variation between sales and quantity.

The variable sales is the dependent Variable and quantity is the independent Variable.

Look at the output shown on the screen.

The Analysis of variance table is obtained.

Let’s interpret the regression result.

ANOVA Table provides the p-value and R-square value. The p-value is used to test the hypothesis and r-
square value defines the variation between the dependent and independent variables.

Note that the output also shows fit diagnostics, residuals, and fit plot details for the dependent variable
“Sales.”

This concludes the demo on how to perform the statistical procedure and interpret regression results
using PROC REG.

477
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

478
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following statements names d.
a variable to identify observations in the The ID statement names a variable to identify
printout? observations in the printout.

479
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA

The ANOVA procedure performs analysis of variance for balanced data from a wide variety of
experimental designs. The data is balanced if there are equal numbers of observations for every
combination of the classification factors.

Whenever the data is not balanced, use the GLM procedure, whose statements are almost identical to
those of PROC ANOVA.

PROC GLM is a general procedure that works with both balanced and unbalanced data.

480
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)

In ANOVA, a continuous response variable, known as a dependent variable, is measured under


experimental conditions. This is identified by the classification variables, known as independent
variables.

The variation in the response might be due to the effects in the classification along with the random
error accounting for the remaining variation.

481
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)

Let’s step into the syntax classroom to learn the syntax of PROC ANOVA.

482
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)

PROC ANOVA <options> ;

CLASS variables </ option> ;

MODEL dependents=effects </ options> ;

ABSORB variables;

BY variables;

FREQ variable;

MANOVA <test-options></ detail-options>;

MEANS effects </ options>;

REPEATED factor-specification </ options>;

TEST <H=effects> E=effect;

The absorb statement absorbs classification effects in a model.

The class statement declares the classification of variables.

The FREQ statement specifies a frequency variable.

483
© Copyright 2015, Simplilearn. All rights reserved.
484
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)

The MANOVA statement performs a multivariate analysis of variance.

The REPEATED statement performs multivariate and univariate repeated measures analysis of variance.

485
© Copyright 2015, Simplilearn. All rights reserved.
Demo-PROC ANOVA

In this demo, you will learn on how to perform the statistical procedure and interpret ANOVA results
when tLet’s understand PROC ANOVA using an example.

Three Voice Over talents are given 5 subjects each to read. The reading speed is recorded in words per
minute for each subject in the test. Analyze their scores.

The scores of each voice over talent is shown.

Let’s analyze their scores using PROC ANOVA.

Import the data into the SAS console.

Note that the dataset is named as Test.

Use PROC ANOVA to check variance among groups when the data is balanced.

Use the title statement to give the title for the analysis. Here, let’s name the analysis as ANOVA.

The class statement declares the classification of variables. Here, the Voice_Over_Talent is the variable.

The MODEL statement specifies the dependent and independent variables in the regression model.
Here, words count is the dependent variable and Voice over talent is the independent variable.

Use the plot statement to plot the graph for words count and voice over talent.

The output is shown on the screen. Let’s interpret the result.

The F-test statistics value is 7.14 with a p-value of 0.0091. The p-value is less than 0.05 and so we reject
the null hypothesis.

This concludes that the reading methods were not all the same for the word counts.

A graphical comparison allows you to visually see the distribution of the groups.

If the p value is low, there is a little chance of overlap between the two or more groups.

486
© Copyright 2015, Simplilearn. All rights reserved.
This concludes the demo on how to perform the statistical procedure and interpret ANOVA results when
the data is balanced. he data is balanced.

487
© Copyright 2015, Simplilearn. All rights reserved.
Activity

Let’s check your understanding. Play “Organize to Analyze.”

Read the problem carefully and analyze what needs to be done using SAS techniques.

Generate statistical values from the Electronic dataset for the sales variable, where sales is greater than
100 and Order Priority is “Critical.” Also, limit the output to two decimal places.

Click each code in the correct sequence to write the program that will be the solution to the
problem. Click the dataset tab to view them.

Hint: Semicolon can be clicked any number of times.

Let’s begin “Organize to Analyze.”

488
© Copyright 2015, Simplilearn. All rights reserved.
Activity (contd.)

489
© Copyright 2015, Simplilearn. All rights reserved.
Activity (contd.)

490
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01

Let’s practice what you have learned so far in this lesson. There are two Mini Projects in this lesson. Read
the question carefully and then answer them. The techniques and steps are provided to assist you under
the guide section.

491
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01

A Consulting firm wants to perform the correlation analysis with descriptive statistics between Sales and
Profit for their E-Commerce client. Their E Commerce dataset keeps a track of number of days used to
deliver a Product, Product Category, Sales, Quantity, Profit, Discount, and Customer Information. They
need to perform the analysis for product belong to Product Category Fashion where sales is more than
150. They also want to display the information graphically in the form of symmetric matrix plot.

As a SAS programmer, write the code for the above requirement.

492
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01 (contd.)

Follow these steps to solve the problem:

 Import the E-Commerce data.


 Create a table with the required variables.
 Apply relevant PROC SQL statement to get the output.
 Plot the matrix graph.

493
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01 (conttd.)

We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

Click Next to view the demo.

494
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 02

A XYZ pharmaceutical company has developed four different medicines for headache relief. It wants to
compare the time of relief of these medicines. The company recorded the time of relief in 20 different
patients, with a group of five trying each medicine. XYZ wants to test if all four medicines take the same
time or are is it different.

Following is the relevant data.

Brand 1 Brand 2 Brand 3 Brand 4

22.1 29.4 25.7 26.1

24.2 32.6 29.3 22.4

26.1 27.5 22.4 21.4

28.1 34.5 27.2 26.3

23.2 31.1 28.8 24.2

As a SAS programmer, write the code for the above requirement.

495
© Copyright 2015, Simplilearn. All rights reserved.
496
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 02 (contd.)

Follow these steps to solve the problem:

 Import the relevant data


 Choose the relevant statistical procedure statement
 Name the analysis if required
 Identify the dependent and independent variables and perform relevant operations on them

497
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 02 (contd.)

We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

Click Next to view the demo.

498
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s now quickly recap the key concepts of this lesson:

 The “PROC Means” calculates the number of observations, Mean, Standard Deviation, and
maximum and minimum values from the dataset.

 The PROC FREQ is used to obtain a frequency distribution and to analyze multidimensional
tables.
 The UNIVARIATE procedure provides data summarization tools, high-resolution graphics
displays, and information on the distribution of numeric variables.

 PROC CORR is a correlation procedure used to check the strength between two or more
variables.

 The PROC REG is used to estimate linear regression models.

 The ANOVA procedure performs analysis of variance (ANOVA) for balanced data from a wide
variety of experimental designs.

499
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “Basic Statistical Procedure.” The next lesson is “Data Exploration.”

500
© Copyright 2015, Simplilearn. All rights reserved.
\

501
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation

1 By Default, Proc Means creates a b.


summary report with ____. Proc Means creates summary report with N,
Means, STD, and MAX and Min variables

2 Which of the following statements is c.


required in PROC ANOVA? The statement Class and Module statements
are used in PROC ANOVA.
3 Which of the following statements do we b.
use to specify the regression model? To specify the regression model, use PROC Reg.

4 Which of the following statements do we a.


use to create a normal curve in the Use the Normal statement to create a normal
Histogram Chart? curve in the Histogram Chart.

502
© Copyright 2015, Simplilearn. All rights reserved.
503
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 8 — Data Exploration

504
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

In this lesson, you will learn about Data preparation and how to summarize the data.

505
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will learn how to perform data cleaning and convert numeric values into character
variables, and vice versa.

You will understand the various character and date/time functions.

You will also understand how SAS handles missing values in your datasets using various procedures.

506
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation

Let’s start this lesson by defining data preparation. Often, Data Scientists get data that is not in correct
format for analysis. To convert the data to the correct format for analysis, they perform Data
preparation.

507
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation(contd.)

Data preparation is a time-consuming task for any analytical project. Data Preparation tasks involve
collecting relevant data, sampling, and aggregating data attributes.

508
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation(contd.)

Data sources are collated at the customer or account level from different sources. These sources may
include billing and payment transactional data, demographic figures, and financial data.

509
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation(contd.)

In short, before you perform required analyses, you need to prepare the data you already have.

To prepare your data for the required analysis, you need to clean the data as the first step. Data cleaning
refers to the removal of data values that are incorrect from a data source.

When you clean the data, you may come across dirty data. These data contain inaccurate and erroneous
data values. The inaccuracy happens quite often when data is downloaded from the server or any other
source.

Therefore, you should perform data cleaning, to avoid erroneous or irrelevant data values.

510
© Copyright 2015, Simplilearn. All rights reserved.
Data Cleaning—Example

XYZ Company downloads sales data from the server. The column “name” in the sales report has a junk
character at the end of each name. Here the forward double slash is a junk value. Before the company
uses this sales report for analysis, it needs to clean the column “customer name.”

Look at the example shown on the screen. This example shows how to remove the junk value “forward
double slash” for a single observation.

511
© Copyright 2015, Simplilearn. All rights reserved.
Data Cleaning—Example(contd.)

The compress function removes the specified characters from a variable. It is also used to remove the
unnecessary spaces from a variable. Here the compress function removes the forward slash.

The Put statement is used to write variables in output line. Here the output line is “Correct_name.”

When you run the code, the output is generated, and it is shown on the screen.

The double forward slash is removed from the column “name.”

512
© Copyright 2015, Simplilearn. All rights reserved.
General Comments on Data Cleaning

Each set of data that needs to be cleaned has its own set of difficulties and challenges.
Therefore, the following information allows the “cleaner” to tackle all problems in the basic
cleaning line.

 Is there a pre-existing data source?


 Are there any business rules that need to be used during cleaning?
 What are the cleaning problems in the new data?
This information needs to be gathered before performing analysis

513
© Copyright 2015, Simplilearn. All rights reserved.
General observation for Data Cleaning

Following are the general observations of data cleaning:


• The data is always dirtier than you thought it was.
• New problems will always be triggered once the existing one is solved.
• Data cleaning is an on-going process and it never stops.

514
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

515
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1. Which of the following functions in c.

SAS is used to remove the character The compress function removes the specified

string from a variable? characters from a variable

516
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion

Hopefully, you have understood the need for data cleaning.

In SAS, while cleaning the data, most of the time the data scientists need to change or convert the
format of the variable. Sometimes, it is required to change the numeric data to character variables, or
vice versa.

To convert from Numeric to Character, use the Put function.To convert from character to numeric, use
the Input function.

517
© Copyright 2015, Simplilearn. All rights reserved.
Syntax Classroom

Let’s step into the syntax classroom to learn the syntax for the Put function.

518
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)

The argument source identifies the constant, variable, or expression whose values you are required to
reformat. The source argument can be character or numeric.

519
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)

The argument format specifies a format to use when the variable values are written. This argument must
be the name of a format with a period and optional width and decimal specifications.

520
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)

Note that the format must be of the same type as the source, either character or numeric. That is, if the
source is character, the format name must begin with a dollar sign, and if the source is numeric, the
format name must not begin with a dollar sign.

521
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)

By default, if the source is numeric, the resulting string is right aligned, and if the source is character, the
result is left aligned.

To overcome the default alignment, you can add an alignment specification to a format.

Following are the alignment specifications to change the default alignment:

 The keyword “L” aligns the value to the left.


 The keyword “C” aligns the value to the center.
 The keyword “R” aligns the value to the right.

522
© Copyright 2015, Simplilearn. All rights reserved.
Numeric to Character Conversion

For example, look at the Electronic dataset that stores the zip code as a numeric value and
Electronic_CustomerInfo dataset that stores the zip code as a character variable.

523
© Copyright 2015, Simplilearn. All rights reserved.
Numeric to Character Conversion

Look at the program shown on the screen to convert a numeric value to a character variable. Here, the
Put function converts the zip code from numeric and stores it as character.

Zw. format writes standard numeric data with leading 0s. Z5 format adds leading zeros whenever a value
comes with less than 5 digits.

The output generated is shown on the screen.

A new character variable called zip code is created utilizing the Put function.

524
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric Conversion

Sometimes numeric data is imported into character variables, and it may be desirable to convert these
character variables into numeric variables.

Note that it is not possible to directly change the type of a variable. It is only possible to write the
variable to a new variable containing the same data, although with a different type.

525
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric Conversion(contd.)

By renaming and dropping variables, it is possible to produce a new variable with the same name as the
original, although with a different type.

There are two methods to convert character to numeric—using the multiplication operator and using
the Input function.

The native approach is to multiply the character variable by 1, causing SAS to perform an implicit type
conversion.

SAS performs an implicit character to numeric conversion and gives a note to this effect in the log. Look
at the example code shown on the screen.

This method is considered as poor programming practice and should be avoided. A preferable method to
convert character to numeric value is using the Input function. Look at the example code shown on the
screen.

526
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax for input function.

527
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)

The argument source specifies a character constant, variable, or expression to which you want to apply a
specific informat.

528
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)

The argument informat refers to the SAS informat that you want to apply to the source. This argument
must be the name of an informat followed by a period, and it cannot be a character constant, variable,
or expression.

529
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)

The Input function returns the value produced when a SAS expression is converted using a specified
informat.

530
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)

The SAS code that demonstrates character to numeric conversion is shown on the screen. The input
function converts the character variable type to numeric type.

When you run this code, the output is generated and it shown on the screen.

Note that, the contact number type is numeric.

In addition, to character or numeric conversions, the Put and Input functions can also be used in the
conversion of date or time values into character variables and vice versa.

531
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions

Following is the list of character functions that are extremely useful in data cleaning:

Click each function to learn more about it.

532
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)

The compress function removes specified characters from a variable. It is also used to remove
unnecessary spaces from a variable.

533
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)

The index, index c, and index w functions return the starting position for a character, character string, or
word and are extremely useful in determining where to start or stop when sub-stringing a variable.

534
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)

The Left function justifies the variable value to the left.

535
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)

The length function returns the number of characters with a character variable value.

536
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)

The lowcase function changes all the letters to the lowercase within a variable values.

537
© Copyright 2015, Simplilearn. All rights reserved.
The right function justifies the variable value to the right.

538
© Copyright 2015, Simplilearn. All rights reserved.
The scan function returns a portion of the variable value as defined by a delimiter. For example, the
delimiter could be a space, comma, and semi-colon.

539
© Copyright 2015, Simplilearn. All rights reserved.
The substring returns a portion of the variable value based on the starting position and number of
characters.

540
© Copyright 2015, Simplilearn. All rights reserved.
The translate function replaces a specific character with characters that are specified.

541
© Copyright 2015, Simplilearn. All rights reserved.
The transfer word function replaces a portion of the character string (word) with another character
string or word. For example, a delimiter was supposed to be a comma but data in some cases contains a
colon. This function could be used to replace the comma with a colon.

542
© Copyright 2015, Simplilearn. All rights reserved.
The trim function removes the trailing blanks from the right-hand side of a variable value.

543
© Copyright 2015, Simplilearn. All rights reserved.
The uppercase function changes all the letters to the uppercase within a variable values.

544
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function

Let’s step into the syntax classroom to learn the syntax for the Scan function.

545
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function

Most of the time, you need to extract the portion of the character variable. To extract the portion of the
character variable, use the Scan function.

SCAN(TEXT,N<,DELIMITERS>);

The Scan function returns the nth word from a text expression.

546
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function

Text refers a character constant, variable, or expression you want to modify.

547
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function

N specifies the number of the word in the character string that you want SCAN to select. If N is positive,
SCAN counts words from left to right, and if N is negative, SCAN counts words from right to left.

548
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function

Delimiters are a group of characters used to separate words. The default delimiters are shown on the
screen.

549
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function

Let’s extract the first name and last name of the customer in a different variable from electronic
customer information dataset.

The first name is extracted using the Scan function with n value equal to 1.

The last name is extracted using the Scan function with n value equal to 2.

The output generated is shown on the screen.

Note that the first name and last name are extracted in different columns.

550
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions

Let’s now learn to extract the portions of datetime values.

Date/Time functions are a set of functions that return portions of date time, date, or time values.

These functions are especially useful for extracting the date and time from a date time value or
converting separate month, day and year values into a SAS date value.

The MDY function creates a SAS date value from numeric values that represent the month, day, and
year.

551
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions

Let’s step into the syntax classroom to learn the syntax for MDY function.

552
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions

MDY(month,day,year)

If the data is numeric, use MDY function to convert the separate variables into a single date value
variable. However, if the data is character then the conversion to numeric should occur first and then
the conversion to the date value should occur.

Let’s understand this with the help of an example.

The Electronic_custinfo dataset contains month, date, and year in the separate variables.

553
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions

However, there is only a single variable in the electronic dataset. To add month, date, and year details of
electronic customer information in the Electronic dataset use the ‘MDY’ function.

The format command will format date in the suggested way.

Date9. option format the date in the format shown on the screen. Look at the output shown on the
screen.

The MDY function converts the separate variables from the Electronic Custinfo dataset into a Single
variable.

554
© Copyright 2015, Simplilearn. All rights reserved.
Various Date/Time Functions

Following is a list of date/time functions that are extremely useful in data cleaning.

Function Use
Month Returns the month from a date value

Day Returns the day from a date value

Year Returns the year from a date value

Hour Returns the hour from a time value

Minute Returns the minute from a time value

Second Returns the second from a time value

DatePart Returns the date only from a date time value

Timepart Returns the time only from a date time value

HMS Returns a time value from the numeric values for


hour, minutes and seconds

Today() Returns the current date value.

Date() Returns the current date value.

555
© Copyright 2015, Simplilearn. All rights reserved.
Missing Value Treatment

So far you have learned how to clean the data and convert the numeric data valued to character
variables, and vice versa.

Consider the “west region” dataset.

Look at the code shown on the screen.

PROC PRINT DATA=West;

RUN ;

When your run this code, the output is generated and it is shown on the screen.

You can observe in the output that for some observations there is a ‘decimal’ sign. This implies that
there are missing numeric values for these observations.

556
© Copyright 2015, Simplilearn. All rights reserved.
Missing Value Treatment

Let’s now learn how SAS handles these missing data values using SAS procedures.

As a general rule, SAS procedures that perform computations handle missing data by omitting the
missing values.

The way that missing values are eliminated is not always the same among SAS procedures, so let's us
look at some examples.

557
© Copyright 2015, Simplilearn. All rights reserved.
Missing Value Treatment

First, let's perform a proc means on our datafile and see how it handles the missing values. Note that
there are 50 observations.

Look at the code shown on the screen.

PROC MEANS DATA=West;

VAR Sales;

RUN ;

Note that the proc mean procedure is used.

Look at the output shown on the screen.

The total number of observations in the output table is 37 but actually there are 50 observations. So,
you can conclude that Proc means ignores the missing value observations.

558
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions

With the help of same example, Let’s now perform proc Freq on our datafile and see how it handles the
missing values.

Look at the code shown on the screen.

PROC FREQ DATA=west;

TABLES Sales;

RUN;

Note that the proc frequency procedure is used.

Look at the output shown on the screen.

As you see the output, proc freq performed its computations using just the available data. Note that the
percentages are computed based on just the total number of non-missing cases.

559
© Copyright 2015, Simplilearn. All rights reserved.
Following are the various SAS functions and how they handle the missing values.

Click each function to know more.

560
© Copyright 2015, Simplilearn. All rights reserved.
By default, missing values are excluded and percentages are based on the number of non-missing values. If
you use the missing option in the tables statement, the percentages are based on the total number of
observations (non-missing and missing) and the percentage of missing values are reported in the table.

561
© Copyright 2015, Simplilearn. All rights reserved.
If class variables have missing values, proc means will exclude that observations. If you want to include, we
can use Missing option in the proc statement or class statement.

562
© Copyright 2015, Simplilearn. All rights reserved.
By default, correlations are computed based on the number of pairs with non-missing data that is pairwise
deletion of missing data. The no miss option can be used on the proc corr statement to request that
correlations be computed only for observations that have non-missing data for all variables on the var
statement.

563
© Copyright 2015, Simplilearn. All rights reserved.
In Proc reg, if any of the variables on the model or var statement are missing, they are excluded from the
analysis, that is, listwise deletion of missing data.

564
© Copyright 2015, Simplilearn. All rights reserved.
SAS has a number of procedures to help you to present the report in the desired format.

One of the most commonly used Procedures for Data Summarization is Proc Report.

565
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax for Proc Report.

566
© Copyright 2015, Simplilearn. All rights reserved.
PROC REPORT DATA= datasetname;

COLUMN variable list and column specifications;

DEFINE column / column usage and attributes;

COMPUTE column;

compute block statements;

ENDCOMP;

RUN;

The column statement describes the arrangement of all columns and of headings that span more than
one column.

567
© Copyright 2015, Simplilearn. All rights reserved.
The define column describes how to use and display a report item.

568
© Copyright 2015, Simplilearn. All rights reserved.
The Compute and ENDCOMP specifies one or more programming statements that PROC REPORT
executes as it builds the report.

569
© Copyright 2015, Simplilearn. All rights reserved.
Let’s understand the PROC Report with the help of an example.

Look at the program shown on the screen.

The column Sales with Order ID and Product is created from Electronic dataset.

The incentive is computed, which is 10% of Sales in the report.

The incentive variable is formatted in dollars up to one decimal place

570
© Copyright 2015, Simplilearn. All rights reserved.
Look at the output shown on the screen.

The table sales report is created with Order Id, Product name, sales, and incentives.

Note that there is a dollar sign before the values in the sales variable. The incentive is computed per the
given calculation.

571
© Copyright 2015, Simplilearn. All rights reserved.
Let’s practice what you have learned so far in this lesson. Read the question carefully and then answer
them. The techniques and steps are provided to assist you in the guide section.

572
© Copyright 2015, Simplilearn. All rights reserved.
A leading consulting firm wants to create a summary report grouped by region and customer Name for
their client. Their Dataset has a track of Customer Name, Region, Sales, Profit, and Shipping Cost. There
are lot of junk characters. It also wanted to group sales, profit, and Shipping Cost under one “Data”
header. As a SAS programmer, write the code for the above requirement. Note that the dataset has a lot
of junk characters. Clean the dataset before you perform the task.

573
© Copyright 2015, Simplilearn. All rights reserved.
Follow the below steps to solve the problem:

1. Import the dataset


2. Perform data cleaning
3. Define the requirement
4. Use relevant codes to generate the summary report

574
© Copyright 2015, Simplilearn. All rights reserved.
We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

Click Next to view the demo.

575
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s now quickly recap the concepts you have learned in the lesson:

 The optimization is a mathematical technique to find a maximum value and a minimum value of
a function subject to constraints.
 Optimization techniques cut down the operational costs and maximize the profit of the
company.
 The various types of optimization programming are linear programming, mixed linear integer
programming, quadratic programming, and nonlinear programming.
 The objective functions and constraints can be linear or nonlinear.
 The PROC OPTMODEL is also used to model linear, mixed integer linear, and quadratic
optimization programs.
 Solver is a method or procedure to resolve an optimization problem.

576
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “Data Exploration.” The next lesson is “Advanced Statistical Techniques.”
.

577
© Copyright 2015, Simplilearn. All rights reserved.
578
© Copyright 2015, Simplilearn. All rights reserved.
579
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which of the following functions returns d.
the number of characters in SAS? Length function of SAS returns the number of
characters.

2 Which of the following functions converts a.


a Numeric value to a Character Value? The Put function in SAS is used to convert
Numeric to Character Value.

3 The column statement in Proc Report is d.


used to: Column statement in Proc Report is used to
describe the arrangement of columns.

580
© Copyright 2015, Simplilearn. All rights reserved.
581
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 09 — Advanced Statistics

582
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

In this lesson, “Advanced Statistics,” you will learn about clustering, decision tree, linear regression, and
logistic regression.

583
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will learn how to create a cluster and to perform cluster analysis on the dataset.

You will learn about the decision tree.

You will also learn to identify the regression types and to analyze the variations of the variables.

584
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Cluster

A cluster is the process of organizing similar objects into groups.

For example, an E-Commerce company wants to analyze and collect information about customers who have
bought or shown interest on an iPhone. This allows the company to target them for future sales.

The analysis to group similar customer behavior is called cluster analysis. It is also used to summarize the
data.

585
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Cluster

SAS clustering procedures are used to cluster observations or variables in a SAS dataset. There are various
types of cluster analyses available in SAS:

 Disjoint clusters
 Hierarchical clusters
 Overlapping clusters
 Fuzzy clusters

Disjoint clusters place each observation in one cluster.

Hierarchical clusters are organized, and there is no overlap between the clusters.

Overlapping clusters limit the number of observations and allow any degree of overlap.

Fuzzy clusters are defined by a probability or grade of membership of each object in each cluster.

586
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Cluster

Following are the SAS cluster analysis procedures:

 PROC ACECLUS obtains approximate estimates of the pooled within-cluster covariance matrix when
the clusters are assumed to be multivariate normal with equal covariance matrices.
 PROC CLUSTER clusters the observations in a SAS dataset hierarchically.
 PROC DISTANCE computes the various measures of distance, dissimilarity, or similarity between the
observations of a SAS dataset.
 PROC FASTCLUS performs disjoint cluster analysis on the basis of distances computed from one or
more quantitative variables.
 PROC MODECLUS clusters the observations in a SAS dataset.
 PROC VARCLUS divides a set of numeric variables into disjoint or hierarchical clusters.
 PROC TREE produces a tree diagram, also known as a dendrogram or phenogram, from a dataset
created by the PROC CLUSTER or PROC VARCLUS.

Let’s learn about the PROC Cluster in detail.

587
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into syntax classroom to learn the syntax of a PROC CLUSTER.

588
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

The syntax for the PROC CLUSTER is shown on the screen.

PROC CLUSTER METHOD=name <options>;

BY variables;

COPY variables;

FREQ variable;

ID variable;

RMSSTD variable;

VAR variables;

The PROC CLUSTER statement calls the cluster procedure.

589
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

The “method =” statement specifies the clustering method.

590
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

The FREQ statement is optional. The VAR, ID, and COPY statements are mandatory. The RMSSTD statement is
used to display the root-mean-square standard deviation of each cluster.

591
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

The COPY statement copies the variables from the input dataset to the OUTTREE= dataset. The Outtree =
dataset specifies the output dataset.

The ID statement identifies the observations in the displayed cluster history and in the OUTTREE = dataset.

The VAR statement is used to list the required numeric variables in the cluster analysis.

592
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

593
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

594
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

595
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER

596
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

There are various clustering methods available in SAS:

 Average method
 Centroid method
 Complete method
 Density method
 EML method
 Flexible method
 Single method
 Ward method

Click each method to learn more about it.

The Method = average requests average linkage. In average linkage, the distance between two clusters is the
average distance between pairs of observations, with one in each cluster.

597
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method =centroid requests the centroid method. In the centroid method, the distance between two
clusters is defined as the squared distance between their centroids or means.

598
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method = complete requests the complete linkage. In complete linkage, the distance between two
clusters is the maximum distance between an observation in one cluster and an observation in the other
cluster.

599
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method= density requests the density linkage. Density linkage is a class of clustering methods using
nonparametric probability density estimation.

600
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method= EML joins clusters to maximize the likelihood at each level of the hierarchy.

601
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method= Flexible requests the Lance-Williams flexible-beta method. It specifies the beta value for the
flexible beta method.

602
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method= Single requests single linkage. In single linkage, the distance between two clusters is the
minimum distance between an observation in one cluster and an observation in the other cluster.

603
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies

The Method= Ward requests the ward’s minimum variance method. In Ward’s minimum-variance method,
the distance between two clusters is the ANOVA sum of squares between the two clusters added up
over all the variables.

604
© Copyright 2015, Simplilearn. All rights reserved.
Demo-clustering Method

This demo explains how to create a cluster based on the salary and profit for the electronic dataset.

Let’s perform customer analysis on our E-commerce dataset.

Import the “Electronic” dataset to the SAS console. Follow the import steps to import the relevant dataset.

The PROC CLUSTER invokes the cluster procedure.

The keyword Print is used to specify the required number of clusters. Here, we have used print=7.

The option “Simple” is used to display the descriptive statistics.

The option “Method” determines the clustering method used by the procedure. For example, we will use the
CENTROID method as it is resistant to errors in the results than other methods.

We can obtain the values of root mean square standard deviation of each cluster using the keyword RMSSTD.

The option “Rsquare” is used to display the R-square and semi-partial R-square values.

The values of the ID variable identify the observations in the displayed cluster history and in the OUTTREE=
data set. If the ID statement is omitted, each observation is denoted by OBn, where n is the observation
number.

605
© Copyright 2015, Simplilearn. All rights reserved.
The VAR statement lists numeric variables to be used in the cluster analysis. In this example, we have used
the sales and profit variables from the electronic dataset.

Let’s run this program to see the output.

The cluster history table is generated which shows the number of clusters and variance between each
clusters.

The results of cluster analysis are best summarized using a dendrogram.

In a dendrogram, the distance is plotted on X axis, and the sample units are plotted on Y axis.

The tree shows how sample units are combined into clusters. It also shows the height of each branching point
corresponding to the distance at which two clusters are joined.

This concludes the demo on creating a cluster based on the salary and profit for the electronic dataset.

606
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering

As discussed earlier, there are various cluster analysis procedures. The most used cluster analysis procedure
is PROC FASTCLUS or K-Means Clustering.

The K-Means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean.

PROC FASTCLUS is used in a variety of analytic, business intelligence, reporting, and data management
situations.

607
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax of PROC FASTCLUS.

608
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering

PROC FASTCLUS <MAXCLUSTERS= n> <RADIUS= t> <options>;

VAR variables;

ID variables;

FREQ variable;

WEIGHT variable;

BY variables;

The PROC FASTCLUS statement calls the FASTCLUS procedure.

609
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering

The maxclusters = n option specifies the maximum number of clusters permitted. The default value of
maxclusters is 100.

610
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering

 The radius = t option specifies minimum distance from the previous seed to classify an observation as
a new seed. By default t = 0.
 Let’s understand K-Means clustering with the help of an example.
 Let’s perform K-Means Clustering on our same Electronic Dataset.


 The electronic dataset is imported to the SAS console.

611
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering

The PROC FASTCLUS statement calls the FASTCLUS procedure.

The out= option specifies the output dataset. Here, the output is stored in the “electronic dataset” table.

The option “Maxclusters” defines the number of required clusters and “Maxiter” defines the number of
iterations.

The sales and profit variables are chosen to perform K-Means clustering.

When you run this code, the output is generated, and it is shown on the screen.

The clusters are grouped on the basis of maximum distance from seed to observations.

612
© Copyright 2015, Simplilearn. All rights reserved.
The distance between the seed and observation of the first cluster distance is zero, and the last cluster is the
maximum value.

613
© Copyright 2015, Simplilearn. All rights reserved.
Now let's do a Knowledge check of what you have learned so far.

614
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following methods is used a.
to join clusters to maximize the likelihood The Method= EML joins clusters to maximize
at each level of the hierarchy? the likelihood at each level of the hierarchy.

615
© Copyright 2015, Simplilearn. All rights reserved.
Decision Tree

So far you have learned about clustering, and how to perform cluster analysis using SAS.

Let’s now learn the next concept of this lesson “Decision Tree.”

616
© Copyright 2015, Simplilearn. All rights reserved.
Decision Tree

A decision tree is a powerful multivariate analysis used to identify the various ways to split the dataset into
branch like segments.

In decision trees, each segment or branch is called a node. The bottom nodes of a decision tree are called
leaves.

The decision tree is used to model other approaches, select inputs, or to create dummy variables in the
regression equation.

Decision trees find the relationship between the input values and target values in a group of observations,
and hence the decision trees are so useful.

617
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax of decision trees.

618
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

PROC DTREE options ;

EVALUATE / options ;

MODIFY specifications ;

MOVE specifications ;

QUIT ;

RECALL ;

RESET options ;

SAVE ;

SUMMARY / options ;

TREEPLOT / options ;

VARIABLES / options ;

VPC specifications ;

619
© Copyright 2015, Simplilearn. All rights reserved.
VPI specifications ;

The decision tree procedure begins with the PROC DTREE statement and terminates with the QUIT statement.

620
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The EVALUATE statement evaluates the decision tree and calculate the optimal decisions.

621
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The MODIFY statement is used to change either the type of a stage or the reward from an outcome.

622
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The MOVE statement is used to change the order of the stages.

623
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The QUIT statement terminates the processing.

624
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The recall statement informs PROC DTREE to recall the decision that was saved previously with a SAVE
statement.

625
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The RESET statement is used to reset the options after the procedure has started.

626
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The SAVE statement saves the current decision model.

The SUMMARY statement displays the summary of the report.

627
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The TREEPLOT statement plots the current decision tree.

The VARIABLES statement specifies the variables in the input dataset.

628
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE

The VPC statement computes the value of perfect control or the value of uncertainty.

The VPI statement computes the value of perfect information.

629
© Copyright 2015, Simplilearn. All rights reserved.
Decision tree — Example

Many financial decisions are difficult to analyze because of the variety of available strategies and the
continuous nature of the problems.

Look at the example that has been taken from the SAS university edition.

A loan officer is faced with the problem of deciding whether to approve or deny an application for a one-year
$ 30000 loan at the current rate of 15 % of interest. If the application is approved, the borrower will either
pay off the loan in full after one year or default. Based on experience, the default rate is about 36 out of 700.
If the loan is denied, the money is put in government bonds at the interest rate of 8 %.

To obtain more information about the applicant, the loan officer engages a credit investigation unit at a cost
of $ 500 per person that will give either a positive recommendation for making a loan or a negative
recommendation. Past experience with this investigator yields that of those who ultimately paid off their
loans, 570 out of 664 were given a positive recommendation. On the other hand, 6 out of 26 that had
defaulted had also been given a positive recommendation by the investigator.

630
© Copyright 2015, Simplilearn. All rights reserved.
631
© Copyright 2015, Simplilearn. All rights reserved.
Decision tree — Example

The following code invokes the DTREE procedure to solve this decision problem.

title 'Loan Grant Decision';

proc dtree

stagein=Stage6 probin=Prob6 payoffs=Payoff6

summary target=investigation nowarning;

modify 'Order investigation' reward -500;

evaluate;

OPTIONS LINESIZE=85;

summary / target=Application;

OPTIONS LINESIZE=80;

The keyword “title” defines the title of problem. Here, Loan Grant decision is the title of this problem.

The STAGEIN= data set, gives the structure of the decision problem.

632
© Copyright 2015, Simplilearn. All rights reserved.
The PROBIN= data set gives the probability distributions for the random events at the chance nodes.

The PAYOFFS= data set gives the payoffs for the various scenarios.

When you run this code, the output is generated, and it is shown on the screen.

The loan officer should order the credit investigation and approve the loan application if the investigator
gives the applicant a positive recommendation.

633
© Copyright 2015, Simplilearn. All rights reserved.
Regression

Let’s now learn the last concept of this lesson – Regression.

Regression is used to formulate a functional relationship between a set of independent or explanatory


variables with a dependent or response variable. The independent variables are represented as “X.” The
dependent variables are represented as “Y.”

Mathematically, regression is denoted as shown on the screen.

Y= f (X1, X2, X3,…,Xn)

634
© Copyright 2015, Simplilearn. All rights reserved.
Regression

There are two types of dependent variables available in regression. They are continuous and binary variables.

The variable that has scalar quantity are called continuous variables. For example: Sales, Profit, and quantity

The variables that have the binary values, that is 1 or 0, are called binary variables. For example: Yes or no,
True or false, and buy or not buy

635
© Copyright 2015, Simplilearn. All rights reserved.
Regression

Based on these dependent variables, the regression is classified into two types—Linear regression and logistic
regression.

636
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

Linear regression is an approach to model the relationship between a continuous dependent variable and one
or more explanatory or independent variables.

Remember that independent variables can be continuous or discrete types.

637
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax of linear regression.

638
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

PROC REG <options>;

MODEL dependent Variable =Independent Variable;

VAR variables;

FREQ variable;

WEIGHT variable;

ID variable;

OUTPUT OUT=SASdataset keyword=names...;

PLOT yvariable*xvariable = symbol ...;

RESTRICT linear_equation,...;

TEST linear_equation,...;

MTEST linear_equation,...;

BY variables;

639
© Copyright 2015, Simplilearn. All rights reserved.
640
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The MODEL statement specifies the dependent and independent variables in the regression model.

The OUTPUT statement requests an output dataset and names the variables to contain predicted values,
residuals, and other output values.

641
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The ID statement names a variable to identify observations in the printout.

642
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The WEIGHT and FREQ statements declare variables to weigh observations.

643
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The BY statement specifies variables to define subgroups for the analysis. The analysis is repeated for each
value of the BY variable.

644
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The RESTRICT statement applies restrictions on the parameter estimates.

645
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The TEST statement validates hypotheses for the parameters estimated.

646
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

The MTEST statement can validate hypotheses involving several dependent variables (multivariate regression
models).

647
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

Remember that the PROC REG statement is always accompanied by one or more MODEL statements to
specify regression models. One OUTPUT statement may follow each MODEL statement.

648
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression

There are two types of linear regression based on the number of independent variables. They are— simple
linear regression and multiple linear regression.

In simple linear regression, a single independent variable is used to predict the value of a dependent variable.

In multiple linear regression, two or more independent variables are used to predict the value of a dependent
variable.

Let’s understand these types of linear regression with the help of an example.

649
© Copyright 2015, Simplilearn. All rights reserved.
Simple Linear Regression — Example

Look at the program shown on the screen.

The objective of this program is to check the variation in sales based on the profit from the electronic dataset.

Here, the variable sales is the dependent variable and variable profit is the independent variable. This is the
example of simple linear regression as there is one independent variable.

When you run this code, the output is generated, and it is shown on the screen.

650
© Copyright 2015, Simplilearn. All rights reserved.
From the output, you can infer that the p-value for the profit is less than 5 percent and therefore the variable
profit is significant at 95 percent confidence level.

Also, note that the R-square value is 79.7 percent which tells you that the variation between sales and profit
is strong.

651
© Copyright 2015, Simplilearn. All rights reserved.
Multiple Linear Regression — Example

Let’s now look at the program shown on the screen.

The objective of this program is to check the variation in sales based on the profit and quantity from the
electronic dataset.

Here, the variable sales is the dependent variable and variable profit and quantity are the independent
variables. This is the example of multiple linear regression as there is more than one independent variable.

When you run this code, the output is generated, and it is shown on the screen.

652
© Copyright 2015, Simplilearn. All rights reserved.
From the output, we obtain the value of R-square which defines the variation of sales based on the quantity
and profit. Note that R-square value is 86 percent.

The t-value for profit and quantity is greater than 1.96 which means that the variables are significant at 95
percent confidence level.

Also, note that the P value for the profit and quantity variables are less than 0.05 and hence the variables are
found to be significant.

653
© Copyright 2015, Simplilearn. All rights reserved.
Logistic Regression

Well, let’s look at the second type of regression—Logistic regression.

Logistic regression is regression analysis conducted if the dependent variable is dichotomous or binary. Like
all regression analyses, logistic regression is a predictive analysis.

Logistic regression is used to describe data and to explain the relationship between one dependent binary
variable and one or more metric independent variables. Metric independent variables are variables that are
measured on an interval or a ratio scale.

654
© Copyright 2015, Simplilearn. All rights reserved.
Logistic Regression

The logistic regression is used in the areas such as insurance, marketing, sales, operations, health, and
gaming.

655
© Copyright 2015, Simplilearn. All rights reserved.
656
© Copyright 2015, Simplilearn. All rights reserved.
Logistic Regression

PROC LOGISTIC <options> ;

BY variables ;

CLASS variable <(options)><variable <(options)>></ options> ;

CONTRAST ’label’ effect values<, effect values,></ options> ;

EXACT <’label’><INTERCEPT><effects></ options> ;

FREQ variable ;

<label:> MODEL events/trials=<effects></ options> ;

OUTPUT <OUT=SAS-data-set><keyword=name <keyword=name>></ option> ;

ROC <’label’> <specification> </ options> ;

ROCCONTRAST <’label’><contrast></ options> ;

SCORE <options> ;

STRATA effects </ options> ;

657
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which of the following cluster places a.
each observation in one cluster? Disjoint clusters place each observation in one
cluster.

2 Which of the following statements is used c.


to display the root-mean-square standard The RMSSTD statement is used to display the
deviation of each cluster? root-mean-square standard deviation of each
cluster.

3 Which of the following statements b.


defines the number of iterations? The statement “Maxiter” defines the number of
iterations.

4 The bottom nodes of decision tree are b.


called ______. The bottom nodes of decision tree are called
leaves.

5 Which of the following analyses is b.


performed if the dependent variable has Logistic regression is the regression analysis
binary values? that is conducted if the dependent variable is
dichotomous or binary.

658
© Copyright 2015, Simplilearn. All rights reserved.
659
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 10 — Working with Time Series Data

660
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.

In this lesson, “Working with Time Series Data,” you will understand what time series analysis is and how
to work with time series data.

661
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson, you will learn how to read SAS date and datetime values.

You will learn the patterns and terminologies of time series.

You will list the various time series models of SAS.

You will also learn how to plot, transform, transpose, and interpolating time series data in SAS datasets.

662
© Copyright 2015, Simplilearn. All rights reserved.
Need for Time Series Analysis

Let’s begin this lesson understanding the need for time series analysis.The datasets might be the daily
sales score of E-commerce, the weekly production of a shoe manufacturing company, the number of
tickets sold by an Airline services every month, yearly GDP of developing country, and so on.

663
© Copyright 2015, Simplilearn. All rights reserved.
Need for Time Series Analysis (contd.)

Have you noticed that all these datasets include time?

Of course, these datasets include time variables—year and month.

Time-series analysis is used to analyze such types of datasets.

664
© Copyright 2015, Simplilearn. All rights reserved.
Need for Time Series Analysis (contd.)

Time-series analysis is used to list the observations in time order. The observations can be either from a
single or multiple samples. It is also used to forecast patterns based on historic time interval data.

665
© Copyright 2015, Simplilearn. All rights reserved.
Goals of Time Series Analysis

Let’s understand the goals of time series analysis. The main goals of time series analysis are as follows:

 Identifying the patterns in correlated data


 Understanding and modeling the data
 Forecasting short-term trends from previous patterns
 Understanding how a single event changes the time series

666
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis

Let’s step into the “Syntax Classroom” to learn the syntax of time series.

These tasks that you perform to increase the sales through marketing campaign is called marketing
Analysis.

667
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis

Syntax of Proc Time Series

PROC TIMESERIES DATA=<input-data-set>

OUT=<output-data-set>;

ID <time-ID-variable> INTERVAL=<frequency>

ACCUMULATE=<statistic>;

VAR <time-series-variables>;

RUN;

The syntax for PROC time series is shown on the screen. The TIMESERIES procedure forms time series
from the input time-stamped transactional data. It provides results using the Output Delivery System, or
ODS.

668
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis(contd.)

The ACCUMULATE option in the ID or VAR statement is used to accumulate the observations within each
time period. You can use various options in the ACCUMULATE such as none, total, average, minimum,
maximum, and median.

669
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis(contd.)

The INTERVAL option in the ID or VAR statement is used to specify the frequency or width of each time
interval. You can use various options in the INTERVAL such as day, month, and year.

670
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples

Let’s now understand time series analysis using an example.

An E-commerce Company wants to analyze the records associated with each of its customers over time.
The dataset keeps a track of Order Date, Customer ID, Customer Name, Product Category, Product,
Sales, and Profit.

In this case, you can analyze each record using the Time Series Procedure.

671
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples(contd.)

Look at the example program shown on the screen.

The OUT= option specifies the storage location of the resulting time series data for each customer. Here,
the resulting time series data is stored in the Ecommerce_Monthly dataset.

The INTERVAL= Month option specifies that the transactions are to be aggregated on a monthly basis.

The ACCUMULATE = TOTAL option specifies the sum of the transactions to be calculated.

672
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples(contd.)

When a BY statement appears in the PROC TIMESERIES step, the procedure expects the input data to be
sorted with BY variables together with the ID variable.We can use proc sort to order the E_commerce
data by “Customer_Name” and “Order_Date.”Note that “Customer_Name” must appear prior to
“Order_Date” in the sort procedure.

Look at the output shown on the screen.

673
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples(contd.)

In this example, each BY group associated with the BY variable “Customer_Name” contains an
observation for each customer for every month.Each observation contain the variables “Sales” and
“Profit” where values (that is, totals) are aggregated by months.

All records are sorted based on the customers in ascending order (Jan→Feb→Mar…………→Dec2015).

674
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options

There are various options available in Time Series Analysis in SAS. Some of the options used in the time
series analysis are as follows:

 CROSSPLOTS = option
 MAXERROR = number
 PLOTS = option
 PRINT= option
 SORTNAMES
Click each option to learn more.

675
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)

CROSSPLOTS = option specifies the cross-variable graphical output desired. The CROSSPLOTS= option
produces results similar to the datasets listed in parentheses next to the preceding options.

By default, the TIMESERIES procedure produces no graphical output. You can use plotting options such
as Series and CCF to plot the output graph.

676
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)

MAXERROR = number limits the number of warning and error messages that are produced during the
execution of the procedure to the specified value. The default is MAXERRORS=50. This option is
particularly useful in BY-group processing where it can be used to suppress the recurring messages.

677
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)

PLOTS = option specifies the desired UNIVARIATE graphical output. By default, the TIMESERIES
procedure produces no graphical output. You can use plotting options such as Series, Residual, cycles,
and Histogram to plot the graphical output.

678
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)

PRINT = option specifies the desired printed output. By default, the TIMESERIES procedure produces no
printed output. You can use the printing options such as decomp, seasons, trends, descstats, and
summary to produce printed output.

679
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)

SORTNAMES specifies that the variables specified in the VAR and CROSSVAR statements be processed in
sorted order by the variable names. This option allows the output data sets to be pre-sorted by the
variable names.

680
© Copyright 2015, Simplilearn. All rights reserved.
Reading Date and Datetime Values

SAS provides a selection of informats for reading SAS date and datetime values. A SAS informat is an
instruction that converts the character-string values into the numerical values of a SAS variable.

To see what date is today in the SAS log, type the command shown on the screen.

%put today is: %sysfunc(today());

A SAS informat is used to convert the values from a character-string into the numerical value of a SAS
variable.

Look at the example shown on the screen.

681
© Copyright 2015, Simplilearn. All rights reserved.
Reading Date and Datetime Values(contd.)

The ANYDTDTE informat utilized to convert text strings into SAS date values. Look at the output shown
on the screen. The dates are displayed in the same format though it is written in various formats.

SAS also provides formats to convert the representation of date and datetime values used by SAS. A SAS
format is an instruction that converts the internal numerical value to a character string that can be
printed or displayed.

Let’s consider the same example chosen for the informat. Look at the output shown on the screen. The
dates are displayed in the desired format.

682
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns

Hope, you have understood what time series analysis is, its goals, and available options. Let’s now look
at time series patterns. There are four types of time series patterns available in SAS. They are as follows:

 Trend
 Seasonality
 Cyclic
 Random
A trend pattern exists when there is a long-term increase or decrease in the data. It does not have to be
linear. Sometimes, a trend can be referred to as “changing direction” as it changes from an increasing
trend to a decreasing trend or vice versa. For example the rising and falling trend pattern of the stock
market.

683
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns(contd.)

A seasonality pattern is defined as the repeating pattern with a fixed period. A seasonal pattern exists
when a series is influenced by seasonal factors. For example, the quarter of the year, the month, or day
of the week. Note that the seasonality is always of a fixed and known period.

684
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns(contd.)

A cyclic pattern exists when the data exhibits a rise and fall for an unfixed period. The duration of a cycle
depends on the type of business or industry being analyzed, but it is usually at least two years. Overall,
the length of cycles is on average longer than the length of a seasonal pattern. The business cycle is an
example of an economy's periodic patterns of growth, recession, and recovery.

685
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns(contd.)

A random pattern is obtained if the data is not able to obtain any of the three patterns—Trend,
Seasonality, and Cyclic.

For example, a daily change in the S&P500 index has no trend, seasonality, or cyclic behavior.

686
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check

Now let's do a Knowledge check of what you have learned so far.

687
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following patterns is always c.
of a fixed and known period? Seasonality is always of a fixed and
known period.

688
© Copyright 2015, Simplilearn. All rights reserved.
White Noise Process

Based on the correlation between its values at different times, the data can be of two types. The data
can be uncorrelated with zero mean and constant variance or correlated with constant mean and
variance.

A Series is called white noise if the data is completely random in nature.

689
© Copyright 2015, Simplilearn. All rights reserved.
White Noise Process(contd.)

A white noise process has a zero mean, a constant variance, and no correlation between its values at
different times. Plots of white noise series exhibit erratic, jumpy, and unpredictable behaviour.

Since values are uncorrelated, previous values do not help us forecast future values.

690
© Copyright 2015, Simplilearn. All rights reserved.
White Noise Process(contd.)

The Scatter plot of such a series across time will indicate no pattern, hence forecasting future values is
not possible.

Therefore, if the data shows the white noise feature, avoid performing Time Series Analysis, and vice
versa. For example, the stock price of TATA Motors may vary from day to day, and it becomes
uncorrelated. Forecasting the future values is not possible. In this case, to forecast for the next day
calculate the average of the data. For example, a daily change in the S&P500 index has no trend,
seasonality, or cyclic behaviour.

691
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model

There are various time series models available in SAS. They are as follows:

 Auto Regressive Model, or AR model


 Moving Average Model, or MA model
 Autoregressive and Moving Average, or ARMA model
 Autoregressive Integrated Moving Average, or ARIMA Model
Click each model to know more.

692
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)

Auto Regressive Model:

The Auto Regressive, or AR, model is used to forecast time series using the past values Yt-1, Yt-2, Yt-3 and
so on.The equation for the auto regressive model is shown on the screen.

Yt = c + ϕ1yt−1+ ϕ2yt−2+ ⋯ + ϕpyt−p+ et


Here Yt is the function of different past values of the same variable, Et is the error terms, “c” is a.

693
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)

Moving Average Model:

The Moving Average, or MA, model is used to forecast time series if Yt depends only on the random
error terms.

The equation for the moving average model is shown on the screen.

Yt = ϕ1Et−1 + ϕ2Et−2 + ⋯ + ϕpEt−p

Here Yt is the function of past error terms. Et is the error term.ϕ1 to ϕp are the parameters.

The error terms here are assumed to be white noise processes with a zero mean and constant variance.

694
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)

Autoregressive and Moving Average model or ARMA:

The Autoregressive and Moving Average, or ARMA, model is used to forecast time series using both the
past values and error terms.

It is referred as ARMA (p,q), where p is autoregressive terms and q is moving average terms .

The equation for Autoregressive and Moving Average model is shown on the screen.

Yt = B0 + B1 Yt-1 + B2 Y t-2 + ------------ + Bp Y t-p + Et + ϕ1Et−1 + ϕ2Et−2 +---------+ ϕpEt−p

695
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)

Autoregressive Integrated Moving Average Model or ARIMA:

The Autoregressive Integrated Moving Average, or ARIMA, model predicts a value in a response time
series as a linear combination of its own past values, past errors, and current and past values of other
time series.The order of an ARIMA model is usually denoted by the notation shown on the screen.

ARIMA(p,d,q ),

p is the order of the autoregressive part

d is the order of the differencing

q is the order of the moving-average process

696
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)

If no differencing is done (d = 0), the models are usually referred to as ARMA(p, q) models. The equation
for the Autoregressive Integrated Moving Average model for ARMA (p, d=0, q) is shown on the screen.

Yt = B0 + B1 Yt-1 + B2 Y t-2 + ------------ + Bp Y t-p + Et + ϕ1Et−1 + ϕ2Et−2 +---------+ ϕpEt−p

697
© Copyright 2015, Simplilearn. All rights reserved.
Stationarity of a Time Series

A series is said to be strictly stationary if the marginal distribution of y at time t is same at any other
point of time. This implies that the mean, variance, and covariance of the series are time invariant.

A series is said to be weakly stationary or non-stationary if mean, variance, and covariance are constant.

698
© Copyright 2015, Simplilearn. All rights reserved.
Stationarity of a Time Series(contd.)

Mean is constant

a) E(Y1) = E(Y2) = E(Y3) = _________ = E(Yt) = µ (a constant)


Variance is constant

b) Var( Y1) = Var(Y2) = Var(Y3) = ……………….. = Var(Yt) = µ ( a Constant)


Covariance is constant

c) Cov (Y1, Y 1+ k) = ………………… = COV(Y5 , Y 5+ k) = Yk.

699
© Copyright 2015, Simplilearn. All rights reserved.
Stages of ARIMA Modelling

The estimation and forecasting of UNIVARIATE time series is carried out using Box and Jenkins and
ARIMA models or often referred as Box-Jenkins model. Remember that this model is applicable only if
the variable is stationary.

There are three stages in ARIMA modelling. They are as follows:

 Identification stage
 Estimation and diagnostic checking stage
 Forecasting stage
Let’s learn about each stage in detail.

700
© Copyright 2015, Simplilearn. All rights reserved.
Identification Stage

Following are the two considerations to forecast time series using ARIMA modeling:

 Ensure the variables are stationary.


 Ensure the variables do not belong to the white noise category.
You can perform differentiation on the non- stationary variable to make it stationary.

701
© Copyright 2015, Simplilearn. All rights reserved.
Identification Stage(contd.)

In identification stage, perform the following tasks:

 Specify the response series and identify candidate ARIMA models for it.
 Perform a stationary test to determine if differencing is necessary.
Use the IDENTIFY statement to specify the response series and identify candidate ARIMA models for it.

The IDENTIFY statement

 Reads time series that are to be used in later statements,


 Perform differentiation for the time series, and
 Computes autocorrelations, inverse autocorrelations, partial autocorrelations, and cross-
correlations.
The analysis of the IDENTIFY statement output usually suggests one or more ARIMA models that could
be fit.

702
© Copyright 2015, Simplilearn. All rights reserved.
Estimation and diagnostic checking stage

In the Estimation and diagnostic checking stage, perform the following tasks:

 Specify the ARIMA model to fit to the specified variable and estimate the parameter.
 Judge the adequacy of the model.
 Perform significance tests, goodness-of-fit statistics, and white noise residuals.
Significance tests for a parameter are used to identify the unnecessary terms in the model.

Goodness-of-fit statistics aids in comparing this model with others.

Tests for white noise residuals indicate whether the residual series contains additional information that
might be used by a more complex model.

703
© Copyright 2015, Simplilearn. All rights reserved.
Estimation and diagnostic checking stage(contd.)

Use the ESTIMATE statement to specify the ARIMA model to fit to the variable specified in the previous
IDENTIFY statement and to estimate the parameters of that model.

The ESTIMATE statement also produces diagnostic statistics to help you judge the adequacy of the
model.

704
© Copyright 2015, Simplilearn. All rights reserved.
Forecasting Stage

In the forecasting stage, you use the FORECAST statement to forecast future values of the time series
and to generate confidence intervals for these forecasts from the ARIMA model produced by the
preceding ESTIMATE statement.

705
© Copyright 2015, Simplilearn. All rights reserved.
Now let's do a Knowledge check of what you have learned so far.

706
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following statements helps c.
you judge the adequacy of a model? The ESTIMATE statement produces diagnostic
statistics to help you judge the adequacy of a
model.

707
© Copyright 2015, Simplilearn. All rights reserved.
Stages of ARIMA modeling -Example

Consider the electronic dataset as an example and let’s forecast the Sales variable using the ARIMA
model.

708
© Copyright 2015, Simplilearn. All rights reserved.
Demo

Proc Arima Data=Electronic;

Identify Var=Sales nlag=24;

Run;

The statement PROC ARIMA forecasts the time series using the ARIMA model.

The Identity Statement checks the stationarity of a variable and performs white noise residual test. It
also produces descriptive statistics, time series plot of the series, sample autocorrelation function plot
(ACF), inverse autocorrelation function plot (IACF), partial autocorrelation function plot (PACF), and
White Noise.

These autocorrelation function plots show the degree of correlation with the past values of the series at
which the correlation was computed.

The NLAG= option controls the number of lags for which the autocorrelations are shown. By default, the
autocorrelation functions are plotted to lag 24.

Let’s run this program.

In this example, the white noise hypothesis is rejected strongly as the mean of the working series is not
zero. Also, the series is non-stationary as the auto correlation trends are not similar.

Since the series is non-stationary, let’s perform differentiation to make the series stationary.

709
© Copyright 2015, Simplilearn. All rights reserved.
Let’s write the code to make the series stationary.

Identify Var=Sales(1);

To differentiate the SALES series, use another IDENTIFY statement and specify the first differentiation of
SALES to analyze.

Instead of modeling the SALES series itself, we can model the change in SALES from one period to the
next period.

Well, let’s run this code now.

You can notice that, this statement evaluates the change in sales between periods instead of evaluating
the total sales amount (Identify Var=Sales statement).

Let’s now perform the estimation and diagnostic checking stage of ARIMA model.

We can use the estimate statement to specify the ARIMA model to fit to the variable specified in the
previous IDENTIFY statement and to estimate the parameters of that model.

Here let’s use AR(1) to predict the change in sales. The p value refers to the order of the autoregressive
part (first order)

Here, the value of p=1.

Note that there are various candidate models such as MA(1) and ARMA to plot autocorrelation for the
series.

Estimate p=1;

Let’s run this code now.

The p-value for the autoregressive parameter is 0.0024 (less than 5%), so this term is highly significant.
On the other hand, the p-value for MU indicates that the mean term adds very little to the model.

710
© Copyright 2015, Simplilearn. All rights reserved.
The test statistics for the residuals series indicate whether the residuals are uncorrelated (white noise)
or contain additional information that might be used by a more complex model. In this case, the test
statistics reject the no-autocorrelation hypothesis at a high level of significance (p = 0.0029 for the first
six lags). This means that the residuals are not white noise, and so the AR(1) model is not a fully
adequate model for this series.

Let’s now perform the forecasting Stage.

To produce the forecast output, use the FORECAST statement after the ESTIMATE statement for the
model you decide is best.

Note that if the last model fit is not the best, then repeat the ESTIMATE statement for the best model
before you use the FORECAST statement.

Let’s use the LEAD= option to specify how many periods ahead to forecast. In this example program, the
sales aeries is forecasted for one year ahead from the most recently available SALES figure. So, let’s use
lead=12.

Let’s use INTERVAL= option to indicate the interval of data. In this example, let’s obtain the data in the
interval of month.

The ID= option specifies the ID variable which is typically a SAS date, time, or datetime variable. In this
example, let’s use id-date.

The OUT= option writes the forecasts to the output dataset. In this example, let’s store the forecasted
data in the “results” dataset.

forecast lead=12 interval=month id=date out=results;

run;

Let’s run this code now.

We have obtained the time series forecasts for the next year for all the months.

The notation of the ARIMA model for this example is represented as ARIMA(1,1,1) model since the
IDENTIFY statement specified d = 1, and the final ESTIMATE statement specified p = 1 and q = 1.

711
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate

So far you have learned the various time series models. Let’s now learn how to plot, transform,
transpose, and interpolate time series data in SAS datasets.

Plot Time Series

To plot the time series use the options shown on the screen.

Options Description
PROC GPLOT produces high resolution color graphics plots
PROC PLOT produce low resolution line printer type plots
PROC TIMEPLOT plots time series data vertically on the page instead of
horizontally across the page

712
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate(contd.)

Transform Time Series:

It is often useful to transform time series for analysis or forecasting.

Transforming time series is used to restrict the range, obtain non-linear trend, and stabilize the variance.

713
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate(contd.)

Transpose Time Series:

The TRANSPOSE procedure is used to transpose datasets from one form to another.

The TRANSPOSE procedure can transpose variables and observations within BY groups.

714
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate(contd.)

Interpolate Time Series:

The EXPAND procedure interpolates a time series. By default, the EXPAND procedure performs
interpolation by first fitting cubic spline curves to the available data and then computing needed
interpolating values from the fitted spline curves.

715
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Let’s practice what you have learned so far in this lesson. Read the questions carefully and then answer
them. Techniques and steps are provided to assist you under the guide section.

716
© Copyright 2015, Simplilearn. All rights reserved.
Assignment(contd.)

A pharmaceutical company wants to forecast daily Sales based on its Sales Dataset. The dataset keeps a
track of Order_ID, Product, Product_Category, Sales, Profit, and Order_Priority. As a SAS programmer,
write the code for this requirement.

717
© Copyright 2015, Simplilearn. All rights reserved.
Assignment(contd.)

Follow these steps to solve the problem:

 Import the dataset.


 Perform white noise residual test and stationary test.
 Judge the adequacy of the model and perform significance tests.
 Forecast the time series ahead.

718
© Copyright 2015, Simplilearn. All rights reserved.
Assignment(contd.)

We recommend you to first solve the project and then view the solution to assess your learning.

You can perform this project in the installed SAS University Edition.

Go to the next screen to assess your performance.

Click Next to view the demo.

719
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeways

Let’s now quickly recap the key concepts of this lesson:


 The TIMESERIES procedure forms time series from the input time-stamped transactional
data.
 A SAS informat is an instruction that converts the character-string values into the numerical
value of a SAS variable.
 A SAS format is an instruction that converts the internal numerical value to a character string
that can be printed or displayed.
 There are three stages in ARIMA modelling—the Identification stage, the Estimation and
diagnostic checking stage, and the Forecasting stage.
The TRANSPOSE procedure can transpose variables and observations within BY groups.

720
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes “Working with Time Series Data.” The next lesson is “Data Optimization Using SAS.”

721
© Copyright 2015, Simplilearn. All rights reserved.
722
© Copyright 2015, Simplilearn. All rights reserved.
723
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 Which of the following procedures forms a.
time series from the input time-stamped The TIMESERIES procedure forms a time series
transactional data? from the input time-stamped transactional
data.

2 Which of the following statements b.


specifies the desired UNIVARIATE PLOTS= option specifies the desired
graphical output? UNIVARIATE graphical output.

3 A white noise process has a _____. c.


A white noise process is one with a zero mean,
a constant variance, and no correlation
between its values at different times.

4 Which of the following models predicts a d.


value in a response time series as a linear An Autoregressive Integrated Moving Average,
combination of its own past values, past or ARIMA, model predicts a value in a response
errors, and current and past values of time series as a linear combination of its own
other time series? past values, past errors, and current and past
values of other time series.

724
© Copyright 2015, Simplilearn. All rights reserved.
725
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 11 — Designing Optimization Models

726
© Copyright 2015, Simplilearn. All rights reserved.
Introduction

Hi, and welcome back to the “Data Science with Statistical Analysis System or SAS” course offered by
Simplilearn.

In this lesson, “Designing Optimization Models,” you will learn how to solve the various types of
optimization problems.

727
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me

In this lesson you will understand the need for optimization in industries. You will learn the problems
involved in optimization.

In addition, you will learn how to perform optimization using Statistical Analysis System.

728
© Copyright 2015, Simplilearn. All rights reserved.
Need for Optimization

Let’s start this lesson by defining what Optimization is.

The optimization is a mathematical technique to find a maximum and minimum value of a function
subject to constraints. Optimization techniques are important in many industries today, and it forms a
major part of the area of Operational Research.

It cut downs the operational costs and maximizes the profit of the company.

729
© Copyright 2015, Simplilearn. All rights reserved.
Need for Optimization(contd.)

Let’s understand this with an example.

A company is organizing a bus trip for 400 of its employees to Vegas. The admin team has contacted an
agency which have 10 and 8 buses with seating capacity up to 50 and 40 people, respectively. However,
only 9 drivers are available in a shift. The rental cost for a large bus is $800 and that for a small bus is
$600. The admin team has to calculate how many buses of each type it will have to charter at the least
possible cost.

These kind of complex linear problems can be solved using optimization techniques of SAS.

To find out the minimum transport cost with all constraints is one of the optimization problems

730
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming

Before we deal with the optimization problems, let’s understand the various types of optimization
programming.

The various types of optimization programming are linear programming, mixed linear integer
programming, quadratic programming, and nonlinear programming.

731
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)

Linear programming is a technique to maximize or minimize a function of several variables such as cost,
time, and production subject to the constraints of the problem. If variables are real numbers and each
variable is dependent on another variable, then use linear programming for optimization.

732
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)

Mixed linear integer program is used when the decision variables are constrained to be integer values at
the optimal solution. The integer values may be binary numbers and whole numbers. The use of integer
variables greatly expands the scope of useful optimization problems that you can define and solve.

733
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)

The quadratic programming is used to solve optimization problems if the variables have quadratic
functions subject to the linear constraints. The standard form of quadratic equation is shown on the
screen.

ax2 + bx + c = 0

734
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)

The nonlinear programming is used if any of the objective functions or constrains has nonlinear function.
If the variables are not dependent on another variable, then, it is referred as nonlinear equation.

735
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems

The major optimization problem is that of minimizing or maximizing an objective function subject to
constraints imposed on the variables of that function.

The objective functions and constraints can be linear or nonlinear.

There are various types of constraints such as bound constraints, equality constraints, inequality
constraints, or integer constraints.

736
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

The mathematical form of an optimization problem is called a mathematical program. When this
mathematical program is fed to the relevant algorithm, it determines the optimal values for the decision
variable has either maximized or minimized objective and are on between the defined limits.

737
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

So, optimization can be defined as the process of determining the objective and limits of optimal values.

If the constraints of an optimization are linear and the objective is either linear or quadratic, the
optimization problem can be solved using the SAS procedure.

738
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

The optimizations problems are classified into four types based on the nature of functional form of
objectives and constraints. They are:

 Linear optimization problem


 Mixed integer linear optimization problem
 Quadratic optimization problem
 Nonlinear optimization problem
Let’s learn the various procedures used to solve these types of optimization problems.

Click each tab to know more.

739
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

Linear Optimization Problem:

The PROC OPTLP is used to solve the linear optimization problem. It uses a mathematical programming
system format or MPS format. This format is used to describe linear programming and integer
programming problems.

The files of MPS format are mostly in text format and possess specific conventions for the order
specified.

740
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

Mixed Integer Linear Optimization Problem:

The PROC OPTMILP is used to solve the mixed integer linear problem. It is the linear problem in which
the decision variables are integer constrained.

It requires a SAS dataset to specify the mixed integer linear program to follow to the MPS format.

741
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

Quadratic optimization problem:

The PROC OPTQP is used to solve the quadratic optimization program that has the problems with a
quadratic objective function and a collection of linear constraints.

The input data problem needs to be specified in quadratic programming system, or QPS, format.

742
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)

Nonlinear optimization problem:

The PROC OPTMODEL is an optimization modeling language, and it is used to model nonlinear
optimization programs.

The Nonlinear optimization problem is defined as the system that has either constraints of equalities and
inequalities or the objective functions that are nonlinear.

743
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL

The PROC OPTMODEL is also used to model linear, mixed integer linear, and quadratic optimization
programs.

The PROC OPTMODEL is used mostly for the following reasons:

 You can declare a model, pass it directly to various solvers such as primal simplex, dual simplex,
iterative and network point, and review the solver result.
 You can also save an instance of a linear model in dataset form for use by the OPTLP procedure.

744
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

Solver is a method or procedure to resolve an optimization problem. The solver used in the linear
programming, mixed integer linear programming, quadratic programming, and nonlinear programming
is LP, MILP, QP, and NLP respectively.

745
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the “Syntax Classroom” to learn the syntax of the PROC OPTMODEL.

746
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

Generally, the PROC OPTMODEL syntax is written as shown on the screen.

The PROC OPTMODEL procedure includes the modeling language and solvers for several classes of
mathematical programming problems.

747
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

The Var function is used to declare the variables.

748
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

The con function is used to declare the constraints.

749
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

The objective functions are used to define minimum and maximum objectives.

750
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

The solve function is used to run the solver.

751
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

The print statement is used to show the output.

752
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

Note that the PROC OPTMODEL ends with the quit statement.

753
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)

PROC OPTMODEL statement is divided into three categories. They are:

• PROC statement

• Declaration statements

• Programming statements

754
© Copyright 2015, Simplilearn. All rights reserved.
PROC Statement

The PROC statement invokes the procedure and sets initial option values. The various PROC statement
options are shown on the screen.

The CDIGITS = number specifies the expected number of decimal digits of accuracy for nonlinear
constraints.

The ERRORLIMIT = number| NONE specifies the maximum number of error messages that can be
displayed.

The FD = FORWARD | CENTRAL selects the method used to approximate numeric derivatives when
analytic derivatives are unavailable.

The INTFUZZ = number specifies the tolerance for rounding the bounds on integer and binary variables
to integer values.

The MAXLABLEN = number specifies the maximum length for MPS row and column labels.

The PMATRIX =number adjusts the density evaluation of a two-dimensional array to affect how it is
displayed.

The PDIGITS = number requests that the PRINT statement display number significant digits for numeric
columns for which no format is specified.

The declaration statements declare optimization model components.

755
© Copyright 2015, Simplilearn. All rights reserved.
756
© Copyright 2015, Simplilearn. All rights reserved.
Declaration Statements

The various declaration statements are shown on the screen.

Con declares a constraint.

IMPVAR declares optimization expressions.

MAX declares a maximization objective.

MIN declares a minimization objective.

NUMBER declares a number type parameter.

PROBLEM declares a problem.

SET declares a set type parameter.

STRING declares a string type parameter.

VAR declares optimization variables.

757
© Copyright 2015, Simplilearn. All rights reserved.
758
© Copyright 2015, Simplilearn. All rights reserved.
Programming Statements

The programming statements read and write data, invoke the solver, and prints the results.

The various programming statements are shown on the screen.

= assigns a value to a variable or parameter.

CALL invokes a library subroutine.

CLOSEFILE closes the opened file.

COFOR executes the statement repeatedly with support for concurrent solver invocations.

CONTINUE terminates one iteration of a loop statement.

CREATE DATA creates a new SAS dataset.

USE PROBLEM selects the current problem.

SOLVE invokes a PROC OPTMODEL solver.

PRINT outputs string and numeric

759
© Copyright 2015, Simplilearn. All rights reserved.
760
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1

Let’s solve some of the optimization problems using statistical Analysis System. Each example has
problem statement, analysis, required code, and output.

Click each tab to know more.

761
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1

Problem Statement

A manufacturer produces two products, X and Y, with two machines, A and B. The cost of producing
each unit and working plan of X and Y is shown on the screen.

The cost of producing each unit and working plan of machine A is shown in table 1.

The cost of producing each unit and working plan of machine B is shown in table 2:

The week starts with a stock of 30 units of X and 90 units of Y and a demand of 75 units of X and 95 units
of Y.

Plan the production, to end the week with the maximum stock.

762
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1

From the given condition, the constraints are derived, and they are shown on the screen. From the
question, we need to obtain the maximum stock in a week.

So the objective equation is derived, and it is shown on the screen.

We have identified the objective functions and constraints.

The variables X and Y are real numbers, and they are greater than zero. Also the variable X is dependent
on Y and Y is dependent on X. So, the equation is termed as Linear equation.

Let’s solve this problem using SAS’s PROC OPTMODEL.

Note that SAS’s licensed version is required to solve the optimization problems.

763
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1

Following are the required code to optimize the linear equation using SAS’s PROC OPTMODEL. Use the
procedure “PROC OPTMODEL” to inform the SAS to optimize the problem.

First, set the variable and introduce the logical constraints if any. Here, the variable X and Y are set as
greater than or equal to zero.

Second, set the constraints of the problem using the keyword “con”. Here, there are four constraints
involved in this problem.

Third, set the objective function of the problem. Here, the objective function is to find the maximum the
stock. So, use the function “Max.” Note that “F” is the variable that has the value of maximum function.

Fourth, use the solve keyword to solve the optimization problem. SAS decides the best solver method of
computation. You can also mention the relevant solver function such as LP, NLP, MILP, or QP. Here, the
relevant solver function will be LP as the problem is the linear optimization problem.
Finally, use the print statement to print the required values. Here, the values of f, X, and Y are printed on
the screen.

764
© Copyright 2015, Simplilearn. All rights reserved.
765
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1

The output for this example is shown on the screen.

The solver used in this example is dual simplex. Look at the solution status. The status is “optimal,” and it
shows the optimization is achieved.

From the output, we infer that X value is 45 and Y value is 6.25.

The maximum output is 1.25 units.

766
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2

A mathematician has analyzed and derived the following equation. He needs to calculate the minimum
output for that equation. So, instead of solving it manually, he approaches a SAS programmer to
optimize the equation.

Minimum: 4 * x1 + 5*x1**2 + 3*x1**2+ 7*x2 + 6*x1*x2

He also has the following constraints to solve the equation.

Constraints:

X1 – x2 <=5

X1 + x2 >=50

X1 >=0

X2 >= 0

767
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2

Analysis:

This equation has squared value and therefore it is termed as Nonlinear equation.

Let’s solve this problem using SAS’s PROC OPTMODEL.

Note that SAS’s licensed version is required to solve the optimization problems.

768
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2

Code:

Following is the required code to optimize the quadratic equation using SAS’s PROC OPTMODEL.
proc optmodel;
var x1 >= 0, x2 >= 0;
con con1: x1 - x2 <= 5;
con con2: x1 + 2 * x2 >= 50;
minimize f = 4 * x1 + 5*x1**2 + 3*x1**2+ 7*x2 + 6*x1*x2;
solve;
print f x1 x2;
quit;

Use the procedure “PROC OPTMODEL” to inform the SAS to optimize the problem.

First, set the variable and introduce the logical constraints if any. Here, the variable X and Y are set as
greater than or equal to zero.

Second, set the constraints of the problem using the keyword “con.” Here, there are four constraints
involved in this problem. Note that X1 and X2 are already set as greater than or equal to zero.

Third, set the objective function of the problem. Here, the objective function is to find the minimum
value of the equation. So, use the function “Min.” Note that “F” is the variable that has the minimum
value of the function.

769
© Copyright 2015, Simplilearn. All rights reserved.
Fourth, use the solve keyword to solve the optimization problem. SAS decides the best solver method of
computation. You can also mention the relevant solver function such as LP, NLP, MILP, or QP. Here, the
relevant solver function will be QP as the problem is the quadratic optimization.
Finally, use the print statement to print the required values. Here, the values of f, X, and Y are printed on
the screen.

770
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2

The output for this example is shown on the screen.

The solver used in this example is NLPC. Look at the solution status. The status is “optimal,” and it shows
the optimization is achieved.

From the output, we infer that X1 value is 0 and X2 value is 25.

The minimum output is 175.

771
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Let’s practice what you have learned so far in this lesson. Read the questions carefully and then answer
them.

772
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

A farmer wants to adjust the ratio of components mix in a fertilizer for the current crop. He bought plant
food mix A and plant food mix B.

Each cubic yard of food mix A contains 20 pounds of phosphoric acid, 30 pounds of nitrogen, and 5
pounds of potash.

Each cubic yard of food mix B contains 10 pounds of phosphoric acid, 30 pounds of nitrogen, and 10
pounds of potash.

He requires a minimum of 460 pounds of phosphoric acid, 960 pounds of nitrogen, and 220 pounds of
potash.

If food mix A costs $30 per cubic yard and food B costs $35 per cubic yard, how many cubic yards of each
food should the farmer blend to meet the minimum chemical requirements at a minimal cost?

As a SAS programmer, write the code for the above requirement.

773
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

Let Y be the food mix A and X be the food mix B.

The constraints for the equation are derived and shown on the screen. The objective equation is shown
on the screen.

Minimum F=30Y+35x

We recommend you to first solve the project and then view the solution to assess your learning.

You need the licensed version of SAS to solve this problem.

Go to the next screen to assess your performance.

Click Next to view the solution.

774
© Copyright 2015, Simplilearn. All rights reserved.
Assignment

The output for this example is shown on the screen.

The solver used in this example is Dual Simplex. Look at the solution status. The status is “optimal,” and
it shows the optimization is achieved.

From the output, we infer that X value is 12 and Y value is 20.

The minimum output cost is $ 1020.

775
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways

Let’s now quickly recap the concepts you have learned in the lesson:

 The optimization is a mathematical technique to find a maximum value and a minimum value of
a function subject to constraints.
 Optimization techniques cut down the operational costs and maximize the profit of the
company.
 The various types of optimization programming are linear programming, mixed linear integer
programming, quadratic programming, and nonlinear programming.
 The objective functions and constraints can be linear or nonlinear.
 The PROC OPTMODEL is also used to model linear, mixed integer linear, and quadratic
optimization programs.
 Solver is a method or procedure to resolve an optimization problem.

776
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion

This concludes the course “Statistical Analysis System.Enjoy learning with Simplilearn.

777
© Copyright 2015, Simplilearn. All rights reserved.
778
© Copyright 2015, Simplilearn. All rights reserved.
779
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:

S.No. Question Answer & Explanation


1 A _____ is the mathematical form of an a.
optimization problem. The mathematical form of an optimization
problem is called a mathematical program.

2 Which of the following formats is used to b.


describe linear programming and integer MPS format is used to describe linear
programming problems? programming and integer programming
problems.

3 Which of the following functions declares a.


the constraints? Con declares a constraint.

780
© Copyright 2015, Simplilearn. All rights reserved.
781
© Copyright 2015, Simplilearn. All rights reserved.
782
© Copyright 2015, Simplilearn. All rights reserved.
783
© Copyright 2015, Simplilearn. All rights reserved.

Вам также может понравиться