Академический Документы
Профессиональный Документы
Культура Документы
3
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi! Welcome to the “Data Science with Statistical Analysis System, or SAS,” course offered by
Simplilearn. In this video you’ll see some interesting highlights of this course.
4
© Copyright 2015, Simplilearn. All rights reserved.
Why SAS
Have you faced challenges during data processing because of the size of the data?
Have you felt the need to combine, separate, compare, and extract data based on a specific
requirement?
Has interpreting data been difficult because you couldn’t manipulate it?
Have you ever wanted to learn the most in-demand Analytics technology?
5
© Copyright 2015, Simplilearn. All rights reserved.
Why SAS
SAS can help you achieve all this and more. It offers a variety of data analysis tools that can deal with
large data. SAS provides an end-to-end solution for the entire Analytics cycle. It’s the undisputed leader
in the commercial analytics space.
6
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS
SAS is an integrated system of software solutions, which enables you to perform the following tasks:
Applications development
7
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS
Data science is concerned with organizing, packing, and delivering data. SAS can help in all three stages.
With the tools at their disposal in SAS, Data Scientists can organize, analyze, and provide interpretations
or results.
SAS has an edge over other tools with its huge array of statistical functions, user-friendly graphical user
interface, and technical support.
8
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS
Industries that use SAS include Automotive, Banking, Capital Markets, Consumer Goods, Defense, Health
Care, Higher Education, Manufacturing, Media, Retail, Sports, Entertainment, and so on.
9
© Copyright 2015, Simplilearn. All rights reserved.
Market Trends
Demand for SAS professionals has increased dramatically compared to other data analysis software
professionals.
10
© Copyright 2015, Simplilearn. All rights reserved.
Objectives
Simplilearn’s Data Science with SAS course will enable you to:
11
© Copyright 2015, Simplilearn. All rights reserved.
Objectives
Attention people!
Simplilearn provides an exciting range of learning modules for our eager learners.
12
© Copyright 2015, Simplilearn. All rights reserved.
Objectives
13
© Copyright 2015, Simplilearn. All rights reserved.
Objectives
Wants to hold your attention by providing logical breaks in the form of knowledge checks.
14
© Copyright 2015, Simplilearn. All rights reserved.
Objectives
15
© Copyright 2015, Simplilearn. All rights reserved.
Objectives
Going one step further, Simplilearn introduces gaming to add an element of challenge to your learning.
Learn while you play “Organize to Analyze”. Let’s begin this course!
16
© Copyright 2015, Simplilearn. All rights reserved.
Simplilearn’s Data Science with SAS Course
This course enables you to learn the key concepts of SAS, which are important for Data Analytics, using
practical examples. The course comprises 32 hours of Instructor Led Training, 24 hours of eLearning, and
hands-on experience with industry projects. You will receive full support from the Simplilearn Faculty
throughout the course and mentoring for project work.
You will also be able to access three sets of assessment papers comprising 100 questions each, case
studies, and four live industry projects on the SAS tool. On successful completion, you will receive an
experience certificate.
Complete any two projects and get them evaluated by the lead trainer.
Submit your queries by writing to Help and Support on www.simplilearn.com or talking directly to our
support staff with the Simplitalk and Live Chat options.
17
© Copyright 2015, Simplilearn. All rights reserved.
Simplilearn’s Data Science with SAS Course
Go ahead and begin “Data Science with SAS” course. The first lesson is, “Analytics Overview.”
Happy Learning!
18
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 01 — Analytics Overview
19
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hello and welcome to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
In this lesson “Analytics Overview,” you will learn what data analytics is and the ways to perform data
analysis, using SAS.
20
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will understand the concept of data analytics, its types, and techniques. You will be
able to list the various types of analytical problems industries face, and describe ways to solve those
using SAS. You will also learn the various widely used analytical tools to perform data analysis.
21
© Copyright 2015, Simplilearn. All rights reserved.
What is Analytics
Analytics plays a vital role not only in businesses but also in various fields such as sports, healthcare,
finance, and government. It is hard to think of any aspect of life that is not affected by analytics.
22
© Copyright 2015, Simplilearn. All rights reserved.
What is Analytics
Analytics is a scientific process that examines raw data to draw meaningful conclusions from the data. It
gives insights into the information to help organizations make better decisions.
23
© Copyright 2015, Simplilearn. All rights reserved.
What is Analytics
The study of analytics often involves analyzing historical data to look for potential trends, to understand
the effects of certain decisions, or to evaluate the performance of the business basis of the decisions
made. This comprehensive knowledge of past trends and decisions can form the basis on which
corrective actions can be taken.
24
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example
Suppose you are working with an Ecommerce company and you want to run a marketing campaign to
increase your sales.
25
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example
To do so, you need to analyze your existing campaigns and how much they help in increasing the current
business and collect some more statistical information from all the areas. This will help you examine the
key areas that can give drive your business.
These tasks that you perform to increase the sales through marketing campaign is called marketing
Analysis.
26
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example
Analytics even helps companies optimize their Supply Chain performance. By analyzing their historical
data on daily, weekly, and monthly basis, they evaluate and forecast the future demand of their
products.
27
© Copyright 2015, Simplilearn. All rights reserved.
Data Analysis—Example
Suppose you are working in a multinational tire company and you want to analyze the demand of tires at
two different depots across the globe. If proper analysis and evaluation is performed, you can supply the
products per the demand and maintain the required stock in the stores.
From these examples, it is clear that data analysis plays a vital role in every organization.
28
© Copyright 2015, Simplilearn. All rights reserved.
Types of Analytics
29
© Copyright 2015, Simplilearn. All rights reserved.
Descriptive Analytics
Descriptive analytics allows you to break a big chunk of data into smaller pieces, chunking out relevant
information from the data or providing a brief synopsis of what happened. This is also known as the
“simplest class of analytics.”
Let us take an example of using descriptive analytics for customer data. It includes finding answers to the
following questions:
30
© Copyright 2015, Simplilearn. All rights reserved.
Diagnostic Analytics
Diagnostic analytics is the best option to go for if you want to go deeper into the collected data.
In Diagnostic Analytics, we are not concerned about “What happened,” instead we focus on “why
happened.”
Descriptive Analytics doesn’t provide us with answers to questions like “How do we fix this?” or “How
can we improve this?”
31
© Copyright 2015, Simplilearn. All rights reserved.
Predictive Analytics
Predictive analytics is another option to help us condense data. It uses different statistical, data
modeling, and data mining techniques to study the latest and past trends, thereby allowing the business
analysts or data scientists to make predictions.
Here is an example of using Predictive Analytics for a marketing campaign. It will look for answers to the
following questions:
Who will respond to this campaign, and for what product and through which channel?
What are the potential values of each customer and prospect?
Who will stop the subscription to your service, and when would that be?
32
© Copyright 2015, Simplilearn. All rights reserved.
Prescriptive Analytics
Prescriptive analytics is the last phase of business analytics and is related to both descriptive and
predictive analytics. While descriptive analytics provides information about what has happened and
predictive analytics helps forecast what might happen, which is probabilistic in nature, prescriptive
analytics optimizes decision making by determining the best solution available among various choices,
given the business constraints.
33
© Copyright 2015, Simplilearn. All rights reserved.
Areas of Analytics
Let’s look at a few types of analytics depending on the areas we us use them in:
Customer Analytics
Financial Analytics
Performance Analytics
Risk Analytics
34
© Copyright 2015, Simplilearn. All rights reserved.
Customer Analytics
Customer Analytics is a process that helps organizations make critical decisions and deliver offers that
are anticipated. This analytics offers organizations necessary customer insights to make better
decisions. Customer analytics uses techniques such as market segmentation, predictive analytics, data
modeling, and data visualization. It plays a pivotal role in the prediction of customer behavior.
Example:
Telecom companies these days use different marketing methods to retain their customers.
35
© Copyright 2015, Simplilearn. All rights reserved.
Financial Analytics
This type of analytics is the new way to drive competitive advantage. It helps financial executives explore
different ways to answer specific finance-related business questions and forecast future financial
situations. In today's dynamic business environment, financial analytics helps the finance function to
bring greater value to organizations.
Companies can leverage financial analytics to take multiple views of their data and derive insights that
will help them take necessary actions.
Example:
Reading Cash flow statement, balance sheets, and income statements comes under financial analytics.
36
© Copyright 2015, Simplilearn. All rights reserved.
Performance Analytics
Performance analytics is the practice of using data and technology to study how our business is
performing to continuously make it better. The basic functions involved in Performance Analytics are
Planning, Organizing, Staffing, Directing, and Controlling.
Example:
In Human Resource Management, the performance of the employees is monitored on a regular basis,
keeping in mind the parameters dependent on the expected outcomes.
37
© Copyright 2015, Simplilearn. All rights reserved.
Risk Analytics
Risk analysis tries to foresee the uncertainties of the predicted future that helps evaluate a project’s
success or failure.
Quantitative risk analysis quantifies the possible project results specific to a project. This analysis tries to
numerically evaluate the possibilities of various adverse events and predict the losses a company would
go through if any of these possibilities come true.
Qualitative risk analysis is performed on almost all risks and is not numerically defined. This method
involves defining various project-related threats and risks, determining the extent of these risks and
proposing corrective actions to avoid these risks.
Example:
In the Banking Industry, credit scores are built to predict an individual’s delinquency behavior and is
used to represent the credit worthiness of each individual.
38
© Copyright 2015, Simplilearn. All rights reserved.
39
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Tools
So far you have learned what analytics is, its types, and the areas of analytics.
Let’s now look at the popular analytical tools available for data analysis.
40
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Tools
Excel,
SAS,
Python,
R,
MATLAB, and
Tableau Software.
41
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Tools(contd.)
Following reasons make SAS one of the best and most popular tools to visualize data:
Helps users understand the nature of the customers and anticipate the future by forecasting and
modelling
Processes and manages large and complex datasets
Works with multiple variables
Tracks all the operations of datasets and generates output
Provides better Graphical User Interface, Graphs, Regression results, and Summary statistics
42
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques
With the help of analytical techniques, we can easily examine the complex relationships between
variables.
• Clustering
• Regression
• Decision Tree
• Time Series
43
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques(contd.)
Clustering is the process of grouping abstract objects into classes of similar objects. It is a common
technique used for statistical data analysis and is mainly involved in the process of data mining.
It is used in various applications such as market research, pattern recognition, data analysis, and image
processing.
44
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques (contd.)
Regression is a statistical measure to determine the strength of the relationship between one dependent
variable (usually denoted by Y) and a series of other changing variables (known as independent
variables).
Example:
Consider sales data where we have quantity sold, amount, and marketing expenses of various products
in the company. Using regression, we can determine the relationship between quantity sold, amount,
and marketing expenses.
45
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques (contd.)
Decision tree is a form of multiple variable analysis. It allow us to predict, explain, describe, or classify an
outcome.
Example:
46
© Copyright 2015, Simplilearn. All rights reserved.
Analytical Techniques (contd.)
This helps in forecasting and predicting the future values based on previously observed values.
47
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
Analytics is a scientific process to examine raw data to draw meaningful conclusions from the
data.
Descriptive analytics allows you to break a big chunk of data into smaller pieces.
Diagnostic analytics is used go deeper into the collected data.
Predictive analytics helps condense data.
Prescriptive analytics optimizes decision making by determining the best solution from the
available options.
Customer Analytics is a process that helps organizations make critical decisions and deliver
offers that are anticipated.
A few analytical techniques of SAS are Clustering, Regression, Decision Tree, and Time Series.
48
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
49
© Copyright 2015, Simplilearn. All rights reserved.
50
© Copyright 2015, Simplilearn. All rights reserved.
51
© Copyright 2015, Simplilearn. All rights reserved.
52
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
53
© Copyright 2015, Simplilearn. All rights reserved.
54
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 02 — Introduction to SAS
55
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
In this lesson “Introduction to SAS,” you will get introduced to the essential concepts of Statistical
Analysis System.
56
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will understand what SAS is and its components. You will also get acquainted with the
SAS console.
In addition, you will learn to import/export data and list SAS’s different temporary and permanent
libraries.
57
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS
Let’s start this lesson by defining what Statistical Analysis System is.
Statistical Analysis System, or SAS, is a software suite developed by the SAS Institute for advanced
analytics, multivariate analyses, Business Intelligence, data management, and predictive analytics.
58
© Copyright 2015, Simplilearn. All rights reserved.
What is SAS
SAS is a set of solutions for enterprise-wide business users, and it provides a powerful fourth-generation
programming language for performing tasks such as:
• quality improvement.
Before we begin with the concepts of SAS, let us install the SAS University Edition in your system.
59
© Copyright 2015, Simplilearn. All rights reserved.
SAS University Edition
You can download the free SAS University Edition by visiting the website shown on the screen.
http://www.sas.com/en_us/software/university-edition/download-software.html
Ensure you have the following system configuration to install the software:
1. 64-bit hardware
2. 1GB RAM
3. Microsoft Windows 7, 8, 8.1, or 10
4. Microsoft Internet Explorer 9, 10, or 11, Mozilla Firefox 21 or later, or Google Chrome 27 or later
version
Click Installation Steps button to download the installation steps of SAS software.
This installation steps is also available in the link shown on the screen.
http://support.sas.com/software/products/university-
edition/docs/en/SASUniversityEditionQuickStartVirtualBox.pdf
Follow the installation steps carefully and enjoy working on the SAS software.
60
© Copyright 2015, Simplilearn. All rights reserved.
61
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition
Now that you have installed the SAS University Edition in your system, let’s see how to open the SAS
software.
Open Virtual box by double-clicking its icon on the desktop to access the SAS University Edition.
62
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition
Click Start button. Virtual box opens “Oracle VMware Virtual box.”
63
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition
Type the link shown on the screen, http://localhost:10080/ in your Internet Explorer, Mozilla Firefox, or
Google Chrome. Now, you can access the SAS University Addition Information Center.
64
© Copyright 2015, Simplilearn. All rights reserved.
Opening SAS University Edition
Click Start SAS studio. The SAS studio opens in new window.
There you have it! You now have access to SAS and can start practicing this new programming language.
65
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
SAS provides a graphical user interface that makes SAS easy to use.
66
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
The SAS studio has the navigation pane on the left side and work area on the right side.
The navigation pane helps you to access files from your system, server, or shared folder. It also has
saved tasks, snippets, libraries, and file shortcuts for easy access.
67
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
The work area has three windows, namely CODE, LOG, and RESULTS.
68
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
The LOG window is used to view messages about your SAS session and debug SAS programs.
69
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
To start a new program, either right-click “My folder” under “Server Files and Folder” on the Navigation
pane and Click “New” and select “SAS program,” or just press the shortcut key “F4.”
70
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
71
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
This icon is used to save the program in the desired name and location.
This icon is used to cut the program to paste it in the desired place.
This icon is used to paste the copied program in the desired place.
72
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
73
© Copyright 2015, Simplilearn. All rights reserved.
Navigating In the SAS Console
-This icon is used to find the desired code and replace it with another code.
74
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Input Files
When you work with SAS, you use files that are created and maintained by SAS and files that are not
related to SAS.
SAS files
External files
Database Management System, or DBMS, files
SAS files:
Files with formats or structures known to SAS are called SAS files. All SAS files reside in a SAS library.
A SAS file can be a SAS dataset, a catalog, a stored program, a multidimensional database file, and a
financial database file.
75
© Copyright 2015, Simplilearn. All rights reserved.
76
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Input Files
External files:
The files with formats or structure unknown to SAS are called external files. The raw data that you want
to read into a SAS data file are referred to external files.
77
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Input Files
Files that are stored in the form of databases are called Database Management System files. SAS
software enables you to write and read data to and from many common Database Management
Systems.
78
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Elements
Statements,
Expressions,
Formats, and
Functions similar to those of many other programming languages.
These elements are used within the DATA step or PROC step of a SAS statement.
79
© Copyright 2015, Simplilearn. All rights reserved.
SAS Language Elements
The DATA step statement enables you to write and read raw data to and from external files and SAS
files.
The PROC step statement is a group of procedure statements that enables you to analyze data to create
tables, reports, charts, and SQL queries.
In short, you can say that the DATA step is used to create and manipulate SAS data and the PROC step is
used to analyze the data and generate the output.
80
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step
The DATA step is used to create SAS datasets, compute values, and select specific input records for
processing.
SAS log
SAS data file
SAS view
External data file
81
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step
SAS log is the default type and contains a list of processing messages.
82
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step
A SAS data file is a SAS dataset that contains a data portion and a data descriptor portion. The descriptor
portion consists of the information about the contents and attributes of SAS dataset.
83
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step
SAS view is a SAS dataset that uses descriptor information and data from other files.
84
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step
85
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
Let’s step into the “Syntax Classroom.” In “Syntax Classroom,” you can learn all the essential syntaxes
required to work on a SAS software.
86
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
Let’s understand this with an example. Take a look at the example code written on the screen.
Data Electronic;
Datalines;
Run;
87
© Copyright 2015, Simplilearn. All rights reserved.
Here, the keyword “data” creates the dataset electronic.
88
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
The variables declared here are product name, sales man name, and price.
89
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
The dollar symbol represents the product name and sales man name as characters.
90
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
The keyword “Datalines” indicates that the next lines contain input data.
91
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
92
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
In this example, the product name, salesman name, and price are referred to as variables and their
values are called observations.
93
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
The keyword “Proc Print” is used to print the output in the electronic dataset.
94
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step—Example
The keyword “title” names the dataset. Here, the dataset is named “Electronic Dataset of Online XYZ
Store”.
95
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Compilation Phase
When you submit a DATA step for execution, it is first compiled and then executed. Let’s learn about
each phase in detail.
The compile phase checks for any syntax errors. The SAS statements written in SAS software are
compiled in this phase.
The compile phase creates an input buffer, a program data vector, and descriptor information.
96
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Compilation Phase
An input buffer is the area of memory into which each record of raw data is read when an INPUT
statement is executed. The input buffer is created if it contains the raw data.
97
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Compilation Phase
A program data vector, or PDV, is the area of memory where the SAS System builds your dataset one
observation at a time. When the program executes, data values are read from the input buffer or
created by SAS language statements and assigned to the appropriate variables in the program data
vector. From here, the variables are written to the SAS dataset as a single observation.
Descriptor Information creates and maintains each SAS dataset, including dataset attributes and variable
attributes.
98
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Execution Phase
All executable statements in the DATA step are executed once for each iteration. If your input file
contains raw data, then SAS reads a record into the input buffer. SAS then reads the values in the input
buffer and assigns the values to the appropriate variables in the program data vector. SAS also calculates
values for variables created by program statements and writes these values to the program data vector.
99
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Execution Phase
When the program reaches the end of the DATA step, three actions occur by default, which make using
the SAS language different from using most other programming languages. They are:
• SAS writes the current observation from the program data vector to the dataset.
• Variables in the program data vector are reset to missing values. However, the automatic
variables _N_ is not reset but incremented by one. SAS builds the second observation and
continues until there are no more records to read. The dataset is then closed, and SAS goes on
to the next DATA or PROC step.
100
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Execution Phase
Variables in the program data vector are reset to missing values. However, the automatic variables _N_
is not reset but incremented by one. SAS builds the second observation and continues until there are no
more records to read. The dataset is then closed, and SAS goes on to the next DATA or PROC step.
101
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Example
Let’s understand DATA step processing with the same example used earlier in the syntax classroom.
When you submit a DATA step for execution by clicking the “Run” button, SAS automatically compiles
the DATA step and then executes it. In the compilation Phase, SAS creates an input buffer for electronic
dataset to hold the data as it is not a SAS dataset.
The PDV contains all the variables—product name, salesman name, and price in the input dataset. In
addition, two variables, N and Error, are generated automatically. The “_N_” variable represents the
number of times the DATA step has iterated. The “_ERROR_” variable acts like a binary switch whose
value is 0, if no errors exist in the DATA step, or 1, if one or more errors exist.
Initially in the process, all variable values are set to missing values, except _N_ and _Error_ automatic
variables. Missing characters in SAS are represented by a Period.
102
© Copyright 2015, Simplilearn. All rights reserved.
DATA Step Processing—Example
SAS reads the first data line into the input buffer.
The INPUT statement then reads the data values from the dataset in the input buffer and writes
them to the PDV where they become variable values.
SAS increments the _N_ automatic variable by 1 and resets the _ERROR_ automatic variable to 0
at the end of each iteration.
The data is printed as there is a PROC statement in the end.
103
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
So far you have learned the two major statements of SAS and their execution processes.
SAS libraries allow us to store datasets and user-defined formats so that they can be used in our
programs. In general, the SAS library is a folder located in our local machine or share drive that we use to
store raw data for SAS Programs.
104
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
Let’s step into the syntax classroom to learn the syntax used for SAS libraries. Click go to enter into the
syntax classroom.
105
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
SAS allows you to create your own library and to access the existing library.
To create your own library, use the syntax shown on the screen.
106
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
107
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
“libref” represents the name of the library. The library name should always be less than or equal to 8
characters and should start with a character.
After using the keyword ”libref”, you should mention the desired file path.
108
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
Note that the “LIBNAME”statement is not used in the DATA step or PROC step.
109
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
To access the stored library, use the syntax shown on the screen.
Libref.dataset_name
110
© Copyright 2015, Simplilearn. All rights reserved.
SAS Libraries- Creating a New Library
Here, libref is the stored library name. The dataset name represents the name of the stored dataset.
After performing tasks when you close the SAS sessions, any libraries that you have defined in your
program will be lost. This means that you need to reload the library when you start the SAS program
each time.
111
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries
Permanent Library
Temporary Library
A permanent SAS library exists on the external storage medium of your computer, and it is not deleted
when the SAS session terminates. Permanent SAS libraries are stored until you delete them.
A temporary SAS library exists only for the current SAS session.
112
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries
SAS files are held in a special work space and this work space is assigned to default libref WORK. Note
that files in the temporary WORK library can be used in any DATA step or SAS procedure during the SAS
session, but they are typically not available for subsequent SAS sessions.
113
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries
Let’s step into the classroom to understand how to use a temporary library.
114
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries
Work.Data_set_name;
115
© Copyright 2015, Simplilearn. All rights reserved.
Permanent and Temporary SAS Libraries
Data work.Electronic;
Datalines;
Run;
This example indicates that the dataset is created in a temporary library. However, the output remains
the same for both ways of coding.
116
© Copyright 2015, Simplilearn. All rights reserved.
117
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
118
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following keywords is used d.
to create a library in SAS? The keyword LIBNAME is used to create a
library in SAS.
119
© Copyright 2015, Simplilearn. All rights reserved.
Demo-Importing Data
Well, you have learned about the various types of libraries associated with the SAS software. Let’s now
learn about the most important concept—Importing and Exporting data with the help of a
demonstration.
Click “Server and share Folder” in the navigation pane and browse the file you want to import. Here, we
will import the Ecommerce Data.
You can find the dataset name and its location as shown on the top.
If you have data in a specific worksheet in your Excel workbook, you can pass the name of your
worksheet in the Worksheet Name box. By default, SAS imports data from the first worksheet.
You can change the storage location of the output by clicking the change button. By default, the output
dataset is saved to the Work library, which is a temporary location. The contents in this library are
deleted when you exit the SAS Studio.
The Results tab shows the attributes of the new SAS dataset.
The Output Data tab shows the contents of the new dataset.
120
© Copyright 2015, Simplilearn. All rights reserved.
Demo -Exporting Data
Double-click the “Generate CSV file” option from the “Data” drop-down list.
Note that in this example, the dataset car is exported. You can also change the dataset by typing the
required dataset name.
121
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
122
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
Import the data of the North region from Ecommerce dataset. The Ecommerce data is available in the
Downloads.
123
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
124
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes “Introduction to SAS.” The next lesson is “Combining and Modifying Datasets.”
125
© Copyright 2015, Simplilearn. All rights reserved.
126
© Copyright 2015, Simplilearn. All rights reserved.
127
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
128
© Copyright 2015, Simplilearn. All rights reserved.
129
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 03 — Combining & Modifying Datasets
130
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
I will take you through this lesson on Combining and Modifying Datasets.
131
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will learn the different methods used to combine datasets. You will also learn to
modify datasets, and use SAS functions and procedures to manipulate data.
132
© Copyright 2015, Simplilearn. All rights reserved.
Why Combine or Modify Data
A data analyst often has to combine or modify data to aid analysis. For example, a company that sells
products both online and through teleshopping keeps track of its sales in two databases. If it wants to
know the total sales for a period, it has to combine both datasets to know the total sales figure. SAS
offers many methods to combine datasets such as concatenating, interleaving, one-to-one reading, and
one-to-one merging. The method selection depends on the requirements and business scenarios.
analytics, multivariate analyses, Business Intelligence, data management, and predictive analytics.
133
© Copyright 2015, Simplilearn. All rights reserved.
Why Combine or Modify Data
Take another example where the company has sales information for the last one year and now wants to
analyze it quarterly, after sorting sales from the highest to the lowest in a particular region. This sort of
data modification can be done with SAS using the functions and procedures available in the tool.
Let’s begin this lesson by learning the combining datasets techniques.
134
© Copyright 2015, Simplilearn. All rights reserved.
Combining Datasets
We’ll learn four methods of combining datasets, such as Concatenating, Interleaving, one-to-one
reading, and one-to-one merging.
135
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Datasets
Concatenating datasets in SAS means stacking datasets one “on top” of the other into a single dataset.
The number of observations in the new dataset is the sum of the observations in the original datasets.
If a company maintains employee details department-wise and wants to have all the employee details in
one dataset for payroll processing, then by concatenating the individual department datasets it can have
the information in one dataset.
• APPEND procedure
Let’s learn both these methods and their differences so that you will be able to choose a method based
on the combining requirements.
If the datasets that you concatenate contain the same variables, and each variable has the same
attributes in all the datasets, then the results of the SET statement and PROC APPEND are the same.
On the other hand, if the datasets contains different variables, the results will differ for both.
136
© Copyright 2015, Simplilearn. All rights reserved.
137
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Datasets - Set Statement
Let’s step into the “Syntax Classroom” to learn the syntax. The SET statement allows you to read and
modify datasets.
Set SAS_Data_sets;
138
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Datasets - Set Statement
139
© Copyright 2015, Simplilearn. All rights reserved.
SAS SET Statement Demo
An E-Commerce company maintains its data in two datasets “Electronic” and “Fashion” and each has the
following variables: ‘Order_ID’, ‘Products’, ‘Region’, and ‘Sales’. The company wants a consolidated
report of both datasets to understand the combined sales amount for the year. This can be done with
the concatenation method in SAS.
‘Electronic’ and ‘Fashion’ datasets are in “myfolders” of this machine under the “Lesson3” sub-folder.
Let’s import both these datasets using the PROC Import process. You can see the code has been entered
in the program editor for each dataset to import the data from the folder to the SAS application.
In the “Output Data” tab you can see the ‘Electronic’ and ‘Fashion’ datasets that have been generated.
140
© Copyright 2015, Simplilearn. All rights reserved.
Select the program and click the Run icon.
In the ‘Output tab’ you can see the name of the table ‘combinedataset’ which has the ‘Fashion’ and
‘Electronic’ datasets combined.
141
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating – PROC Append
The APPEND procedure adds the observations from one SAS dataset to the end of another SAS dataset.
PROC APPEND does not process the observations of the first dataset. It adds the data of the second
dataset directly to the end of the original dataset.
142
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating – PROC Append
base-data-set is the SAS dataset to which you want to append the data. If this dataset does not exist,
then SAS creates it. The value of base-data-set becomes the most recently created dataset.
Data-set-to-append is the SAS dataset that contains the observations to add to the end of the base
dataset. If you don’t pass this option, then PROC APPEND adds the data in the current SAS dataset to the
end of the base dataset.
FORCE option forces PROC APPEND to concatenate the files in some situations where the program
executes syntax error.
143
© Copyright 2015, Simplilearn. All rights reserved.
Demo – Concatenate Proc Append & FORCE option
Let’s try to concatenate ‘Fashion’ and ‘Electronic’ datasets using the PROC append function.
Use the keywords ‘PROC APPEND’ and specify the names of the datasets to be combined.
In the ‘Log’ tab, you can see an error message that has been generated and the two datasets have not
been combined. The message says that some variable lengths are different.
You can see the keyword “Force” being included in the program.
In the output data tab, you can see the combined dataset that includes both Fashion and Electronic
datasets.
144
© Copyright 2015, Simplilearn. All rights reserved.
145
© Copyright 2015, Simplilearn. All rights reserved.
SET and Append–A Comparison
Having learned the two methods of concatenating, namely SET statement and Append procedure, let’s
now look at a comparison of these methods.
The SET statement can be used to combine any number of datasets while the Append procedure is used
for combining only two datasets.
The SET statement uses all the variables and assigns missing values where appropriate, while the append
procedure uses the force option to concatenate datasets with missing values.
The Set statement uses explicitly defined formats, informats and labels while in the append procedure
these are defined in the base dataset.
If variable names have different lengths, the SET statement will use the dataset named first while the
append procedure truncates the value of the variable to match the base dataset.
SET statement will not concatenate if there are different variable types in the datasets while the Append
procedure uses the force option to concatenate.
146
© Copyright 2015, Simplilearn. All rights reserved.
147
© Copyright 2015, Simplilearn. All rights reserved.
Interleaving Method
Interleaving method is a way of combining individual sorted datasets into one big sorted dataset.
However, before combining the datasets you have to ensure that they are sorted by the same variable
or variables. The SET statement along with the BY statement is used in this method.
148
© Copyright 2015, Simplilearn. All rights reserved.
Interleaving Method
For example, when dataset Electronic and dataset Fashion are interleaved by variable “Sales”, we get
dataset “OutputSales”. Let’s see how to write this code.
Data OutputSales;
by Sales;
Run;
Note that the data should be sorted, here the data is sorted by the Sales field.
149
© Copyright 2015, Simplilearn. All rights reserved.
Interleaving - Demo
Let’s write the program in the Program Editor to Interleave ‘Electronic’ and ‘Fashion’ Datasets.
You can see the two datasets displayed here. To combine them, we will write the program:
Data OutputSales;
by Sales;
You can see the combined dataset here through the interleaving method, and it is sorted by the ‘Sales’
field.
150
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
151
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 For which combining method should the c.
datasets be sorted? In the interleaving method of combining, the
data should be sorted by the same variable.
152
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Reading
One-to-one reading combines two or more SAS datasets, one "to the right" of the other into a single
"fat" dataset. In a one-to-one reading, a single observation in one data set is related to a single
observation from another dataset based on the values of one or more selected variables. A one-to-one
reading implies that each value of the selected variable occurs no more than once in each data set.
For Example:
A company maintains two records. The first record has the variables “Order ID”, “Sales_Amount”, and
“Product”. The second record has the variables “Order ID”, “Customer_Name”, and “Location”. Suppose
the company wants to know from an Order ID number all related information such as Sales, Product,
Customer Name, and Location to analyze it further for sales forecasts, one-to-one reading method of
combining datasets is used.
153
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Reading
Data onetooneread;
Set <Dataset>;
Set <Dataset1>;
Run;
Set is a keyword and Dataset refers to the names of the datasets to be combined. Set will read the
observations from each dataset matching the first one with the first and so on. It will stop at the end of
the smaller dataset. Let’s see a demonstration of the one-to-one read method.
154
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Reading - Demo
Let’s write the program in the Program Editor to combine Sales and Customer_Info datasets using the
one-to-one read method.
You can see the data inputted here. Let’s write the code to generate these two datasets.
Select the data and program and click the Run icon.
You can see the ‘Sales’ dataset and the ‘Customer_Info’ dataset here.
Let’s now write the one-to-one read code to combine these datasets.
Data onetooneread;
set Sales;
set Customer_Info;
Run;
This will combine the first observation of Sales with the first observation of Customer_Info and then the
second observation of Sales with the second observation of Customer_Info and so on to create one-to-
155
© Copyright 2015, Simplilearn. All rights reserved.
one-read. The dataset stops after it reads the last observation from the smallest dataset. For example, if
you check the combined dataset, it has ignored Order ID “6” in the sales dataset.
156
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Merging
One-to-one Merging, like one-to-one reading, also combines two or more SAS datasets, one "to the
right" of the other into a single "fat" dataset. Use one-to-one merging when you want to combine one
observation from each data set, but it is not important to match observations. The precondition is that
the datasets have been sorted by the variable which is being used for merging.
For Example:
Suppose the dataset Sales contains three variables: Order_ID, Sales_Amount, and Product;
and the dataset Customer_Info contains three variables: Order_ID, Customer_Name, and Location;
157
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Merging
The one-to one merge syntax to combine these datasets would be:
Data onetooneread;
Run;
158
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
159
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 In which combining method in SAS does b.
the dataset stop reading data once it In one-to-one reading method the dataset
reads the last observation from smallest stops reading data once the last observation
data set. from the smallest data set is read.
160
© Copyright 2015, Simplilearn. All rights reserved.
One-to-One Merge - Demo
Let’s write the program in the Program Editor to combine Sales and Customer_Info Datasets using the
one-to-one-read method.
You can see the inputted data of the two datasets. Select the program and click the Run icon.
You can see the first dataset Sales here and the second dataset Customer_Info here.
This will combine the first observation of Sales with first observation of Customer_Info and the second
observation of Sales with the second observation of Customer_Info and so on to create the one-to-one-
read dataset. When SAS performs a one-to-one merge, the DATA step continues to read observations
until the last observation is read from the largest dataset.
161
© Copyright 2015, Simplilearn. All rights reserved.
162
© Copyright 2015, Simplilearn. All rights reserved.
Data Manipulation
We saw a few data combining techniques so far. Let’s now look at some data manipulation techniques.
But before we begin, what is data manipulation?
Data manipulation is the process of changing or rearranging data for further analysis. Data becomes
easier to read as it is organized in a systematic manner to facilitate study and analysis.
A popular use of data manipulation is allowing website owners to know their most popular pages and
traffic sources. Data manipulation helps in sorting and analyzing raw data to understand required
information.
163
© Copyright 2015, Simplilearn. All rights reserved.
Data Manipulation
164
© Copyright 2015, Simplilearn. All rights reserved.
Delete and group observations
If-then-else statement is mainly used to group observations. It executes a SAS statement for
observations that meets a specific condition.
165
© Copyright 2015, Simplilearn. All rights reserved.
Delete Observations - Demo
Suppose you want to delete observations based on a certain condition, IF and DELETE are the two
keywords that are to be used in the program.
For example, if you want to delete observations greater than $150 in the sales field in the ‘Electronic’
dataset, write the program:
Data Datset_Deleteobservations;
Set
In the ‘Output Data’ tab, you can see the dataset ‘Dataset_Deleteobservations’ with sales figures of $150
and less here.
166
© Copyright 2015, Simplilearn. All rights reserved.
Delete and Keep variables – Demo
Sometimes, you might want to delete one or more variables from a dataset. To do this, you have to use
the DROP keyword.
In the ‘Electronic’ dataset, if you want to delete the variables ‘Shipping Cost’ and ‘Order Priority’
variables, you have to write the program using the keyword DELETE.
First, specify the output dataset name, then use the keyword DROP, and specify the variable names that
you want to delete. Set ‘Electronic’ indicates the dataset to be used.
Specify the output dataset and use the keyword KEEP followed by the variables that you want to retain.
Mention the dataset to be used and then click the Run icon.
In the Output Tab, you can see the original dataset ‘Electronic’ with all the variables.
167
© Copyright 2015, Simplilearn. All rights reserved.
168
© Copyright 2015, Simplilearn. All rights reserved.
Modifying Variable Attributes
Variables in SAS contain a number of attributes such as Name, Type, Length, Format, Label, and so on.
If you want to modify the attributes of a variable, for example, change the name to a new one, or cut
down the length of a variable, you can use the code specified for each action in SAS..
169
© Copyright 2015, Simplilearn. All rights reserved.
Modifying Variable Attributes
Default Format is the temporary format for displaying values of variables which are not in the FORMAT
statement. Default format is not permanently associated with variables in the output dataset.
RENAME old-name=new-name;
170
© Copyright 2015, Simplilearn. All rights reserved.
Modifying Variable Attributes - Demo
In this demonstration, you will see how to modify variable attributes using ‘Rename’, ‘Label’ and
‘Format’ keywords.
“Set” Electronic specifies the dataset to be used and the keyword ‘Rename’ indicates that the variable
‘Product’ has to be renamed to ‘Product_Names’.
The difference between ‘Rename’ and ‘Label’ is that rename permanently changes the name, whereas
label command retains the old name but displays the new name.
“Label” followed by the old name and the equal to sign and followed by the new name implies that the
name has to be changed.
171
© Copyright 2015, Simplilearn. All rights reserved.
We are also going to format Sales so that the amount is displayed with a dollar sign and a decimal point
followed by two zeroes.
Use the keyword ‘Format’, mention the variable name, and type ‘dollar10.2’, which can be found in the
SAS dictionary of formats. This format will display the amount with a dollar sign, a comma, and two
decimal places.
You can see the output dataset under the ‘column names’ option displaying the new variable names and
formats.
‘Product’ has been renamed as ‘Product_Names’. The amount in the ‘Sales’ column is displayed with a
dollar sign and a decimal point followed by two zeroes.
In the ‘column labels’ option, you can see that ‘Order ID’ has be labelled as ‘ID’.
172
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
Combining and modifying datasets create data that serves the purpose of data analysis better.
Four methods of combining data are concatenating, interleaving, one-to-one reading, and one-
to-one-merging.
Data manipulation techniques allow you to modify variable or observation attributes, exclude or
include data based on a criteria, or rename variables and attributes for further analysis.
173
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes the lesson “Combining and Modifying Datasets”. The next lesson will discuss “PROC SQL”.
174
© Copyright 2015, Simplilearn. All rights reserved.
175
© Copyright 2015, Simplilearn. All rights reserved.
176
© Copyright 2015, Simplilearn. All rights reserved.
177
© Copyright 2015, Simplilearn. All rights reserved.
178
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
179
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 04 — PROC SQL
180
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the “Data Science with Statistical Analysis System or SAS” course offered by
Simplilearn.
In this lesson, “PROC SQL,” you will be introduced to the essential concepts of PROC SQL.
181
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will understand the concept of data analytics, its types, and techniques. You will be
able to list the various types of analytical problems industries face, and describe ways to solve those
using SAS. You will also learn the various widely used analytical tools to perform data analysis.
182
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQL
Structured Query Language, or SQL, is a generic database language that helps to communicate with
databases.
183
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQL
The PROC SQL is the base SAS implementation of SQL. It allows you to retrieve, summarize, sort, join,
and concatenate datasets or databases available in SAS.
184
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQ
The PROC SQL is the base SAS implementation of SQL. It allows you to retrieve, summarize, sort, join,
and concatenate datasets or databases available in SAS.
185
© Copyright 2015, Simplilearn. All rights reserved.
What is PROC SQL
The PROC SQL allows you to combine the functionality of the DATA step and PROC step into a single
step.
Before we begin with the concepts of PROC SQL, let’s understand some terminologies associated with
the PROC SQL.
186
© Copyright 2015, Simplilearn. All rights reserved.
Terminologies of SQL
The following table lists the equivalent terms that are used in SQL, SAS, and data processing.
The PROC SQL table is termed a SAS data file in SAS and file in data processing.
The row in SQL is termed an observation in SAS and record in data processing.
The column in SQL is termed a variable in SAS and field in data processing.
Well, let’s now learn the syntax of PROC SQL and its uses.
187
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax
Let’s step into the “Syntax Classroom” to learn the syntax of PROC SQL.
188
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax
--------------
QUIT;
The PROC SQL command begins with the keyword “proc sql” and ends with the keyword “quit.” The
keyword “quit” is used to terminate the procedure.
189
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax
The PROC SQL command begins with the keyword “proc sql” and ends with the keyword “quit.” The
keyword “quit” is used to terminate the procedure.
190
© Copyright 2015, Simplilearn. All rights reserved.
PROC SQL- Syntax
WHERE clause,
GROUP BY clause,
ORDER BY clause.
Every PROC SQL statement must have at least one select statement. It displays the query's results
without the PRINT statement.
However, the other clauses such as where, group by, having, and order by are optional and can be
applied according to the requirement.
Let’s understand the syntax of each clause and learn how to retrieve data from a single table using these
clauses in PROC SQL.
191
© Copyright 2015, Simplilearn. All rights reserved.
192
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
Select Statement
select column_name
from sql.database_name;
The select statement contains two clauses, namely “select clause” and “from clause.” The “Select
clause” is used to select the specific row or column.
193
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
The “from clause” is used to select the dataset or table from which the data needs to be extracted.
194
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
Where Clause
select column_name
from sql.data_set_name
where <condition>;
The “where clause” is used to extract the data that fulfills the specific condition. Note that the keyword
used for this clause is “Where.”
195
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
Order by Clause:
select column_name
from sql.data_set_name
where <condition>
The “order by” clause sorts the output set by one or more columns. It also allows you to sort the output
data both in alphabetical and numerical order. Note that the column name is mentioned after the
keyword “order by.” The option is set after mentioning the column name.
196
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
Group by Clause:
The keyword “group by,” breaks the resultant data into subsets of rows. You should use an aggregate
function either in the “select” clause or a “having” clause to group the data. Some of the aggregate
functions are avg, mean, count, sum, and max.
from sql.data_set_name
group by <condition>
197
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
Having Clause:
from sql.data_set_name
group by <condition>
198
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from a Table
199
© Copyright 2015, Simplilearn. All rights reserved.
Demo- Retrieve data from a table
In this demo, you will learn how to retrieve data from a table using the PROC SQL clauses.
In this demo, we will retrieve data of all the products from the Electronic dataset, which have sum of
sales greater than 450, in a descending order.
The dataset “Electronic” is imported to the SAS console using the code shown on the screen.
To retrieve data using PROC SQL clauses, use the keyword proc sql. PROC SQL executes the program
without using the RUN statement.
The columns product, sales, and order priority are selected from the table “Electronic” using the
keyword “Select.”
In this demo, the products that have sales greater than 200 are selected using the Where statement.
The Group By statement is used to group data by a specified column. Here, we will group the product
column. With the GROUP BY clause, we can also use an aggregate function in the SELECT clause or in a
HAVING clause.
In this demo, the products which have the sum of sales greater than 450 are grouped. Note that the
aggregate function SUM is used here.
200
© Copyright 2015, Simplilearn. All rights reserved.
The Group By statement is used to group data by a specified column. Here, we will group the product
column. With the GROUP BY clause, we can also use an aggregate function in the SELECT clause or in a
HAVING clause.
In this demo, the products which have the sum of sales greater than 450 are grouped. Note that the
aggregate function SUM is used here.
This concludes the demo on how to retrieve from a table using the PROC SQL clauses.
201
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
202
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following PROC SQL clauses a.
uses aggregate functions? The Having clause uses aggregate functions.
203
© Copyright 2015, Simplilearn. All rights reserved.
Selecting columns in a Table
At times, you will need to select all columns or a specific column in a table.
To select all columns in a table, use an asterisk symbol in the “select” clause.
To select a specific column in a table, use the column name in the “select” clause.
Using PROC SQL, you can also eliminate the duplicate rows from the output data. To do so, use the
keyword “distinct” in the select clause.
204
© Copyright 2015, Simplilearn. All rights reserved.
Creating New Variable
PROC SQL allows you to create a new variable in the query result. These columns can either be text or
calculations. You can add text column to the query result by using a string or literal expression.
205
© Copyright 2015, Simplilearn. All rights reserved.
Creating New Variable
Take a look at this example program and its output dataset shown on the screen.
Proc SQL;
Quit;
Here, a new column “bonus,” is created, where observations are derived from the sales column. The
generated output is shown on the screen.
206
© Copyright 2015, Simplilearn. All rights reserved.
Formetting the Variable in SAS
You can also change the format of the variable and assign a new label to the dataset.
Proc SQL;
Quit;
In this example, the attribute format is used to modify the format of the sales variable and a label is used
to name the output dataset. The generated output is shown on the screen.
207
© Copyright 2015, Simplilearn. All rights reserved.
Case Expression
PROC SQL also allows you how to process conditional data. Case expression is a valid SQL-expression
that resolves to a table column where the values are compared to all the when-conditions. Using “Case”
expression in the select clause, you can extract the data that fulfils the set condition.
Proc SQL;
Case
End As Order_Priority1
from Electronic;
208
© Copyright 2015, Simplilearn. All rights reserved.
Quit;
In this example, from the electronic dataset, the product and discount column are selected and the
condition is set to the sales column. The end statement is required in the case expression. Also, set the
condition in descending order to increase the efficiency because SAS stops checking the case expression
as soon as it finds the first true value.
The output dataset is shown here. Note that the column “Order priority 1” is generated.
209
© Copyright 2015, Simplilearn. All rights reserved.
Referencing a CALCULATED Column
CALCULATED enables you to use the results of an expression in the same SELECT clause or in the WHERE
clause.
To derive the Net Profit, create a Tax column, which is 5% of the sales amount, and subtract Tax from
the profit.
You must use the CALCULATED keyword with the alias to inform PROC SQL that the value is calculated
within the query.
Otherwise, the SQL code will fail with a message similar to “column Tax was not found.”
Proc Sql;
Case
210
© Copyright 2015, Simplilearn. All rights reserved.
When Sales between 101 and 200 then “’High’”
End as Order_Priority1
From Electronic;
Quit;
211
© Copyright 2015, Simplilearn. All rights reserved.
Create Totals— Example
Using SAS, you can also obtain the totals by Order_Priority1. Look at the example shown on the screen.
The SUM function returns the sum of each row of the columns specified as arguments.
Proc Sql;
Select
Case
End as Order_Priority1,
212
© Copyright 2015, Simplilearn. All rights reserved.
sum(Sales) as Total_Sales Format=Dollar10.2,
count(*) as Number_Sales
From Electronic
group by Order_Priority1
Quit;
213
© Copyright 2015, Simplilearn. All rights reserved.
SQL Pass-Through Facility
The SQL Procedure Pass-Through Facility communicates with the DBMS through the SAS/ACCESS engine.
The Pass-Through Facility allows you to do the following::
• Pass native DBMS SQL statements to a DBMS
• Display the query results formatted as a report
• Create SAS datafiles and views from query results
Since the database is typically optimized and indexed to handle queries, complex joins are handled much
faster with a SQL pass-through query.
Take a look at the example program shown on the screen.
Use keyword connect to link the DBMS.
214
© Copyright 2015, Simplilearn. All rights reserved.
Creating a New Table
Using the “Create Table” statement, you can create a new table to define the columns and their
attributes. You can also specify a column's name, type, length, format, and label.
215
© Copyright 2015, Simplilearn. All rights reserved.
Creating a New Table
Proc SQL;
Quit;
In this example, the “electronic_example” dataset is created. This dataset will have the data from the
electronic data set that has higher-order priority.
The second select statement is used to show the complete electronic dataset. Note that only one table is
created using the “Create Table” statement.
216
© Copyright 2015, Simplilearn. All rights reserved.
217
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
218
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 While using the CASE expression, values c.
are compared to all the _____. While using the CASE expression, values are
compared to all the When conditions.
219
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables
So far you have learned how to retrieve data from a single table. Let’s now learn how to retrieve data
from multiple tables.
220
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables
If you want to combine multiple tables through SAS code, it requires several PROC SORT, DATA step, and
merge function. However, using PROC SQL, multiple datasets are combined easily.
To select data from multiple tables, simply join the tables in a query.
221
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables
Let’s step into the “Syntax classroom” to learn the syntax for selecting two tables using PROC SQL,
222
© Copyright 2015, Simplilearn. All rights reserved.
Retrieving Data from Multiple Tables
proc sql;
select *
Quit;
Use the keyword “select” to select the table. The asterisk symbol selects all the columns from tables 1
and 2. To select the particular column from table, simply mention the column name after the keyword
select.
223
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
The data that you may need for a research can come from different sources. To combine them, simply,
join the tables in a query.
There are two types of joins: Inner Join and Outer Join
• The Inner Join selects all rows from both tables as long as there is a match between the columns in
both tables.
• The Outer Join returns all matching records from both tables whether the other table matches or not.
224
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
The INNER JOIN selects all rows from both tables as long as there is a match between the columns in
both tables.
225
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
You can perform an inner join by using a list of table-names separated by commas with the WHERE
clause or by using the INNER JOIN and ON keywords.
226
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
Let’s take an example of how we can join Electronic and Electronic_CustInfo datasets to the attach
customer name and customer ID to each order.
You can select all columns from both tables with * and utilize the feedback option.
You can use the FEEDBACK option to see exactly how PROC SQL is implementing your query.
In the log session, you can see all column names with e and c table aliases. The output obtained is shown
on the screen.
227
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
You can also customize your query by selecting only required columns in the order you prefer. Observe
the changes made in the code to select preferred columns.
228
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
You can obtain the same results by performing an inner join with WHERE clause and INNER JOIN and On
keywords.
229
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
In contrast with an inner join, an outer join keeps rows that match the condition as well as some or all of
the unmatched data from one or both tables.
There are three types of outer joins: left, right, and full.
230
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
The LEFT JOIN returns all rows from the left table (table1), with the matching rows in the right table
(table2). The electronic dataset and electronic customer information dataset is taken as an example.
231
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
The RIGHT JOIN returns all rows from the right table (table2), with the matching rows in the left table
(table1). The electronic dataset and electronic customer information dataset are taken as an example.
232
© Copyright 2015, Simplilearn. All rights reserved.
Selecting Data from Multiple Tables
The FULL OUTER JOIN returns all rows from the left table (table1) and from the right table (table2). The
electronic dataset and electronic customer information dataset are taken as an example.
233
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Query Results
You can concatenate the two query results using the “Union” operator. Union Operator takes unique
observations from the dataset and generates a report.
Remember that “Union” does not return duplicate rows. If a row occurs more than once, then only one
occurrence is returned.
234
© Copyright 2015, Simplilearn. All rights reserved.
Concatenating Query Results
Sometimes, you need to return duplicate rows as well. In this case, you can use the keyword “Union All”
which requests that duplicate rows too remain in the output.
You can also concatenate two or more query results using the operator Except, Intersect, and Outer
union.
Use the operator “Except” to produce rows that are part of the first query only.
Use the operator “Intersect” to produce rows that are common to both the queries.
235
© Copyright 2015, Simplilearn. All rights reserved.
Demo - Concatenating Query Results
This demo shows you how to concatenate the query results using the operator “Union.”
The two datasets, namely north and south are imported to the SAS console.
The variables “Order ID, region, and, sales amount” have been selected from the dataset “North” and
“South” using the keyword “Select.”
The keyword “Union” is used to concatenate the two datasets. The Union operator produces all unique
rows from both queries.
Note that the variables selected in both the datasets are the same.
This concludes the demo on how to concatenate the query results using the operator “Union.”
236
© Copyright 2015, Simplilearn. All rights reserved.
237
© Copyright 2015, Simplilearn. All rights reserved.
Activity
Read the problem carefully and analyze what needs to be done using SAS techniques.
Create a new table with a new variable which is 10% of Sales if Sales is greater than 100 and 5% of Sales
if sales is less than 100 from the Electronic Dataset.
Click each code in the correct sequence to write the program that will be the solution to the
problem. Click the dataset tab to view them.
Hint: Name the new table as “Electronic_Data1” and new variable as “Incentive.” Semicolon can be
clicked any number of times.
238
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
Let’s practice what you have learned so far in this lesson. There are two Mini Projects in this lesson. Read
the question carefully and then answer them. The techniques and steps are provided to assist you under
the guide section.
239
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
ABC eCommerce company has to create a report in SAS from the master dataset.
The report should display the total sales and profits details for the watch, iron, LED, and LCD products in
descending order.
240
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
3. Look only for the watch, iron, LED and LCD products
241
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
242
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 2
ABC eCommerce company has a requirement to create a new table with the variables Order_ID,
Order_Date, Product and Sales variables from “Electronic” dataSet and Customer_ID, Customer_Name
from “Electronic_Custinfo” dataset.
This table should be the extract of rows from Electronic and Electronic_Custinfo datasets that have as
sales value greater than 150 based on order ID in a descending order.
243
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 2
2. Extract Order_ID, Order_Date, Product and Sales variables from Electronic Data Set and
ustomer_ID, Customer_Name for all records from Electronic_Custinfo dataset
244
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 2
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
245
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
Let’s now quickly recap the concepts you have learned in the lesson:
Structured Query Language, or SQL, is a generic database language that helps you communicate with
databases.
PROC SQL allows you to retrieve, summarize, sort, join, and concatenate datasets or databases available
in SAS.
WHERE clause,
GROUP BY clause,
ORDER BY clause.
The asterisk symbol selects all the columns from the table.
246
© Copyright 2015, Simplilearn. All rights reserved.
The Inner Join selects all rows from both tables as long as there is a match between the columns in both
tables.
The Outer Join returns all matching records from both tables whether the other table matches or not.
247
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
248
© Copyright 2015, Simplilearn. All rights reserved.
249
© Copyright 2015, Simplilearn. All rights reserved.
250
© Copyright 2015, Simplilearn. All rights reserved.
251
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
252
© Copyright 2015, Simplilearn. All rights reserved.
253
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 05 — SAS Macros
254
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the “Data Science with Statistical Analysis System or SAS” course offered by
Simplilearn.
In this lesson “SAS Macros,” you will get introduce to the essential concepts of SAS macros.
255
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will learn how to minimize the amount of SAS code using SAS Macros.
You will learn how to use macro function to manipulate the character strings and text.
You will also identify the differences between automatic and user-defined macro variables.
256
© Copyright 2015, Simplilearn. All rights reserved.
Need for SAS Macros
You have a program and you need to run it over again and again. Writing the program every time is time
consuming and tiring.
SAS allows you to use macros in your program which reduces the time spent writing the same code
repeatedly.
257
© Copyright 2015, Simplilearn. All rights reserved.
Need for SAS Macros
• Changes made in one location of your program cascades throughout your program.
• The programs are data driven, letting SAS decide what to do based on actual data values.
The purpose of the SAS macro language is to generate text which is used in SAS programs; this text can be in
any valid SAS code, namely statements, variables, text strings, and PROC steps.
258
© Copyright 2015, Simplilearn. All rights reserved.
Macro variables
Macro variables are tools that enable you to dynamically modify the text in a SAS program through symbolic
substitution. You can assign large or small amounts of text to macro variables, and after that, you can use
that text by simply referencing the variable that contains it. Macro variable values have a maximum length of
65,534 characters.
259
© Copyright 2015, Simplilearn. All rights reserved.
Macro variables
Let’s step into the syntax classroom to learn how to refer a macro variable in the code.
260
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables
Macro variables defined by the macro processor are called automatic macro variables. These variables are
also called global variables.
To invoke an automatic macro variable, use an ampersand followed by the macro variable name that starts
with a three-letter prefix “SYS.”
&SYSLAST macro variable returns the name of the most recent SAS data set.
261
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables
&SYSNOBS macro variable returns the number of observations in the last data set.
262
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables
&SYSDATE and &SYSDATE9 values represent the date on which a SAS session began executing in the two- and
four-digit format of the year, respectively.
263
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables
&SYSDAY macro variable returns Day of week on which SAS job or session began executing.
264
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables
&SYSTIME macro variable returns the time at which a SAS job or session began executing
265
© Copyright 2015, Simplilearn. All rights reserved.
Automatic Macro Variables
Use the command “%PUT _AUTOMATIC_” to view all available automatic macro variables.
266
© Copyright 2015, Simplilearn. All rights reserved.
Automatic macro variables
Let’s understand the automatic macro variable with the help of an example.
run;
The “&SYSDAY” and “&SYSDATE” are automatic macro variables created when the SAS session starts.
When the above code is run, we get the output as shown on the screen.
Note that an ampersand symbol is used to refer those values in the title statement.
267
© Copyright 2015, Simplilearn. All rights reserved.
User-Defined Macro Variables
User-defined macro variables or local variables enable you to create a value once and replace that value
repeatedly within a program.
268
© Copyright 2015, Simplilearn. All rights reserved.
User-Defined Macro Variables
Let’s step into the syntax classroom to learn the syntax of user-defined macro variable.
269
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement
To create a macro variable, after the keyword %LET, specify the name of the macro variable you want to
create, an equal sign, and then the value of the macro variable.
Use the command “%PUT _user_” to view all user-defined macro variables in the SAS log.
270
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement
Use the command “%PUT _ALL_”to view all user-defined and automatic macro variables in the SAS log.
271
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement
To delete user-defined macro variables mention the variable name next to the statement “%SYMDEL.”
272
© Copyright 2015, Simplilearn. All rights reserved.
User-defined Macro Variable
Look at the following example program that explains the use of a user-defined macro variable.
run;
“High” is the value field, and it can take any numeric, text, or date value. “Order” is the name of the local
variable.
When the above code is run, we get the output as shown on the screen. Note that only the column “order
with high value” is generated.
273
© Copyright 2015, Simplilearn. All rights reserved.
274
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions
Similar to SAS base functions, SAS Macro functions are built-in programming routines that enable you to
process many types of data manipulation tasks.
The syntax of a macro function is similar to that of a SAS function and they yield similar results and are
executed by the macro processor.
275
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Strings and Text
Macro character functions help you to change lowercase words to uppercase, extract a substring of a
character string, get a word from a text, and so on.
276
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Logical Operations and Execution
Macro functions perform arithmetic and logical operations. These include tasks such as performing simple
arithmetic tasks, computing dates, and evaluating logical expressions.
The %SYSEVALF function evaluates arithmetic and logical expressions using a floating-point arithmetic.
277
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
Let’s track the E-commerce dataset variable “Sales and Profit” for previous years grouped by the ship mode
type.
Look at the program shown on the screen to understand how to assign a value to a macro variable and how
to manipulate it.
DBMS=XLSX
OUT=WORK.E_Commerce;
GETNAMES=YES;
RUN;
%let DSN=E_Commerce;
278
© Copyright 2015, Simplilearn. All rights reserved.
title1 "%UPCASE(%SCAN(&VAR,1)) and %UPCASE(%SCAN(&VAR,2)) for %UPCASE(&DSN) channel";
var &var;
class Ship_Mode;
run;
The %LET statement creates a macro variable and assigns a value to it. Here, DSN and Var are the macro
variables. The value ecommerce is assigned to the DSN macro variable and sales and profit are assigned to
the var macro variable.
279
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
To refer the macro variable, precede the name of the macro variable with an ampersand symbol.
A macro processor resolves the reference and substitutes the macro variable's value before the program
compiles and executes.
Thus, the variable “&DSN” is replaced with value “E_Commerce” and variable “var” with values “Sales Profit.”
280
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
The %SCAN function extracts the nth word from a macro variable, where the words are separated by
delimiters. The default delimiters are shown on the screen.
281
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
The first %SCAN function extracts the “Sales” value from the macro variable “&var.”
The second %SCAN function extracts the “Profit” value from the macro variable “&var.”
282
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
The %UPCASE function converts a character to the upper case before substituting that value in a SAS
program.
283
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
The function "&sysdate” is used to refer the current year and obtain data prior to it in the WHERE clause.
Here the WHERE clause extracts rows which have the year value lesser than that of the current year.
284
© Copyright 2015, Simplilearn. All rights reserved.
X`
The %SYSFUNC invokes the automatic macro function "&sysdate” and extracts the current year value with the
YEAR() function.
285
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 1
When you run this code, you get the output as shown on the screen.
286
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 2
It’s easy to make changes and track statistics for the Aging and Discount values for prior years using SAS
macro functions.
Let’s consider the same program to track statistics for the Aging and Discount value in the previous years.
Simply change the variable from sales profit to Aging and discount and run the program.
287
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions – Example 2
In the output window, you can see the updated report for Aging and Discount statistics.
288
© Copyright 2015, Simplilearn. All rights reserved.
Macro Functions—Logical Operations and Execution
SAS also allows you to verify the values of macro variables and display them in the SAS log.
Consider the same example. To verify the values of macro variables, simply add the code “options
symbolgen;.”
289
© Copyright 2015, Simplilearn. All rights reserved.
SYMBOLGEN System Option
You can find symbolgen messages that display the value of macro variables.
290
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros
You can use PROC SQL to analyze data, calculate values, and create macro variables in a single step.
Suppose you need to store a list of Regions from E-Commerce data in the macro variable.
from E_Commerce ;
quit;
%put Regions=&Regions;
291
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros
The INTO clause of the PROC SQL procedure is a very convenient way to store all unique values in one macro
variable.
The SEPARATED BY clause specifies the character(s) that is used as a delimiter in the value of the macro
variable. The unique regions are to be separated by a comma.
292
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros (contd.)
The %PUT statement writes the value of macro variable to the SAS log. Here “Regions” is the macro variable.
293
© Copyright 2015, Simplilearn. All rights reserved.
SQL Clauses for Macros (contd.)
294
© Copyright 2015, Simplilearn. All rights reserved.
Now let's do a Knowledge check of what you have learned so far.
295
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Global variables cannot be accessed by b.
any SAS program available in the SAS Global variables can be accessed by any SAS
environment. program available in the SAS environment.
296
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement
Sometimes, you need to interpret the sales results of various regions. Writing a program for each region is
repetitive and time consuming. Using %macro statement, you can pass the required parameter in a program.
A parameter list can contain any number of macro parameters separated by commas. Note that you cannot
use a text expression to generate a macro name in a %MACRO statement.
297
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the classroom to learn the syntax of the %macro statement.
298
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement
Macro Statements;
%MEND;
299
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement (contd.)
You can call the macro by mentioning the macro name and passing the required values into it.
300
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement (contd.)
Note that, semicolons are not required for macro calls, but it is a good programming practice to have it.
301
© Copyright 2015, Simplilearn. All rights reserved.
The %Macro Statement–Example
Look at the following example to understand how to create and call a %macro statement.
%Macro Output(Sales_Amount=);
Run;
%Mend;
%Output(Sales_Amount=200);
Here, the macro name is output, the parameter is Sales_amount, the macro statement is where sales is
greater than sales amount, and value is 200.
When you run this code, you get the output as shown on the screen.
302
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the classroom to learn the syntax of conditional statement.
303
© Copyright 2015, Simplilearn. All rights reserved.
The Conditional Statement
action;
%END;
The mentioned action will be executed only if the condition set is fulfilled.
304
© Copyright 2015, Simplilearn. All rights reserved.
The Conditional Statement–Example
%Macro Output(Sales_Amount=);
Run;
%End;
%Else %do;
Run;
%End;
305
© Copyright 2015, Simplilearn. All rights reserved.
%Mend;
%Output(Sales_amount=150);
Here the macro name is output, the parameter is Sales_amount, and the value is 250.
According to the condition set, the Proc Print procedure is executed if the sales amount is greater than 200
and the PROC Contents procedure is executed if the sales amount is less than 200.
Note that the sales amount value passed here is 250. In this example, the passed sales amount value, 250, is
greater than 200 and the Proc print procedure is executed.
When you run this code, you get the output as shown on the screen.
306
© Copyright 2015, Simplilearn. All rights reserved.
The Conditional Statement–Example (contd.)
If the sales amount value is passed as 150, which is below 200, the set condition becomes false. Therefore,
the else part of the conditional statement gets executed. The PROC content procedure is present in the else
statement and the output obtained is shown on the screen.
307
© Copyright 2015, Simplilearn. All rights reserved.
Activity
You are a SAS developer in a leading organization and need to prepare a report from an ecommerce dataset.
The condition to extract the data varies based on the management requirements daily, say, if you need to
fetch the LED products or watches for instance. You felt that writing code for each product and varied
requirements daily was time consuming and tiring.
Which of the following concepts would you use to code for the above requirement?
Let the dataset name be electronic, the macro name be productwise, and the value for macro be watch.
308
© Copyright 2015, Simplilearn. All rights reserved.
Activity
309
© Copyright 2015, Simplilearn. All rights reserved.
Activity
310
© Copyright 2015, Simplilearn. All rights reserved.
Activity
311
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
Let’s practice what you have learned so far in this lesson. Read the question carefully and answer them. The
techniques and steps are provided to assist you under the guide section.
312
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
A famous ecommerce company wants to create a macro to sort data from the Electronic Dataset. It wants to
pass different variables names, title in the macro parameters, and print the dataset with title.
313
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
3. Check if field name is sales and sort the report per the requirement.
314
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
315
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
316
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes “SAS Macros” lesson. The next lesson is “Basics of Statistics.”
317
© Copyright 2015, Simplilearn. All rights reserved.
318
© Copyright 2015, Simplilearn. All rights reserved.
319
© Copyright 2015, Simplilearn. All rights reserved.
320
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
S.No. Question Answer & Explanation
1 Macros in SAS start with _____. c.
Macros in SAS start with %Macro .
321
© Copyright 2015, Simplilearn. All rights reserved.
322
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 06 — Basics of Statistics
323
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the Data Science with Statistical Analysis System or SAS course offered by
Simplilearn.
In this lesson, “Basics of Statistics,” you will be introduced to the essential concepts of statistics used in
the Statistical Analysis System.
324
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will understand what Descriptive Statistics is, its uses, and how it helps to analyze
data. You will learn the various testing techniques used in an inferential statistics. You will also
understand the differences between parametric and non-parametric techniques.
325
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Statistics
326
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Statistics (contd.)
It is widely used to understand the complex problems of the real world and simplify them to make well-
informed decisions.
327
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Statistics (contd.)
Several statistical principles, functions, and algorithms can be used to analyze primary data, build a
statistical model, and predict the outcomes.
328
© Copyright 2015, Simplilearn. All rights reserved.
Statistical and Non-statistical Analysis
An analysis of any situation can be done in two ways: Statistical analysis or a Non-Statistical analysis.
Statistical analysis is the science of collecting, exploring, and presenting large amounts of data to identify
the patterns and trends. Statistical analysis is also called Quantitative Analysis.
Non-statistical analysis provides generic information and includes, text, sound, still images, and moving
images. Non-statistical analysis is also called Qualitative Analysis.
Although both forms of analysis provide results, statistical analysis gives more insight and a clearer
picture, feature that makes it vital for businesses.
329
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics
There are two major categories of statistics: Descriptive Statistics and Inferential Statistics.
330
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)
Descriptive Statistics helps organize data and focuses on the main characteristics of the data. It provides
a summary of the data numerically or graphically. Numerical measures, such as average, mode, standard
deviation or SD, and correlation are used to describe the features of a dataset.
331
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)
Suppose you want to study the height of students in a class room. In the Descriptive Statistics, you would
record the height of every person in the class room and then find out the maximum height, minimum
height, and average height of the population.
332
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)
Inferential Statistics generalizes the larger dataset and applies probability theory to draw a conclusion. It
allows you to infer population parameters based on the sample statistics and to model relationships
within the data. Modeling allows you to develop mathematical equations which describe the
interrelationships between two or more variables.
333
© Copyright 2015, Simplilearn. All rights reserved.
Major Categories of Statistics (contd.)
Consider the same example of calculating the height of students in the class room. In Inferential
Statistics, you would categorize height as “tall,” “medium,” and “small” and then take only a small
sample from the population to study the height of students in the class room.
334
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms
The field of statistics touches our lives in many ways. From the daily routines in our homes to the
business of making the greatest cities run, the effects of statistics are everywhere.
335
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
There are various statistical terms that one should be aware of while dealing with statistics:
Population
Sample
Variable
Quantitative variable
Qualitative variable
Discrete variable
Continuous variable
336
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
337
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
A variable is a feature that is characteristic of any member of the population differing in quality or
quantity from another member.
338
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
A variable differing in quantity is called a quantitative variable, for example, the weight of a person,
number of people in a car.
339
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
A variable differing in quality is called a qualitative variable or attribute, for example, color, the degree of
damage of a car in an accident.
340
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
A discrete variable is one in which no value can be assumed between the two given values. For example,
the number of children in a family.
341
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Terms (contd.)
A continuous variable is one in which any value can be assumed between the two given values. For
example, the time taken for a 100-meter run.
342
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures
Typically, there are four types of statistical measures used to describe the data. They are:
Measures of Frequency
Measures of Central Tendency
Measures of Spread
Measures of Position
343
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)
Frequency of the data indicates the number of times a particular data value occurs in the given dataset.
The measures of frequency are number and percentage.
344
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)
Central tendency indicates whether the data values tend to accumulate in the middle of the distribution
or toward the end. The measures of central tendency are mean, median, and mode.
345
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)
Spread describes how similar or varied the set of observed values are for a particular variable. The
measures of spread are standard deviation, variance, and quartiles. The measures of spread are also
called measures of dispersion.
346
© Copyright 2015, Simplilearn. All rights reserved.
Types of Statistical Measures (contd.)
Position identifies the exact location of a particular data value in the given dataset. The measures of
position are percentiles, quartiles, and standard scores.
347
© Copyright 2015, Simplilearn. All rights reserved.
Procedures in SAS for Descriptive Statistics
Statistical Analysis System, or SAS, provides a list of procedures to perform descriptive statistics. They
are as follows:
Proc Print
Proc Contents
Proc Means
Proc Freq
Proc Univariate
Proc GChart
Proc Boxplot
Proc Gplot
348
© Copyright 2015, Simplilearn. All rights reserved.
Procedures in SAS for Descriptive Statistics (contd.)
Proc Means – It provides data summarization tools to compute Descriptive Statistics for variables across
all observations and within the groups of observations.
Proc Freq – It produces one-way to n-way frequency and cross-tabulation tables. Frequencies can also be
an output of a SAS dataset.
Proc Univariate - It goes beyond what PROC MEANS does and is useful in conducting some basic
statistical analyses and includes high resolution graphical features.
Proc GChart - The GCHART procedure produces six types of charts: block charts, horizontal - vertical bar
charts, pie - donut charts, and star charts. These charts graphically represent the value of a statistic
calculated for one or more variables in an input SAS dataset. The charted variables can be either numeric
or character.
Proc Boxplot - The BOXPLOT procedure creates side-by-side box-and-whisker plots of measurements
organized in groups. A box-and-whisker plot displays the mean, quartiles, and minimum and maximum
observations for a group.
349
© Copyright 2015, Simplilearn. All rights reserved.
Proc Gplot – Gplot procedure creates two-dimensional graphs including, simple scatter plots, overlay
plots in which multiple sets of data points are displayed on one set of axes, plots against a second
vertical axis, bubble plots, and logarithmic plots.
350
© Copyright 2015, Simplilearn. All rights reserved.
Demo- Descriptive Statistics
In this demo, you will learn how to use Descriptive Statistics to analyze the mean from the electronic
database.
In the left pane, right-click the electronic.xlsx dataset and click Import Data.
The code to import the data generates automatically. Copy the code and paste it in the new window.
The PROC Means procedure is used to analyze the mean of the imported dataset.
The keyword DATA identifies the input dataset. In this demo, the input dataset is “electronic.”
Note that the number of observations, mean, Standard deviation, and maximum and minimum values of
the electronic dataset are obtained.
This concludes the demo on how to use Descriptive Statistics to analyze the mean from the electronic
database.
351
© Copyright 2015, Simplilearn. All rights reserved.
352
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
353
© Copyright 2015, Simplilearn. All rights reserved.
354
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 XYZ plywood manufacturing company b.
wants to check the strength of its It is an example of Inferential Statistics. It
plywood. The company picks one out of allows you to infer population parameters
every 200 pieces of plywood as a sample based on sample statistics.
to test the quality. What is this scenario
an example of?
355
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing
So far you have learned about descriptive statistics. Let’s now learn about inferential statistics.
Hypothesis testing is an inferential statistical technique to determine whether there is enough evidence
in a data sample to infer that a certain condition holds true for the entire population. To understand the
characteristics of the general population, we take a random sample and analyze the properties of the
sample. We then test whether or not the identified conclusions correctly represent the population as a
whole.
The purpose of hypothesis testing is to choose between two competing hypotheses about the value of a
population parameter.
356
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)
For example, one hypothesis might claim that the wages of men and women are equal, while the other
might claim that women make more than men.
357
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)
The null hypothesis is assumed to be true unless there is strong evidence to the contrary.
The alternative hypothesis is assumed to be true when the null hypothesis is proven false.
358
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)
Let’s understand the null hypothesis and alternative hypothesis using a general example.
Null hypothesis attempts to show that no variation exists between variables and alternative hypothesis
is any hypothesis other than the null. For example, say a pharmaceutical company has introduced a
medicine in the market for a particular disease and people have been using it for a considerable period
of time and it’s generally considered safe. If the medicine is proved to be safe, then it is referred to as
null hypothesis.
359
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing (contd.)
To reject null hypothesis, we should prove that the medicine is unsafe. If the null hypothesis is rejected,
then the alternative hypothesis is used.
360
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types
Before you perform any statistical tests with variables, it is significant to recognise the nature of the
variables involved. Based on the nature of the variables, it is classified into four types.
361
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)
They are categorical or nominal variables, ordinal variables, interval variables, and ratio variables.
Nominal variables are ones which have two or more categories, and it is impossible to order the values.
Examples of nominal variables include gender and blood group.
362
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)
Ordinal variables have values ordered logically. However, the relative distance between two data values
is not clear. Examples of ordinal variables include considering the size of coffee cup—large, medium, and
small and considering the ratings of a product—bad, good, and best.
363
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)
Interval variables are similar to ordinal variables, except that the values are measured in a way where
their differences are meaningful. With an interval scale, equal differences between scale values do have
equal quantitative meaning. For this reason, an interval scale provides more quantitative information
than the ordinal scale. The interval scale does not have a true zero point. A true zero point means that a
value of zero on the scale represents zero quantity of the construct being assessed.
Examples of interval variables include the Fahrenheit scale used to measure temperature and distance
between two compartments in a train. The Fahrenheit scale does not have a true zero point.
364
© Copyright 2015, Simplilearn. All rights reserved.
Variable Types (Contd.)
Ratio scales are similar to interval scales in that equal differences between scale values have equal
quantitative meaning. However, ratio scales also have a true zero point which give them an additional
property. For example, the system of inches used with a common ruler is an example of a ratio scale.
There is a true zero point because zero inches does in fact indicate a complete absence of length.
365
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process
Let’s understand the process of hypothesis testing. There are four steps to be performed to test the
hypothesis of any variables. Click each step to know more.
The first step is to make assumptions and state the null hypothesis and the alternative hypothesis.
Assume each sample is an independent random sample and the distribution of the response variable
follows normal distribution. The null hypothesis, or H0, states that a population parameter is equal to a
value. The alternative hypothesis, or H1, states that the population parameter is different than the value
of the population parameter in the null hypothesis. The alternative hypothesis is what is believed to be
true or is proven to be true.
366
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process (contd.)
The second step is to select the appropriate test statistic and the level of significance.
If the population standard deviation, σ, is known and either the data is normally distributed or the
sample size n is greater than 30, you can use the normal distribution or z-statistic.
If the population standard deviation, σ, is unknown and either the data is normally distributed or the
sample size is greater than 30, you can use the t-distribution or t-statistic.
367
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process (contd.)
The third step is to calculate the p-value. Compute the appropriate test statistic and make the decision.
Use the formulas shown on the screen to obtain the p-value depending on the statistic.
368
© Copyright 2015, Simplilearn. All rights reserved.
Hypothesis Testing – Process (contd.)
The fourth step is to compare the p-value to alpha to interpret the decision.
If the p-value is less than or equal to alpha, the evidence is strong against the null hypothesis, so
you can reject the null hypothesis.
If the p-value is greater than alpha, the evidence is weak against the null hypothesis, so you fail
to reject the null hypothesis.
If the p-value is equal to alpha, the evidence is neither strong nor weak against the null hypothesis. In
this case, you draw your own conclusions.
369
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
370
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following has values in a b.
logical order? Ordinal variables take on values that can be
logically ordered or ranked.
371
© Copyright 2015, Simplilearn. All rights reserved.
Demo-Hypothesis Testing
In this demo, you will learn how to perform hypothesis testing using SAS.
In this example, let’s check the aging length of certain observations from a random sample.
The input statement is used to declare the aging variable and cards to read data into SAS.
Let’s assume the null hypothesis to be that the mean days to deliver a product is 6 days.
So H0 equals 6. Alpha value is the probability of making an error, which is 5% standard and hence alpha
equals 0.05.
Note that the p-value is greater than the alpha value, which is 0.05. Therefore, we fail to reject the null
hypothesis.
372
© Copyright 2015, Simplilearn. All rights reserved.
This concludes the demo on how to perform the hypothesis testing using SAS.
373
© Copyright 2015, Simplilearn. All rights reserved.
Parametric and Non-parametric Tests
Let’s now learn about hypothesis testing procedures. There are two types of hypothesis testing
procedures. They are parametric tests and non-parametric tests.
In statistical inference or hypothesis testing, the traditional tests, such as t- test and ANOVA, are called
parametric tests. They depend on the specification of a probability distribution except for a set of free
parameters.
In simple words, you can say that if the population information is known completely by its parameter,
then it is called a parametric test.
374
© Copyright 2015, Simplilearn. All rights reserved.
Parametric and Non-parametric Tests
If the population or parameter information is not known and you are still required to test the hypothesis
of the population, then it is called a non-parametric test. Non-parametric tests do not require any strict
distributional assumptions.
375
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests
T-test
ANOVA
Chi-square
Linear regression
T-Test:
A T-test determines if two sets of data are significantly different from each other.
376
© Copyright 2015, Simplilearn. All rights reserved.
377
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
For example:
Let’s say you have to find out which region spends the highest amount of money on shopping. It’s
impractical to ask everyone in the different regions about their shopping expenditure.
In this case, you can calculate the highest shopping expenditure by collecting sample observations from
each region.
With the help of the t-test, you can check if the difference between the regions are significant or a
statistical fluke.
378
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
ANOVA:
ANOVA is a generalized version of the T-test and used when the mean of the interval dependent variable
is different to the categorical independent variable. When we want to check variance between two or
more groups, we apply the ANOVA test.
379
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
For example:
Let’s look at the same example of the T-test example. Now, you want to check how much people in
various regions spend every month on shopping. In this case, there are four groups, namely East, West,
North, and South. With the help of the ANOVA-test, you can check if the difference between the regions
is significant or a statistical fluke.
380
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
Chi-Square
Chi-square is a statistical test used to compare observed data with data you would expect to obtain
according to a specific hypothesis.
381
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
382
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
Linear regression
There are two types of linear regression—simple linear regression and multiple linear regression.
Simple linear regression is used when one wants to test how well a variable predicts another variable.
Multiple linear regression allows one to test how well multiple variables (Independent Variables) predict
a variable of interest. When using multiple linear regression, we additionally assume the predictor
variables are independent.
383
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests (contd.)
For Example, finding relationship between any two variables, say Sales and Profit, is called simple linear
regression.
Finding relationship between any three variables, say sales, cost, and Telemarketting and is called
Multiple linear regression.
Let’s Say a Ecommerce company noticed the hike in Sales because of two marketing campaigns. They
have three field one Sales, second cost spent on Direct marketing campaign and third cost spent on Tele
Marketing Campaign.
Here Sales we are denoting by S, Cost on Tele Marketing Campaign by TM and Direct Marketing by DM.
So checking the relationship between these three variables (Sales based on campaigns) is the example of
Multiple Regression.
384
© Copyright 2015, Simplilearn. All rights reserved.
Non-parametric Tests
Some of the non-parametric tests are Wilcoxon rank sum test and Kruskal-Wallis H-test.
The Wilcoxon Signed-Rank Test is a non-parametric statistical hypothesis test used to compare two
related samples or matched samples to assess whether or not their population mean ranks differ.
In Wilcoxon Rank Sum test, you can test the null hypothesis on the basis of the ranks of the
observations.
385
© Copyright 2015, Simplilearn. All rights reserved.
Non-parametric Tests (contd.)
Kruskal-Wallis H-Test:
In this test, you can test the null hypothesis on the basis of the ranks of the independent samples.
386
© Copyright 2015, Simplilearn. All rights reserved.
Parametric Tests-Advantages and Disadvantages
Provide information about the population in terms of parameters and confidence intervals
Easier to use in modeling, analyzing, and for describing data with central tendencies and data
transformations
Express the relationship between two or more variables
Don’t need to convert data into rank order to test
387
© Copyright 2015, Simplilearn. All rights reserved.
Non-parametric Tests—Advantages and Disadvantages
388
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
Descriptive Statistics helps organize data and focuses on the main characteristics of the data.
Inferential Statistics generalizes the larger dataset and applies probability theory to draw a
conclusion.
Hypothesis testing is an inferential statistical technique to determine whether there is enough
evidence in a data sample to infer that a certain condition holds true for the entire population.
If the population information is known completely by its parameter, then it is called a parametric
test.
If the population or parameter information is not known and you are still required to test the
hypothesis of the population, then it is called a non-parametric test.
389
© Copyright 2015, Simplilearn. All rights reserved.
390
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes “Basics of Statistics.” The next lesson is “Basic Statistical Procedure.”
391
© Copyright 2015, Simplilearn. All rights reserved.
392
© Copyright 2015, Simplilearn. All rights reserved.
393
© Copyright 2015, Simplilearn. All rights reserved.
394
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
395
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 07 — Statistical Procedure
396
© Copyright 2015, Simplilearn. All rights reserved.
397
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
In this lesson, “Statistical Procedures,” you will be introduced to the various procedures of statistics
available in Statistical Analysis System.
398
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will understand the various statistical procedures such as PROC Means, PROC FREQ,
PROC UNIVARIATE, PROC CORR, PROC REG, and PROC ANOVA that help perform statistical tests. You
will also learn how to create graphs and interpret the results.
399
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Procedures
The statistical procedures are used to analyze, represent, and calculate statistical data.
400
© Copyright 2015, Simplilearn. All rights reserved.
Statistical Procedures (contd.)
There are various statistical procedures that help perform statistical tests:
PROC Means
PROC UNIVARIATE
PROC FREQ
PROC CORR
PROC REG
PROC ANOVA
401
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means
402
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)
One of the most powerful and flexible procedures of SAS System is PROC MEANS. You can use it rapidly
and efficiently to analyze the values of numeric variables and place those analyses either in the output
window or in a SAS dataset or both. Mastering the basic syntax and features of this procedure will
enable you to analyze your datasets easily.
403
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)
PROC MEANS is used in a variety of analytic, business intelligence, reporting, and data management
situations.
404
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)
The PROC Mean is used to calculate descriptive statistics, estimate quartiles including the median,
calculate confidence limits for the mean, identify extreme values, and perform a “t-test.”
405
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)
Let’s step into the “Syntax Classroom” to learn the syntax of PROC Means.
406
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means (contd.)
RUN;
The syntax for the means procedure is shown on the screen. The keyword “PROC Means” calculates the
number of observations, Mean, Standard Deviation, and maximum and minimum values from the
dataset.
407
© Copyright 2015, Simplilearn. All rights reserved.
PROC Means—Examples
Let’s now understand the use of SAS procedures using PROC Means as an example.
You can use PROC Means without options. By default, SAS uses the last created dataset, and it generates
the means for all of the numeric variables in that dataset.
PROC MEANS;
RUN;
Look at the output shown on the screen. In this example, from the E-Commerce dataset, the number of
observations, mean, maximum and minimum values, and Standard Deviation are obtained.
408
© Copyright 2015, Simplilearn. All rights reserved.
Example 2: Using options on the PROC statement
SAS allows you to use various options to generate the desired output when you use PROC Means.
RUN;
Note that the data= option is optional. However, it is strongly recommended you use it as it avoids
errors of omission when you revise your programs.
409
© Copyright 2015, Simplilearn. All rights reserved.
410
© Copyright 2015, Simplilearn. All rights reserved.
Example 2: Using options on the PROC statement (contd.)
You can also use the options such as n, mean, mode, and Standard Deviation after the keyword “PROC
Means.”
RUN;
Look at the output shown on the screen. In this example, from the electronic dataset, the number of
observations, mean, and Standard Deviation alone are obtained.
411
© Copyright 2015, Simplilearn. All rights reserved.
Example 3: Using additional statements
In addition, you can use additional statements in PROC Means to get the desired output.
BY
CLASS
FREQ
ID
OUTPUT
TYPES
VAR
WAYS
WEIGHT
412
© Copyright 2015, Simplilearn. All rights reserved.
413
© Copyright 2015, Simplilearn. All rights reserved.
Example 3: Using additional statements (contd.)
The CLASS statement identifies variables whose values define subgroups for the analysis.
The FREQ statement identifies a variable whose values represent the frequency of each observation.
The OUTPUT statement creates an output dataset that contains specified statistics and identification
variables.
The TYPES statement identifies specific combinations of class variables to use to subdivide the data.
The VAR statement identifies the analysis variables and their order in the results.
The WAYS statement specifies the number of ways to make unique combinations of class variables.
The WEIGHT statement identifies a variable whose values weigh each observation in the statistical
calculations.
414
© Copyright 2015, Simplilearn. All rights reserved.
415
© Copyright 2015, Simplilearn. All rights reserved.
Example 3: Using additional statements (contd.)
Look at the example shown on the screen. In this example, the statements variable and class are used.
Class Ship_Mode;
Run;
Look at the output shown on the screen. In this example, SAS calculates the average Sale and Profit
within each Ship_Mode type.
The “Standard Class” Ship_Mode appears to have the highest average Sales and average Profit.
416
© Copyright 2015, Simplilearn. All rights reserved.
Example 4: Using additional statements
Going a step further, SAS helps you to compute median, mode, quartile, kurtosis, and skewness.
Var Sales;
Class Ship_Mode;
Run;
The keyword mean generates the average of Sales column for each shipmode type.
The keyword median generates the “middle” value or median of Sales column for each shipmode type.
417
© Copyright 2015, Simplilearn. All rights reserved.
418
© Copyright 2015, Simplilearn. All rights reserved.
Example 4: Using additional statements
The keyword mode generates the most repeated value or mode of Sales column for shipmode.
The keyword P25 generates the first quartile value of Sales column for shipmode.
419
© Copyright 2015, Simplilearn. All rights reserved.
Example 4: Using additional statements
The keyword P50 generates the second quartile value of Sales column for shipmode.
The keyword P75 generates the third quartile value of Sales column for shipmode.
420
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
421
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following statements is used a.
to identify the analysis variables and their The VAR statement identifies the analysis
order in the results? variables and their order in the results.
422
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ
So far you have learned the use of SAS procedures using PROC Means. Let’s now learn the use of SAS
procedures using PROC FREQ.
423
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The PROC FREQ is used to obtain a frequency distributions and to analyse multi-dimension tables.
It invokes the procedure and identifies the input dataset optionally. By default, the PROC FREQ uses the
recently generated SAS dataset.
424
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
Let’s step into the “Syntax Classroom” to learn the syntax of PROC FREQ.
425
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
BY variable-list;
WEIGHT variable;
FORMAT;
TEST options;
The PROC FREQ statement invokes the FREQ procedure. By default, similar to PROC MEANS, the
procedure uses the most recently created SAS dataset.
426
© Copyright 2015, Simplilearn. All rights reserved.
427
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The BY statement obtains a separate analysis in groups defined by the BY variables (the prior sorting is
required).
428
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The TABLES statement requests cross-tabulation tables and statistics for those tables.
429
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The WEIGHT statement names a numeric variable that provides a weight for each observation in the
input data set.
430
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The OUTPUT statement creates an output data set that contains specified statistics and identification
variables.
431
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The EXACT statement requests exact tests or confidence limits for the specified statistics.
432
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The TEST statement requests asymptotic tests for measures of association and measures of agreement.
433
© Copyright 2015, Simplilearn. All rights reserved.
PROC FREQ (contd.)
The statements and options in PROC FREQ can be categorized into three primary ways. They are as
follows:
434
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC FREQ
In this demo, you will learn how to perform the statistical procedure using PROC FREQ.
Let’s perform the statistical analysis using PROC FREQ on the electronic dataset.
In PROC FREQ, we use the “BY” statement instead of the “CLASS” statement. To use the BY statement in
PROC FREQ, data should be sorted by variables.
Let’s sort the “Electronic” dataset using the PROC SORT statement.
Let’s now calculate the frequency of sorted products from the electronic dataset using PROC FREQ
statement.
Look at the output shown on the screen. The product table is created with frequency, percent, and
cumulative frequency and percent columns.
This concludes the demo on how to perform the statistical procedure using PROC FREQ.
435
© Copyright 2015, Simplilearn. All rights reserved.
436
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE
Hope, you have understood the concept of PROC FREQ. Let’s now learn the next statistical procedure,
PROC UNIVARIATE.
437
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
PROC UNIVARIATE is a powerful base statistical procedure that combines other analytical procedures
such as FREQ, MEANS, SUMMARY, and TABULATE into a single PROC step.
438
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
The UNIVARIATE procedure provides data summarization tools, high-resolution graphics displays, and
information on the distribution of numeric variables.
439
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
It calculates descriptive statistics, median, mode, range, quartiles, frequency tables, and
confidence limits.
It tabulates extreme observations and extreme values and plots the data distribution.
It performs tests for location and normality.
It performs goodness-of-fit tests for fitted parametric and nonparametric distributions.
It creates histograms—one-way and two-way comparative histograms, comparative quantile-
quantile plots, and comparative probability plots.
It creates output data sets with requested statistics, histogram intervals, and parameters of the
fitted distributions.
440
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
Let’s step into the “Syntax Classroom” to learn the syntax of PROC UNIVARIATE.
441
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
Run;
The keyword “PROC UNIVARIATE” examines the distribution of your data, including an assessment of
normality and discovery of outliers.
442
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
The PROC UNIVARIATE procedure allows you to include various options and statements.
Follow the syntax shown on the screen while using various options and statements.
FREQ variable;
ID variable(s);
443
© Copyright 2015, Simplilearn. All rights reserved.
PROBPLOT <variable(s)> </ option(s)>;
VAR variable(s);
WEIGHT variable;
444
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
445
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
446
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
447
© Copyright 2015, Simplilearn. All rights reserved.
PROC UNIVARIATE (contd.)
Note that, like PROC Print, PROC UNIVARIATE also have “By” statement to produce separate analyses for
each value of the variable specified.
448
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC UNIVARIATE
In this demo, you will learn how to perform the statistical procedure using PROC UNIVARIATE.
The statement PROC UNIVARIATE invokes the UNIVARIATE procedure. We have chosen the electronic
dataset.
The VAR statement selects the analysis variables and determines their order in the report. Here, aging is
an analysis variable.
The HISTOGRAM statement creates histograms and superimposes the estimated parametric and
nonparametric probability density curves.
In this example, we will plot a normal curve. To plot a normal histogram curve, use the statement
“Normal.”
The moments, basic statistical measures, tests for location, quantiles levels, extreme observations,
histogram plot, and normal distribution are obtained for the aging variable.
This concludes the demo on how to perform the statistical procedure using PROC FREQ.
449
© Copyright 2015, Simplilearn. All rights reserved.
450
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
451
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following procedures b.
calculates the unique values of the PROC FREQ calculates the unique values of the
variable, the number of observations at variable, the number of observations at each
each value, a cumulative count, and a value, a cumulative count, and a cumulative
cumulative percent? percent.
452
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR
453
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
PROC CORR is a correlation procedure used to check the strength between two or more variables. It is
used to compute simple descriptive statistics, Pearson product-moment correlation coefficient between
variables, Spearman’s rank-order correlation, and Kendall correlation coefficient.
It also calculates Fisher's Z transformation for the Pearson product-moment and Spearman’s rank-order
correlation coefficients to get 95% confidence intervals.
454
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Descriptive statistics are used to describe the basic features of the data.
The Pearson product-moment correlation, or Pearson correlation for short, is used to measure
the linear correlation between two variables.
Spearman’s rank-order correlation is used to prove or disprove the hypothesis.
Kendall rank correlation is used to measure the ordinal association between two variables.
455
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Well, let’s now step into the syntax classroom to learn the syntax of PROC CORR.
456
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Run;
The “PROC CORR” statement computes Pearson product-moment correlation for the recent dataset. It
also computes probabilities to test the null hypothesis.
457
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
There are various options available in PROC CORR statement under datasets, statistical analysis, Pearson
Correlation Statistics, ODS Output Graphics, and Printed Output category.
Click each category to know the various options available in PROC CORR.
458
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Datasets
Option Description
DATA Specifies the input dataset
OUTH Specifies the output dataset with Hoeffding’s statistics
OUTK Specifies the output dataset with Kendall correlation statistics
OUTP Specifies the output dataset with Pearson correlation statistics
OUTS Specifies the output dataset with Spearman correlation statistics
459
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Statistical Analysis
Option Description
EXCLNPWGT Excludes observations with nonpositive weight values from the analysis
FISHER Requests correlation statistics using Fisher’s Z transformation
HOEFFDING Requests Hoeffding’s measure of dependence
KENDALL Requests Kendall’s tau-b
NOMISS Excludes observations with missing analysis values from the analysis
PEARSON Requests Pearson product-moment correlation
SPEARMAN Requests Spearman rank-order correlation
460
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Option Description
ALPHA Computes Cronbach’s coefficient alpha
COV Computes covariances
CSSCP Computes corrected sums of squares and cross products
SINGULAR Specifies the singularity criterion
SSCP Computes sums of squares and cross products
VARDEF Specifies the divisor for variance calculations
461
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Option Description
PLOTS=MATRIX Computes Cronbach’s coefficient alpha
PLOTS=SCATTER Computes covariance
462
© Copyright 2015, Simplilearn. All rights reserved.
PROC CORR (contd.)
Printed Output
Option Description
BEST= Displays the specified number of ordered correlation coefficients
NOCORR Suppresses Pearson correlations
NOPRINT Suppresses all printed output
NOPROB Suppresses P-values
NOSIMPLE Suppresses descriptive statistics
RANK Displays ordered correlation coefficients
463
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC CORR
In this demo, you will learn how to perform the statistical procedure and obtain a scatter plot using
PROC CORR.
Let’s create a basic statistics and correlation matrix table for the electronic dataset.
The statement PROC CORR is used to check the strength between two or more variables.
The variables sales, products, and discounts are selected as the analysis variable using the VAR
statement.
The first value is the correlation coefficient and second value is the p-value.
In correlation matrix table, the correlation coefficient is 1 because the diagonal elements are in
correlation between the same variables.
For this, let’s add the plots statement ODS Graphic Option On and ODS Graphic Option Off. This
statement helps add graphics in the output window.
Use the Plot statement with the matrix option to create the matrix table for the selected variables.
The scatter plot matrix is obtained for the variables–sales, profit, and discount.
This concludes the demo on how to perform the statistical procedure and obtain scatter plot using PROC
CORR.
464
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG
So far you have learned the statistical procedure such as PROC MEANS, PROC FREQ, PROC UNIVARIATE,
and PROC CORR.
465
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
466
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Let’s step into the syntax classroom to learn the syntax of PROC REG.
467
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
VAR variables;
FREQ variable;
WEIGHT variable;
ID variable;
RESTRICT linear_equation,...;
TEST linear_equation,...;
MTEST linear_equation,...;
BY variables;
The MODEL statement specifies the dependent and independent variables in the regression model. The
MODEL statement provides the output with a covariance matrix and other summarized statistical values.
468
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
469
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Note that if you need to fit a model to the data, you should use a model statement. If you need to use
only PROC REG, the VAR statement is necessary and the model statement becomes optional.
470
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Let’s now learn the various options available in PROC REG statement under datasets, ODS Output
Graphics, Traditional graphics, Display options, and other options.
Click each category to know the various options available in PROC REG.
471
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Datasets
Option Description
DATA Specifies the input dataset
OUTTEST Outputs a dataset that contains parameter estimates and other model fit summary
statistics
OUTSSCP Outputs a dataset that contains sums of squares and cross products
COVOUT Outputs the covariance matrix for parameter estimates to the OUTEST= dataset
OUTSEB Outputs standard errors of the parameter estimates to the OUTEST= dataset
OUTSTB Outputs standardized parameter estimates to the OUTEST= dataset; Use only with
the RIDGE= or PCOMIT= option
OUTVIF Outputs the variance inflation factors to the OUTEST= data set
Use only with the RIDGE= or PCOMIT= option
PCOMIT Performs incomplete principal component analysis and outputs estimates to the
OUTEST= dataset
RIDGE Performs ridge regression analysis and outputs estimates to the OUTEST= dataset
RSQUARE Outputs the number of regressors, the error degrees of freedom,
and the model R2 to the OUTEST= dataset
TABLEOUT Outputs standard errors, confidence limits, and associated test statistics of the
parameter estimates to the OUTEST= dataset
472
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Option Description
PLOTS= Produces ODS graphical displays
473
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Traditional Graphics
Option Description
ANNOTATE= Specifies an annotation dataset
GOUT= Specifies the graphics catalog in which graphics output is saved
474
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Display Options
Option Description
LINEPRINTER Creates plots requested as line printer plot
ALL Displays all statistics including correlation matrix, simple statistics values, and
uncorrected sums of squares and cross products matrix
475
© Copyright 2015, Simplilearn. All rights reserved.
PROC REG (contd.)
Other Options
Option Description
ALPHA= Sets significance value for confidence and prediction intervals and tests
SINGULAR Sets criterion for checking for singularity
476
© Copyright 2015, Simplilearn. All rights reserved.
Demo—PROC REG
In this demo, you will learn how to perform the statistical procedure and interpret regression results
using PROC REG.
The MODEL statement specifies the dependent and independent variables in the regression model.
In this example, let’s check the variation between sales and quantity.
The variable sales is the dependent Variable and quantity is the independent Variable.
ANOVA Table provides the p-value and R-square value. The p-value is used to test the hypothesis and r-
square value defines the variation between the dependent and independent variables.
Note that the output also shows fit diagnostics, residuals, and fit plot details for the dependent variable
“Sales.”
This concludes the demo on how to perform the statistical procedure and interpret regression results
using PROC REG.
477
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
478
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following statements names d.
a variable to identify observations in the The ID statement names a variable to identify
printout? observations in the printout.
479
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA
The ANOVA procedure performs analysis of variance for balanced data from a wide variety of
experimental designs. The data is balanced if there are equal numbers of observations for every
combination of the classification factors.
Whenever the data is not balanced, use the GLM procedure, whose statements are almost identical to
those of PROC ANOVA.
PROC GLM is a general procedure that works with both balanced and unbalanced data.
480
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)
The variation in the response might be due to the effects in the classification along with the random
error accounting for the remaining variation.
481
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)
Let’s step into the syntax classroom to learn the syntax of PROC ANOVA.
482
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)
ABSORB variables;
BY variables;
FREQ variable;
483
© Copyright 2015, Simplilearn. All rights reserved.
484
© Copyright 2015, Simplilearn. All rights reserved.
PROC ANOVA (contd.)
The REPEATED statement performs multivariate and univariate repeated measures analysis of variance.
485
© Copyright 2015, Simplilearn. All rights reserved.
Demo-PROC ANOVA
In this demo, you will learn on how to perform the statistical procedure and interpret ANOVA results
when tLet’s understand PROC ANOVA using an example.
Three Voice Over talents are given 5 subjects each to read. The reading speed is recorded in words per
minute for each subject in the test. Analyze their scores.
Use PROC ANOVA to check variance among groups when the data is balanced.
Use the title statement to give the title for the analysis. Here, let’s name the analysis as ANOVA.
The class statement declares the classification of variables. Here, the Voice_Over_Talent is the variable.
The MODEL statement specifies the dependent and independent variables in the regression model.
Here, words count is the dependent variable and Voice over talent is the independent variable.
Use the plot statement to plot the graph for words count and voice over talent.
The F-test statistics value is 7.14 with a p-value of 0.0091. The p-value is less than 0.05 and so we reject
the null hypothesis.
This concludes that the reading methods were not all the same for the word counts.
A graphical comparison allows you to visually see the distribution of the groups.
If the p value is low, there is a little chance of overlap between the two or more groups.
486
© Copyright 2015, Simplilearn. All rights reserved.
This concludes the demo on how to perform the statistical procedure and interpret ANOVA results when
the data is balanced. he data is balanced.
487
© Copyright 2015, Simplilearn. All rights reserved.
Activity
Read the problem carefully and analyze what needs to be done using SAS techniques.
Generate statistical values from the Electronic dataset for the sales variable, where sales is greater than
100 and Order Priority is “Critical.” Also, limit the output to two decimal places.
Click each code in the correct sequence to write the program that will be the solution to the
problem. Click the dataset tab to view them.
488
© Copyright 2015, Simplilearn. All rights reserved.
Activity (contd.)
489
© Copyright 2015, Simplilearn. All rights reserved.
Activity (contd.)
490
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01
Let’s practice what you have learned so far in this lesson. There are two Mini Projects in this lesson. Read
the question carefully and then answer them. The techniques and steps are provided to assist you under
the guide section.
491
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01
A Consulting firm wants to perform the correlation analysis with descriptive statistics between Sales and
Profit for their E-Commerce client. Their E Commerce dataset keeps a track of number of days used to
deliver a Product, Product Category, Sales, Quantity, Profit, Discount, and Customer Information. They
need to perform the analysis for product belong to Product Category Fashion where sales is more than
150. They also want to display the information graphically in the form of symmetric matrix plot.
492
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01 (contd.)
493
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 01 (conttd.)
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
494
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 02
A XYZ pharmaceutical company has developed four different medicines for headache relief. It wants to
compare the time of relief of these medicines. The company recorded the time of relief in 20 different
patients, with a group of five trying each medicine. XYZ wants to test if all four medicines take the same
time or are is it different.
495
© Copyright 2015, Simplilearn. All rights reserved.
496
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 02 (contd.)
497
© Copyright 2015, Simplilearn. All rights reserved.
Assignment 02 (contd.)
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
498
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
The “PROC Means” calculates the number of observations, Mean, Standard Deviation, and
maximum and minimum values from the dataset.
The PROC FREQ is used to obtain a frequency distribution and to analyze multidimensional
tables.
The UNIVARIATE procedure provides data summarization tools, high-resolution graphics
displays, and information on the distribution of numeric variables.
PROC CORR is a correlation procedure used to check the strength between two or more
variables.
The ANOVA procedure performs analysis of variance (ANOVA) for balanced data from a wide
variety of experimental designs.
499
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes “Basic Statistical Procedure.” The next lesson is “Data Exploration.”
500
© Copyright 2015, Simplilearn. All rights reserved.
\
501
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
502
© Copyright 2015, Simplilearn. All rights reserved.
503
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 8 — Data Exploration
504
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
In this lesson, you will learn about Data preparation and how to summarize the data.
505
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will learn how to perform data cleaning and convert numeric values into character
variables, and vice versa.
You will also understand how SAS handles missing values in your datasets using various procedures.
506
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation
Let’s start this lesson by defining data preparation. Often, Data Scientists get data that is not in correct
format for analysis. To convert the data to the correct format for analysis, they perform Data
preparation.
507
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation(contd.)
Data preparation is a time-consuming task for any analytical project. Data Preparation tasks involve
collecting relevant data, sampling, and aggregating data attributes.
508
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation(contd.)
Data sources are collated at the customer or account level from different sources. These sources may
include billing and payment transactional data, demographic figures, and financial data.
509
© Copyright 2015, Simplilearn. All rights reserved.
Data Preparation(contd.)
In short, before you perform required analyses, you need to prepare the data you already have.
To prepare your data for the required analysis, you need to clean the data as the first step. Data cleaning
refers to the removal of data values that are incorrect from a data source.
When you clean the data, you may come across dirty data. These data contain inaccurate and erroneous
data values. The inaccuracy happens quite often when data is downloaded from the server or any other
source.
Therefore, you should perform data cleaning, to avoid erroneous or irrelevant data values.
510
© Copyright 2015, Simplilearn. All rights reserved.
Data Cleaning—Example
XYZ Company downloads sales data from the server. The column “name” in the sales report has a junk
character at the end of each name. Here the forward double slash is a junk value. Before the company
uses this sales report for analysis, it needs to clean the column “customer name.”
Look at the example shown on the screen. This example shows how to remove the junk value “forward
double slash” for a single observation.
511
© Copyright 2015, Simplilearn. All rights reserved.
Data Cleaning—Example(contd.)
The compress function removes the specified characters from a variable. It is also used to remove the
unnecessary spaces from a variable. Here the compress function removes the forward slash.
The Put statement is used to write variables in output line. Here the output line is “Correct_name.”
When you run the code, the output is generated, and it is shown on the screen.
512
© Copyright 2015, Simplilearn. All rights reserved.
General Comments on Data Cleaning
Each set of data that needs to be cleaned has its own set of difficulties and challenges.
Therefore, the following information allows the “cleaner” to tackle all problems in the basic
cleaning line.
513
© Copyright 2015, Simplilearn. All rights reserved.
General observation for Data Cleaning
514
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
515
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1. Which of the following functions in c.
SAS is used to remove the character The compress function removes the specified
516
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion
In SAS, while cleaning the data, most of the time the data scientists need to change or convert the
format of the variable. Sometimes, it is required to change the numeric data to character variables, or
vice versa.
To convert from Numeric to Character, use the Put function.To convert from character to numeric, use
the Input function.
517
© Copyright 2015, Simplilearn. All rights reserved.
Syntax Classroom
Let’s step into the syntax classroom to learn the syntax for the Put function.
518
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)
The argument source identifies the constant, variable, or expression whose values you are required to
reformat. The source argument can be character or numeric.
519
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)
The argument format specifies a format to use when the variable values are written. This argument must
be the name of a format with a period and optional width and decimal specifications.
520
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)
Note that the format must be of the same type as the source, either character or numeric. That is, if the
source is character, the format name must begin with a dollar sign, and if the source is numeric, the
format name must not begin with a dollar sign.
521
© Copyright 2015, Simplilearn. All rights reserved.
Data Type Conversion(contd.)
By default, if the source is numeric, the resulting string is right aligned, and if the source is character, the
result is left aligned.
To overcome the default alignment, you can add an alignment specification to a format.
522
© Copyright 2015, Simplilearn. All rights reserved.
Numeric to Character Conversion
For example, look at the Electronic dataset that stores the zip code as a numeric value and
Electronic_CustomerInfo dataset that stores the zip code as a character variable.
523
© Copyright 2015, Simplilearn. All rights reserved.
Numeric to Character Conversion
Look at the program shown on the screen to convert a numeric value to a character variable. Here, the
Put function converts the zip code from numeric and stores it as character.
Zw. format writes standard numeric data with leading 0s. Z5 format adds leading zeros whenever a value
comes with less than 5 digits.
A new character variable called zip code is created utilizing the Put function.
524
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric Conversion
Sometimes numeric data is imported into character variables, and it may be desirable to convert these
character variables into numeric variables.
Note that it is not possible to directly change the type of a variable. It is only possible to write the
variable to a new variable containing the same data, although with a different type.
525
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric Conversion(contd.)
By renaming and dropping variables, it is possible to produce a new variable with the same name as the
original, although with a different type.
There are two methods to convert character to numeric—using the multiplication operator and using
the Input function.
The native approach is to multiply the character variable by 1, causing SAS to perform an implicit type
conversion.
SAS performs an implicit character to numeric conversion and gives a note to this effect in the log. Look
at the example code shown on the screen.
This method is considered as poor programming practice and should be avoided. A preferable method to
convert character to numeric value is using the Input function. Look at the example code shown on the
screen.
526
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax for input function.
527
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)
The argument source specifies a character constant, variable, or expression to which you want to apply a
specific informat.
528
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)
The argument informat refers to the SAS informat that you want to apply to the source. This argument
must be the name of an informat followed by a period, and it cannot be a character constant, variable,
or expression.
529
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)
The Input function returns the value produced when a SAS expression is converted using a specified
informat.
530
© Copyright 2015, Simplilearn. All rights reserved.
Character to Numeric (contd.)
The SAS code that demonstrates character to numeric conversion is shown on the screen. The input
function converts the character variable type to numeric type.
When you run this code, the output is generated and it shown on the screen.
In addition, to character or numeric conversions, the Put and Input functions can also be used in the
conversion of date or time values into character variables and vice versa.
531
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions
Following is the list of character functions that are extremely useful in data cleaning:
532
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)
The compress function removes specified characters from a variable. It is also used to remove
unnecessary spaces from a variable.
533
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)
The index, index c, and index w functions return the starting position for a character, character string, or
word and are extremely useful in determining where to start or stop when sub-stringing a variable.
534
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)
535
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)
The length function returns the number of characters with a character variable value.
536
© Copyright 2015, Simplilearn. All rights reserved.
Character Functions(contd.)
The lowcase function changes all the letters to the lowercase within a variable values.
537
© Copyright 2015, Simplilearn. All rights reserved.
The right function justifies the variable value to the right.
538
© Copyright 2015, Simplilearn. All rights reserved.
The scan function returns a portion of the variable value as defined by a delimiter. For example, the
delimiter could be a space, comma, and semi-colon.
539
© Copyright 2015, Simplilearn. All rights reserved.
The substring returns a portion of the variable value based on the starting position and number of
characters.
540
© Copyright 2015, Simplilearn. All rights reserved.
The translate function replaces a specific character with characters that are specified.
541
© Copyright 2015, Simplilearn. All rights reserved.
The transfer word function replaces a portion of the character string (word) with another character
string or word. For example, a delimiter was supposed to be a comma but data in some cases contains a
colon. This function could be used to replace the comma with a colon.
542
© Copyright 2015, Simplilearn. All rights reserved.
The trim function removes the trailing blanks from the right-hand side of a variable value.
543
© Copyright 2015, Simplilearn. All rights reserved.
The uppercase function changes all the letters to the uppercase within a variable values.
544
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function
Let’s step into the syntax classroom to learn the syntax for the Scan function.
545
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function
Most of the time, you need to extract the portion of the character variable. To extract the portion of the
character variable, use the Scan function.
SCAN(TEXT,N<,DELIMITERS>);
The Scan function returns the nth word from a text expression.
546
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function
547
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function
N specifies the number of the word in the character string that you want SCAN to select. If N is positive,
SCAN counts words from left to right, and if N is negative, SCAN counts words from right to left.
548
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function
Delimiters are a group of characters used to separate words. The default delimiters are shown on the
screen.
549
© Copyright 2015, Simplilearn. All rights reserved.
SCAN Function
Let’s extract the first name and last name of the customer in a different variable from electronic
customer information dataset.
The first name is extracted using the Scan function with n value equal to 1.
The last name is extracted using the Scan function with n value equal to 2.
Note that the first name and last name are extracted in different columns.
550
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions
Date/Time functions are a set of functions that return portions of date time, date, or time values.
These functions are especially useful for extracting the date and time from a date time value or
converting separate month, day and year values into a SAS date value.
The MDY function creates a SAS date value from numeric values that represent the month, day, and
year.
551
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions
Let’s step into the syntax classroom to learn the syntax for MDY function.
552
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions
MDY(month,day,year)
If the data is numeric, use MDY function to convert the separate variables into a single date value
variable. However, if the data is character then the conversion to numeric should occur first and then
the conversion to the date value should occur.
The Electronic_custinfo dataset contains month, date, and year in the separate variables.
553
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions
However, there is only a single variable in the electronic dataset. To add month, date, and year details of
electronic customer information in the Electronic dataset use the ‘MDY’ function.
Date9. option format the date in the format shown on the screen. Look at the output shown on the
screen.
The MDY function converts the separate variables from the Electronic Custinfo dataset into a Single
variable.
554
© Copyright 2015, Simplilearn. All rights reserved.
Various Date/Time Functions
Following is a list of date/time functions that are extremely useful in data cleaning.
Function Use
Month Returns the month from a date value
555
© Copyright 2015, Simplilearn. All rights reserved.
Missing Value Treatment
So far you have learned how to clean the data and convert the numeric data valued to character
variables, and vice versa.
RUN ;
When your run this code, the output is generated and it is shown on the screen.
You can observe in the output that for some observations there is a ‘decimal’ sign. This implies that
there are missing numeric values for these observations.
556
© Copyright 2015, Simplilearn. All rights reserved.
Missing Value Treatment
Let’s now learn how SAS handles these missing data values using SAS procedures.
As a general rule, SAS procedures that perform computations handle missing data by omitting the
missing values.
The way that missing values are eliminated is not always the same among SAS procedures, so let's us
look at some examples.
557
© Copyright 2015, Simplilearn. All rights reserved.
Missing Value Treatment
First, let's perform a proc means on our datafile and see how it handles the missing values. Note that
there are 50 observations.
VAR Sales;
RUN ;
The total number of observations in the output table is 37 but actually there are 50 observations. So,
you can conclude that Proc means ignores the missing value observations.
558
© Copyright 2015, Simplilearn. All rights reserved.
Date/Time Functions
With the help of same example, Let’s now perform proc Freq on our datafile and see how it handles the
missing values.
TABLES Sales;
RUN;
As you see the output, proc freq performed its computations using just the available data. Note that the
percentages are computed based on just the total number of non-missing cases.
559
© Copyright 2015, Simplilearn. All rights reserved.
Following are the various SAS functions and how they handle the missing values.
560
© Copyright 2015, Simplilearn. All rights reserved.
By default, missing values are excluded and percentages are based on the number of non-missing values. If
you use the missing option in the tables statement, the percentages are based on the total number of
observations (non-missing and missing) and the percentage of missing values are reported in the table.
561
© Copyright 2015, Simplilearn. All rights reserved.
If class variables have missing values, proc means will exclude that observations. If you want to include, we
can use Missing option in the proc statement or class statement.
562
© Copyright 2015, Simplilearn. All rights reserved.
By default, correlations are computed based on the number of pairs with non-missing data that is pairwise
deletion of missing data. The no miss option can be used on the proc corr statement to request that
correlations be computed only for observations that have non-missing data for all variables on the var
statement.
563
© Copyright 2015, Simplilearn. All rights reserved.
In Proc reg, if any of the variables on the model or var statement are missing, they are excluded from the
analysis, that is, listwise deletion of missing data.
564
© Copyright 2015, Simplilearn. All rights reserved.
SAS has a number of procedures to help you to present the report in the desired format.
One of the most commonly used Procedures for Data Summarization is Proc Report.
565
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax for Proc Report.
566
© Copyright 2015, Simplilearn. All rights reserved.
PROC REPORT DATA= datasetname;
COMPUTE column;
ENDCOMP;
RUN;
The column statement describes the arrangement of all columns and of headings that span more than
one column.
567
© Copyright 2015, Simplilearn. All rights reserved.
The define column describes how to use and display a report item.
568
© Copyright 2015, Simplilearn. All rights reserved.
The Compute and ENDCOMP specifies one or more programming statements that PROC REPORT
executes as it builds the report.
569
© Copyright 2015, Simplilearn. All rights reserved.
Let’s understand the PROC Report with the help of an example.
The column Sales with Order ID and Product is created from Electronic dataset.
570
© Copyright 2015, Simplilearn. All rights reserved.
Look at the output shown on the screen.
The table sales report is created with Order Id, Product name, sales, and incentives.
Note that there is a dollar sign before the values in the sales variable. The incentive is computed per the
given calculation.
571
© Copyright 2015, Simplilearn. All rights reserved.
Let’s practice what you have learned so far in this lesson. Read the question carefully and then answer
them. The techniques and steps are provided to assist you in the guide section.
572
© Copyright 2015, Simplilearn. All rights reserved.
A leading consulting firm wants to create a summary report grouped by region and customer Name for
their client. Their Dataset has a track of Customer Name, Region, Sales, Profit, and Shipping Cost. There
are lot of junk characters. It also wanted to group sales, profit, and Shipping Cost under one “Data”
header. As a SAS programmer, write the code for the above requirement. Note that the dataset has a lot
of junk characters. Clean the dataset before you perform the task.
573
© Copyright 2015, Simplilearn. All rights reserved.
Follow the below steps to solve the problem:
574
© Copyright 2015, Simplilearn. All rights reserved.
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
575
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
Let’s now quickly recap the concepts you have learned in the lesson:
The optimization is a mathematical technique to find a maximum value and a minimum value of
a function subject to constraints.
Optimization techniques cut down the operational costs and maximize the profit of the
company.
The various types of optimization programming are linear programming, mixed linear integer
programming, quadratic programming, and nonlinear programming.
The objective functions and constraints can be linear or nonlinear.
The PROC OPTMODEL is also used to model linear, mixed integer linear, and quadratic
optimization programs.
Solver is a method or procedure to resolve an optimization problem.
576
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes “Data Exploration.” The next lesson is “Advanced Statistical Techniques.”
.
577
© Copyright 2015, Simplilearn. All rights reserved.
578
© Copyright 2015, Simplilearn. All rights reserved.
579
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
580
© Copyright 2015, Simplilearn. All rights reserved.
581
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 09 — Advanced Statistics
582
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
In this lesson, “Advanced Statistics,” you will learn about clustering, decision tree, linear regression, and
logistic regression.
583
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will learn how to create a cluster and to perform cluster analysis on the dataset.
You will also learn to identify the regression types and to analyze the variations of the variables.
584
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Cluster
For example, an E-Commerce company wants to analyze and collect information about customers who have
bought or shown interest on an iPhone. This allows the company to target them for future sales.
The analysis to group similar customer behavior is called cluster analysis. It is also used to summarize the
data.
585
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Cluster
SAS clustering procedures are used to cluster observations or variables in a SAS dataset. There are various
types of cluster analyses available in SAS:
Disjoint clusters
Hierarchical clusters
Overlapping clusters
Fuzzy clusters
Hierarchical clusters are organized, and there is no overlap between the clusters.
Overlapping clusters limit the number of observations and allow any degree of overlap.
Fuzzy clusters are defined by a probability or grade of membership of each object in each cluster.
586
© Copyright 2015, Simplilearn. All rights reserved.
Introduction to Cluster
PROC ACECLUS obtains approximate estimates of the pooled within-cluster covariance matrix when
the clusters are assumed to be multivariate normal with equal covariance matrices.
PROC CLUSTER clusters the observations in a SAS dataset hierarchically.
PROC DISTANCE computes the various measures of distance, dissimilarity, or similarity between the
observations of a SAS dataset.
PROC FASTCLUS performs disjoint cluster analysis on the basis of distances computed from one or
more quantitative variables.
PROC MODECLUS clusters the observations in a SAS dataset.
PROC VARCLUS divides a set of numeric variables into disjoint or hierarchical clusters.
PROC TREE produces a tree diagram, also known as a dendrogram or phenogram, from a dataset
created by the PROC CLUSTER or PROC VARCLUS.
587
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into syntax classroom to learn the syntax of a PROC CLUSTER.
588
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
BY variables;
COPY variables;
FREQ variable;
ID variable;
RMSSTD variable;
VAR variables;
589
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
590
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
The FREQ statement is optional. The VAR, ID, and COPY statements are mandatory. The RMSSTD statement is
used to display the root-mean-square standard deviation of each cluster.
591
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
The COPY statement copies the variables from the input dataset to the OUTTREE= dataset. The Outtree =
dataset specifies the output dataset.
The ID statement identifies the observations in the displayed cluster history and in the OUTTREE = dataset.
The VAR statement is used to list the required numeric variables in the cluster analysis.
592
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
593
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
594
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
595
© Copyright 2015, Simplilearn. All rights reserved.
PROC CLUSTER
596
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
Average method
Centroid method
Complete method
Density method
EML method
Flexible method
Single method
Ward method
The Method = average requests average linkage. In average linkage, the distance between two clusters is the
average distance between pairs of observations, with one in each cluster.
597
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method =centroid requests the centroid method. In the centroid method, the distance between two
clusters is defined as the squared distance between their centroids or means.
598
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method = complete requests the complete linkage. In complete linkage, the distance between two
clusters is the maximum distance between an observation in one cluster and an observation in the other
cluster.
599
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method= density requests the density linkage. Density linkage is a class of clustering methods using
nonparametric probability density estimation.
600
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method= EML joins clusters to maximize the likelihood at each level of the hierarchy.
601
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method= Flexible requests the Lance-Williams flexible-beta method. It specifies the beta value for the
flexible beta method.
602
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method= Single requests single linkage. In single linkage, the distance between two clusters is the
minimum distance between an observation in one cluster and an observation in the other cluster.
603
© Copyright 2015, Simplilearn. All rights reserved.
Clustering Methodologies
The Method= Ward requests the ward’s minimum variance method. In Ward’s minimum-variance method,
the distance between two clusters is the ANOVA sum of squares between the two clusters added up
over all the variables.
604
© Copyright 2015, Simplilearn. All rights reserved.
Demo-clustering Method
This demo explains how to create a cluster based on the salary and profit for the electronic dataset.
Import the “Electronic” dataset to the SAS console. Follow the import steps to import the relevant dataset.
The keyword Print is used to specify the required number of clusters. Here, we have used print=7.
The option “Method” determines the clustering method used by the procedure. For example, we will use the
CENTROID method as it is resistant to errors in the results than other methods.
We can obtain the values of root mean square standard deviation of each cluster using the keyword RMSSTD.
The option “Rsquare” is used to display the R-square and semi-partial R-square values.
The values of the ID variable identify the observations in the displayed cluster history and in the OUTTREE=
data set. If the ID statement is omitted, each observation is denoted by OBn, where n is the observation
number.
605
© Copyright 2015, Simplilearn. All rights reserved.
The VAR statement lists numeric variables to be used in the cluster analysis. In this example, we have used
the sales and profit variables from the electronic dataset.
The cluster history table is generated which shows the number of clusters and variance between each
clusters.
In a dendrogram, the distance is plotted on X axis, and the sample units are plotted on Y axis.
The tree shows how sample units are combined into clusters. It also shows the height of each branching point
corresponding to the distance at which two clusters are joined.
This concludes the demo on creating a cluster based on the salary and profit for the electronic dataset.
606
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering
As discussed earlier, there are various cluster analysis procedures. The most used cluster analysis procedure
is PROC FASTCLUS or K-Means Clustering.
The K-Means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean.
PROC FASTCLUS is used in a variety of analytic, business intelligence, reporting, and data management
situations.
607
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax of PROC FASTCLUS.
608
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering
VAR variables;
ID variables;
FREQ variable;
WEIGHT variable;
BY variables;
609
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering
The maxclusters = n option specifies the maximum number of clusters permitted. The default value of
maxclusters is 100.
610
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering
The radius = t option specifies minimum distance from the previous seed to classify an observation as
a new seed. By default t = 0.
Let’s understand K-Means clustering with the help of an example.
Let’s perform K-Means Clustering on our same Electronic Dataset.
The electronic dataset is imported to the SAS console.
611
© Copyright 2015, Simplilearn. All rights reserved.
K Means Clustering
The out= option specifies the output dataset. Here, the output is stored in the “electronic dataset” table.
The option “Maxclusters” defines the number of required clusters and “Maxiter” defines the number of
iterations.
The sales and profit variables are chosen to perform K-Means clustering.
When you run this code, the output is generated, and it is shown on the screen.
The clusters are grouped on the basis of maximum distance from seed to observations.
612
© Copyright 2015, Simplilearn. All rights reserved.
The distance between the seed and observation of the first cluster distance is zero, and the last cluster is the
maximum value.
613
© Copyright 2015, Simplilearn. All rights reserved.
Now let's do a Knowledge check of what you have learned so far.
614
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following methods is used a.
to join clusters to maximize the likelihood The Method= EML joins clusters to maximize
at each level of the hierarchy? the likelihood at each level of the hierarchy.
615
© Copyright 2015, Simplilearn. All rights reserved.
Decision Tree
So far you have learned about clustering, and how to perform cluster analysis using SAS.
Let’s now learn the next concept of this lesson “Decision Tree.”
616
© Copyright 2015, Simplilearn. All rights reserved.
Decision Tree
A decision tree is a powerful multivariate analysis used to identify the various ways to split the dataset into
branch like segments.
In decision trees, each segment or branch is called a node. The bottom nodes of a decision tree are called
leaves.
The decision tree is used to model other approaches, select inputs, or to create dummy variables in the
regression equation.
Decision trees find the relationship between the input values and target values in a group of observations,
and hence the decision trees are so useful.
617
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax of decision trees.
618
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
EVALUATE / options ;
MODIFY specifications ;
MOVE specifications ;
QUIT ;
RECALL ;
RESET options ;
SAVE ;
SUMMARY / options ;
TREEPLOT / options ;
VARIABLES / options ;
VPC specifications ;
619
© Copyright 2015, Simplilearn. All rights reserved.
VPI specifications ;
The decision tree procedure begins with the PROC DTREE statement and terminates with the QUIT statement.
620
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
The EVALUATE statement evaluates the decision tree and calculate the optimal decisions.
621
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
The MODIFY statement is used to change either the type of a stage or the reward from an outcome.
622
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
623
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
624
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
The recall statement informs PROC DTREE to recall the decision that was saved previously with a SAVE
statement.
625
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
The RESET statement is used to reset the options after the procedure has started.
626
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
627
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
628
© Copyright 2015, Simplilearn. All rights reserved.
PROC DTREE
The VPC statement computes the value of perfect control or the value of uncertainty.
629
© Copyright 2015, Simplilearn. All rights reserved.
Decision tree — Example
Many financial decisions are difficult to analyze because of the variety of available strategies and the
continuous nature of the problems.
Look at the example that has been taken from the SAS university edition.
A loan officer is faced with the problem of deciding whether to approve or deny an application for a one-year
$ 30000 loan at the current rate of 15 % of interest. If the application is approved, the borrower will either
pay off the loan in full after one year or default. Based on experience, the default rate is about 36 out of 700.
If the loan is denied, the money is put in government bonds at the interest rate of 8 %.
To obtain more information about the applicant, the loan officer engages a credit investigation unit at a cost
of $ 500 per person that will give either a positive recommendation for making a loan or a negative
recommendation. Past experience with this investigator yields that of those who ultimately paid off their
loans, 570 out of 664 were given a positive recommendation. On the other hand, 6 out of 26 that had
defaulted had also been given a positive recommendation by the investigator.
630
© Copyright 2015, Simplilearn. All rights reserved.
631
© Copyright 2015, Simplilearn. All rights reserved.
Decision tree — Example
The following code invokes the DTREE procedure to solve this decision problem.
proc dtree
evaluate;
OPTIONS LINESIZE=85;
summary / target=Application;
OPTIONS LINESIZE=80;
The keyword “title” defines the title of problem. Here, Loan Grant decision is the title of this problem.
The STAGEIN= data set, gives the structure of the decision problem.
632
© Copyright 2015, Simplilearn. All rights reserved.
The PROBIN= data set gives the probability distributions for the random events at the chance nodes.
The PAYOFFS= data set gives the payoffs for the various scenarios.
When you run this code, the output is generated, and it is shown on the screen.
The loan officer should order the credit investigation and approve the loan application if the investigator
gives the applicant a positive recommendation.
633
© Copyright 2015, Simplilearn. All rights reserved.
Regression
634
© Copyright 2015, Simplilearn. All rights reserved.
Regression
There are two types of dependent variables available in regression. They are continuous and binary variables.
The variable that has scalar quantity are called continuous variables. For example: Sales, Profit, and quantity
The variables that have the binary values, that is 1 or 0, are called binary variables. For example: Yes or no,
True or false, and buy or not buy
635
© Copyright 2015, Simplilearn. All rights reserved.
Regression
Based on these dependent variables, the regression is classified into two types—Linear regression and logistic
regression.
636
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
Linear regression is an approach to model the relationship between a continuous dependent variable and one
or more explanatory or independent variables.
637
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the syntax classroom to learn the syntax of linear regression.
638
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
VAR variables;
FREQ variable;
WEIGHT variable;
ID variable;
RESTRICT linear_equation,...;
TEST linear_equation,...;
MTEST linear_equation,...;
BY variables;
639
© Copyright 2015, Simplilearn. All rights reserved.
640
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
The MODEL statement specifies the dependent and independent variables in the regression model.
The OUTPUT statement requests an output dataset and names the variables to contain predicted values,
residuals, and other output values.
641
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
642
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
643
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
The BY statement specifies variables to define subgroups for the analysis. The analysis is repeated for each
value of the BY variable.
644
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
645
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
646
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
The MTEST statement can validate hypotheses involving several dependent variables (multivariate regression
models).
647
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
Remember that the PROC REG statement is always accompanied by one or more MODEL statements to
specify regression models. One OUTPUT statement may follow each MODEL statement.
648
© Copyright 2015, Simplilearn. All rights reserved.
Linear Regression
There are two types of linear regression based on the number of independent variables. They are— simple
linear regression and multiple linear regression.
In simple linear regression, a single independent variable is used to predict the value of a dependent variable.
In multiple linear regression, two or more independent variables are used to predict the value of a dependent
variable.
Let’s understand these types of linear regression with the help of an example.
649
© Copyright 2015, Simplilearn. All rights reserved.
Simple Linear Regression — Example
The objective of this program is to check the variation in sales based on the profit from the electronic dataset.
Here, the variable sales is the dependent variable and variable profit is the independent variable. This is the
example of simple linear regression as there is one independent variable.
When you run this code, the output is generated, and it is shown on the screen.
650
© Copyright 2015, Simplilearn. All rights reserved.
From the output, you can infer that the p-value for the profit is less than 5 percent and therefore the variable
profit is significant at 95 percent confidence level.
Also, note that the R-square value is 79.7 percent which tells you that the variation between sales and profit
is strong.
651
© Copyright 2015, Simplilearn. All rights reserved.
Multiple Linear Regression — Example
The objective of this program is to check the variation in sales based on the profit and quantity from the
electronic dataset.
Here, the variable sales is the dependent variable and variable profit and quantity are the independent
variables. This is the example of multiple linear regression as there is more than one independent variable.
When you run this code, the output is generated, and it is shown on the screen.
652
© Copyright 2015, Simplilearn. All rights reserved.
From the output, we obtain the value of R-square which defines the variation of sales based on the quantity
and profit. Note that R-square value is 86 percent.
The t-value for profit and quantity is greater than 1.96 which means that the variables are significant at 95
percent confidence level.
Also, note that the P value for the profit and quantity variables are less than 0.05 and hence the variables are
found to be significant.
653
© Copyright 2015, Simplilearn. All rights reserved.
Logistic Regression
Logistic regression is regression analysis conducted if the dependent variable is dichotomous or binary. Like
all regression analyses, logistic regression is a predictive analysis.
Logistic regression is used to describe data and to explain the relationship between one dependent binary
variable and one or more metric independent variables. Metric independent variables are variables that are
measured on an interval or a ratio scale.
654
© Copyright 2015, Simplilearn. All rights reserved.
Logistic Regression
The logistic regression is used in the areas such as insurance, marketing, sales, operations, health, and
gaming.
655
© Copyright 2015, Simplilearn. All rights reserved.
656
© Copyright 2015, Simplilearn. All rights reserved.
Logistic Regression
BY variables ;
FREQ variable ;
SCORE <options> ;
657
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
658
© Copyright 2015, Simplilearn. All rights reserved.
659
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 10 — Working with Time Series Data
660
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the Data Science with Statistical Analysis System, or SAS, course offered by
Simplilearn.
In this lesson, “Working with Time Series Data,” you will understand what time series analysis is and how
to work with time series data.
661
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson, you will learn how to read SAS date and datetime values.
You will also learn how to plot, transform, transpose, and interpolating time series data in SAS datasets.
662
© Copyright 2015, Simplilearn. All rights reserved.
Need for Time Series Analysis
Let’s begin this lesson understanding the need for time series analysis.The datasets might be the daily
sales score of E-commerce, the weekly production of a shoe manufacturing company, the number of
tickets sold by an Airline services every month, yearly GDP of developing country, and so on.
663
© Copyright 2015, Simplilearn. All rights reserved.
Need for Time Series Analysis (contd.)
664
© Copyright 2015, Simplilearn. All rights reserved.
Need for Time Series Analysis (contd.)
Time-series analysis is used to list the observations in time order. The observations can be either from a
single or multiple samples. It is also used to forecast patterns based on historic time interval data.
665
© Copyright 2015, Simplilearn. All rights reserved.
Goals of Time Series Analysis
Let’s understand the goals of time series analysis. The main goals of time series analysis are as follows:
666
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis
Let’s step into the “Syntax Classroom” to learn the syntax of time series.
These tasks that you perform to increase the sales through marketing campaign is called marketing
Analysis.
667
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis
OUT=<output-data-set>;
ID <time-ID-variable> INTERVAL=<frequency>
ACCUMULATE=<statistic>;
VAR <time-series-variables>;
RUN;
The syntax for PROC time series is shown on the screen. The TIMESERIES procedure forms time series
from the input time-stamped transactional data. It provides results using the Output Delivery System, or
ODS.
668
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis(contd.)
The ACCUMULATE option in the ID or VAR statement is used to accumulate the observations within each
time period. You can use various options in the ACCUMULATE such as none, total, average, minimum,
maximum, and median.
669
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis(contd.)
The INTERVAL option in the ID or VAR statement is used to specify the frequency or width of each time
interval. You can use various options in the INTERVAL such as day, month, and year.
670
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples
An E-commerce Company wants to analyze the records associated with each of its customers over time.
The dataset keeps a track of Order Date, Customer ID, Customer Name, Product Category, Product,
Sales, and Profit.
In this case, you can analyze each record using the Time Series Procedure.
671
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples(contd.)
The OUT= option specifies the storage location of the resulting time series data for each customer. Here,
the resulting time series data is stored in the Ecommerce_Monthly dataset.
The INTERVAL= Month option specifies that the transactions are to be aggregated on a monthly basis.
The ACCUMULATE = TOTAL option specifies the sum of the transactions to be calculated.
672
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples(contd.)
When a BY statement appears in the PROC TIMESERIES step, the procedure expects the input data to be
sorted with BY variables together with the ID variable.We can use proc sort to order the E_commerce
data by “Customer_Name” and “Order_Date.”Note that “Customer_Name” must appear prior to
“Order_Date” in the sort procedure.
673
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Examples(contd.)
In this example, each BY group associated with the BY variable “Customer_Name” contains an
observation for each customer for every month.Each observation contain the variables “Sales” and
“Profit” where values (that is, totals) are aggregated by months.
All records are sorted based on the customers in ascending order (Jan→Feb→Mar…………→Dec2015).
674
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options
There are various options available in Time Series Analysis in SAS. Some of the options used in the time
series analysis are as follows:
CROSSPLOTS = option
MAXERROR = number
PLOTS = option
PRINT= option
SORTNAMES
Click each option to learn more.
675
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)
CROSSPLOTS = option specifies the cross-variable graphical output desired. The CROSSPLOTS= option
produces results similar to the datasets listed in parentheses next to the preceding options.
By default, the TIMESERIES procedure produces no graphical output. You can use plotting options such
as Series and CCF to plot the output graph.
676
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)
MAXERROR = number limits the number of warning and error messages that are produced during the
execution of the procedure to the specified value. The default is MAXERRORS=50. This option is
particularly useful in BY-group processing where it can be used to suppress the recurring messages.
677
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)
PLOTS = option specifies the desired UNIVARIATE graphical output. By default, the TIMESERIES
procedure produces no graphical output. You can use plotting options such as Series, Residual, cycles,
and Histogram to plot the graphical output.
678
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)
PRINT = option specifies the desired printed output. By default, the TIMESERIES procedure produces no
printed output. You can use the printing options such as decomp, seasons, trends, descstats, and
summary to produce printed output.
679
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Analysis—Options(contd.)
SORTNAMES specifies that the variables specified in the VAR and CROSSVAR statements be processed in
sorted order by the variable names. This option allows the output data sets to be pre-sorted by the
variable names.
680
© Copyright 2015, Simplilearn. All rights reserved.
Reading Date and Datetime Values
SAS provides a selection of informats for reading SAS date and datetime values. A SAS informat is an
instruction that converts the character-string values into the numerical values of a SAS variable.
To see what date is today in the SAS log, type the command shown on the screen.
A SAS informat is used to convert the values from a character-string into the numerical value of a SAS
variable.
681
© Copyright 2015, Simplilearn. All rights reserved.
Reading Date and Datetime Values(contd.)
The ANYDTDTE informat utilized to convert text strings into SAS date values. Look at the output shown
on the screen. The dates are displayed in the same format though it is written in various formats.
SAS also provides formats to convert the representation of date and datetime values used by SAS. A SAS
format is an instruction that converts the internal numerical value to a character string that can be
printed or displayed.
Let’s consider the same example chosen for the informat. Look at the output shown on the screen. The
dates are displayed in the desired format.
682
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns
Hope, you have understood what time series analysis is, its goals, and available options. Let’s now look
at time series patterns. There are four types of time series patterns available in SAS. They are as follows:
Trend
Seasonality
Cyclic
Random
A trend pattern exists when there is a long-term increase or decrease in the data. It does not have to be
linear. Sometimes, a trend can be referred to as “changing direction” as it changes from an increasing
trend to a decreasing trend or vice versa. For example the rising and falling trend pattern of the stock
market.
683
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns(contd.)
A seasonality pattern is defined as the repeating pattern with a fixed period. A seasonal pattern exists
when a series is influenced by seasonal factors. For example, the quarter of the year, the month, or day
of the week. Note that the seasonality is always of a fixed and known period.
684
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns(contd.)
A cyclic pattern exists when the data exhibits a rise and fall for an unfixed period. The duration of a cycle
depends on the type of business or industry being analyzed, but it is usually at least two years. Overall,
the length of cycles is on average longer than the length of a seasonal pattern. The business cycle is an
example of an economy's periodic patterns of growth, recession, and recovery.
685
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Patterns(contd.)
A random pattern is obtained if the data is not able to obtain any of the three patterns—Trend,
Seasonality, and Cyclic.
For example, a daily change in the S&P500 index has no trend, seasonality, or cyclic behavior.
686
© Copyright 2015, Simplilearn. All rights reserved.
Knowledge Check
687
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following patterns is always c.
of a fixed and known period? Seasonality is always of a fixed and
known period.
688
© Copyright 2015, Simplilearn. All rights reserved.
White Noise Process
Based on the correlation between its values at different times, the data can be of two types. The data
can be uncorrelated with zero mean and constant variance or correlated with constant mean and
variance.
689
© Copyright 2015, Simplilearn. All rights reserved.
White Noise Process(contd.)
A white noise process has a zero mean, a constant variance, and no correlation between its values at
different times. Plots of white noise series exhibit erratic, jumpy, and unpredictable behaviour.
Since values are uncorrelated, previous values do not help us forecast future values.
690
© Copyright 2015, Simplilearn. All rights reserved.
White Noise Process(contd.)
The Scatter plot of such a series across time will indicate no pattern, hence forecasting future values is
not possible.
Therefore, if the data shows the white noise feature, avoid performing Time Series Analysis, and vice
versa. For example, the stock price of TATA Motors may vary from day to day, and it becomes
uncorrelated. Forecasting the future values is not possible. In this case, to forecast for the next day
calculate the average of the data. For example, a daily change in the S&P500 index has no trend,
seasonality, or cyclic behaviour.
691
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model
There are various time series models available in SAS. They are as follows:
692
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)
The Auto Regressive, or AR, model is used to forecast time series using the past values Yt-1, Yt-2, Yt-3 and
so on.The equation for the auto regressive model is shown on the screen.
693
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)
The Moving Average, or MA, model is used to forecast time series if Yt depends only on the random
error terms.
The equation for the moving average model is shown on the screen.
Here Yt is the function of past error terms. Et is the error term.ϕ1 to ϕp are the parameters.
The error terms here are assumed to be white noise processes with a zero mean and constant variance.
694
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)
The Autoregressive and Moving Average, or ARMA, model is used to forecast time series using both the
past values and error terms.
It is referred as ARMA (p,q), where p is autoregressive terms and q is moving average terms .
The equation for Autoregressive and Moving Average model is shown on the screen.
695
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)
The Autoregressive Integrated Moving Average, or ARIMA, model predicts a value in a response time
series as a linear combination of its own past values, past errors, and current and past values of other
time series.The order of an ARIMA model is usually denoted by the notation shown on the screen.
ARIMA(p,d,q ),
696
© Copyright 2015, Simplilearn. All rights reserved.
Time Series Model(contd.)
If no differencing is done (d = 0), the models are usually referred to as ARMA(p, q) models. The equation
for the Autoregressive Integrated Moving Average model for ARMA (p, d=0, q) is shown on the screen.
697
© Copyright 2015, Simplilearn. All rights reserved.
Stationarity of a Time Series
A series is said to be strictly stationary if the marginal distribution of y at time t is same at any other
point of time. This implies that the mean, variance, and covariance of the series are time invariant.
A series is said to be weakly stationary or non-stationary if mean, variance, and covariance are constant.
698
© Copyright 2015, Simplilearn. All rights reserved.
Stationarity of a Time Series(contd.)
Mean is constant
699
© Copyright 2015, Simplilearn. All rights reserved.
Stages of ARIMA Modelling
The estimation and forecasting of UNIVARIATE time series is carried out using Box and Jenkins and
ARIMA models or often referred as Box-Jenkins model. Remember that this model is applicable only if
the variable is stationary.
Identification stage
Estimation and diagnostic checking stage
Forecasting stage
Let’s learn about each stage in detail.
700
© Copyright 2015, Simplilearn. All rights reserved.
Identification Stage
Following are the two considerations to forecast time series using ARIMA modeling:
701
© Copyright 2015, Simplilearn. All rights reserved.
Identification Stage(contd.)
Specify the response series and identify candidate ARIMA models for it.
Perform a stationary test to determine if differencing is necessary.
Use the IDENTIFY statement to specify the response series and identify candidate ARIMA models for it.
702
© Copyright 2015, Simplilearn. All rights reserved.
Estimation and diagnostic checking stage
In the Estimation and diagnostic checking stage, perform the following tasks:
Specify the ARIMA model to fit to the specified variable and estimate the parameter.
Judge the adequacy of the model.
Perform significance tests, goodness-of-fit statistics, and white noise residuals.
Significance tests for a parameter are used to identify the unnecessary terms in the model.
Tests for white noise residuals indicate whether the residual series contains additional information that
might be used by a more complex model.
703
© Copyright 2015, Simplilearn. All rights reserved.
Estimation and diagnostic checking stage(contd.)
Use the ESTIMATE statement to specify the ARIMA model to fit to the variable specified in the previous
IDENTIFY statement and to estimate the parameters of that model.
The ESTIMATE statement also produces diagnostic statistics to help you judge the adequacy of the
model.
704
© Copyright 2015, Simplilearn. All rights reserved.
Forecasting Stage
In the forecasting stage, you use the FORECAST statement to forecast future values of the time series
and to generate confidence intervals for these forecasts from the ARIMA model produced by the
preceding ESTIMATE statement.
705
© Copyright 2015, Simplilearn. All rights reserved.
Now let's do a Knowledge check of what you have learned so far.
706
© Copyright 2015, Simplilearn. All rights reserved.
S.No. Question Answer & Explanation
1 Which of the following statements helps c.
you judge the adequacy of a model? The ESTIMATE statement produces diagnostic
statistics to help you judge the adequacy of a
model.
707
© Copyright 2015, Simplilearn. All rights reserved.
Stages of ARIMA modeling -Example
Consider the electronic dataset as an example and let’s forecast the Sales variable using the ARIMA
model.
708
© Copyright 2015, Simplilearn. All rights reserved.
Demo
Run;
The statement PROC ARIMA forecasts the time series using the ARIMA model.
The Identity Statement checks the stationarity of a variable and performs white noise residual test. It
also produces descriptive statistics, time series plot of the series, sample autocorrelation function plot
(ACF), inverse autocorrelation function plot (IACF), partial autocorrelation function plot (PACF), and
White Noise.
These autocorrelation function plots show the degree of correlation with the past values of the series at
which the correlation was computed.
The NLAG= option controls the number of lags for which the autocorrelations are shown. By default, the
autocorrelation functions are plotted to lag 24.
In this example, the white noise hypothesis is rejected strongly as the mean of the working series is not
zero. Also, the series is non-stationary as the auto correlation trends are not similar.
Since the series is non-stationary, let’s perform differentiation to make the series stationary.
709
© Copyright 2015, Simplilearn. All rights reserved.
Let’s write the code to make the series stationary.
Identify Var=Sales(1);
To differentiate the SALES series, use another IDENTIFY statement and specify the first differentiation of
SALES to analyze.
Instead of modeling the SALES series itself, we can model the change in SALES from one period to the
next period.
You can notice that, this statement evaluates the change in sales between periods instead of evaluating
the total sales amount (Identify Var=Sales statement).
Let’s now perform the estimation and diagnostic checking stage of ARIMA model.
We can use the estimate statement to specify the ARIMA model to fit to the variable specified in the
previous IDENTIFY statement and to estimate the parameters of that model.
Here let’s use AR(1) to predict the change in sales. The p value refers to the order of the autoregressive
part (first order)
Note that there are various candidate models such as MA(1) and ARMA to plot autocorrelation for the
series.
Estimate p=1;
The p-value for the autoregressive parameter is 0.0024 (less than 5%), so this term is highly significant.
On the other hand, the p-value for MU indicates that the mean term adds very little to the model.
710
© Copyright 2015, Simplilearn. All rights reserved.
The test statistics for the residuals series indicate whether the residuals are uncorrelated (white noise)
or contain additional information that might be used by a more complex model. In this case, the test
statistics reject the no-autocorrelation hypothesis at a high level of significance (p = 0.0029 for the first
six lags). This means that the residuals are not white noise, and so the AR(1) model is not a fully
adequate model for this series.
To produce the forecast output, use the FORECAST statement after the ESTIMATE statement for the
model you decide is best.
Note that if the last model fit is not the best, then repeat the ESTIMATE statement for the best model
before you use the FORECAST statement.
Let’s use the LEAD= option to specify how many periods ahead to forecast. In this example program, the
sales aeries is forecasted for one year ahead from the most recently available SALES figure. So, let’s use
lead=12.
Let’s use INTERVAL= option to indicate the interval of data. In this example, let’s obtain the data in the
interval of month.
The ID= option specifies the ID variable which is typically a SAS date, time, or datetime variable. In this
example, let’s use id-date.
The OUT= option writes the forecasts to the output dataset. In this example, let’s store the forecasted
data in the “results” dataset.
run;
We have obtained the time series forecasts for the next year for all the months.
The notation of the ARIMA model for this example is represented as ARIMA(1,1,1) model since the
IDENTIFY statement specified d = 1, and the final ESTIMATE statement specified p = 1 and q = 1.
711
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate
So far you have learned the various time series models. Let’s now learn how to plot, transform,
transpose, and interpolate time series data in SAS datasets.
To plot the time series use the options shown on the screen.
Options Description
PROC GPLOT produces high resolution color graphics plots
PROC PLOT produce low resolution line printer type plots
PROC TIMEPLOT plots time series data vertically on the page instead of
horizontally across the page
712
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate(contd.)
Transforming time series is used to restrict the range, obtain non-linear trend, and stabilize the variance.
713
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate(contd.)
The TRANSPOSE procedure is used to transpose datasets from one form to another.
The TRANSPOSE procedure can transpose variables and observations within BY groups.
714
© Copyright 2015, Simplilearn. All rights reserved.
Plot,Transform,Transpose,and Interpolate(contd.)
The EXPAND procedure interpolates a time series. By default, the EXPAND procedure performs
interpolation by first fitting cubic spline curves to the available data and then computing needed
interpolating values from the fitted spline curves.
715
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
Let’s practice what you have learned so far in this lesson. Read the questions carefully and then answer
them. Techniques and steps are provided to assist you under the guide section.
716
© Copyright 2015, Simplilearn. All rights reserved.
Assignment(contd.)
A pharmaceutical company wants to forecast daily Sales based on its Sales Dataset. The dataset keeps a
track of Order_ID, Product, Product_Category, Sales, Profit, and Order_Priority. As a SAS programmer,
write the code for this requirement.
717
© Copyright 2015, Simplilearn. All rights reserved.
Assignment(contd.)
718
© Copyright 2015, Simplilearn. All rights reserved.
Assignment(contd.)
We recommend you to first solve the project and then view the solution to assess your learning.
You can perform this project in the installed SAS University Edition.
719
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeways
720
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes “Working with Time Series Data.” The next lesson is “Data Optimization Using SAS.”
721
© Copyright 2015, Simplilearn. All rights reserved.
722
© Copyright 2015, Simplilearn. All rights reserved.
723
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
724
© Copyright 2015, Simplilearn. All rights reserved.
725
© Copyright 2015, Simplilearn. All rights reserved.
Lesson 11 — Designing Optimization Models
726
© Copyright 2015, Simplilearn. All rights reserved.
Introduction
Hi, and welcome back to the “Data Science with Statistical Analysis System or SAS” course offered by
Simplilearn.
In this lesson, “Designing Optimization Models,” you will learn how to solve the various types of
optimization problems.
727
© Copyright 2015, Simplilearn. All rights reserved.
What’s In It for Me
In this lesson you will understand the need for optimization in industries. You will learn the problems
involved in optimization.
In addition, you will learn how to perform optimization using Statistical Analysis System.
728
© Copyright 2015, Simplilearn. All rights reserved.
Need for Optimization
The optimization is a mathematical technique to find a maximum and minimum value of a function
subject to constraints. Optimization techniques are important in many industries today, and it forms a
major part of the area of Operational Research.
It cut downs the operational costs and maximizes the profit of the company.
729
© Copyright 2015, Simplilearn. All rights reserved.
Need for Optimization(contd.)
A company is organizing a bus trip for 400 of its employees to Vegas. The admin team has contacted an
agency which have 10 and 8 buses with seating capacity up to 50 and 40 people, respectively. However,
only 9 drivers are available in a shift. The rental cost for a large bus is $800 and that for a small bus is
$600. The admin team has to calculate how many buses of each type it will have to charter at the least
possible cost.
These kind of complex linear problems can be solved using optimization techniques of SAS.
To find out the minimum transport cost with all constraints is one of the optimization problems
730
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming
Before we deal with the optimization problems, let’s understand the various types of optimization
programming.
The various types of optimization programming are linear programming, mixed linear integer
programming, quadratic programming, and nonlinear programming.
731
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)
Linear programming is a technique to maximize or minimize a function of several variables such as cost,
time, and production subject to the constraints of the problem. If variables are real numbers and each
variable is dependent on another variable, then use linear programming for optimization.
732
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)
Mixed linear integer program is used when the decision variables are constrained to be integer values at
the optimal solution. The integer values may be binary numbers and whole numbers. The use of integer
variables greatly expands the scope of useful optimization problems that you can define and solve.
733
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)
The quadratic programming is used to solve optimization problems if the variables have quadratic
functions subject to the linear constraints. The standard form of quadratic equation is shown on the
screen.
ax2 + bx + c = 0
734
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Progamming(contd.)
The nonlinear programming is used if any of the objective functions or constrains has nonlinear function.
If the variables are not dependent on another variable, then, it is referred as nonlinear equation.
735
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems
The major optimization problem is that of minimizing or maximizing an objective function subject to
constraints imposed on the variables of that function.
There are various types of constraints such as bound constraints, equality constraints, inequality
constraints, or integer constraints.
736
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
The mathematical form of an optimization problem is called a mathematical program. When this
mathematical program is fed to the relevant algorithm, it determines the optimal values for the decision
variable has either maximized or minimized objective and are on between the defined limits.
737
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
So, optimization can be defined as the process of determining the objective and limits of optimal values.
If the constraints of an optimization are linear and the objective is either linear or quadratic, the
optimization problem can be solved using the SAS procedure.
738
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
The optimizations problems are classified into four types based on the nature of functional form of
objectives and constraints. They are:
739
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
The PROC OPTLP is used to solve the linear optimization problem. It uses a mathematical programming
system format or MPS format. This format is used to describe linear programming and integer
programming problems.
The files of MPS format are mostly in text format and possess specific conventions for the order
specified.
740
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
The PROC OPTMILP is used to solve the mixed integer linear problem. It is the linear problem in which
the decision variables are integer constrained.
It requires a SAS dataset to specify the mixed integer linear program to follow to the MPS format.
741
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
The PROC OPTQP is used to solve the quadratic optimization program that has the problems with a
quadratic objective function and a collection of linear constraints.
The input data problem needs to be specified in quadratic programming system, or QPS, format.
742
© Copyright 2015, Simplilearn. All rights reserved.
Optimization Problems(contd.)
The PROC OPTMODEL is an optimization modeling language, and it is used to model nonlinear
optimization programs.
The Nonlinear optimization problem is defined as the system that has either constraints of equalities and
inequalities or the objective functions that are nonlinear.
743
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL
The PROC OPTMODEL is also used to model linear, mixed integer linear, and quadratic optimization
programs.
You can declare a model, pass it directly to various solvers such as primal simplex, dual simplex,
iterative and network point, and review the solver result.
You can also save an instance of a linear model in dataset form for use by the OPTLP procedure.
744
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
Solver is a method or procedure to resolve an optimization problem. The solver used in the linear
programming, mixed integer linear programming, quadratic programming, and nonlinear programming
is LP, MILP, QP, and NLP respectively.
745
© Copyright 2015, Simplilearn. All rights reserved.
Let’s step into the “Syntax Classroom” to learn the syntax of the PROC OPTMODEL.
746
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
The PROC OPTMODEL procedure includes the modeling language and solvers for several classes of
mathematical programming problems.
747
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
748
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
749
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
The objective functions are used to define minimum and maximum objectives.
750
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
751
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
752
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
Note that the PROC OPTMODEL ends with the quit statement.
753
© Copyright 2015, Simplilearn. All rights reserved.
PROC OPTMODEL(contd.)
• PROC statement
• Declaration statements
• Programming statements
754
© Copyright 2015, Simplilearn. All rights reserved.
PROC Statement
The PROC statement invokes the procedure and sets initial option values. The various PROC statement
options are shown on the screen.
The CDIGITS = number specifies the expected number of decimal digits of accuracy for nonlinear
constraints.
The ERRORLIMIT = number| NONE specifies the maximum number of error messages that can be
displayed.
The FD = FORWARD | CENTRAL selects the method used to approximate numeric derivatives when
analytic derivatives are unavailable.
The INTFUZZ = number specifies the tolerance for rounding the bounds on integer and binary variables
to integer values.
The MAXLABLEN = number specifies the maximum length for MPS row and column labels.
The PMATRIX =number adjusts the density evaluation of a two-dimensional array to affect how it is
displayed.
The PDIGITS = number requests that the PRINT statement display number significant digits for numeric
columns for which no format is specified.
755
© Copyright 2015, Simplilearn. All rights reserved.
756
© Copyright 2015, Simplilearn. All rights reserved.
Declaration Statements
757
© Copyright 2015, Simplilearn. All rights reserved.
758
© Copyright 2015, Simplilearn. All rights reserved.
Programming Statements
The programming statements read and write data, invoke the solver, and prints the results.
COFOR executes the statement repeatedly with support for concurrent solver invocations.
759
© Copyright 2015, Simplilearn. All rights reserved.
760
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1
Let’s solve some of the optimization problems using statistical Analysis System. Each example has
problem statement, analysis, required code, and output.
761
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1
Problem Statement
A manufacturer produces two products, X and Y, with two machines, A and B. The cost of producing
each unit and working plan of X and Y is shown on the screen.
The cost of producing each unit and working plan of machine A is shown in table 1.
The cost of producing each unit and working plan of machine B is shown in table 2:
The week starts with a stock of 30 units of X and 90 units of Y and a demand of 75 units of X and 95 units
of Y.
Plan the production, to end the week with the maximum stock.
762
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1
From the given condition, the constraints are derived, and they are shown on the screen. From the
question, we need to obtain the maximum stock in a week.
The variables X and Y are real numbers, and they are greater than zero. Also the variable X is dependent
on Y and Y is dependent on X. So, the equation is termed as Linear equation.
Note that SAS’s licensed version is required to solve the optimization problems.
763
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1
Following are the required code to optimize the linear equation using SAS’s PROC OPTMODEL. Use the
procedure “PROC OPTMODEL” to inform the SAS to optimize the problem.
First, set the variable and introduce the logical constraints if any. Here, the variable X and Y are set as
greater than or equal to zero.
Second, set the constraints of the problem using the keyword “con”. Here, there are four constraints
involved in this problem.
Third, set the objective function of the problem. Here, the objective function is to find the maximum the
stock. So, use the function “Max.” Note that “F” is the variable that has the value of maximum function.
Fourth, use the solve keyword to solve the optimization problem. SAS decides the best solver method of
computation. You can also mention the relevant solver function such as LP, NLP, MILP, or QP. Here, the
relevant solver function will be LP as the problem is the linear optimization problem.
Finally, use the print statement to print the required values. Here, the values of f, X, and Y are printed on
the screen.
764
© Copyright 2015, Simplilearn. All rights reserved.
765
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 1
The solver used in this example is dual simplex. Look at the solution status. The status is “optimal,” and it
shows the optimization is achieved.
766
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2
A mathematician has analyzed and derived the following equation. He needs to calculate the minimum
output for that equation. So, instead of solving it manually, he approaches a SAS programmer to
optimize the equation.
Constraints:
X1 – x2 <=5
X1 + x2 >=50
X1 >=0
X2 >= 0
767
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2
Analysis:
This equation has squared value and therefore it is termed as Nonlinear equation.
Note that SAS’s licensed version is required to solve the optimization problems.
768
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2
Code:
Following is the required code to optimize the quadratic equation using SAS’s PROC OPTMODEL.
proc optmodel;
var x1 >= 0, x2 >= 0;
con con1: x1 - x2 <= 5;
con con2: x1 + 2 * x2 >= 50;
minimize f = 4 * x1 + 5*x1**2 + 3*x1**2+ 7*x2 + 6*x1*x2;
solve;
print f x1 x2;
quit;
Use the procedure “PROC OPTMODEL” to inform the SAS to optimize the problem.
First, set the variable and introduce the logical constraints if any. Here, the variable X and Y are set as
greater than or equal to zero.
Second, set the constraints of the problem using the keyword “con.” Here, there are four constraints
involved in this problem. Note that X1 and X2 are already set as greater than or equal to zero.
Third, set the objective function of the problem. Here, the objective function is to find the minimum
value of the equation. So, use the function “Min.” Note that “F” is the variable that has the minimum
value of the function.
769
© Copyright 2015, Simplilearn. All rights reserved.
Fourth, use the solve keyword to solve the optimization problem. SAS decides the best solver method of
computation. You can also mention the relevant solver function such as LP, NLP, MILP, or QP. Here, the
relevant solver function will be QP as the problem is the quadratic optimization.
Finally, use the print statement to print the required values. Here, the values of f, X, and Y are printed on
the screen.
770
© Copyright 2015, Simplilearn. All rights reserved.
Optimization- Example 2
The solver used in this example is NLPC. Look at the solution status. The status is “optimal,” and it shows
the optimization is achieved.
771
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
Let’s practice what you have learned so far in this lesson. Read the questions carefully and then answer
them.
772
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
A farmer wants to adjust the ratio of components mix in a fertilizer for the current crop. He bought plant
food mix A and plant food mix B.
Each cubic yard of food mix A contains 20 pounds of phosphoric acid, 30 pounds of nitrogen, and 5
pounds of potash.
Each cubic yard of food mix B contains 10 pounds of phosphoric acid, 30 pounds of nitrogen, and 10
pounds of potash.
He requires a minimum of 460 pounds of phosphoric acid, 960 pounds of nitrogen, and 220 pounds of
potash.
If food mix A costs $30 per cubic yard and food B costs $35 per cubic yard, how many cubic yards of each
food should the farmer blend to meet the minimum chemical requirements at a minimal cost?
773
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
The constraints for the equation are derived and shown on the screen. The objective equation is shown
on the screen.
Minimum F=30Y+35x
We recommend you to first solve the project and then view the solution to assess your learning.
774
© Copyright 2015, Simplilearn. All rights reserved.
Assignment
The solver used in this example is Dual Simplex. Look at the solution status. The status is “optimal,” and
it shows the optimization is achieved.
775
© Copyright 2015, Simplilearn. All rights reserved.
Key Takeaways
Let’s now quickly recap the concepts you have learned in the lesson:
The optimization is a mathematical technique to find a maximum value and a minimum value of
a function subject to constraints.
Optimization techniques cut down the operational costs and maximize the profit of the
company.
The various types of optimization programming are linear programming, mixed linear integer
programming, quadratic programming, and nonlinear programming.
The objective functions and constraints can be linear or nonlinear.
The PROC OPTMODEL is also used to model linear, mixed integer linear, and quadratic
optimization programs.
Solver is a method or procedure to resolve an optimization problem.
776
© Copyright 2015, Simplilearn. All rights reserved.
Conclusion
This concludes the course “Statistical Analysis System.Enjoy learning with Simplilearn.
777
© Copyright 2015, Simplilearn. All rights reserved.
778
© Copyright 2015, Simplilearn. All rights reserved.
779
© Copyright 2015, Simplilearn. All rights reserved.
ANSWERS:
780
© Copyright 2015, Simplilearn. All rights reserved.
781
© Copyright 2015, Simplilearn. All rights reserved.
782
© Copyright 2015, Simplilearn. All rights reserved.
783
© Copyright 2015, Simplilearn. All rights reserved.