Вы находитесь на странице: 1из 291

Introduction to Data Science

Data
Data comes from the Latin word, "datum,"
meaning a "thing given." Although the term
"data" has been used since as early as the
1500s, modern usage started in the 1940s
and 1950s as practical electronic computers
began to input, process, and output data.
Data is a set of values of qualitative or
quantitative variables. It is information in raw or
unorganized form. It may be a fact, figure,
characters, symbols etc.

data is collected by a huge range of organizations


and institutions, including businesses (e.g., sales
data, revenue, profits, stock price), governments
(e.g., crime rates, unemployment rates, literacy
rates) and non-governmental organizations (e.g.,
censuses of the number of homeless people by
non-profit organizations).
The inventor of the World Wide Web, Tim Berners-Lee, is often
quoted as having said, "Data is not information, information is not
knowledge, knowledge is not understanding, understanding is not
wisdom."

Wisdom

Knowledge

Information

Data
Data
Data are numbers, words or images that have yet to
be organized or analyzed to answer a specific
question.

Information
Produced through processing, manipulating and
organizing data to answer questions, adding to the
knowledge of the receiver.
Knowledge
What is known by a person or persons. Involves
interpreting information received, adding relevance
and context to clarify the insights the information
contains.

Wisdom
Wisdom is the synthesis of knowledge and
experiences into insights that deepen one's
understanding of relationships and the meaning of
life.
Characteristics of Data
 Accuracy
Data should be sufficiently accurate for the intended use
and should be captured only once, although it may have
multiple uses. Data should be captured at the point of
activity.

This characteristic refers to the exactness of the data. It


cannot have any erroneous elements and must convey
the correct message without being misleading. For
example, accuracy in healthcare might be more
important than in another industry (which is to say,
inaccurate data in healthcare could have more serious
consequences) and, therefore, justifiably worth higher
levels of investment.
 Validity
Data should measure what is intended to be measured.
Data should be stored and used in compliance with
relevant requirements, including the correct application
of any rules or definitions.

For example, on surveys, items such as gender and


nationality are typically limited to a set of options and
open answers are not permitted. Any answers other
than these would not be considered valid based on the
survey’s requirement. This is the case for most data and
must be carefully considered when determining its
quality.
 Reliability
Data should reflect stable and consistent data collection
Methods. Progress toward performance targets should
reflect process changes rather than variations in data
collection approaches or methods.

Many systems in today’s environments use and/or


collect the same source data. Regardless of what source
collected the data or where it resides, it cannot
contradict a value residing in a different source or
collected by a different system. There must be a stable
and steady mechanism that collects and stores the data
without contradiction or unwarranted variance.
Timeliness
Data should be captured as quickly as possible after the
event or activity and must be available for the intended
use within a reasonable time period. Data must be
available quickly and frequently enough to support
information needs and to influence decisions.

Data collected too soon or too late could misrepresent a


situation and drive inaccurate decisions.
 Relevance
Data captured should be relevant to the purposes for
which it is to be used. This will require a periodic review
of requirements to reflect changing needs.

For example every organization will have to redetermine


on a regular basis what data is relevant in achieving the
business objectives. These objectives change from time
to time, and with them the relevance of the data that
needs to be collected, stored and managed.
 Completeness
Data requirements should be clearly specified based on
the information needs of the organisation and data
collection processes matched to these requirements.

Incomplete data is as dangerous as inaccurate data.


Gaps in data collection lead to a partial view of the
overall picture to be displayed. Without a complete
picture of how operations are running, uninformed
actions will occur. It’s important to understand the
complete set of requirements that constitute a
comprehensive set of data to determine whether or not
the requirements are being fulfilled.
Data Collection Techniques
Information you gather can come from a range of sources.
Likewise, there are a variety of techniques to use when gathering
primary data. Listed below are some of the most common data
collection techniques.

Interviews
Questionnaires and Surveys
Observations
Focus Groups
Case Studies
Documents and Records
Basic types of Data
There are two basic types of data: numerical and
categorical data.

Numerical data: data to which a number is assigned


as a quantitative value.

Categorical data: data defined by the classes or


categories into which an individual member falls.
Categorical data represents characteristics.
Therefore it can represent things like a person’s
gender, language etc.
Types of Numerical data
• Discrete: Reflects a number obtained by counting—no decimal.
• Continuous: Reflects a measurement; the number
of decimal places depends on the precision of the measuring
device.
Ratio scale: Order and distance implied. Differences can
be compared; has a true zero. Ratios can be compared.
Examples: Height, weight, blood pressure
A weight of 4 grams is twice a weight of 2 grams, because weight
is a ratio variable
 Interval scale: Order and distance implied. Differences
can be compared; no true zero. Ratios cannot be
compared.
Example: Temperature in Celsius.
A temperature of 100 degrees C is not twice as hot as 50
degrees C, because temperature C is not a ratio variable.
Categorical data

Defined by the classes or categories into which an


individual member falls.

• Nominal Scale: Name only--Gender, hair


color, ethnicity

• Ordinal Scale: Nominal categories with an


implied order--Low, medium, high.

3 - 17
Why Data Types are important?

Data types are an important concept


because statistical methods can only be
used with certain data types. You have to
analyze continuous data differently than
categorical data otherwise it would result in
a wrong analysis. Therefore knowing the
types of data you are dealing with, enables
you to choose the correct method of
analysis.
Data Science Overview
Data Science
Why all the excitement?
Where does data come from?
What can you do with the data?
What is Data Science?
How to do Data Science?
Who are Data Scientists?
Why all the excitement?
1.The decline in the price of sensors (like barcode
readers) and other technology over recent decades has
made it cheaper and easier to collect a lot more data.
2. Similarly, the declining cost of storage has made it
practical to keep lots of data hanging around, regardless
of its quality or usefulness.
3. Many people’s attitudes about privacy seem to have
accommodated the use of Facebook and other
platforms where we reveal lots of information about
ourselves.
4.Researchers have made significant advances in the
"machine learning" algorithms that form the basis of
many data mining techniques.
Data Analysis Has Been Around for a
While…

R.A. Fisher W.E. Deming


Peter Luhn

Howard
Dresner
Exciting new effective
applications of data analytics

e.g.,
Google Flu Trends:

Detecting outbreaks
two weeks ahead
of Centers for Disease Control
data

New models are estimating


which cities are most at risk
for spread of the Ebola virus.

Prediction model is built on


Various data sources,
types and analysis.
A history of the (Business) Internet:
1997
Sponsored search
Sponsored search
• Google revenue around $50 bn/year from marketing,
97% of the companies revenue.

• Sponsored search uses an auction – a pure competition


for marketers trying to win access to consumers.

• In other words, a competition for models of consumers –


their likelihood of responding to the ad – and of
determining the right bid for the item.

• There are around 30 billion search requests a month.


Perhaps a trillion events of history between search
providers.
Where does data come from?
“Big Data” Sources
Big Data EveryWhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
“Data is the New Oil”
– World Economic Forum 2011
3 Vs of Big Data

• Raw Data: Volume


• Change over time: Velocity
• Data types: Variety
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially

Exponential increase in
collected/generated data
Characteristics of Big Data:
2-Speed (Velocity)
• Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body 


any abnormal measurements require immediate reaction
Characteristics of Big Data:
3-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of
data

To extract knowledge all these types of


data need to linked together
Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data
What can you do with the data?
• Data Science is currently a popular interest of
organizations

• High demand for people trained in Data Science


– databases, warehousing, data architectures
– data analytics – statistics, machine learning
• Big Data – gigabytes/day or more
Examples:
– Walmart, cable companies (ads linked to content, viewer
trends), airlines/Orbitz, HMOs, call centers, Twitter (500M
tweets/day), traffic surveillance cameras, detecting fraud,
identity theft...
• supports “Business Intelligence”
– quantitative decision-making and control
– finance, inventory, pricing/marketing, advertising
– need data for identifying risks, opportunities, conducting
“what-if” analyses
Big Data Application
Understanding and Targeting Customers
This is one of the biggest and most publicized areas
of big data use today. Here, big data is used to better
understand customers and their behaviors and
preferences.

Understanding and Optimizing Business Processes


Big data is also increasingly used to optimize business
processes. Retailers are able to optimize their stock
based on predictions generated from social media
data, web search trends and weather forecasts.
Improving Healthcare and Public Health
The computing power of big data analytics enables us to
decode entire DNA strings in minutes and will allow us to
find new cures and better understand and predict disease
patterns.

Improving Science and Research


Science and research is currently being transformed by
the new possibilities big data brings. Take, for example,
CERN, the nuclear physics lab with its Large Hadron
Collider, the world's largest and most powerful particle
accelerator. Experiments to unlock the secrets of our
universe - how it started and works - generate huge
amounts of data.
 Optimizing Machine and Device Performance

Big data analytics help machines and devices become


smarter and more autonomous. For example, big data
tools are used to operate Google's self-driving car. The
Toyota Prius is fitted with cameras, GPS as well as
powerful computers and sensors to safely drive on the
road without the intervention of human beings. We can
even use big data tools to optimize the performance of
computers and data warehouses.
Financial Trading
High-Frequency Trading (HFT) is an area where big
data finds a lot of use today. Here, big data
algorithms are used to make trading decisions.
Today, the majority of equity trading now takes
place via data algorithms that increasingly take into
account signals from social media networks and
news websites to make, buy and sell decisions in
split seconds.
Improving Security and Law Enforcement.

Big data is applied heavily in improving security and


enabling law enforcement. National Security Agency
(NSA) in the U.S. uses big data analytics to foil terrorist
plots (and maybe spy on us). Others use big data
techniques to detect and prevent cyber attacks. Police
forces use big data tools to catch criminals and even
predict criminal activity and credit card companies use
big data use it to detect fraudulent transactions.
What is Data Science?
“Data Science” an Emerging Field
Data Science – A Definition

Data Science is the science which uses computer


science, statistics and machine learning,
visualization and human-computer interactions
to collect, clean, integrate, analyze, visualize,
interact with data to create data products.
Data science includes data analysis as an important
component of the skill set required for many jobs in this
area, but is not the only necessary skill.
 Data scientists play active roles in the design and
implementation work of four related areas: data
architecture, data acquisition, data analysis, and data
archiving.
 Key skills highlighted by the brief case study include
communication skills, data analysis skills, and ethical
reasoning skills.
How to do Data Science?
Tools
Workflow
Creativity
The Tools of Data Science
Different analytical tool used by thousands of data analysts
worldwide
The open source R system for data analysis and visualization.
The single most popular and powerful tool, outside of R, is a
proprietary statistical system called SAS (pronounced "sass"). SAS
contains a powerful programming language that provides access
to many data types, functions, and language features.
Next in line in the statistics realm is SPSS, a package used by
many scientists (the acronym used to stand for Statistical Package
for the Social Sciences). SPSS is much friendlier than SAS.
R, SPSS, and SAS grew up as statistics packages, but there are also
many general purpose programming languages/packages like
Java, Python, MATLAB, Hadoop, Excel, SQL, ETL etc that
incorporate features valuable to data scientists.
Who are Data Scientists?
Data scientists play the most active roles in the four A’s
of data: data architecture, data acquisition data
analysis, and data archiving.

Skills
Learning the application domain - The data scientist
must quickly learn how the data will be used in a
particular context.
Communicating with data users - A data scientist
must possess strong skills for learning the needs and
preferences of users. Translating back and forth
between the technical terms of computing and
statistics and the vocabulary of the application domain
is a critical skill.
Seeing the big picture of a complex system - After
developing an understanding of the application domain,
the data scientist must imagine how data will move
around among all of the relevant systems and people.
Knowing how data can be represented - Data
scientists must have a clear understanding about how
data can be stored and linked, as well as about
"metadata”.
Data transformation and analysis - When data
become available for the use of decision makers, data
scientists must know how to transform, summarize, and
make inferences from the data. As noted above, being
able to communicate the results of analyses to users is
also a critical skill here.
Visualization and presentation - Although numbers often
have the edge in precision and detail, a good data display (e.g.,
a bar chart) can often be a more effective means of
communicating results to data users.

 Attention to quality - No matter how good a set of data


may be, there is no such thing as perfect data. Data scientists
must know the limitations of the data they work with, know
how to quantify its accuracy, and be able to make suggestions
for improving the quality of the data in the future.

Ethical reasoning - If data are important enough to collect,


they are often important enough to affect people’s lives. Data
scientists must understand important ethical issues such as
privacy, and must be able to communicate the limitations of
data to try to prevent misuse of data or analytical results.
The data scientist also needs to have excellent
communication skills, be a great systems
thinker, have a good eye for visual displays,
and be highly capable of thinking critically
about how data will be used to make
decisions and affect people’s lives.
Data Science Process
1. Known Unknown?
2. We’d like to know…?
3. Outcomes?
4. What Data?
5. Hypothesis?

The World Ingest Raw Data Munch Data The Dataset

Product manufactured Transactions MapReduce Interdependency?


Goods shipped Web-Scraping ETL Correlation?
Product purchased Web-clicks & logs Data Wrangle Covariance?
Phone calls made Sensor data Data Cleaning Causality?
Energy consumed Mobile data Data Reduction Dimensionality?
Fraud committed Docs, E-mails, Xls Sample Missing Values?
Repair requested Social Feed Select, Join, Bind Relevant?
System
Data Science Process continued
The Dataset Explore Data

Represent Data

Discover Data

Deliver Insight
Learn from Data Data Product
Visualize Insight
Description and Inference Objective
Data and Algorithm Levers
Actionable
Models Modeling
Machine learning Simulation Predictive
Networks and Graphs Optimization Immediate Impact
Regression and Prediction Visualization
Business Value
Classification and Clustering
Experiment and Iteration Easy to explain
Identifying Data Problems

Data Science is different from other areas such as


mathematics or statistics. Data Science is an applied
activity and data scientists serve the needs and solve the
problems of data users. Before you can solve a problem,
you need to identify it and this process is not always as
obvious as it might seem.
Follow the Data
In data science, one key to success is to "follow the
data." In most cases, a data scientist will not help to
design an information system from scratch. Instead,
there will be several or many legacy systems where data
resides; a big part of the challenge to the data scientist
lies in integrating those systems.
The critical starting point would be to follow the data.
Data scientist would need to be like a detective, finding
out in a substantial degree of detail the content, format,
senders, receivers, transmission methods, repositories,
and users of data at each step in the process and at
each organization where the data are processed or
housed.
Fortunately there is an extensive area of study and practice called
"data modeling" that provides theories, strategies, and tools to
help with the data scientist’s goal of following the data. These
ideas started in earnest in the 1970s with the introduction by
computer scientist Ed Yourdon of a methodology called Data Flow
Diagrams. A more contemporary approach, that is strongly linked
with the practice of creating relational databases, is called the
entity relationship model. Professionals using this model develop
Entity-Relationship Diagrams (ERDs) that describe the structure
and movement of data in a system.
Goal of Data Science

Discovering what we don’t know from data.


Obtaining predictive, actionable insight from data.
 Turn data into data products that have business impact.
Communicating relevant business stories from data.
Building confidence in decision that drive business value.
Data Science Applications
 Transaction Databases  Market Analysis and Management, Corporate
Analysis & Risk Management, Fraud Detection (Security and Privacy)

 Wireless Sensor Data  Smart Home, Real-time Monitoring, Internet of


Things

 Text Data, Social Media Data  Product Review and Consumer


Satisfaction (Facebook, Twitter, LinkedIn), E-discovery

 Software Log Data  Automatic Trouble Shooting

 Genotype and Phenotype Data  Patient-Centered Care, Personalized


Medicine
Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak,
Memcached,
ACID = Atomicity, Consistency, Isolation and Durability
Apache River, …
CAP = Consistency, Availability, Partition Tolerance
What If Analysis
What if analysis is structured as what will happen
to the solution if an input variable, an assumption ,
or a parameter value is changed.

What-If analysis allows businesses to analyze their


financial picture to forecast and make other
financial decisions.

Example: What will happen to the total inventory


cost if the cost of carrying inventory increases by
10 percent?
Basics of What-If Analysis
• Analysis quickly completed with Excel.
• Performed by changing values in input cells.
• Dependent cell
usually contains a formula.
changes when input data changes.
• Model worksheet contains the what-if
analysis.
Three approaches to answering ‘What if‘
questions:

Scenario analysis
Sensitivity analysis
Simulation
Goal Seek
Goal Seek, optimizes a goal and provides a solution
when one variable changes.

Goal – seeking analysis calculates the values of


inputs necessary to achieve a desired level of an
output(goal). It represents a backward solution
approach.
Goal Seek
• Only one variable will change.
• Maximizes the results within the other
financial constraints.
• Example: maximize the cost of a remodeling
project where payments do not exceed $1200
per month.
Use Goal Seek
1. Complete the worksheet.
2. Select the cell containing the formula.
3. From the Tools menu select Goal Seek
Use Goal Seek

Set cell contains


the formula
To value is the value
you want the set cell
to be (max payment)

The cell to change


to obtain desired
result (principal)
Completed Goal Seek
Maximum amount of loan
Solver
Solver
• What-if solutions often affect more than
one factor.
Example: how to change production quantity
given multiple variables; i.e. multiple product
lines, available resources.
• Solver determines optimum value of data
by changing other data factors.
• Constraints can be used to limit how values
change.
Prepare for Solver
Use Solver

1. From the Tools menu


select Solver.

2. Determine target cell


– the limiting
conditions for the
problem.
3. Determine if the
target cell should be
maximized, Solver Parameters Dialog Box
minimized or equal
to a value.
Use Solver
4. Enter the changing (or adjusting) cells that
contain variables that will change the
results.
Use Solver
5. Determine the constraints that will limit the
changing values.
 Cells that change are called changing or
adjustable cells.
6. Click Add to add a new constraint.
Use Solver

Completed Solver Parameter


Use Solver
7. Click Solve.

Solver has found


a solution

Keep solution or
restore original Solver Results Dialog Box
values
Solver Results

Changing
cells
Scenarios
Scenario analysis is the process of analyzing
possible future events by considering
alternative possible outcomes.

•Scenarios create several different solutions


for a complex business problem.

•Change our inputs / assumptions to better


understand possible outcomes of analysis

•Helps us to formalize one or more possible


answers to questions about the future.
Scenario Manager
• Scenarios are useful when data is uncertain.
• A scenario may contain up to 32 variables.
• Use Scenario Manager to determine best case
and worst case scenarios.
• Scenario Manager finds solutions to complex
business problems.
• Use named cells to help analyze results.
Procedures for Planning Analysis

• Define the problem


• Input values
• Dependent cells
• Results
• Complete the analysis
Define the Problem
• The problem must be clearly defined to begin
the what-if analysis.
Scenario: a couple want to purchase a home and
cannot have payments larger than $1,200 per
month.
Problem definition: what is the maximum
purchase price they can pay for a house?
Input Values
• Determine the data input values:
Amount of the loan (principal)
Interest Rate
Length of loan

Guess
Dependent Cells
• Determine which cells contain formulas –
dependent cells.

Input cells

Dependent cell
Results
• What are the results given the input data?
Monthly payment =$1,945.79
• Do the results match the requirements?
No, payment must be max $1,200
• If not, change the input data to obtain the
needed results.
What happens if interest rate changes?
What happens if purchase price changes?
Complete the Analysis
• Enter variables and determine the result.
• Make changes in data until desired result is
obtained.
• Analysis completion will differ depending on
which what-if tool is used.
Use Scenario Manager
1. Create a worksheet with known information.
Use Scenario Manager

Create Scenarios
Use Scenario Manager
2. From the Tools menu select Scenarios.

Click on
Add

Scenario Manager Dialog Box


Use Scenario Manager

3. Name the scenario.


4. Identify changing
cells on original
worksheet.
5. Add comments as
needed.
6. Click OK.
Use Scenario Manager
• Edit the amounts for
that specific scenario.
• For each scenario enter
the variables that apply.
• Continue to add
scenarios until each has
Scenario Values Dialog Box been set up.
Use Scenario Manager
7. Select a scenario.
8. Click Show.
9. Variables in the current worksheet are
replaced by those in the scenario.

34
Scenario Summary Report
• A summary of the results of all scenarios can
be displayed in a separate worksheet.
• Access Scenario Summary dialog box.
Sensitivity Analysis
Involves changing the values of an input to
a model or formula incrementally and
measuring the related change in
outcome(s).

Sensitivity analysis can perform using Data


table in excel.
Data Tables
A data table automates data analysis and
organizes the results when one or two
variables change.
One-Variable Data Table
• A one-variable data table is used to evaluate
financial information for decision making.
• Only one variable is changing:
In a business, what happens when net sales
change?
How do interest rates affect the monthly payment
on a car loan?
Use a One-Variable Data Table
• Define the Problem:
What effect will interest rates have on loan
payments?
• Create the worksheet.
Use a One-Variable Data Table

Enter Enter formula


input to determine
values amount of
to payment
tested
Use a One-Variable Data Table
1. From the Data menu select Table.
2. Determine if input values have been entered
as columns or as rows.
Use a One-Variable Data Table—Cont.
3. Enter the input cell – the cell in the original
worksheet that contains the variable
(interest rate).
4. Enter input cell as absolute value.
Completed Data Table
Two-Variable Data Table
• Makes financial comparisons when two
variables change.
• Uses one formula to evaluate two sets of
variables.
Example: What happens if interest rate and length
of loan change?
Use a Two-Variable Data Table
1. Complete the worksheet.
Use a Two-Variable Data Table

2. Place the formula between


the two variables

3. Create a data table with


both variables
Use a Two-Variable Data Table
4. In Row input cell, input cell that contains the
original value of the principal.
5. In Column input cell, input cell that contains
the original value of the interest rate.
Completed Two-Variable Data Table
MBI Corporation makes special–purpose computers . A decision must
be made : How many computers should be produced next month at the
Boston Plant? Two types of computers are considered : the CC-7, which
requires 300 days of labor and Rs10,000 in materials , and the CC-8,
which requires 500 days of labor and Rs 15,000 in materials. The profit
contribution of each CC-7 is Rs 8,000, whereas that of each CC-8 is Rs
12,000. the plant has a capacity of 200,000 working days per month
and material budget is $8 million per month. Marketing requires that
at least 100 units of the CC-7 and at least 200 units of the CC-8 be
produced each month. The problem is to maximize the company’s
profits by determining how many units of the CC-7 and How many
units of the CC-8 should be produced each month.
Decision Variables :
X1=units of CC-7 Produced
X2=units of CC-8 Produced
Objective :
Z = 8,000 X1 + 12,000 X2
Constraints :
300 X1 + 500 X2 <= 200,000 (Labor Constraint)
10,000 X1 + 15,000 X2 <= 8,000,000 (Budget Constraint)
X1 >=100
X2 >= 200
Advanced Statistical Applications
Performing statistical analysis in excel is very convenient.

Excel has various built-in functions that allow to perform


all sorts of statistical calculations.

Analysis Tool Pack in Excel also work as a Tool for

* Summarizing data
* Fitting data (simple linear regression, Multiple
Regression)
* Hypothesis testing (t-test)
What is Data Analysis?
Data analysis is the process used to get result from
raw data that can be used to make decisions.

Results of data analysis can be used for:


•Detecting trends
•Making predictions
The Descriptive Statistics
Descriptive Statistics answer basic
questions about the central tendency
and dispersion of data observations.

* Numerical Summaries
* Measures of location
* Measures of variability
Click on Data  Select Data Analysis Descriptive Statistics Select input
range Check Summary statistics  Ok
Hypothesis Testing
• H0 = null hypothesis
– There is no significant difference
• H1 = alternative hypothesis
– There is a significant difference

Excel will automatically calculate t-values to compare:

Means of two datasets with equal variances


Means of two datasets with unequal variances
Two sets of paired data

abs(t-score) < abs(t-critical): accept H0 6


Data > Data analysis > t-Test: Two sample . . .

As a rule of thumb, can use “equal variances”


if ratio of variances < 3.
t-Test – two sample
t-Test: Two-Sample Assuming Unequal Variances

Variable 1 Variable 2
Probability of drawing two
Mean 54.99931 50.90014
random samples from a
Variance 1.262476 7.290012
normally distributed
Observations 22 5 population and getting the
Hypothesized Mean Difference 0 mean of sample #1 this
df 4 much larger than the mean
t Stat 3.329922 of sample #2. The mean of
P(T<=t) one-tail 0.014554 sample #1 is larger at a
t Critical one-tail 2.131846 significance level of =0.03
P(T<=t) two-tail 0.029108 (or “at the 3 % significance
t Critical two-tail 2.776451 level”), because p < 0.03.

t > tcritical so the mean of


sample #1 is significantly
different from the mean of
sample #2.
8
Paired sample t-test
The paired sample t-test compares the means of
two variables for a single group. The procedure
computes the difference between values of the two
variables for each case and tests whether the
average differs from zero.

Example: high blood pressure, all the patients are


measured at the beginning of the study, given a
medicine/ treatment and measured after
treatment.
ANOVA : Analysis of variance
A One-Way Analysis of Variance is a way to test the equality
of three or more means at one time by using variances.
Assumptions
•The populations from which the samples were obtained
must be normally or approximately normally distributed.
•The samples must be independent.
•The variances of the populations must be equal.
Hypotheses
The null hypothesis will be that all population means are
equal, the alternative hypothesis is that at least one mean
is different.
Summary Table
All of this sounds like a lot to remember, and it is. However,
there is a table which makes things really nice.

SS df MS F
Between SS(B) k-1 SS(B) MS(B)
----------- --------------
k-1 MS(W)
Within SS(W) N-k SS(W) .
-----------
N-k
Total SS(W) + N-1 .
SS(B)
F test statistic

Recall that a F variable is the ratio of two independent chi-


square variables divided by their respective degrees of
freedom. Also recall that the F test statistic is the ratio of
two sample variances, well, it turns out that's exactly what
we have here. The F test statistic is found by dividing the
between group variance by the within group variance. The
degrees of freedom for the numerator are the degrees of
freedom for the between group (k-1) and the degrees of
freedom for the denominator are the degrees of freedom
for the within group (N-k).
So How big is F?
Since F is
Mean Square Between / Mean Square Within

= MSG / MSE

A large value of F indicates relatively more


difference between groups than within groups
(evidence against H0)
Correlation
Correlation is the extent to which variables in two
different data series tend to move together (or
apart) over time. A correlation coefficient
Describes the strength of the correlation between
two series. Values in range (-1.0, 1.0)

Click on Data  Select Data Analysis  Correlation 


Select input range  Ok
Correlation Coefficient, r = .75

Correlation: Player Salary and Ticket


Price

30
20 Change in Ticket
10 Price
0 Change in
-10 Player Salary
-20
1995 1996 1997 1998 1999 2000 2001
Regression Analysis

Correlation measures whether a linear


relationship exists between two series of data.
Linear Regression attempts to find the
relationship between the two series and
expresses this relation with a linear equation.
Linear equation in the form:

y = mx+ b
Data  Data Analysis  Regression

Select a dependent variable (y) and an


independent variable (x).
What does this output tell us?
It describes the relationship in terms of an equation:
Y= -66672 + 10.64X
The value of -66672 is known as the Intercept as it gives
the value of y when x = 0.
The value of 10.64 is known as the Slope or the Gradient
and measures the increase in the value of Y that results
from a one unit increase in the value of X.
How good is the fit?
R-Square statistic describes how much
of the variation in Y variable was explained by variation in X
variable.
R-Square = 1 is perfect.
R-Square > 0.5 is considered good
Fitting Data
• For linear regression, use Data > Data Analysis > Regression
– Don’t use trendline – not enough analysis
– X Variable  slope
– R2 – Coefficient of Determination
• A value near 1.0 means that the value of y depends strongly on the value of x.
Does NOT prove the dependence is linear.
– Use residual plots to show linearity
• If the relationship is linear, then residuals should be random around zero
– Use p-values to show significance of linear fit
• p = probability that points are arranged like this by chance
• p = probability of getting this apparent correlation by drawing values at
random from two unrelated, normally-distributed populations
• Low value of p means that the fit is significant

21
Multiple Regression
Multiple regression is the appropriate method of analysis
when the research problem involves a single metric dependent
variable presumed to be related to two or more metric
independent variables. The objective of multiple regression
analysis is to predict the changes in the dependent variable in
response to change in the independent variables. This
objective is most often achieved through the statistical rule of
least squares.

For example , organization attempt to predict it’s sales from


information on its expenditure for advertising , the number of
salespeople, and the number of stores carrying it’s product.
The regression analysis in previous example has only
considered situations in which the value of Y depended
upon one independent X variable. However it is easy to
envisage circumstances in which Y depends upon two or
more independent x variables - call them X1 and X2.
Under these circumstance we would want to fit a
regression equation of the form: y = a + b*X1 + c*X2.
this is known as Multiple Regression.
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.97
R Square 0.94
Adjusted R
Square 0.92

Standard Error 1030.04

Observations 12.00

ANOVA
df SS F Significance F

Regression 2.00 145881218.66 68.75 0.00


Residual 9.00 9548781.34
Total 11.00 155430000.00

Coefficients Standard Error P-value Lower 95% Upper 95%


Intercept 19668.96 2728.32 0.00 13497.07 25840.86
ADVEX 0.53 0.05 0.00 0.41 0.65
PRICE -6.41 0.78 0.00 -8.18 -4.63
Statistical Quality Control
The population is the set of entities under study. It is a collection of people, items, or
events about which you want to make inferences.
Sample : This subset of the population is called a sample. So, we can say, a sample is a
subset of people, items, or events from a larger population that you collect and analyze
to make inferences. To represent the population well, a sample should be randomly
collected and adequately large.
Importance of Sample
• Typically it is impossible to survey/measure the every member of entire population
because not all members are observable. If it is possible to study the entire population it
is often costly to do so and would take a great deal of time.
• Use this sample to draw inferences about the population under study, given some
conditions.
Statistic: Measure of some characteristics of data in a sample(e. g. the mean height of
men)of the population- called a statistic.
Parameter: A parameter is a statistical constant that describes a feature about a
population. Measure of some characteristics of (e. g. the mean height of men) the
population- called a parameter.
Mean (Arithmetic):The mean (or average) is the most popular and well known measure of central tendency.

Advantage:
• An important property of the mean is that it includes every value in your data set as part of the calculation.
• In addition, the mean is the only measure of central tendency where the sum of the deviations of each value
from the mean is always zero.
• It can be used with both discrete and continuous data, although its use is most often with continuous data.

Disadvantage:
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
Measure:
The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So,
if we have n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by x.

Median: The median is the middle score for a set of data that has been arranged in order of magnitude. The
median is less affected by outliers and skewed data.

Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or
histogram
Best measure of central
Type of Variable
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not
Mean
skewed)
Interval/Ratio (skewed) Median

Standard Deviation: The standard deviation is a measure of the spread of


scores within a set of data. We can estimate the population standard deviation
from a sample standard deviation. These two standard deviations - sample and
population standard deviations - are calculated differently.
The sample standard deviation formula is:
where,
s = sample standard deviation
X= score
¯x= sample mean

The population
 
standard deviation formula is:

 µ= population mean


n = number of scores in population
Variance: variance measures the variability (volatility) from an average or mean, and volatility is a measure of risk,
the variance statistic can help determine the risk .
Importance of variance:
Use variance to see how individual numbers relate to each other within a data set. A drawback to variance is that it
gives added weight to numbers far from the mean (outliers).

The formula for the variance in a population is:

The formula to estimate the variance from a sample is


SAMPLING VARIABILITY
The sampling variability of a statistic refers to how much the statistic varies from
sample to sample and is usually measured by its standard error; the smaller the
standard error, the less the sampling variability. For example, the standard error of
the mean is a measure of the sampling variability of the mean. Recall that the
formula for the standard error of the mean is

Importance of Standard Error


The standard error is an estimate of the standard deviation of a statistic. The
standard error is important because it is used to compute other measures,
like confidence intervals and margins of error.
Problem 1
Which of the following statements is true.
I. The standard error is computed solely from sample attributes. 
II. The standard deviation is computed solely from sample attributes. 
III. The standard error is a measure of central tendency.
(A) I only 
(B) II only 
(C) III only 
(D) All of the above. 
(E) None of the above.
Ans:

Q. A teacher sets an exam for their pupils. The teacher wants to summarize the
results the pupils attained as a mean and standard deviation. Which standard
deviation should be used?

A.
Q. A researcher has recruited males aged 45 to 65 years old for an exercise training
study to investigate risk markers for heart disease (e.g., cholesterol). Which standard
deviation would most likely be used?

Q. One of the questions on a national consensus survey asks for respondents' age.
Which standard deviation would be used to describe the variation in all ages received
from the consensus?

A.
What is a Z-Score?
A z-score is a measure of how many standard deviations below or above the population mean a raw score is. A
z-score is also known as a standard score and it can be placed on a normal distribution curve.

A z-score can tell you where that person’s weight is compared to the average population’s mean weight.

The Z Score Formula: One Sample


The basic z score formula for a sample is:
z=x–μ/σ
Example: let’s say we have a test score of 190. The test has a mean (μ) of 150 and a standard deviation (σ) of 25.
Assuming a normal distribution, calculate z score:
z=x–μ/σ

The z score tells you how many standard deviations from the mean your score is.

Z Score Formula: Standard Error of the Mean


For multiple samples, want to describe the standard deviation of those sample means using z score formula:
z = x – μ / (σ / √n)
This z-score will tell you how many standard errors there are between the sample mean and the population
mean.
Sample problem: In general, the mean height of women is 65″ with a standard deviation of 3.5″. What is the
probability of finding a random sample of 50 women with a mean height of 70″, assuming the heights are
normally distributed?
Point Estimation and
Interval Estimation
Population proportion
A part of a population with a particular attribute, expressed as a fraction, decimal or
percentage of the whole population.
Formula: The population proportion is the number of members in the population with
a particular attribute divided by the number of members in the population.
.

Example: a. If 6 out of 40 students plan to go to graduate school, the proportion of all students who plan to go to
graduate school is estimated as ________. The standard error of this estimate is ________. 
b. If 54 out of 360 students plan to go to graduate school, the proportion of all students who plan to go to graduate
school is estimated as ________. The standard error of this estimate is ________.
Statistical inference
Statistical inference is the process of inference from the sample to a population
with calculated degree of certainty. The two common forms of statistical
inference are:
• Estimation
• Null hypothesis tests of significance (NHTS)

Estimation in Statistics
. Estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample.

Examples of parameters include:


• p: called “the population proportion”
• μ: called “the population mean”
• σ: called the “population standard deviation”
Point estimates are single points that are used to infer parameters directly. For
example,

• Sample proportion pˆ (“p hat”) is the point estimator of p


• Sample mean x (“x bar”) is the point estimator of μ
• Sample standard deviation s is the point estimator of σ

Point estimates and parameters represent fundamentally different things.


• Point estimates are calculated from the data; parameters are not.
• Point estimates vary from study to study; parameters do not.
• Point estimates are random variables: parameters are constants.
Every member of the
Statistical estimation population has the
same chance of being
selected in the sample
Population

Parameters

estimation

Random sample

Statistics
Statistical estimation
Estimate

Point estimate Interval estimate


* sample mean * confidence interval for mean
* sample proportion * confidence interval for proportion

Point estimate is always within the interval estimate


Point Estimate vs. Interval Estimate
An estimate of a population parameter may be expressed in two ways:

Point estimate: A point estimate of a population parameter is a single value of a
statistic.

Interval estimate: An interval estimate is defined by two numbers, between which a
population parameter is said to lie.
Confidence Intervals
Statisticians use a confidence interval to express the accuracy and uncertainty
associated with a particular sampling method. A confidence interval consists of three
parts.

A confidence level.

A statistic.

A margin of error.
The confidence level describes the uncertainty of a sampling method. The
probability part of a confidence interval is called a confidence level. The statistic
and the margin of error define an interval estimate that describes the accuracy of
the method. The interval estimate of a confidence interval is defined by
the sample statistic + margin of error.
Interval estimate as a 95% confidence interval means that if we used the same
sampling method to select different samples and compute different interval estimates,
the true population parameter would fall within a range defined by the sample
statistic + margin of error 95% of the time.
Confidence intervals are preferred to point estimates, because confidence intervals
indicate (a) the accuracy of the estimate and (b) the uncertainty of the estimate.

Estimate – Margin of error Estimat Estimate + Margin of


e error
Margin of Error
In a confidence interval, the range of values above and below the sample statistic is
called the margin of error.
For example, suppose an election survey reports that the independent candidate will
receive 30% of the vote. The reports that the survey had a 5% margin of error and a
confidence level of 95%. These findings result in the following confidence interval: We
are 95% confident that the independent candidate will receive between 25% and 35%
of the vote.
Estimating µ with the help of sampling distribution of the mean with known σ
The SDM indicates that:
• x-bar is an unbiased estimate of μ;
• the SDM tends to be normal when the population is normal or when the sample is
large;
• the standard deviation of the SDM is equal to σ/√n . This is called standard error of
the mean (SEM) and reflects the accuracy of x-bar as an estimate of μ:

SEM =σ/ √n
Suppose a measurement that has σ = 10.
o A sample of n = 1 for this variable derives SEM =
o A sample of n = 4 derives SEM =
o A sample of n = 16 derives SEM =

Confidence Interval for μ with known σ


Let (1−α)100% represent the confidence level of a confidence interval. α : the “lack of confidence”
A (1−α)100% CI for μ is given by:

The reason we use z1-α/2 instead of z1-α in this formula is because the random error is split between
underestimates (left tail of the SDM) and overestimates (right tail of the SDM). The confidence level 1−α area lies
between −z1−α/2 and z1−α/2:
The common levels of confidence and their associated alpha levels and z quantiles:
(1−α)100% α z1-α/2
90% .10 1.64

95% .05 1.96


99% .01 2.58
Calculate,
i) 90% CI for µ for sample size n = 10 with SEM = 4.30 and x = 29.0. The z value =1.64.

μ: ?
Margin of error:?

ii) 95% CI for µ and margin of error with same x-bar and SEM?
iii) the 99% CI for μ and margin of error with same x-bar and SEM?

Confidence Level Confidence Interval Confidence Interval


Length
90%
95%
99%

Suppose a population with σ = 15 and unknown mean μ. A random sample of 10 observations from this
population and observe the following values: {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. Based on these 10
observations, x = ? , SEM = ? and a 95% CI for μ = ?
Sample Size Requirements for estimating µ
m represents the margin of error and population size is n.

From above equation we get,

Given, standard deviation σ = 15 and want to estimate µ with 95% confidence.


i)What will be the samples size required to achieve a margin of error of 5?

ii) The samples size required to achieve a margin of error of 2.5?


Problem 1
Which of the following statements is true.
I. When the margin of error is small, the confidence level is high. 
II. When the margin of error is small, the confidence level is low. 
III. A confidence interval is a type of point estimate. 
IV. A population mean is an example of a point estimate.
(A) I only 
(B) II only 
(C) III only 
(D) IV only. 
(E) None of the above.

Solution
The correct answer is (E).
Estimating p with Sampling distribution of the proportion
Proportion for sample =
p ˆ = number successes of in the sample/n

In large samples, the sampling distribution of p ˆ is approximately normal with a mean of p and standard error
of the proportion SEP :

n= sample size, where q = 1 – p.

Confidence interval for p


This approach should be used only in samples that are large. If npq ≥ 5 , then proceed with this method.

An approximate (1−α)100% CI for p is given by

Here,
Example 1: 57 individuals reveals 17 smokers. Use npq rule to determine suitability of the method .
estimate the 95% CI for p .

Example 2: Out of 2673 people surveyed, 170 have risk factor X. We want to determine the population
prevalence of the risk factor with 95% confidence.

Sample size requirement for estimating p

To achieve margin of error m,

where p* represent the an educated guess for the proportion and q* = 1 − p. When no reasonable guess of p is
available, use p* = 0.50.
Example 1: Calculate sample a population with 95% confidence for the prevalence of smoking. How large a
sample is needed to achieve a margin of error of 0.05 if we assume the prevalence of smoking is roughly 30% ?

Example 2: How large a sample is needed to shrink the margin of error to 0.03?
PIVOT TABLE
AND
OPTIMIZATION USING
SOLVER
PIVOT TABLE
Definition:- A pivot table is a user created summary table of original
spreadsheet. We can create the table by defining which fields to view
and how the information should be displayed. Based on our field
selections, Excel organizes the data so we see a different view of our
data. A Pivot Table is way to present information in a report format.
Use:
• A pivot table can aggregate your information .
• Showing a new perspective by moving columns to rows or vice
versa.
• A comparative study can be made by using this table.
Pivot Table Structures
The main areas of the pivot table.

(1) PivotTable Field List – this section in the top right displays the
fields in our spreadsheet. We may check a field or drag it to a
quadrant in the lower portion.
Open Sales.xlsx and perform the following.-
1.Show the region wise selling pattern for all sales persons and their total sales
amount.
2. Display the product wise sales for each region.
3. Compare the monthly selling performance for each sales person.
4. Draw a pivot chart showing monthly regional selling status. Change the chart
according to
product sales.
 
5. Open the student.xlsx. Display the month wise sum of score for all subjects and
their
grand total.
6. Display the highest score for each students.
7. Display the pivot chart for student’s monthly score.
The Data worksheet in the Groceriespt.xlsx file contains more than 900 rows of sales
data. Each row contains the number of units sold and revenue of a product at a
store as well as the month and year of the sale. The product group (fruit, milk, cereal,
or ice cream) is also included. You would like to see a breakdown of sales during
each year of each product group and product at each store. You would also like to
be able to show this breakdown during any subset of months in a given year (for
example, what the sales were from January through June).

Determine the following using groceries worksheet:

• Amount spent per year in each store on each product


• Total spending at each store
• Total spending for each year
Travel database:
From information in a random sample of 925 people, I know the gender, the age, and the amount these people
spent on travel last year. How can I use this data to determine how gender and age influence a person’s travel
expenditures? What can I conclude about the type of person to whom I should mail the brochure?

To understand this data, you need to break it down as follows:


• Average amount spent on travel by gender
• Average amount spent on travel for each age group
• Average amount spent on travel by gender for each age group
OPTIMIZATION- using Solver
SOLVER
Definition: Solver is an Excel add-in that can solve problems by
enabling a Target cell to achieve some goal. This goal may be to
minimise, maximise, or achieve some target value. It solves the
problem by adjusting a number of input cells according to a set of
criteria or constraints which are defined by the user.

Solver to present a solution it needs certain items of information


from us.
•Target Cell: The cell where the result will appear. This cell
must contain a formula.
•Changing Cells: The cells containing the variable values that
Solver will update whilst trying to calculate the result
•Constraints cells: Cells containing values used in conditions
that need to be met.
Before running Solver, ensure that all the data needed is on the
sheet. This includes the formula in the target cell, cells for
changing cells and the constraints.
Terms should be know

• Objectives – Target or what we want to do . It may be either


Maximum, Minimum or set a specific value.
• Subject to the Constraints:
Constraints are the rules which define the limits of the possible
solutions to the problem. It is the element factor or a subsystem
that works as a bottleneck. It restricts an entity, project, or
system (such as a manufacturing or decision making process)
from achieving its potential with reference to its goal.

• Optimal Solution:
Alternative or approach that best fits the situation, employs
resources in a most effective and efficient manner, and yields
the highest possible return under the circumstances.
Optimization
. is a mathematical discipline that concerns the
finding of minima and maxima of functions, subject to so-
called constraints.

When trying to find an optimal solution, the binding


constraint is the factor that the solution is more dependent
on. If you change it, the optimal solution will have to change.
The non-binding constraint doesn't affect the optimal
solution and can be changed without changing it.

• Iteration means the process of repeating a set of


instructions a specified number of times usually with the aim
of approaching a desired goal or target or result. Each
repetition of the process is also called an "iteration," and the
results of one iteration are used as the starting point for the
next iteration.
Types of Problem that could be handled by Solver
• Product Mix:  Determine how many products of each type
to assemble from certain parts to maximize profits while not
exceeding available parts inventory.
• Machine Allocation:  Allocate production of a product to
different machines (with different capacities, startup cost
and
operating cost) to meet production target at minimum cost.
• Blending:  Determine which raw materials from different
sources to blend to produce a substance with certain
desired qualities at minimum cost.
• Process Selection - Decide which of several processes
(with different speeds, costs, etc.) should be used to make a
desired quantity of product in a certain amount of time, at
minimum cost.
• Cutting Stock:  Determine how to cut larger pieces (of
wood, steel, etc.) into smaller pieces of desired sizes, each
needed in certain quantities, in order to minimize waste.
Solver –Optimization Tool
A manufacturer produces four products A, B, C and D by using two types of
machines (lathe and milling machines). The time required on the two machines to
manufacture one unit of each of the four products, the profit per unit of the product
and the total time available on the two types of machines per day are given below:
•Find the number of units to be manufactured of each product per day for
maximizing the profit.
•Find profit value, if minimum qty. of Product A and B will be 30 and 10, respectively.
Machine Total required per unit (minutes) Total time
Product A Product B Product C Product D available per
day (minutes)
Lathe 7 10 4 9 1200
machine
Milling 3 40 1 1 800
machine
Profit per 45 100 30 50
unit (Rs.)
Unit Cost of Transportation from sources M, P and T to the Destinations A, B, C and D
are given.

A B C D
M 0.6 0.56 0.22 0.4
P 0.36 0.3 0.28 0.58
T 0.65 0.68 0.55 0.42

The available capacities at M, P, and T are 9000, 12000 and 13000 units, respectively.

The Demand at the destinations are 7500, 8500, 9500 and 8000 units, respectively.

Formulate the Transportation problem and solve it using SOLVER.


A Bank processes checks seven days a week. Different number of workers needed
each day to process checks. For example, 13 workers are needed on Tuesday, 15
workers are needed on Wednesday, and so on. All bank employees work five
consecutive days. Find the minimum number of employees that the Bank can have
and still meet its labour requirements based on the following data.
Number
starting Day worker starts Mon Tues. Wed. Thurs. Fri. Sat Sun.
0Monday 1 1 1 1 1 0 0
0Tuesday 0 1 1 1 1 1 0
0Wednesday 0 0 1 1 1 1 1
0Thursday 1 0 0 1 1 1 1
0Friday 1 1 0 0 1 1 1
0Saturday 1 1 1 0 0 1 1
0Sunday 1 1 1 1 0 0 1

Number working 0 0 0 0 0 0 0
>= >= >= >= >= >= >=
Number needed 17 13 15 17 9 9 12
When you click Solve, you’ll see the message, “Solver could not find a feasible solution.” This message does
not mean that you made a mistake in your model but, rather, that with limited resources, you can’t meet
demand for all products.
Multi-criteria Decision
Making
and
Analytical Hierarchical
Problem
Definition
MCDM Type
Characteristics
Criteria type
Solution type
Methods
Multiple criteria decision making (MCDM) refers to making
decisions in the presence of multiple non-commensurable and
conflicting criteria, different units of measurement among the
criteria, and the presence of quite different alternatives.

MCDM problems are common in everyday life. Multi criterion


Decision-Making (MCDM) analysis has some unique
characteristics such as, the presence of multiple conflicting
criteria. In personal context, a house or a car one buys may be
characterized in terms of price, size, style, safety, comfort, etc. In
business context, MCDM problems are more complicated and
usually of large scale.
Normally in problems associated with selection and
assessment, the number of alternative solutions is limited.
Therefore the potential alternative solutions could be infinite. If
this is the case, the problem is referred to as multiple objective
optimisation problems instead of multiple attribute decision
problems. Our focus will be on the problems with a finite
Types of MCDM
There exist two distinctive types of MCDM problems due to the different
problems settings:
• one type having a finite number of alternative solutions and
• the other an infinite number of solutions.
Main Features of MCDM
• Multiple attributes/criteria often form a hierarchy.
• Conflict among criteria.
• Hybrid nature 1) Incommensurable units.
2) Mixture of qualitative and quantitative attributes.
3) Mixture of deterministic and probabilistic attributes.
• Uncertainty 1) Uncertainty in subjective judgments.
2) Uncertainty due to lack of data or incomplete information.
• Large Scale
• Assessment may not be conclusive

MCDM Solutions
All criteria in a MCDM problem can be classified into two categories.
• Criteria that are to be maximised are in the profit criteria category.
• Similarly, criteria that are to be minimised are in the cost criteria category.
An ideal solution to a MCDM problem would maximise all profit criteria and minimise
all cost criteria.
Type of solutions:
• Non dominated solutions--- Preferred solutions
• An alternative (solution) is dominated.
Satisfying solutions.

MCDM Methods
There are two types of MCDM methods. One is compensatory and the other is non-
compensatory.
There are three steps in utilizing any decision-making technique involving numerical
analysis of alternatives:
• Determining the relevant criteria and alternatives
• Attach numerical measures to the relative importance to the criteria and the impact
of the alternatives on these criteria
• Process the numerical values to determine a ranking of each alternative.
Numerous MCDM methods, such as,
• ELECTRE-3 and 4,
• Promethee-2
• Compromise Programming,
• Cooperative Game theory,
• Composite Programming,
• Analytical Hierarchy Process,
• Multi-Attribute Utility Theory,
• Multicriterion Q-Analysis etc.
are employed for different applications.

The WSM Method


The weighted sum model (WSM) is probably the most commonly used approach,
especially in single dimensional problems. If there are m alternatives and n criteria
then,
AWSM =max Σ ao wf for i=1,2,3,----m,
Where Awsm is the WSM score of the best alternative, n is the number of decision
criteria, ao is the actual value of the i-th alternative in terms of the j-th criterion, and
wf is
In single-dimensional cases, with same units , the WSM can be used without
difficulty. Difficulty with this method emerges when it is applied to multi-
dimensional MCDM problems.
The WPM Method
The weighted product model (WPM) is very similar to the WSM. The main
difference is
that instead of addition, there is multiplication. Each alternative is compared with
the others by multiplying a number of ratios, one for each criterion. Each ratio is
raised to the power equivalent to the relative weight of the corresponding criterion.
In order to compare two alternatives AK and AL, the following product has to be
calculated
R (AK/ AL) =akj/aij
Where n is the number of criteria, a is the actual value of the i-th alternative in terms
of
the j-th criterion, and wf is the weight of the j-th criterion.
If the term R(AK/ AL) is greater than or equal to one, then it indicates that
alternative AK
is more desirable than alternative AL ( in the maximization case). The best
Example 4-1:
Suppose that an MCDM problem involves four criteria, which are expressed in exactly the same unit,
and three alternatives. The relative weights of the four criteria were determined to be: W1 = 0.20, W2
= 0.15, W3 = 0.40, and W4 = 0.25. The corresponding aij values are assumed to be as follows:

Criteria
Alt. C1 C2 C3 C4 wt.
A1 25 20 15 30 0.20
A2 10 30 20 30 0.15
A3 30 10 30 10 0.40
0.25

The scores of the three alternatives are (Using WSM):


A1(WSM score) = 25×0.20 + 20×0.15 + 15×0.40 + 30×0.25 = 21.50.
Similarly, A2(WSM score) = 22.00,
and A3(WSM score) = 20.00.

Therefore, the best alternative (in the maximization case) is alternative A2 (because it has the
highest WSM score; 22.00). Moreover, the following ranking is derived: A2 > A1 > A3 (where ">" stands
for "better than").
USING WPM: (to express all criteria in terms of the same unit is not needed). When the WPM is
applied, then the following values are derived:

= 1.007 > 1.

Similarly, R(A1/A3) = 1.067 > 1,

and R(A2/A3) = 1.059 > 1.

Therefore, the best alternative is A1, since it is superior to all the other alternatives. Moreover, the
ranking of these alternatives is as follows: A1 > A2 > A3.
The AHP method

The Analytic Hierarchy Process (AHP) decomposes a complex MCDM problem into
a system of hierarchies. The final step in the AHP deals with the structure of an m*n
matrix ( Where m is the number of alternatives and n is the number of criteria). The
matrix is constructed by using the relative importance of the alternatives in terms of
each criterion. It deals with complex problems which involve the consideration of
multiple criteria/alternatives simultaneously.

AHP based on Pairwise comparison method. It is any process of comparing entities


in pairs to judge which of each entity is preferred, or has a greater amount of some
quantitative property, or whether or not the two entities are identical. A paired
comparison is usually a method to compare one entity with another of a similar
status. Usually, such paired comparisons are made on the grounds of the overall
performance of an individual.

Prof. Thomas L. Saaty (1980) originally developed the Analytic Hierarchy Process
(AHP) to enable decision making in situations characterized by multiple attributes
Major steps in applying the AHP techniques are:

1 Develop a hierarchy of factors impacting the final decision. This is known as the
AHP
decision model. The last level of the hierarchy is the three candidates as an
alternative.

2 Elicit pair wise comparisons between the factors using inputs from users/managers.

3 Evaluate relative importance weights at each level of the hierarchy.

4 Combine relative importance weights to obtain an overall ranking of the three


candidates.

While comparing two criteria, the simple rule as recommended by Saaty (1980). Thus
while comparing two attributes X and Y we assign the values in the following manner
based on the relative preference of the decision maker. To fill the lower triangular
matrix, we use the reciprocal values of the upper diagonal.
Intensity of Definition
Importance
1 Equal importance
3 Moderate importance of one over other
6 Strong importance
7 Very strong importance
9 Absolute importance
2,4,5,8 Intermediate Values
Reciprocals of the If activity i has one of the above numbers assigned
above to it when compared with activity j, then j has the
reciprocal value when compared with i.
1.1 – 1.9 When elements are close and nearly
indistinguishable

Table 1: Scale Used for Pair wise Comparison


Estimating the Consistency for sensitivity analysis
Sensitivity analysis is an extension to AHP. It provides information about the
robustness of any decision. It is applicable and necessary to explore the impact of
alternative priority structure for the rating of employee. The weights for the pair wise
comparison were changed and it was found that the performance evaluation was
also changing accordingly.

Step – 1. Multiply each value in the first column of the pairwise comparison matrix by
corresponding relative priority matrix.
Step – 2. Repeat Step – 1 for remaining columns.

Step – 3. Add the vectors resulted from step-1 and 2.

Step – 4. Divide each elements of the vector of weighed sums obtained in step 1-3
by the
corresponding priority value.

Step – 5. Compute the average of the values found in step –4. Let λ be the average.
Compute the random index, RI, using ratio:
RI = 1.98 (n-2)/n
Accept the matrix if consistency ratio, CR, is less than 0.10, where CR is
Consistency Ratio CR = (CI/RI
)
-
CI: Consistence Index =(λmax n ) / (n – 1), n= no. of terms.
If the Consistency Ratio (CI/CR) <0.10, so the degree of consistency is satisfactory.
The decision maker’s comparison is probably consistent enough to be useful.

Standard Random Index(RI) for number of alternatives:

No. of 3 4 5 6 7 8
alternatives
(n)
RI 0.58 0.9 1.12 1.24 1.32 1.41
Example: A company decided to outsource some parts of their product. Three
different company submit their tender for the above required parts. Three factors are
important to select the best fit- costs, reliability of the product and delivery time of
the orders. The price offered by them as follows:

ABC-100/- per gross


XYZ-80/- per gross
PQR -144/- per gross

1 gross= 12 dozens=144

Criteria : Cost Reliability Delivery Time

Alternatives: ABC ABC ABC


XYZ XYZ XYZ
PQR PQR PQR
Terms of price are compared as, XYZ is moderately preferred to ABC and very strongly preferred to
PQR. Where as, ABC is strongly preferred to PQR.

• Since, XYZ is moderately preferred to ABC, ABC’s entry in the XYZ row is 3 and XYZ ‘s entry in ABC
row is 1/3.

• Since, XYZ is very strongly preferred to PQR, PQR’s entry in the XYZ row is 7 and XYZ’s entry in the
PQR row is 1/7.

• Since , ABC is moderately to strong preferred to PQR, PQR’s entry in the ABC row is 6 and ABS’s
entry in the PQR row is 1/6.

The cost comparison matrix looks like :

ABC XYZ PQR


ABC 1 1/3 6
XYZ 3 1 7
PQR 1/6 1/7 1
Priority Vector for COST according to three companies: ABC(0.298), XYZ (0.632), PQR (0.069).
The Reliability comparison matrix looks like :
ABC XYZ PQR
ABC 1 7 2
XYZ 1/7 1 5
PQR 1/2 1/5 1

Priority Vector for Reliability according to three companies: ABC(0.571), XYZ (0.278), PQR (0.151).

The delivery time comparison matrix looks like:


ABC XYZ PQR
ABC 1 8 1
XYZ 1/8 1 1/8
PQR 1 8 1

Priority Vector for Delivery time according to three companies: ABC(0.471), XYZ (0.059), PQR
(0.471).
Comparison Matrix for Criteria:

Cost Reliability Delivery


Cost 1 7 9
Reliability 1/7 1 7
Delivery 1/9 1/7 1

Priority Vector for criteria : Cost(0.729), Reliability (0.216), Delivery time (0.055).

Overall Priority Vector : ABC= (0.729)*(0.298)+(0.216)*(0.571)+(0.055)*(0.471) =0.366


XYZ = (0.729)*(0.632)+(0.216)*(0.278)+(0.055)*(0.059)= 0.524,
PQR =(0.729)*(0.069)+(0.216)*(0.151)+(0.055)*(0.471)= 0.109.

Priority for outsource : XYZ > ABC > PQR.


A control chart is one of the seven statistical tools of quality control that is used to
determine whether a manufacturing process or a business is in the statistical control state.
If the analysis of the chart makes it clear that the process is in control at current time, that it
is stable, then there is no need and desire to take any action, or to make corrections or
changes for controlling the process.

But if it indicates that the process that is being monitored is not in control, then the chart
analysis can be used to determine the main source of the variation because of which this
degraded performance is seen. Typical example where control charts are used is time
series data.
A control chart consists of:
1) Points of statistic representation like mean, range etc., that are measurements of quality
of samples taken at different times in the process.
2) The mean is calculated of this statistic using all the calculated samples available.
3) A central line is drawn at the mean value so obtained of the statistic.
4) By using all the samples, the standard error is also evaluated of the statistic.
5) Upper and lower control limits
The control chart is a graph used to study how a process changes over time. Data
are plotted in time order. A control chart always has a central line for the average,
an upper line for the upper control limit and a lower line for the lower control limit.
These lines are determined from historical data. By comparing current data to
these lines, you can draw conclusions about whether the process variation is
consistent (in control) or is unpredictable (out of control, affected by special
causes of variation).
Control charts for variable data are used in pairs. The top chart monitors the
average, or the centering of the distribution of data from the process. The bottom
chart monitors the range, or the width of the distribution.
The average is where the shots are clustering, and the range is how tightly they
are clustered. Control charts for attribute data are used singly.
When to Use a Control Chart
 When controlling ongoing processes by finding and correcting problems
as they occur.
 When predicting the expected range of outcomes from a process.
 When determining whether a process is stable (in statistical control).
 When analyzing patterns of process variation from special causes (non-
routine events) or common causes (built into the process).
 When determining whether your quality improvement project should aim
to prevent specific problems or to make fundamental changes to the
process.

Control Chart Basic Procedure


• Choose the appropriate control chart for your data.
• Determine the appropriate time period for collecting and plotting data.
• Collect data, construct your chart and analyze the data.
• Look for “out-of-control signals” on the control chart. When one is identified, mark it on the chart
and investigate the cause. Document how you investigated, what you learned, the cause and how
it was corrected.
Out-of-control signals
o
A single point outside the control limits. In Figure 1, point sixteen is above the UCL
(upper control limit).
o
Two out of three successive points are on the same side of the centerline and farther
than 2 σ from it. In Figure 1, point 4 sends that signal.
o
Four out of five successive points are on the same side of the centerline and farther
than 1 σ from it. In Figure 1, point 11 sends that signal.
o
A run of eight in a row are on the same side of the centerline. Or 10 out of 11, 12 out of
14 or 16 out of 20. In Figure 1, point 21 is eighth in a row above the centerline.
o
Obvious consistent or persistent patterns that suggest something unusual about your
data and your process.
Figure 1 Control Chart: Out-of-Control Signals

When you start a new control chart, the process may be out of control. If so,
the control limits calculated from the first 20 points are conditional limits.
When you have at least 20 sequential points from a period when the process is
operating in control, recalculate control limits.
Types
Depending on the number of process characteristics to be monitored, there are two
basic types of control charts.
• The first, referred to as a univariate control chart, is a graphical display (chart) of
one quality characteristic.
• The second, referred to as a multivariate control chart, is a graphical display of a
statistic that summarizes or represents more than one quality characteristic.

Characteristics of control charts


If a single quality characteristic has been measured or computed from a sample, the
control chart shows the value of the quality characteristic versus the sample number
or versus time. In general, the chart contains a center line that represents the mean
value for the in-control process. Two other horizontal lines, called the upper control
limit (UCL) and the lower control limit (LCL), are also shown on the chart. These
control limits are chosen so that almost all of the data points will fall within these
limits as long as the process remains in-control.
Importance of Control Chart
The control limits might be 0.001 probability limits. The probability of a point falling above the upper limit would be one
out of a thousand, and similarly, a point falling below the lower limit would be one out of a thousand. We would be
searching for an assignable cause if a point would fall outside these limits. Where we put these limits will determine
the risk of undertaking such a search when in reality there is no assignable cause for variation.
Since two out of a thousand is a very small risk, the 0.001 limits may be said to give practical assurances that, if a
point falls outside these limits, the variation was caused be an assignable cause. It must be noted that two out of one
thousand is a purely arbitrary number.
The decision would depend on the amount of risk the management of the quality control program is willing to take. In
general it is customary to use limits that approximate the 0.002 standard.

For normal distribution, the 0.001 probability limits will be very close to the 3σ limits.

If distribution is skewed, say in the positive direction, the 3-sigma limit will fall short of the upper 0.001 limit, while the
lower 3-sigma limit will fall below the 0.001 limit. How much this risk will be increased will depend on the degree of
skewness.

If variation follows a Poisson distribution, for example, for which np = 0.8, the risk of exceeding the upper limit by
chance would be raised by the use of 3-sigma limits from 0.001 to 0.009 and the lower limit reduces from 0.001 to 0.
For a Poisson distribution the mean and variance both equal np. Hence the upper 3-sigma limit is 0.8 + 3 sqrt(0.8) =
3.48 and the lower limit is 0 (here sqrt denotes "square root"). For np = 0.8 the probability of getting more than 3
successes is 0.009.
Different types of control chart for attributes:
1) p – chart: This chart depicts the fraction of nonconforming or the defective product that
is produced in a manufacturing process. It is sometimes also known as the control chart
for fraction nonconforming.

2) np – chart: This chart depicts the number of nonconforming. It is almost the same as the
p – chart.

3) c – chart: This chart depicts the number of defects or non-conformities that are
produced in a manufacturing process.

4) u – chart: This chart depicts the non-conformities per unit that are produced by a
manufacturing process.
Dealing with out-of-control findings
If a data point falls outside the control limits, we assume that the process is probably out of control and that an
investigation is warranted to find and eliminate the cause or causes.

Does this mean that when all points fall within the limits, the process is in control? Not necessarily. If the plot looks
non-random, that is, if the points exhibit some form of systematic behavior, there is still something wrong. For
example, if the first 25 of 30 points fall above the center line and the last 5 fall below the center line, we would wish to
know why this is so. Statistical methods to detect sequences or nonrandom patterns can be applied to the
interpretation of control charts.

Quality Control Charts


In all the processes of production, we are required to monitor as to what extent our products are meeting their
specifications.
In other words and in the most general terms, there exist two enemies of the quality of the product:
1) The deviations from specifications.
2)  The excessive variability around the specifications of the target.
There are two types of control charts:
1) Variables Control Charts:  ones that are applicable to data that is following a continuous distribution.
2)  Attributes Control Charts:  ones that are applicable to data that is following a discrete distribution. Attributes data
is the data which can be classified into one of different categories.
In quality control the common classifications of types conforming and nonconforming are used commonly.
Control Charts for Variables:
Variable control charts are constructed to monitor statistical control for continuous data for both the mean and the
variability. The X¯ chart is the most common chart to monitor the mean. To monitor the variability, the ‘R’ chart and
the ‘s’ chart are the common ones. The control charts for variables can be classified in accordance to the statistic of
X¯ Chart: X¯ chart shows subgroup averages (means
R Chart: R chart is for displaying subgroup ranges
S Chart: S chart showcase subgroup standard deviations.

The ‘R’ Chart


It is used to monitor the variability of a process for small sample sizes (< 10) or to make simplified calculations
by process operators. It is so named as the statistic that is being plotted is the sample range. With the use of
this chart, the estimate of standard deviation of the process.
The ‘S’ Chart
It is used to monitor the variability of the process for large sample sizes (≥≥ 10), or when a computer is made
available for automatic calculations. It is so named because the statistic that is being plotted is the standard
deviation. With the use of this chart, the estimate of standard deviation of the process is: s¯c4s¯c4 .

The 'X¯ ' Chart


It is so named as the statistic that is been plotted is the sample mean. The reason why we take a sample is that
we are not sure every time of the distribution process. By using the sample mean the central limit theorem can
be invoked to make assumption of normality.

For the characteristics of quality that are to be measured on a continuous scale, a particular analysis makes
clear both the process mean and its variability along with a mean chart that is aligned above its corresponding
S- or R- chart.
The most common type of display will actually contain two types of charts and two corresponding histograms. Out
of them one is called an X-bar chart and the other is called an R chart.
In both the line charts above, the horizontal axis is representing the different
samples and the vertical axis for the X-bar chart is representing the means for the
interest characteristic while the vertical axis for the R chart is representing the
ranges.

we can say that the latter chart is a chart of the process variability where if the
variability is large the range will also be. Along with the center line, a typical chart
will also include two horizontal lines additionally in order to represent the upper and
the lower control limits that is UCL and LCL. In general, the different points in the
chart that are representing the samples, are connected by a line as well. Whenever
this line will move outside the limits of upper or lower control or will exhibit
systematic patterns that are across consecutive samples then a quality problem
may exist potentially.
             =
Formula for X chart

X bar=mean of the ith sample


X i = ith data
 =
N=sample size
              = mean of mean of g samples

g                = number of samples


      

  x    = standard deviation of samples

   = estimate of standard deviation of population =


d2                = parameter depends on sample size n

           = parameter depends on sample size  value of A2 can be


A2= directly obtained from the standard tables 

   (Lower control limit for X bar chart)  


Formula R chart
Ri               = range of ith  sample
 Xmax(i)         = maximum value of the data in ith sample
= mean of g samples 
 Xmin (i)        = minimum value of the data in ith sample 

 (Upper control limit for R chart) 


 
 where  

 (Lower control limit for R chart)

Data for X chart ,R-chart, p chart and c chart


Sum of mean of 20 samples =  = 232

Average of mean values of 20 samples = = 11.6 (Center Line of X bar Chart)

 
Average of Ranges of 20 samples         =  
= 4.15 (Center Line of R Chart)

  
Upper Control Limit of X bar chart       = 11.6 + A2 4.15 (A2 = 0.729) = = 14.63

                                                                  
Lower Control Limit of X bar chart       = 11.6 - A2 4.15 (A2 = 0.729) = = 8.57
                                                                   
Upper Control Limit of R chart             = D3 4.15 (D3 = 2.282) = 9.47 = 9.5

                                                                  

Lower Control Limit of R chart             = D 4  = 4.15 (D4 = 0)  


  
X chart R chart
p-chart formulae

 
= centre line of p chart

Where n is the sample size. Sample size in p chart must be

Sometimes LCL in p chart becomes negative, in such cases LCL should be taken as 0

CL =
c-chart formulae

  = centre line of c chart

CL =
PivotTables

Use a PivotTable report to analyze


and summarize your data.
What is an Excel Pivot Table?
 An interactive worksheet table
◦ Provides a powerful tool for summarizing large
amounts of tabular data
 Similar to a cross-tabulation table
◦ A pivot table classifies numeric data in a list based
on other fields in the list
 General purpose:
◦ Quickly summarize data from a worksheet or from
an external source
◦ Calculate totals, averages, counts, etc. based on any
numeric fields in your table
◦ Generate charts from your pivot tables
Why?
 A Pivot Table is way to present information in a
report format.

 PivotTable reports can help to analyze numerical


data and answer questions about it.

 Eg:
◦ Who sold the most, and where.
◦ Which quarters were the most profitable, and which
product sold best.
Data Cube
Where to place data fields
• Page Fields: display data as pages and allows
you to filter to a single item
• Row Fields: display data vertically, in rows
• Column Fields: display data horizontally,
across columns
• Data Items: numerical data to be summarized
Pivot Table Advantages
 Interactive: easily rearrange them by
moving, adding, or deleting fields
 Dynamic: results are automatically
recalculated whenever fields are added or
dropped, or whenever categories are
hidden or displayed
 Easy to update: “refreshable” if the
original worksheet data changes
Sample Data
Creating a PivotTable
• Click in the Excel table or select the range of data for the PivotTable
• In the Tables group on the Insert tab, click the PivotTable button
• Click the Select a table or range option button and verify the
reference in the Table/Range box
• Click the New Worksheet option button or click the Existing
worksheet option button and specify a cell
• Click the OK button
• Click the check boxes for the fields you want to add to the
PivotTable (or drag fields to the appropriate box in the layout
section)
• If needed, drag fields to different boxes in the layout section
Creating a PivotTable
Creating a PivotTable
Adding a Report Filter
to a PivotTable
• A report filter allows you to filter the
PivotTable to display summarized data for one
or more field items or all field items in the
Report Filter area
Filtering PivotTable Fields
• Filtering a field lets you focus on a subset of
items in that field
• You can filter field items in the PivotTable by
clicking the field arrow button in the
PivotTable that represents the data you want
to hide and then uncheck the check box for
each item you want to hide
Refreshing a PivotTable
• You cannot change the data directly in the
PivotTable. Instead, you must edit the Excel
table, and then refresh, or update, the
PivotTable to reflect the current state of the
art objects list
• Click the PivotTable Tools Options tab on the
Ribbon, and then, in the Data group, click the
Refresh button
Grouping PivotTable Items
• When a field contains numbers, dates, or
times, you can combine items in the rows of a
PivotTable and combine them into groups
automatically
Creating a PivotChart
• A PivotChart is a graphical representation of
the data in a PivotTable
• A PivotChart allows you to interactively add,
remove, filter, and refresh data fields in the
PivotChart similar to working with a
PivotTable
• Click any cell in the PivotTable, then, in the
Tools group on the PivotTable Tools Options
tab, click the PivotChart button
Creating a PivotChart
Data Cleansing and
Preprocessing
Data Preprocessing
Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Why Data Preprocessing?

• Data in the real world is dirty


– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Data Cleaning
Data cleaning is a technique that is applied to
remove the noisy data and correct the
inconsistencies in data. Data cleaning involves
transformations to correct the wrong data. Data
cleaning is performed as a data preprocessing step
while preparing the data for a data warehouse.
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions
Simple Discretization Methods: Binning

• Equal-width (distance) partitioning:


– It divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Data Integration
Data Integration is a data preprocessing technique that
merges the data from multiple heterogeneous data sources
into a coherent data store. Data integration may involve
inconsistent data and therefore needs data cleaning.
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from
different sources are different
– possible reasons: different representations, different
scales, e.g., metric vs. British units
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

17
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
(Observed  Expected) 2
2  
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s product moment


coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

20
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
Data Transformation:
Normalization

• min-max normalization
v  minA
v' 
maxA  minA
• z-score normalization
v  meanA
v' 
stand _ devA
• normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Data Reduction Strategies
• Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Numerosity reduction
– Discretization and concept hierarchy generation
Data Cube Aggregation
• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the
task
Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse
– Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
– The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
– Avoid the curse of dimensionality
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Dimensionality reduction techniques
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling

27
Parametric Data Reduction: Regression

• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
y

Regression Analysis
Y1

• Regression analysis: A collective name for


techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more x
X1
independent variables (aka. explanatory
variables or predictors)
• Used for prediction (including
• The parameters are estimated so as to give a forecasting of time-series data),
"best fit" of the data inference, hypothesis testing,
• Most commonly the best fit is evaluated by and modeling of causal
using the least squares method, but other relationships

criteria have also been used


Regression Analysis

• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
Dimensionality Reduction

• Feature selection (i.e., attribute subset selection):


– Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features

– reduce # of patterns in the patterns, easier to understand


Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
Sampling
• Sampling: obtaining a small samples to represent the whole
data set N
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified
sampling.

33
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
– Used in conjunction with skewed data
34
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization:
divide the range of a continuous attribute into intervals
– Some classification algorithms only accept categorical
attributes.
– Reduce data size by discretization
– Prepare for further analysis
Discretization and Concept hierachy

• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
• Concept hierarchies
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Discretization for numeric data

• Binning

• Histogram analysis

• Clustering analysis
Data Preprocessing

• Why preprocess the data?


• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Summary

• Data preparation is a big issue for both warehousing and


mining
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but still an active area
of research
Knowledge Discovery

Here is the list of steps involved in the knowledge discovery process

Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
Data Cleaning − In this step, the noise and inconsistent data is
removed.
Data Integration − In this step, multiple data sources are
combined.
Data Selection − In this step, data relevant to the analysis task
are retrieved from the database.
Data Transformation − In this step, data is transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in
order to extract data patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is
represented.
Data Integration
Data Integration is a data preprocessing technique
that merges the data from multiple heterogeneous
data sources into a coherent data store. Data
integration may involve inconsistent data and
therefore needs data cleaning.
Data Cleaning
Data cleaning is a technique that is applied to
remove the noisy data and correct the
inconsistencies in data. Data cleaning involves
transformations to correct the wrong data. Data
cleaning is performed as a data preprocessing step
while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to
the analysis task are retrieved from the database.
Sometimes data transformation and consolidation are
performed before the data selection process.
Clusters
Cluster refers to a group of similar kind of objects.
Cluster analysis refers to forming group of objects that
are very similar to each other but are highly different
from the objects in other clusters.
Data Transformation
In this step, data is transformed or consolidated into
forms appropriate for mining, by performing summary
or aggregation operations.
Knowledge Discovery Process
– Data mining: the core of
knowledge discovery Knowledge Interpretation
process.

Data Mining

Task-relevant Data
Data transformations

Preprocessed Selection
Data
Data Cleaning

Data Integration

Databases

Вам также может понравиться