Академический Документы
Профессиональный Документы
Культура Документы
Data
Data comes from the Latin word, "datum,"
meaning a "thing given." Although the term
"data" has been used since as early as the
1500s, modern usage started in the 1940s
and 1950s as practical electronic computers
began to input, process, and output data.
Data is a set of values of qualitative or
quantitative variables. It is information in raw or
unorganized form. It may be a fact, figure,
characters, symbols etc.
Wisdom
Knowledge
Information
Data
Data
Data are numbers, words or images that have yet to
be organized or analyzed to answer a specific
question.
Information
Produced through processing, manipulating and
organizing data to answer questions, adding to the
knowledge of the receiver.
Knowledge
What is known by a person or persons. Involves
interpreting information received, adding relevance
and context to clarify the insights the information
contains.
Wisdom
Wisdom is the synthesis of knowledge and
experiences into insights that deepen one's
understanding of relationships and the meaning of
life.
Characteristics of Data
Accuracy
Data should be sufficiently accurate for the intended use
and should be captured only once, although it may have
multiple uses. Data should be captured at the point of
activity.
Interviews
Questionnaires and Surveys
Observations
Focus Groups
Case Studies
Documents and Records
Basic types of Data
There are two basic types of data: numerical and
categorical data.
3 - 17
Why Data Types are important?
Howard
Dresner
Exciting new effective
applications of data analytics
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of Centers for Disease Control
data
Exponential increase in
collected/generated data
Characteristics of Big Data:
2-Speed (Velocity)
• Data is begin generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
Mobile devices
(tracking all objects all the time)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
What can you do with the data?
• Data Science is currently a popular interest of
organizations
Skills
Learning the application domain - The data scientist
must quickly learn how the data will be used in a
particular context.
Communicating with data users - A data scientist
must possess strong skills for learning the needs and
preferences of users. Translating back and forth
between the technical terms of computing and
statistics and the vocabulary of the application domain
is a critical skill.
Seeing the big picture of a complex system - After
developing an understanding of the application domain,
the data scientist must imagine how data will move
around among all of the relevant systems and people.
Knowing how data can be represented - Data
scientists must have a clear understanding about how
data can be stored and linked, as well as about
"metadata”.
Data transformation and analysis - When data
become available for the use of decision makers, data
scientists must know how to transform, summarize, and
make inferences from the data. As noted above, being
able to communicate the results of analyses to users is
also a critical skill here.
Visualization and presentation - Although numbers often
have the edge in precision and detail, a good data display (e.g.,
a bar chart) can often be a more effective means of
communicating results to data users.
Represent Data
Discover Data
Deliver Insight
Learn from Data Data Product
Visualize Insight
Description and Inference Objective
Data and Algorithm Levers
Actionable
Models Modeling
Machine learning Simulation Predictive
Networks and Graphs Optimization Immediate Impact
Regression and Prediction Visualization
Business Value
Classification and Clustering
Experiment and Iteration Easy to explain
Identifying Data Problems
Scenario analysis
Sensitivity analysis
Simulation
Goal Seek
Goal Seek, optimizes a goal and provides a solution
when one variable changes.
Keep solution or
restore original Solver Results Dialog Box
values
Solver Results
Changing
cells
Scenarios
Scenario analysis is the process of analyzing
possible future events by considering
alternative possible outcomes.
Guess
Dependent Cells
• Determine which cells contain formulas –
dependent cells.
Input cells
Dependent cell
Results
• What are the results given the input data?
Monthly payment =$1,945.79
• Do the results match the requirements?
No, payment must be max $1,200
• If not, change the input data to obtain the
needed results.
What happens if interest rate changes?
What happens if purchase price changes?
Complete the Analysis
• Enter variables and determine the result.
• Make changes in data until desired result is
obtained.
• Analysis completion will differ depending on
which what-if tool is used.
Use Scenario Manager
1. Create a worksheet with known information.
Use Scenario Manager
Create Scenarios
Use Scenario Manager
2. From the Tools menu select Scenarios.
Click on
Add
34
Scenario Summary Report
• A summary of the results of all scenarios can
be displayed in a separate worksheet.
• Access Scenario Summary dialog box.
Sensitivity Analysis
Involves changing the values of an input to
a model or formula incrementally and
measuring the related change in
outcome(s).
* Summarizing data
* Fitting data (simple linear regression, Multiple
Regression)
* Hypothesis testing (t-test)
What is Data Analysis?
Data analysis is the process used to get result from
raw data that can be used to make decisions.
* Numerical Summaries
* Measures of location
* Measures of variability
Click on Data Select Data Analysis Descriptive Statistics Select input
range Check Summary statistics Ok
Hypothesis Testing
• H0 = null hypothesis
– There is no significant difference
• H1 = alternative hypothesis
– There is a significant difference
Variable 1 Variable 2
Probability of drawing two
Mean 54.99931 50.90014
random samples from a
Variance 1.262476 7.290012
normally distributed
Observations 22 5 population and getting the
Hypothesized Mean Difference 0 mean of sample #1 this
df 4 much larger than the mean
t Stat 3.329922 of sample #2. The mean of
P(T<=t) one-tail 0.014554 sample #1 is larger at a
t Critical one-tail 2.131846 significance level of =0.03
P(T<=t) two-tail 0.029108 (or “at the 3 % significance
t Critical two-tail 2.776451 level”), because p < 0.03.
SS df MS F
Between SS(B) k-1 SS(B) MS(B)
----------- --------------
k-1 MS(W)
Within SS(W) N-k SS(W) .
-----------
N-k
Total SS(W) + N-1 .
SS(B)
F test statistic
= MSG / MSE
30
20 Change in Ticket
10 Price
0 Change in
-10 Player Salary
-20
1995 1996 1997 1998 1999 2000 2001
Regression Analysis
y = mx+ b
Data Data Analysis Regression
21
Multiple Regression
Multiple regression is the appropriate method of analysis
when the research problem involves a single metric dependent
variable presumed to be related to two or more metric
independent variables. The objective of multiple regression
analysis is to predict the changes in the dependent variable in
response to change in the independent variables. This
objective is most often achieved through the statistical rule of
least squares.
Regression Statistics
Multiple R 0.97
R Square 0.94
Adjusted R
Square 0.92
Observations 12.00
ANOVA
df SS F Significance F
Advantage:
• An important property of the mean is that it includes every value in your data set as part of the calculation.
• In addition, the mean is the only measure of central tendency where the sum of the deviations of each value
from the mean is always zero.
• It can be used with both discrete and continuous data, although its use is most often with continuous data.
Disadvantage:
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
Measure:
The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So,
if we have n values in a data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by x.
Median: The median is the middle score for a set of data that has been arranged in order of magnitude. The
median is less affected by outliers and skewed data.
Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or
histogram
Best measure of central
Type of Variable
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not
Mean
skewed)
Interval/Ratio (skewed) Median
The population
standard deviation formula is:
Q. A teacher sets an exam for their pupils. The teacher wants to summarize the
results the pupils attained as a mean and standard deviation. Which standard
deviation should be used?
A.
Q. A researcher has recruited males aged 45 to 65 years old for an exercise training
study to investigate risk markers for heart disease (e.g., cholesterol). Which standard
deviation would most likely be used?
Q. One of the questions on a national consensus survey asks for respondents' age.
Which standard deviation would be used to describe the variation in all ages received
from the consensus?
A.
What is a Z-Score?
A z-score is a measure of how many standard deviations below or above the population mean a raw score is. A
z-score is also known as a standard score and it can be placed on a normal distribution curve.
A z-score can tell you where that person’s weight is compared to the average population’s mean weight.
The z score tells you how many standard deviations from the mean your score is.
Example: a. If 6 out of 40 students plan to go to graduate school, the proportion of all students who plan to go to
graduate school is estimated as ________. The standard error of this estimate is ________.
b. If 54 out of 360 students plan to go to graduate school, the proportion of all students who plan to go to graduate
school is estimated as ________. The standard error of this estimate is ________.
Statistical inference
Statistical inference is the process of inference from the sample to a population
with calculated degree of certainty. The two common forms of statistical
inference are:
• Estimation
• Null hypothesis tests of significance (NHTS)
Estimation in Statistics
. Estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample.
Parameters
estimation
Random sample
Statistics
Statistical estimation
Estimate
SEM =σ/ √n
Suppose a measurement that has σ = 10.
o A sample of n = 1 for this variable derives SEM =
o A sample of n = 4 derives SEM =
o A sample of n = 16 derives SEM =
The reason we use z1-α/2 instead of z1-α in this formula is because the random error is split between
underestimates (left tail of the SDM) and overestimates (right tail of the SDM). The confidence level 1−α area lies
between −z1−α/2 and z1−α/2:
The common levels of confidence and their associated alpha levels and z quantiles:
(1−α)100% α z1-α/2
90% .10 1.64
μ: ?
Margin of error:?
ii) 95% CI for µ and margin of error with same x-bar and SEM?
iii) the 99% CI for μ and margin of error with same x-bar and SEM?
Suppose a population with σ = 15 and unknown mean μ. A random sample of 10 observations from this
population and observe the following values: {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. Based on these 10
observations, x = ? , SEM = ? and a 95% CI for μ = ?
Sample Size Requirements for estimating µ
m represents the margin of error and population size is n.
Solution
The correct answer is (E).
Estimating p with Sampling distribution of the proportion
Proportion for sample =
p ˆ = number successes of in the sample/n
In large samples, the sampling distribution of p ˆ is approximately normal with a mean of p and standard error
of the proportion SEP :
Here,
Example 1: 57 individuals reveals 17 smokers. Use npq rule to determine suitability of the method .
estimate the 95% CI for p .
Example 2: Out of 2673 people surveyed, 170 have risk factor X. We want to determine the population
prevalence of the risk factor with 95% confidence.
where p* represent the an educated guess for the proportion and q* = 1 − p. When no reasonable guess of p is
available, use p* = 0.50.
Example 1: Calculate sample a population with 95% confidence for the prevalence of smoking. How large a
sample is needed to achieve a margin of error of 0.05 if we assume the prevalence of smoking is roughly 30% ?
Example 2: How large a sample is needed to shrink the margin of error to 0.03?
PIVOT TABLE
AND
OPTIMIZATION USING
SOLVER
PIVOT TABLE
Definition:- A pivot table is a user created summary table of original
spreadsheet. We can create the table by defining which fields to view
and how the information should be displayed. Based on our field
selections, Excel organizes the data so we see a different view of our
data. A Pivot Table is way to present information in a report format.
Use:
• A pivot table can aggregate your information .
• Showing a new perspective by moving columns to rows or vice
versa.
• A comparative study can be made by using this table.
Pivot Table Structures
The main areas of the pivot table.
(1) PivotTable Field List – this section in the top right displays the
fields in our spreadsheet. We may check a field or drag it to a
quadrant in the lower portion.
Open Sales.xlsx and perform the following.-
1.Show the region wise selling pattern for all sales persons and their total sales
amount.
2. Display the product wise sales for each region.
3. Compare the monthly selling performance for each sales person.
4. Draw a pivot chart showing monthly regional selling status. Change the chart
according to
product sales.
5. Open the student.xlsx. Display the month wise sum of score for all subjects and
their
grand total.
6. Display the highest score for each students.
7. Display the pivot chart for student’s monthly score.
The Data worksheet in the Groceriespt.xlsx file contains more than 900 rows of sales
data. Each row contains the number of units sold and revenue of a product at a
store as well as the month and year of the sale. The product group (fruit, milk, cereal,
or ice cream) is also included. You would like to see a breakdown of sales during
each year of each product group and product at each store. You would also like to
be able to show this breakdown during any subset of months in a given year (for
example, what the sales were from January through June).
• Optimal Solution:
Alternative or approach that best fits the situation, employs
resources in a most effective and efficient manner, and yields
the highest possible return under the circumstances.
Optimization
. is a mathematical discipline that concerns the
finding of minima and maxima of functions, subject to so-
called constraints.
A B C D
M 0.6 0.56 0.22 0.4
P 0.36 0.3 0.28 0.58
T 0.65 0.68 0.55 0.42
The available capacities at M, P, and T are 9000, 12000 and 13000 units, respectively.
The Demand at the destinations are 7500, 8500, 9500 and 8000 units, respectively.
Number working 0 0 0 0 0 0 0
>= >= >= >= >= >= >=
Number needed 17 13 15 17 9 9 12
When you click Solve, you’ll see the message, “Solver could not find a feasible solution.” This message does
not mean that you made a mistake in your model but, rather, that with limited resources, you can’t meet
demand for all products.
Multi-criteria Decision
Making
and
Analytical Hierarchical
Problem
Definition
MCDM Type
Characteristics
Criteria type
Solution type
Methods
Multiple criteria decision making (MCDM) refers to making
decisions in the presence of multiple non-commensurable and
conflicting criteria, different units of measurement among the
criteria, and the presence of quite different alternatives.
MCDM Solutions
All criteria in a MCDM problem can be classified into two categories.
• Criteria that are to be maximised are in the profit criteria category.
• Similarly, criteria that are to be minimised are in the cost criteria category.
An ideal solution to a MCDM problem would maximise all profit criteria and minimise
all cost criteria.
Type of solutions:
• Non dominated solutions--- Preferred solutions
• An alternative (solution) is dominated.
Satisfying solutions.
MCDM Methods
There are two types of MCDM methods. One is compensatory and the other is non-
compensatory.
There are three steps in utilizing any decision-making technique involving numerical
analysis of alternatives:
• Determining the relevant criteria and alternatives
• Attach numerical measures to the relative importance to the criteria and the impact
of the alternatives on these criteria
• Process the numerical values to determine a ranking of each alternative.
Numerous MCDM methods, such as,
• ELECTRE-3 and 4,
• Promethee-2
• Compromise Programming,
• Cooperative Game theory,
• Composite Programming,
• Analytical Hierarchy Process,
• Multi-Attribute Utility Theory,
• Multicriterion Q-Analysis etc.
are employed for different applications.
Criteria
Alt. C1 C2 C3 C4 wt.
A1 25 20 15 30 0.20
A2 10 30 20 30 0.15
A3 30 10 30 10 0.40
0.25
Therefore, the best alternative (in the maximization case) is alternative A2 (because it has the
highest WSM score; 22.00). Moreover, the following ranking is derived: A2 > A1 > A3 (where ">" stands
for "better than").
USING WPM: (to express all criteria in terms of the same unit is not needed). When the WPM is
applied, then the following values are derived:
= 1.007 > 1.
Therefore, the best alternative is A1, since it is superior to all the other alternatives. Moreover, the
ranking of these alternatives is as follows: A1 > A2 > A3.
The AHP method
The Analytic Hierarchy Process (AHP) decomposes a complex MCDM problem into
a system of hierarchies. The final step in the AHP deals with the structure of an m*n
matrix ( Where m is the number of alternatives and n is the number of criteria). The
matrix is constructed by using the relative importance of the alternatives in terms of
each criterion. It deals with complex problems which involve the consideration of
multiple criteria/alternatives simultaneously.
Prof. Thomas L. Saaty (1980) originally developed the Analytic Hierarchy Process
(AHP) to enable decision making in situations characterized by multiple attributes
Major steps in applying the AHP techniques are:
1 Develop a hierarchy of factors impacting the final decision. This is known as the
AHP
decision model. The last level of the hierarchy is the three candidates as an
alternative.
2 Elicit pair wise comparisons between the factors using inputs from users/managers.
While comparing two criteria, the simple rule as recommended by Saaty (1980). Thus
while comparing two attributes X and Y we assign the values in the following manner
based on the relative preference of the decision maker. To fill the lower triangular
matrix, we use the reciprocal values of the upper diagonal.
Intensity of Definition
Importance
1 Equal importance
3 Moderate importance of one over other
6 Strong importance
7 Very strong importance
9 Absolute importance
2,4,5,8 Intermediate Values
Reciprocals of the If activity i has one of the above numbers assigned
above to it when compared with activity j, then j has the
reciprocal value when compared with i.
1.1 – 1.9 When elements are close and nearly
indistinguishable
Step – 1. Multiply each value in the first column of the pairwise comparison matrix by
corresponding relative priority matrix.
Step – 2. Repeat Step – 1 for remaining columns.
Step – 4. Divide each elements of the vector of weighed sums obtained in step 1-3
by the
corresponding priority value.
Step – 5. Compute the average of the values found in step –4. Let λ be the average.
Compute the random index, RI, using ratio:
RI = 1.98 (n-2)/n
Accept the matrix if consistency ratio, CR, is less than 0.10, where CR is
Consistency Ratio CR = (CI/RI
)
-
CI: Consistence Index =(λmax n ) / (n – 1), n= no. of terms.
If the Consistency Ratio (CI/CR) <0.10, so the degree of consistency is satisfactory.
The decision maker’s comparison is probably consistent enough to be useful.
No. of 3 4 5 6 7 8
alternatives
(n)
RI 0.58 0.9 1.12 1.24 1.32 1.41
Example: A company decided to outsource some parts of their product. Three
different company submit their tender for the above required parts. Three factors are
important to select the best fit- costs, reliability of the product and delivery time of
the orders. The price offered by them as follows:
1 gross= 12 dozens=144
• Since, XYZ is moderately preferred to ABC, ABC’s entry in the XYZ row is 3 and XYZ ‘s entry in ABC
row is 1/3.
• Since, XYZ is very strongly preferred to PQR, PQR’s entry in the XYZ row is 7 and XYZ’s entry in the
PQR row is 1/7.
• Since , ABC is moderately to strong preferred to PQR, PQR’s entry in the ABC row is 6 and ABS’s
entry in the PQR row is 1/6.
Priority Vector for Reliability according to three companies: ABC(0.571), XYZ (0.278), PQR (0.151).
Priority Vector for Delivery time according to three companies: ABC(0.471), XYZ (0.059), PQR
(0.471).
Comparison Matrix for Criteria:
Priority Vector for criteria : Cost(0.729), Reliability (0.216), Delivery time (0.055).
But if it indicates that the process that is being monitored is not in control, then the chart
analysis can be used to determine the main source of the variation because of which this
degraded performance is seen. Typical example where control charts are used is time
series data.
A control chart consists of:
1) Points of statistic representation like mean, range etc., that are measurements of quality
of samples taken at different times in the process.
2) The mean is calculated of this statistic using all the calculated samples available.
3) A central line is drawn at the mean value so obtained of the statistic.
4) By using all the samples, the standard error is also evaluated of the statistic.
5) Upper and lower control limits
The control chart is a graph used to study how a process changes over time. Data
are plotted in time order. A control chart always has a central line for the average,
an upper line for the upper control limit and a lower line for the lower control limit.
These lines are determined from historical data. By comparing current data to
these lines, you can draw conclusions about whether the process variation is
consistent (in control) or is unpredictable (out of control, affected by special
causes of variation).
Control charts for variable data are used in pairs. The top chart monitors the
average, or the centering of the distribution of data from the process. The bottom
chart monitors the range, or the width of the distribution.
The average is where the shots are clustering, and the range is how tightly they
are clustered. Control charts for attribute data are used singly.
When to Use a Control Chart
When controlling ongoing processes by finding and correcting problems
as they occur.
When predicting the expected range of outcomes from a process.
When determining whether a process is stable (in statistical control).
When analyzing patterns of process variation from special causes (non-
routine events) or common causes (built into the process).
When determining whether your quality improvement project should aim
to prevent specific problems or to make fundamental changes to the
process.
When you start a new control chart, the process may be out of control. If so,
the control limits calculated from the first 20 points are conditional limits.
When you have at least 20 sequential points from a period when the process is
operating in control, recalculate control limits.
Types
Depending on the number of process characteristics to be monitored, there are two
basic types of control charts.
• The first, referred to as a univariate control chart, is a graphical display (chart) of
one quality characteristic.
• The second, referred to as a multivariate control chart, is a graphical display of a
statistic that summarizes or represents more than one quality characteristic.
For normal distribution, the 0.001 probability limits will be very close to the 3σ limits.
If distribution is skewed, say in the positive direction, the 3-sigma limit will fall short of the upper 0.001 limit, while the
lower 3-sigma limit will fall below the 0.001 limit. How much this risk will be increased will depend on the degree of
skewness.
If variation follows a Poisson distribution, for example, for which np = 0.8, the risk of exceeding the upper limit by
chance would be raised by the use of 3-sigma limits from 0.001 to 0.009 and the lower limit reduces from 0.001 to 0.
For a Poisson distribution the mean and variance both equal np. Hence the upper 3-sigma limit is 0.8 + 3 sqrt(0.8) =
3.48 and the lower limit is 0 (here sqrt denotes "square root"). For np = 0.8 the probability of getting more than 3
successes is 0.009.
Different types of control chart for attributes:
1) p – chart: This chart depicts the fraction of nonconforming or the defective product that
is produced in a manufacturing process. It is sometimes also known as the control chart
for fraction nonconforming.
2) np – chart: This chart depicts the number of nonconforming. It is almost the same as the
p – chart.
3) c – chart: This chart depicts the number of defects or non-conformities that are
produced in a manufacturing process.
4) u – chart: This chart depicts the non-conformities per unit that are produced by a
manufacturing process.
Dealing with out-of-control findings
If a data point falls outside the control limits, we assume that the process is probably out of control and that an
investigation is warranted to find and eliminate the cause or causes.
Does this mean that when all points fall within the limits, the process is in control? Not necessarily. If the plot looks
non-random, that is, if the points exhibit some form of systematic behavior, there is still something wrong. For
example, if the first 25 of 30 points fall above the center line and the last 5 fall below the center line, we would wish to
know why this is so. Statistical methods to detect sequences or nonrandom patterns can be applied to the
interpretation of control charts.
For the characteristics of quality that are to be measured on a continuous scale, a particular analysis makes
clear both the process mean and its variability along with a mean chart that is aligned above its corresponding
S- or R- chart.
The most common type of display will actually contain two types of charts and two corresponding histograms. Out
of them one is called an X-bar chart and the other is called an R chart.
In both the line charts above, the horizontal axis is representing the different
samples and the vertical axis for the X-bar chart is representing the means for the
interest characteristic while the vertical axis for the R chart is representing the
ranges.
we can say that the latter chart is a chart of the process variability where if the
variability is large the range will also be. Along with the center line, a typical chart
will also include two horizontal lines additionally in order to represent the upper and
the lower control limits that is UCL and LCL. In general, the different points in the
chart that are representing the samples, are connected by a line as well. Whenever
this line will move outside the limits of upper or lower control or will exhibit
systematic patterns that are across consecutive samples then a quality problem
may exist potentially.
=
Formula for X chart
Average of Ranges of 20 samples =
= 4.15 (Center Line of R Chart)
Upper Control Limit of X bar chart = 11.6 + A2 4.15 (A2 = 0.729) = = 14.63
Lower Control Limit of X bar chart = 11.6 - A2 4.15 (A2 = 0.729) = = 8.57
Upper Control Limit of R chart = D3 4.15 (D3 = 2.282) = 9.47 = 9.5
= centre line of p chart
Sometimes LCL in p chart becomes negative, in such cases LCL should be taken as 0
CL =
c-chart formulae
CL =
PivotTables
Eg:
◦ Who sold the most, and where.
◦ Which quarters were the most profitable, and which
product sold best.
Data Cube
Where to place data fields
• Page Fields: display data as pages and allows
you to filter to a single item
• Row Fields: display data vertically, in rows
• Column Fields: display data horizontally,
across columns
• Data Items: numerical data to be summarized
Pivot Table Advantages
Interactive: easily rearrange them by
moving, adding, or deleting fields
Dynamic: results are automatically
recalculated whenever fields are added or
dropped, or whenever categories are
hidden or displayed
Easy to update: “refreshable” if the
original worksheet data changes
Sample Data
Creating a PivotTable
• Click in the Excel table or select the range of data for the PivotTable
• In the Tables group on the Insert tab, click the PivotTable button
• Click the Select a table or range option button and verify the
reference in the Table/Range box
• Click the New Worksheet option button or click the Existing
worksheet option button and specify a cell
• Click the OK button
• Click the check boxes for the fields you want to add to the
PivotTable (or drag fields to the appropriate box in the layout
section)
• If needed, drag fields to different boxes in the layout section
Creating a PivotTable
Creating a PivotTable
Adding a Report Filter
to a PivotTable
• A report filter allows you to filter the
PivotTable to display summarized data for one
or more field items or all field items in the
Report Filter area
Filtering PivotTable Fields
• Filtering a field lets you focus on a subset of
items in that field
• You can filter field items in the PivotTable by
clicking the field arrow button in the
PivotTable that represents the data you want
to hide and then uncheck the check box for
each item you want to hide
Refreshing a PivotTable
• You cannot change the data directly in the
PivotTable. Instead, you must edit the Excel
table, and then refresh, or update, the
PivotTable to reflect the current state of the
art objects list
• Click the PivotTable Tools Options tab on the
Ribbon, and then, in the Data group, click the
Refresh button
Grouping PivotTable Items
• When a field contains numbers, dates, or
times, you can combine items in the rows of a
PivotTable and combine them into groups
automatically
Creating a PivotChart
• A PivotChart is a graphical representation of
the data in a PivotTable
• A PivotChart allows you to interactively add,
remove, filter, and refresh data fields in the
PivotChart similar to working with a
PivotTable
• Click any cell in the PivotTable, then, in the
Tools group on the PivotTable Tools Options
tab, click the PivotChart button
Creating a PivotChart
Data Cleansing and
Preprocessing
Data Preprocessing
Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Why Data Preprocessing?
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Preprocessing
17
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed Expected) 2
2
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Correlation Analysis (Numeric Data)
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
20
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
Data Transformation:
Normalization
• min-max normalization
v minA
v'
maxA minA
• z-score normalization
v meanA
v'
stand _ devA
• normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(| v ' |)<1
10
Data Preprocessing
27
Parametric Data Reduction: Regression
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
y
Regression Analysis
Y1
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
Dimensionality Reduction
33
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
– Used in conjunction with skewed data
34
Data Preprocessing
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
• Concept hierarchies
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Discretization for numeric data
• Binning
• Histogram analysis
• Clustering analysis
Data Preprocessing
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
Data Cleaning − In this step, the noise and inconsistent data is
removed.
Data Integration − In this step, multiple data sources are
combined.
Data Selection − In this step, data relevant to the analysis task
are retrieved from the database.
Data Transformation − In this step, data is transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in
order to extract data patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is
represented.
Data Integration
Data Integration is a data preprocessing technique
that merges the data from multiple heterogeneous
data sources into a coherent data store. Data
integration may involve inconsistent data and
therefore needs data cleaning.
Data Cleaning
Data cleaning is a technique that is applied to
remove the noisy data and correct the
inconsistencies in data. Data cleaning involves
transformations to correct the wrong data. Data
cleaning is performed as a data preprocessing step
while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to
the analysis task are retrieved from the database.
Sometimes data transformation and consolidation are
performed before the data selection process.
Clusters
Cluster refers to a group of similar kind of objects.
Cluster analysis refers to forming group of objects that
are very similar to each other but are highly different
from the objects in other clusters.
Data Transformation
In this step, data is transformed or consolidated into
forms appropriate for mining, by performing summary
or aggregation operations.
Knowledge Discovery Process
– Data mining: the core of
knowledge discovery Knowledge Interpretation
process.
Data Mining
Task-relevant Data
Data transformations
Preprocessed Selection
Data
Data Cleaning
Data Integration
Databases