Академический Документы
Профессиональный Документы
Культура Документы
Lecture Set 01
Bumsoo Kim
05/22/2012 The George Washington University School of Business Department of Decision Sciences
Outline
Importance of analytics What is statistics? Describing data: Summary Measures Introduction to the software: StatTools
2
Importance of Analytics
Every day, 15 million GB of new information is generatedeight times what is contained in all U.S. libraries. What to analyze: Asking the right questions How to analyze: Using the proper tools and correct interpretation Increasing availability of a number of user friendly and very powerful desktop software packages to collect and analyze data
Importance of Analytics
Increasing importance of data, growth of the field of Analytics Competitive advantage to firms that can better use the available data to make better decisions is perhaps the biggest motivation Bottom line no excuse for not basing decisions on good, sound analysis of data!
Importance of Analytics
US Army, FedEx, United Airlines, IBM, Goldman Sachs, Google, Facebook, Gap, MediCare, Capital One, Marriott, WalMart, BestBuy, Geico
Importance of Analytics
Data summary, regression, hypothesis testing, forecasting, data mining, network analysis, simulation, 6 decision trees, optimization
Importance of Analytics
Analytics in a Nutshell
Data Analysis:
Decision Making:
Introduction
Statistics deals with principles and methods of extracting and processing information
Statistics involves:
What is statistics?
DATA PROCESS INFORMATION
whole population
Example
Example:
Randomly select 500 voters from all of the people eligible to vote; record their responses.
The The
numerical values of the 500 responses is the data. data values are 0s and 1s.
Data are numerical results of measurements on specified variables. If 300 voters said Yes, the sample percentage is 300/500 = 0.6
The
What
Descriptive Statistics - Data summarization Inferential Statistics - Use of sample data to make inferences about a population parameter
Population: A set of objects that is of interest to investigators Sample: A subset of the population
12
Discrete
Continuous
Ordinal
Ex: Ranking in a race, ABC type grades Ex: Social Security, zip code, gender
13
Nominal
Describing Data
Horizontal axis shows intervals or classes Vertical axis shows frequency or relative frequency per class
14
Shapes of Histograms
1-) Symmetric Histogram: - A histogram is symmetric if it has a single peak and looks approximately the same to the left and right of the peak.
10. 0 9.5 10. 3 9.5 9.0 7.3 5.8 5.0 3.8 3.3 2.0 0.3 0.5 0.3 2.3 3.5 2.0 1.3 0.5
6.8
7.3
.45
.46
.47
.48
.49
.5
.51
.52
.53
.54
.55
15
Shapes of Histograms
2-) Skewed to the right (positively skewed) Histogram:
- A histogram is skewed to the right if it has a single peak and the values of the distribution extend much farther to the right than to the left of the peak.
10
15
20
25
30
16
Shapes of Histograms
3-) Skewed to the left (negatively skewed) Histogram:
- A histogram is skewed to the left if it has a single peak and the values of the distribution extend much farther to the left than to the right of the peak.
40
50
60
70
80
90
100
17
Shapes of Histograms
4-) Bimodal Distribution Histogram:
- Some histograms have two or more peaks. This might indicate that the data comes from two different populations.
.48
.49
.5
.51
.52
.53
.54
.55
.56
.57
.58
.59
.6
.61
.62
18
Descriptive Statistics
1-) Measures of Central Location: Mean, median and mode 2-) Measures of Variability: Variance and standard deviation 3-) Other measures: Min, max and range
19
(1 / n)
i 1
yi
20
Median
n 1 2
, if n is odd;
1 y 2
n 2
n 2
, if n is even.
1
21
When the variable of interest contains categorical measurements such as gender, address, salary range etc What does the mode represent if the values are continuous (salaries for instance)? Effect of extreme values to mean, median and mode
22
Measures of Variability
Variance: The variance is essentially the average of squared deviations from the mean.
n
(Yi Y ) 2
2 i 1
n
n
(Yi Y ) 2 s
2 i 1
n 1
Note that when n is large enough, the values are practically the same
23
Measures of Variability
Standard Deviation: The standard deviation is defined as the square root of the variance.
=> Standard Deviation = (Variance) The standard deviation is measured in original units, such as dollars and it is easier to interpret. Both variance and the standard deviation measures can be used as a measure of risk.
24
Measures of Variability
Interpretation of the Standard Deviation: Empirical Rules
For a set of measurements having a mound-shaped symmetric histogram (normally distributed), the interval:
2s contains approximat ely 95% of the measurements; 3s contains approximat ely all of the measurements.
The approximation may be poor if the data are severely skewed or bimodal, or contain outliers.
25
Other Measures
Minimum: Smallest value in the dataset Maximum: Largest value in the dataset Range (Max Min): Difference between the largest value in the dataset and the smallest one p-th percentile (quartile): Value of x that exceeds (p) % of the measurements and is less than the remaining (100-p)%
26
OUTLIER
IN TER Q UA R TILE R A NG E LEF TH AN D W H ISK ER R IGH TH AN D W H ISK ER
Q1 (25th P ER C EN TILE)
Q3 (75th P ER C EN TILE)
Note that higher the IQR (inter-quartile range) higher the variability
27
28
Salary
Salary
60000 50000 40000 500 600 700 800 9001000 Culture 1200 1400 1600
60000 50000 40000 400 600 800 1000 1200 1400 1600 1800 2000 Sports
Salary
29
30
- Time vs. a continuous variable such as monthly stock returns, quarterly sales, weekly interest rates, yearly earnings etc..
- Example: The data contains monthly closing prices for the Dow Jones index from January 1947 through January 1993. The monthly returns from the index are also shown. Each return is the monthly percentage change in the index.
31
2500 2000 1500 1000 500 0 0 100 200 300 Row 400 500 600
32
0.1
Return
33
Summary
Importance of Analytics Making sense of data Descriptive statistics vs inferential statistics Graphing data Measures of central tendency Measures of variability Two variables summaries - Scatterplots Cross Sectional vs Time series data MS Excel & StatTools Demonstration
34
Summary
35
Lecture 2
Topics :
Things to do :
Spend time on Excel, StatTools with provided datasets Read Chapter 4 in the textbook Form your groups & you may attempt Questions 1 & 2 in Assignment 1
36
- Go over these notes & Check Ch. 1,2 &3 for clarifications
-