MBAD6221JUD Lecture1.

MBAD 6221 JUD
Lecture Set 01
Bumsoo Kim
05/22/2012 The George Washington University School of Business Department of Decision Sciences
Outline
Overview of the syllabus Class profile Blackboard
Importance of analytics What is statistics? Describing data: Summary Measures Introduction to the software: StatTools
2
Importance of Analytics
Every day, 15 million GB of new information is generatedeight times what is contained in all U.S. libraries. What to analyze: Asking the right questions How to analyze: Using the proper tools and correct interpretation Increasing availability of a number of user friendly and very powerful desktop software packages to collect and analyze data
Increasing importance of data, growth of the field of Analytics Competitive advantage to firms that can better use the available data to make better decisions is perhaps the biggest motivation Bottom line no excuse for not basing decisions on good, sound analysis of data!
US Army, FedEx, United Airlines, IBM, Goldman Sachs, Google, Facebook, Gap, MediCare, Capital One, Marriott, WalMart, BestBuy, Geico
Data summary, regression, hypothesis testing, forecasting, data mining, network analysis, simulation, 6 decision trees, optimization
Analytics in a Nutshell
Data Analysis:
Summary Statistics Statistical Description =>

(mean, median, mode, range, variance)
Relationships among Variables

- Linear and non-linear relationships
Statistical Inference (Population vs. Sample), Forecasts and Estimates
Decision Making:

Decision Analysis under Uncertainty Optimization Sensitivity Analysis

8
Introduction
Statistics deals with principles and methods of extracting and processing information
Statistics involves:
Collection, display, analysis and interpretation of information
Decision Making under Uncertainty vs. Certainty

9
What is statistics?
DATA PROCESS INFORMATION
Statistics is the process whereby data is transformed into information

Use the information to reach conclusions about the population Information obtained from sample data => inference regarding the
whole population
What conclusion can be reached about the population?

10
Example
Example:
Randomly select 500 voters from all of the people eligible to vote; record their responses.
The The
numerical values of the 500 responses is the data. data values are 0s and 1s.
Data are numerical results of measurements on specified variables. If 300 voters said Yes, the sample percentage is 300/500 = 0.6
The
sample percentage is 60%.

conclusion can be reached about the population percentage?
11
What
Descriptive vs. Inferential Statistics
Descriptive Statistics - Data summarization Inferential Statistics - Use of sample data to make inferences about a population parameter
Population: A set of objects that is of interest to investigators Sample: A subset of the population
12
Typical Types of Variables
Variables and observations Quantitative (Numerical) Variables
Discrete
Ex: Number of arrivals to a bank in a given hour
Continuous
Ex: Income, return
Qualitative (Categorical) Variables
Ordinal
Ex: Ranking in a race, ABC type grades Ex: Social Security, zip code, gender
13
Nominal
Describing Data
Frequency Tables and Histograms

A frequency table lists the number of observations of a particular variable that fall in different categories A histogram is a bar chart of these frequencies

Horizontal axis shows intervals or classes Vertical axis shows frequency or relative frequency per class
14
Shapes of Histograms
1-) Symmetric Histogram: - A histogram is symmetric if it has a single peak and looks approximately the same to the left and right of the peak.
10. 0 9.5 10. 3 9.5 9.0 7.3 5.8 5.0 3.8 3.3 2.0 0.3 0.5 0.3 2.3 3.5 2.0 1.3 0.5
6.8
7.3
.45
.46
.47
.48
.49
.5
.51
.52
.53
.54
.55
Example: Distribution of Diameters of Elevator Rails in a Production Facility
15
2-) Skewed to the right (positively skewed) Histogram:
- A histogram is skewed to the right if it has a single peak and the values of the distribution extend much farther to the right than to the left of the peak.
10
15
20
25
30
Example: Distribution of Times Between Arrivals to a Bank
16
3-) Skewed to the left (negatively skewed) Histogram:
- A histogram is skewed to the left if it has a single peak and the values of the distribution extend much farther to the left than to the right of the peak.
40
50
60
70
80
90
100
Example: Distribution of Exam Scores
17
4-) Bimodal Distribution Histogram:
- Some histograms have two or more peaks. This might indicate that the data comes from two different populations.
.48
.49
.5
.51
.52
.53
.54
.55
.56
.57
.58
.59
.6
.61
.62
Example: Distribution of Diameters of Elevator Rails
18
Descriptive Statistics
1-) Measures of Central Location: Mean, median and mode 2-) Measures of Variability: Variance and standard deviation 3-) Other measures: Min, max and range
19
Measures of Central Location

Mean: The mean is the average of all values of the variable. If the data represent a sample form a larger population, we call this measure the sample mean and denote it by
n
(1 / n)
i 1
yi
And the population mean is denoted by .
20

Median: The median is the middle observation when the data are listed from smallest to largest (ascending order).
Median
n 1 2
, if n is odd;
1 y 2
n 2
n 2
, if n is even.
1
21

Mode: The mode is the most frequently occurring value.
When the variable of interest contains categorical measurements such as gender, address, salary range etc What does the mode represent if the values are continuous (salaries for instance)? Effect of extreme values to mean, median and mode
22
Measures of Variability
Variance: The variance is essentially the average of squared deviations from the mean.
n
(Yi Y ) 2
2 i 1
=> Population variance
n
n
(Yi Y ) 2 s
2 i 1
n 1
=> Sample variance (unbiased estimator of the population variance)
Note that when n is large enough, the values are practically the same
23
Standard Deviation: The standard deviation is defined as the square root of the variance.
=> Standard Deviation = (Variance) The standard deviation is measured in original units, such as dollars and it is easier to interpret. Both variance and the standard deviation measures can be used as a measure of risk.
24
Interpretation of the Standard Deviation: Empirical Rules
For a set of measurements having a mound-shaped symmetric histogram (normally distributed), the interval:
y 1s contains approximat ely 68% of the measurements; y y
2s contains approximat ely 95% of the measurements; 3s contains approximat ely all of the measurements.
The approximation may be poor if the data are severely skewed or bimodal, or contain outliers.
25
Other Measures
Minimum: Smallest value in the dataset Maximum: Largest value in the dataset Range (Max Min): Difference between the largest value in the dataset and the smallest one p-th percentile (quartile): Value of x that exceeds (p) % of the measurements and is less than the remaining (100-p)%
26
Describing the Data Sets with Boxplots

The boxplots can be used in two ways: either to describe a single variable in a data set or compare two (or more) variables.
OUTLIER
IN TER Q UA R TILE R A NG E LEF TH AN D W H ISK ER R IGH TH AN D W H ISK ER
Q1 (25th P ER C EN TILE)
Q3 (75th P ER C EN TILE)
Note that higher the IQR (inter-quartile range) higher the variability
27
Two variables summaries - Scatter plots

Both variables need to be quantitative (namely continuous not categorical)
Suppose one variable is labeled X and the other Y. A scatter plot is a graph of the (X,Y) pairs and is used to assess the simultaneous behavior of two quantitative variables. Example: A survey questions members of 100 households about their spending habits. The data contains salary, expenses for cultural activities, expenses for sports-related activities and expenses for dining out over the past year.
Are there any linear relationships between these variables?
28
Two variables summaries - Scatter plots

Salary vs. the other expenses
Biv ariate Fit of Salary By Culture
80000 70000
Biv ariate Fit of Salary By Sports

80000 70000
Biv ariate Fit of Salary By Dining

80000 70000
Salary
Salary
60000 50000 40000 500 600 700 800 9001000 Culture 1200 1400 1600
60000 50000 40000 400 600 800 1000 1200 1400 1600 1800 2000 Sports
Salary
60000 50000 40000 1000 1500 2000 Dining 2500 3000
Which pairs have positive or negative linear relationships?
29
Cross Sectional Data

Cross sectional data is data based on observations taken at a particular point in time
- Analysis of cross-sectional data usually consists of comparing the differences among the subjects. - Example: Attributes of voters in District at the end of 2011 - Use of PivotTables in Excel to summarize data
30
Time Series Data

A time series is an ordered sequence of observations. Although the ordering is usually through time, particularly in terms of some equally spaced time intervals.
It is used to track the changes in a variable over time.
- Time vs. a continuous variable such as monthly stock returns, quarterly sales, weekly interest rates, yearly earnings etc..
- Example: The data contains monthly closing prices for the Dow Jones index from January 1947 through January 1993. The monthly returns from the index are also shown. Each return is the monthly percentage change in the index.
31
Time Series Data

3500 3000
ClosingIndex
2500 2000 1500 1000 500 0 0 100 200 300 Row 400 500 600
Dow Jones Closing index vs. time (monthly data)
32
Time Series Data
0.1
Return
-0.1 0 100 200 300 R ow 400 500 600
Return vs. time (monthly data)
33
Summary

Importance of Analytics Making sense of data Descriptive statistics vs inferential statistics Graphing data Measures of central tendency Measures of variability Two variables summaries - Scatterplots Cross Sectional vs Time series data MS Excel & StatTools Demonstration
34
Summary
35
Lecture 2
Topics :
- Rules and definitions of probability - Statistical Independence - Probability Trees
Things to do :
Spend time on Excel, StatTools with provided datasets Read Chapter 4 in the textbook Form your groups & you may attempt Questions 1 & 2 in Assignment 1
36
- Go over these notes & Check Ch. 1,2 &3 for clarifications
-

MBAD6221JUD Lecture1.

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MBAD6221JUD Lecture1.

Загружено:

Авторское право:

Доступные форматы

MBAD 6221 JUD

Overview of the syllabus Class profile Blackboard

Summary Statistics Statistical Description =>

Relationships among Variables

Statistical Inference (Population vs. Sample), Forecasts and Estimates

Decision Analysis under Uncertainty Optimization Sensitivity Analysis

Collection, display, analysis and interpretation of information

Decision Making under Uncertainty vs. Certainty

Statistics is the process whereby data is transformed into information

What conclusion can be reached about the population?

sample percentage is 60%.

Descriptive vs. Inferential Statistics

Typical Types of Variables

Variables and observations Quantitative (Numerical) Variables

Ex: Number of arrivals to a bank in a given hour

Ex: Income, return

Qualitative (Categorical) Variables

Frequency Tables and Histograms

Example: Distribution of Diameters of Elevator Rails in a Production Facility

Example: Distribution of Times Between Arrivals to a Bank

Example: Distribution of Exam Scores

Example: Distribution of Diameters of Elevator Rails

Measures of Central Location

And the population mean is denoted by .

Measures of Central Location

Measures of Central Location

=> Population variance

=> Sample variance (unbiased estimator of the population variance)

y 1s contains approximat ely 68% of the measurements; y y

Describing the Data Sets with Boxplots

Two variables summaries - Scatter plots

Are there any linear relationships between these variables?

Two variables summaries - Scatter plots

Biv ariate Fit of Salary By Sports

Biv ariate Fit of Salary By Dining

60000 50000 40000 1000 1500 2000 Dining 2500 3000

Which pairs have positive or negative linear relationships?

Cross Sectional Data

Time Series Data

Time Series Data

Dow Jones Closing index vs. time (monthly data)

Time Series Data

-0.1 0 100 200 300 R ow 400 500 600

Return vs. time (monthly data)

- Rules and definitions of probability - Statistical Independence - Probability Trees

Вам также может понравиться