Вы находитесь на странице: 1из 36

MBAD 6221 JUD

Lecture Set 01
Bumsoo Kim
05/22/2012 The George Washington University School of Business Department of Decision Sciences

Outline

Overview of the syllabus Class profile Blackboard

Importance of analytics What is statistics? Describing data: Summary Measures Introduction to the software: StatTools
2

Importance of Analytics

Every day, 15 million GB of new information is generatedeight times what is contained in all U.S. libraries. What to analyze: Asking the right questions How to analyze: Using the proper tools and correct interpretation Increasing availability of a number of user friendly and very powerful desktop software packages to collect and analyze data

Importance of Analytics

Increasing importance of data, growth of the field of Analytics Competitive advantage to firms that can better use the available data to make better decisions is perhaps the biggest motivation Bottom line no excuse for not basing decisions on good, sound analysis of data!

Importance of Analytics

US Army, FedEx, United Airlines, IBM, Goldman Sachs, Google, Facebook, Gap, MediCare, Capital One, Marriott, WalMart, BestBuy, Geico

Importance of Analytics

Data summary, regression, hypothesis testing, forecasting, data mining, network analysis, simulation, 6 decision trees, optimization

Importance of Analytics

Analytics in a Nutshell
Data Analysis:

Summary Statistics Statistical Description =>


(mean, median, mode, range, variance)

Relationships among Variables


- Linear and non-linear relationships

Statistical Inference (Population vs. Sample), Forecasts and Estimates

Decision Making:

Decision Analysis under Uncertainty Optimization Sensitivity Analysis


8

Introduction

Statistics deals with principles and methods of extracting and processing information

Statistics involves:

Collection, display, analysis and interpretation of information

Decision Making under Uncertainty vs. Certainty


9

What is statistics?
DATA PROCESS INFORMATION

Statistics is the process whereby data is transformed into information


Use the information to reach conclusions about the population Information obtained from sample data => inference regarding the

whole population

What conclusion can be reached about the population?


10

Example
Example:
Randomly select 500 voters from all of the people eligible to vote; record their responses.
The The

numerical values of the 500 responses is the data. data values are 0s and 1s.

Data are numerical results of measurements on specified variables. If 300 voters said Yes, the sample percentage is 300/500 = 0.6
The

sample percentage is 60%.


conclusion can be reached about the population percentage?
11

What

Descriptive vs. Inferential Statistics

Descriptive Statistics - Data summarization Inferential Statistics - Use of sample data to make inferences about a population parameter

Population: A set of objects that is of interest to investigators Sample: A subset of the population

12

Typical Types of Variables

Variables and observations Quantitative (Numerical) Variables

Discrete

Ex: Number of arrivals to a bank in a given hour

Continuous

Ex: Income, return

Qualitative (Categorical) Variables

Ordinal

Ex: Ranking in a race, ABC type grades Ex: Social Security, zip code, gender
13

Nominal

Describing Data

Frequency Tables and Histograms


A frequency table lists the number of observations of a particular variable that fall in different categories A histogram is a bar chart of these frequencies

Horizontal axis shows intervals or classes Vertical axis shows frequency or relative frequency per class

14

Shapes of Histograms
1-) Symmetric Histogram: - A histogram is symmetric if it has a single peak and looks approximately the same to the left and right of the peak.
10. 0 9.5 10. 3 9.5 9.0 7.3 5.8 5.0 3.8 3.3 2.0 0.3 0.5 0.3 2.3 3.5 2.0 1.3 0.5

6.8

7.3

.45

.46

.47

.48

.49

.5

.51

.52

.53

.54

.55

Example: Distribution of Diameters of Elevator Rails in a Production Facility

15

Shapes of Histograms
2-) Skewed to the right (positively skewed) Histogram:
- A histogram is skewed to the right if it has a single peak and the values of the distribution extend much farther to the right than to the left of the peak.

10

15

20

25

30

Example: Distribution of Times Between Arrivals to a Bank

16

Shapes of Histograms
3-) Skewed to the left (negatively skewed) Histogram:
- A histogram is skewed to the left if it has a single peak and the values of the distribution extend much farther to the left than to the right of the peak.

40

50

60

70

80

90

100

Example: Distribution of Exam Scores

17

Shapes of Histograms
4-) Bimodal Distribution Histogram:
- Some histograms have two or more peaks. This might indicate that the data comes from two different populations.

.48

.49

.5

.51

.52

.53

.54

.55

.56

.57

.58

.59

.6

.61

.62

Example: Distribution of Diameters of Elevator Rails

18

Descriptive Statistics
1-) Measures of Central Location: Mean, median and mode 2-) Measures of Variability: Variance and standard deviation 3-) Other measures: Min, max and range

19

Measures of Central Location


Mean: The mean is the average of all values of the variable. If the data represent a sample form a larger population, we call this measure the sample mean and denote it by
n

(1 / n)
i 1

yi

And the population mean is denoted by .

20

Measures of Central Location


Median: The median is the middle observation when the data are listed from smallest to largest (ascending order).

Median

n 1 2

, if n is odd;

1 y 2

n 2

n 2

, if n is even.
1

21

Measures of Central Location


Mode: The mode is the most frequently occurring value.

When the variable of interest contains categorical measurements such as gender, address, salary range etc What does the mode represent if the values are continuous (salaries for instance)? Effect of extreme values to mean, median and mode

22

Measures of Variability
Variance: The variance is essentially the average of squared deviations from the mean.
n

(Yi Y ) 2
2 i 1

=> Population variance

n
n

(Yi Y ) 2 s
2 i 1

n 1

=> Sample variance (unbiased estimator of the population variance)

Note that when n is large enough, the values are practically the same

23

Measures of Variability
Standard Deviation: The standard deviation is defined as the square root of the variance.

=> Standard Deviation = (Variance) The standard deviation is measured in original units, such as dollars and it is easier to interpret. Both variance and the standard deviation measures can be used as a measure of risk.
24

Measures of Variability
Interpretation of the Standard Deviation: Empirical Rules
For a set of measurements having a mound-shaped symmetric histogram (normally distributed), the interval:

y 1s contains approximat ely 68% of the measurements; y y

2s contains approximat ely 95% of the measurements; 3s contains approximat ely all of the measurements.

The approximation may be poor if the data are severely skewed or bimodal, or contain outliers.

25

Other Measures
Minimum: Smallest value in the dataset Maximum: Largest value in the dataset Range (Max Min): Difference between the largest value in the dataset and the smallest one p-th percentile (quartile): Value of x that exceeds (p) % of the measurements and is less than the remaining (100-p)%
26

Describing the Data Sets with Boxplots


The boxplots can be used in two ways: either to describe a single variable in a data set or compare two (or more) variables.

OUTLIER
IN TER Q UA R TILE R A NG E LEF TH AN D W H ISK ER R IGH TH AN D W H ISK ER

Q1 (25th P ER C EN TILE)

Q3 (75th P ER C EN TILE)

Note that higher the IQR (inter-quartile range) higher the variability

27

Two variables summaries - Scatter plots


Both variables need to be quantitative (namely continuous not categorical)
Suppose one variable is labeled X and the other Y. A scatter plot is a graph of the (X,Y) pairs and is used to assess the simultaneous behavior of two quantitative variables. Example: A survey questions members of 100 households about their spending habits. The data contains salary, expenses for cultural activities, expenses for sports-related activities and expenses for dining out over the past year.

Are there any linear relationships between these variables?

28

Two variables summaries - Scatter plots


Salary vs. the other expenses
Biv ariate Fit of Salary By Culture
80000 70000

Biv ariate Fit of Salary By Sports


80000 70000

Biv ariate Fit of Salary By Dining


80000 70000

Salary

Salary

60000 50000 40000 500 600 700 800 9001000 Culture 1200 1400 1600

60000 50000 40000 400 600 800 1000 1200 1400 1600 1800 2000 Sports

Salary

60000 50000 40000 1000 1500 2000 Dining 2500 3000

Which pairs have positive or negative linear relationships?

29

Cross Sectional Data


Cross sectional data is data based on observations taken at a particular point in time
- Analysis of cross-sectional data usually consists of comparing the differences among the subjects. - Example: Attributes of voters in District at the end of 2011 - Use of PivotTables in Excel to summarize data

30

Time Series Data


A time series is an ordered sequence of observations. Although the ordering is usually through time, particularly in terms of some equally spaced time intervals.
It is used to track the changes in a variable over time.

- Time vs. a continuous variable such as monthly stock returns, quarterly sales, weekly interest rates, yearly earnings etc..
- Example: The data contains monthly closing prices for the Dow Jones index from January 1947 through January 1993. The monthly returns from the index are also shown. Each return is the monthly percentage change in the index.

31

Time Series Data


3500 3000
ClosingIndex

2500 2000 1500 1000 500 0 0 100 200 300 Row 400 500 600

Dow Jones Closing index vs. time (monthly data)

32

Time Series Data

0.1

Return

-0.1 0 100 200 300 R ow 400 500 600

Return vs. time (monthly data)

33

Summary

Importance of Analytics Making sense of data Descriptive statistics vs inferential statistics Graphing data Measures of central tendency Measures of variability Two variables summaries - Scatterplots Cross Sectional vs Time series data MS Excel & StatTools Demonstration

34

Summary

35

Lecture 2

Topics :

- Rules and definitions of probability - Statistical Independence - Probability Trees

Things to do :
Spend time on Excel, StatTools with provided datasets Read Chapter 4 in the textbook Form your groups & you may attempt Questions 1 & 2 in Assignment 1
36

- Go over these notes & Check Ch. 1,2 &3 for clarifications
-

Вам также может понравиться