Вы находитесь на странице: 1из 44

Business Statistics

Why statistics?
• Decision making is often based on
analysis of data.
• Statistics helps you to make sense of the
data by using tools that summarize,
present and analyze the data.
• Decision maker can also ascertain the
confidence in the decisions.
Examples
• How many newspapers should the vendor stock
to maximize revenue?
– Depends on the probability distribution of demand and
expected profit
• Are two or more market segments significantly
different?
– Hypothesis testing
• What proportion of people are happy with the
Sixth-pay commission report?
– Parameter estimation
Sample vs. Population
• Population is the entire group/collection of
individuals/objects/things that we want
information about.
• Sample is part of the population that we actually
examine to gather information.
• Example
– We wish to find the average dividend percentage of
all companies traded at NSE.
• All stocks traded at NSE comprises population
• 10% of the stocks selected for gathering information is the
sample
Subdivision within Statistics

 Descriptive Statistics  Inferential Statistics


 Collect
 Predict and forecast
 Organize values of population
 Summarize parameters
 Display  Test hypotheses about
 Analyze values of population
parameters
 Make decisions
Descriptive statistics
- data and frequency distribution
• The following are the departure delay in minutes of 42 flights
selected at random from a particular airport.
10 12 45
13 8 40
13 0 0
20 45 0
95 38 67
4 47 55
0 56 5
45 50 27
50 15 26
34 12 25
48 40 25
50 42 48
53 44 23
56 46 22
Frequency Distribution
 Table with two columns listing:
 Each and every group or class or interval of values
 Associated frequency of each group
• Number of observations assigned to each group
• Sum of frequencies is number of observations
 Class midpoint is the middle value of a group or class or
interval
 Relative frequency is the percentage/proportion of total
observations in each class
 Sum of relative frequencies = 1
Frequency distribution
Delay in Frequency Relative
minutes frequency
0–15 12 0.286

15 - 30 8 0.190

30 – 45 6 0.143

45 – 60 14 0.333

60 or more 2 0.048

Total 42 1
Frequency distribution- histogram

16
14
12
10
Frequency

8
6
4
2
0
0–15 15 - 30 30 – 45 45 – 60 60 or more
Delay in Minutes
Two variable frequency distribution
-cross tabulation
delay in minutes 0-15 15-30 30-45 45-60 60 or more Total
Govt. 5 2 5 9 0 21
Private 7 6 1 5 2 21

Total 12 8 6 14 2 42

A joint frequency distribution of two variables (e.g. ownership of airline, delay


in minutes)
Descriptive statistics - measures
 Measures of Location
 Measures of Variability
 Skewness and Kurtosis
 Association between two variables
Measures of Location
• Arithmetic Mean
• Median
• Mode
• Percentiles
• Quartiles
Arithmetic mean

• The mean of a data set is the average


of all the data values.
 xi Sample mean
x
n

 xi
 Population mean
N
Mean – example
• Average delay in flight departure

x = 1354/42 = 32.2381 minutes


Median
• It is the middle item in a data set that is
arranged in ascending/descending order
• If there are n observations then the
Median = (n+1)/2 th observation.
computation rule
• if n is odd then (n+1)/2 is an integer
• if n is even then use average of n/2 and n/2 +1 th
observation
Example
0 22 45
• Sorted 42 0 23 46

observations  0 25 47
0 25 48
median is average of 4 26 48
21st and 22nd 5 27 50
observation 8 34 50

= (34+38)/2 10 38 50
12 40 53
= 36 12 40 55
13 42 56
13 44 56
15 45 67
20 45 95
Mode
• Mode is the highest occurring observation
– mode in the example is 0
• The greatest frequency can occur at two
or more different values.
• If the data have exactly two modes, the
data are bimodal.
• If the data have more than two modes, the
data are multimodal.
Percentiles and Quartiles

 Given any set of ordered numerical


observations
 The Pth percentile in the ordered set is that
value below which lie P% (P percent) of the
observations in the set.
 The position of the Pth percentile is given by (n +
1)P/100, where n is the number of observations in
the set.
Example
• Calculate 45th percentile of the airline
delay data
the position of 45th percentile is
45*(42+1)/100 = 19.35th
value of 45th percentile
= 19th observation + 0.35 of (20 – 19)th
observation
= 26.35 (26 + 0.35(27-26))
Quartiles
• Quartiles are special names to percentiles
• Q1 = 25th percentile
• Q2 = 50th percentile = median
• Q3 = 75th percentile
Measures of Variability
• Range
• Interquartile Range
• Variance
• Standard Deviation
• Coefficient of Variation
Range
• The range of a data set is the difference
between the largest and smallest data values.
• It is the simplest measure of variability.
• It is very sensitive to the smallest and largest
data values.
• Example from airline delay data
Range = 95 – 0 = 95 minutes
Interquartile range
• The interquartile range of a data set is the
difference between the third quartile and the first
quartile.
• It is the range for the middle 50% of the data.
• It overcomes the sensitivity to extreme data
values.
Variance
• The variance is a measure of variability
that utilizes all the data.
• It is based on the difference between the
value of each observation (xi) and the
mean (x for a sample,  for a population).
22
2  ( xi   )
2 < - Population variance  ( xi  x )
  s2 
N Sample variance - > n 1
Standard deviation
• The standard deviation of a data set is the
positive square root of the variance.
• It is measured in the same units as the
data, making it more easily comparable,
than the variance, to the mean.
• If the data set is a sample, the standard
deviation is denoted s.
• If the data set is a population, the standard
deviation is denoted  (sigma).
Coefficient of Variation
• The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
• If the data set is a sample, the coefficient of variation
is computed as follows:
s s (100)
(100)
x
x
• If the data set is a population, the coefficient of
variation is computed as follows:

(100)

Example
• Variance
= 465.89 minutes square

• Standard Deviation
= 21.585 minutes

• Coefficient of Variation =
= 21.584/32.2381 (100) = 66.95%
Skewness
 Skewness
– Skewness characterizes the degree of
asymmetry of a distribution around its
mean
• Positively skewed
• Symmetric or unskewed
• Negatively skewed
Skewness
Negatively skewed
Skewness
Symmetric
Skewness
Positively Skewed
Skewness - measure
Skewness of a distribution is measured by

( X   ) 3
1 
N 3
For a given data set you may use
Kurtosis
• Kurtosis characterizes the relative
peakedness or flatness of a symmetric
distribution compared to the normal
distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
Kurtosis
Platykurtic - flat distribution
Kurtosis
Mesokurtic - not too flat and not too peaked
Kurtosis
Leptokurtic - peaked distribution
Kurtosis - measure
• Kurtosis for a distribution is measured by
  2  3
( X   ) 4
where 2 
N 4
For a given data set you may use
Association between two variables
Delay Passengers Delay Passengers Delay Passengers
53 65 56 51 50 68
40 61 42 50 0 72
46 53 25 57 38 74
0 65 13 57 55 68
22 45 40 54 45 73
5 58 8 54 15 63
44 68 27 65 48 68
12 65 67 57 0 55
12 56 48 62 10 45
25 50 4 50 50 71
13 70 45 61 56 64
50 73 0 59 26 60
45 63 34 63 47 61
23 56 95 49 20 48
Association between two variables
• Scatter plot
• Covariance
• Correlation Coefficient
Scatter Plot
• Scatter Plots are used to identify any
underlying relationships among pairs of
data sets.
• The plot consists of a scatter of points,
each point representing an observation.
Scatter Plot

Delay vs Passengers

100
90
80
70
60
Delay

50
40
30
20
10
0
0 10 20 30 40 50 60 70 80
Passengers
Covariance
• The covariance is a measure of the linear
association between two variables.
• Positive values indicate a positive
relationship.
• Negative values indicate a negative
relationship
Covariance
• If the data sets are samples, the covariance
is denoted by
 ( xi  x )( yi  y )
sxy  = 20.42 in the
n 1 Airline
example
• If the data sets are populations, the
covariance is denoted by

 ( xi   x )( yi   y )
 xy 
N
Correlation Coefficient

• The coefficient can take on values between -1 and +1.


• Values near -1 indicate a strong negative linear relationship.
• Values near +1 indicate a strong positive linear relationship.
• If the data sets are samples, the coefficient is
sxy
rxy  = 0.121 in Airline
sx s y example

• If the data sets are populations, the coefficient is


 xy
 xy 
 x y

Вам также может понравиться