Академический Документы
Профессиональный Документы
Культура Документы
Summary Measures
Describing Data Numerically
Quartiles
Shape Skewness
Coefficient of Variation
Arithmetic Mean
Median
Mode
Geometric Mean
X G = ( X1 X 2 Xn )1/ n
X=
X
i =1
Arithmetic Mean
The arithmetic mean (mean) is the most common measure of central tendency
X=
Sample size
X
i=1
X1 + X 2 + + Xn = n
Observed values
Arithmetic Mean
(continued)
The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1 + 2 + 3 + 4 + 5 15 = =3 5 5
Mean = 4
1 + 2 + 3 + 4 + 10 20 = =4 5 5
Median
In an ordered array, the median is the middle number (50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
The location of the median: n +1 Median position = position in the ordered data 2
If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers
n +1 Note that is not the value of the median, only the 2 position of the median in the ranked data
Mode
A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical (nominal) data There may may be no mode There may be several modes
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
No Mode
Review Example
$100 K $100 K
Mean:
(`3,000,000/5) = `600,000
Median: middle value of ranked data = `300,000 Mode: most frequent value = `100,000
Mean is generally used, unless extreme values (outliers) exist Then median is often used, since the median is not sensitive to extreme values.
Example: Median home prices may be reported for a region less sensitive to outliers
Quartiles
Quartiles split the ranked data into 4 segments with an equal number of values per segment 25%
Q1
25%
Q2
25%
Q3
25%
The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile
Quartile Formulas
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position: Q1 = (n+1)/4
Second quartile position: Q2 = (n+1)/2 (the median position) Third quartile position: Q3 = 3(n+1)/4
Quartiles
(n = 9) Q1 is in the (9+1)/4 = 2.5 position of the ranked data so use the value half way between the 2nd and 3rd values, so Q1 = 12.5
Quartiles
(continued)
Example:
Geometric Mean
Geometric mean
XG = ( X1 X 2 Xn )
1/ n
R G = [(1 + R1 ) (1 + R 2 ) (1 + Rn )]1/ n 1
Example
An investment of $100,000 declined to $50,000 at the end of year one and rebounded to $100,000 at end of year two:
X1 = $100,000
X 2 = $50,000
X3 = $100,000
50% decrease
100% increase
The overall two-year return is zero, since it started and ended at the same level.
Example
(continued)
Use the 1-year returns to compute the arithmetic mean and the geometric mean:
Arithmetic mean rate of return: Geometric mean rate of return:
Misleading result
Measures of Variation
Variation
Range Interquartile Range Variance Standard Deviation Coefficient of Variation
Measures of variation give information on the spread or variability of the data values.
Same center, different variation
Range
Simplest measure of variation Difference between the largest and the smallest values in a set of data: Range = Xlargest Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Range = 12 - 7 = 5
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Interquartile Range
Can eliminate some outlier problems by using the interquartile range Eliminate some high- and low-valued observations and calculate the range from the remaining values Interquartile range = 3rd quartile 1st quartile = Q3 Q1
Interquartile Range
Example: X
minimum
25%
Q1
25%
Median (Q2)
25%
Q3
25%
maximum
12
30
45
57
70
Interquartile range = 57 30 = 27
Variance
Sample variance:
S =
2
Where
(X X)
i=1 i
n -1
X = mean
n = sample size Xi = ith value of the variable X
Standard Deviation
Most commonly used measure of variation Shows variation about the mean Is the square root of the variance Has the same units as the original data
Sample standard deviation:
S=
(X X)
i=1 i
n -1
14
15
17
18
18
24
Mean = X = 16
(10 X)2 +(12 X)2 +(14 X)2 + +(24 X)2 n 1 (10 16)2 +(12 16)2 +(14 16) 2 + +(24 16)2 8 1 = 4.3095
130 = 7
Measuring variation
Small standard deviation
Data B
11 12 13 14 15 16 17 18 19 20 21
Data C
11 12 13 14 15 16 17 18 19 20 21
Each value in the data set is used in the calculation Values far from the mean are given extra weight
(because deviations from the mean are squared)
Coefficient of Variation
Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of data measured in different units
S CV = X
100%
Both stocks have the same standard deviation, but stock B is less variable relative to its price
Z Scores
A measure of distance from the mean (for example, a Z-score of 2.0 means that a value is 2.0 standard deviations from the mean) The difference between a value and the mean, divided by the standard deviation A Z score above 3.0 or below -3.0 is considered an outlier
X X Z= S
Z Scores
Example:
(continued)
If the mean is 14.0 and the standard deviation is 3.0, what is the Z score for the value 18.5?
The value 18.5 is 1.5 standard deviations above the mean (A negative Z-score would mean that a value is less than the mean)
Shape of a Distribution
Symmetric or skewed
Left-Skewed
Mean < Median
Symmetric
Mean = Median
Right-Skewed
Median < Mean
Population summary measures are called parameters The population mean is the sum of the values in the population divided by the population size, N
N
=
Where
X
i=1
X1 + X 2 + + XN = N
Population Variance
Population variance:
2 =
(X )
i=1 i
Where
Most commonly used measure of variation Shows variation about the mean Is the square root of the population variance Has the same units as the original data
Population standard deviation:
2 (X ) i i=1 N
68%
2 contains about 95% of the values in the population or the sample 3 contains about 99.7% of the values in the population or the sample
95%
99.7%
Chebyshev Rule
Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within k standard deviations of the mean (for k > 1)
Examples: At least
within
(1 - 1/12) x 100% = 0% ..... k=1 ( 1) (1 - 1/22) x 100% = 75% ........ k=2 ( 2) (1 - 1/32) x 100% = 89% . k=3 ( 3)
Sometimes only a frequency distribution is available, not the raw data Use the midpoint of a class interval to approximate the values in that class
X=
m f
j=1
j j
Where
Assume that all values within each class interval are located at the midpoint of the class
S=
(m
j =1
X) f j
2
n -1
Example:
25% 25% 25% 25%
Minimum
Minimum
Quartile
Median
Median
Quartile
Maximum
Maximum
The Box and central line are centered between the endpoints if data are symmetric around the median
Min
Q1
Median
Q3
Max
Q1
Q2 Q3
Q1 Q2 Q3
Q1 Q2 Q3
10
27
0 2 0 23 3 5 5
27 27
The sample covariance measures the strength of the linear relationship between two variables (called bivariate data) The sample covariance:
cov ( X , Y ) =
( X X)( Y Y )
i=1 i i
n 1
Only concerned with the strength of the relationship No causal effect is implied
Interpreting Covariance
Coefficient of Correlation
Measures the relative strength of the linear relationship between two variables Sample coefficient of correlation:
cov (X , Y) r= SX SY
where
cov (X , Y) =
(X X)(Y Y)
i=1 i i
n 1
SX =
(X X)
i=1 i
n 1
SY =
2 (Y Y ) i i=1
n 1
Unit free Ranges between 1 and 1 The closer to 1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship
r = -1 Y
X Y
r = -.6
X Y
r=0
r = +1
r = +.3
r=0
Should report the summary measures that best meet the assumptions about the data set