Вы находитесь на странице: 1из 35

DATA MINING

LECTURE 2
Statistical Methods
Statistical Methods
• Mean
• Median
• Mood
• Mean, median, and mode are three kinds of
"averages". There are many "averages" in
statistics, but these are, I think, the three most
common, and are certainly the three you are most
likely to encounter.
Statistical Methods
• The "mean" is the "average" you're used to,
where you add up all the numbers and then
divide by the number of numbers.
• The "median" is the "middle" value in the list of
numbers. To find the median, your numbers have
to be listed in numerical order from smallest to
largest, so you may have to rewrite your list
before you can find the median.
• The "mode" is the value that occurs most often. If
no number in the list is repeated, then there is no
mode for the list.
What is mean?
• Mean is the average of numbers.
Example:
3, 5, 6, 9, 8

Mean = all values/Total number of values

Mean = 3+5+6+9+8/5

Mean = 6.2
How to calculate the mean for data with
frequencies?
Age (X) Frequency (F) Age * Frequency (FX)
22 5 22 * 5 =  110
33 2 33 * 2 = 66 
44 6 44 * 6 = 264 
66 4 66 * 4 = 264 
Total ( ∑ ) ( ∑F ) = 17 ( ∑FX ) = 704

Mean= ∑FX / ∑F

Mean = 704/ 17

Mean = 41
What is Median?
• Median is the middle value among all values.

How to calculate median for an odd number of


values?
Example:
9, 8, 5, 6, 3

• Arrange values in order


3, 5, 6, 8, 9

Median = 6
How to calculate median for an even
number of values?
Example:
9, 8, 5, 6, 3, 4

Arrange values in order


3, 4, 5, 6, 8, 9

Add 2 middle values and calculate their mean.

Median = 5+6/2

Median = 5.5
What is Mode?
• The mode is the most occurring value.

How to calculate mode?


Example:
3, 6, 6, 8, 9

Mode = 6 (because 6 is occurring 2 times and all


other values occur only one time).
How to calculate the estimated mean
and estimated median of grouped
data?
Groups X F CF
85.5-90.5 88 6 6 Mean= ∑FX/∑F
90.5-95.5 93 4 10 Median = L+(h/f) *((n/2)-cf)
95.5-100.5 98 10 20 =95.5+(5/10)*(15-10)
100.5-105.5 103 6 26 =98
105.5-110.5 108 3 29 Median =L+(((TV/2)-SBM)/FMG)*GW
=95.5+(((30/2)-10)/10)*5
110.5-115.5 113 1 30
=98
Total   30 

Median Group n/2 30/2 15 th observations


95.5-100.5      
lower boundry L  (93+98)/2 95.5 L
Group Width h 100.5-95.5 5 GW
frequency f   10 FMG
Sum of freequency n   30 TV
comulative freequency cf   10 SBM
How to calculate the estimated mode
of the above-grouped data?
Groups X F
85.5-90.5 88 6
90.5-95.5 93 4 f1 FBMG
Mode = L+((Fm-F1)/((2*fm)-f1-f2))*h
95.5-100.5 98 10 fm FMG = 95.5+((10-4)/((2*10)-4-6)))*5
100.5-105.5 103 6 f2 FAMG = 98.5
105.5-110.5 108 3
110.5-115.5 113 1
Total   30

Mode = L+((FMG-FBMG)/((FMG-FBMG)+(FMG-FAMG)))*GW
= 95.5+((10-4)/((10-4)+(10-6)))*5
= 98.5
How to calculate the Quartile Q1
Groups X F CF
85.5-90.5 88 6 6
Q1 = L+(h/f) *((n/4)-c)
90.5-95.5 93 4 10
= 90.5+(5/4)*(7.5-6)
95.5-100.5 98 10 20 15 th =92.375
100.5-105.5 103 6 26
105.5-110.5 108 3 29
110.5-115.5 113 1 30
Total   30 

Median Group n/4 30/4 7.5 th observations


90.5-95.5      
lower boundry L   90.5 L
Group Width h 95.5-90.5 5 GW
frequency f   4 FMG
Sum of freequency n   30 TV
comulative freequency cf   6 SBM
How to calculate the Quartile Q2
Groups X F CF
85.5-90.5 88 6 6
90.5-95.5 93 4 10
95.5-100.5 98 10 20 15 th
100.5-105.5 103 6 26
105.5-110.5 108 3 29
Q2 = L+(h/f) *(2(n/4)-c)
110.5-115.5 113 1 30 = 95.5+(5/10)*(2*(7.5)-10)
Total   30  =98

Median Group 2*(n/4) 2*(30/4) 15 th observations


95.5-100.5      
lower boundry L   95.5 L
Group Width h 100.5-95.5 5 GW
freequency f   10 FMG
Sum of freequency n   30 TV
comulative
freequency cf   10 SBM
How to calculate the Quartile Q3
Groups X F CF Q3= L+(h/f) *(3(n/4)-c)
85.5-90.5 88 6 6 =100.5+(5/6)*(3*(30/4)-20)
90.5-95.5 93 4 10
95.5-100.5 98 10 20
=102.5833
100.5-105.5 103 6 26 22.5 th
105.5-110.5 108 3 29
110.5-115.5 113 1 30
Total   30 

Median Group 3*(n/4) 3*(30/4) 22.5 th observations


100.5-105.5      
lower boundry L   100.5 L
Group Width h 105.5-100.5 5 GW
freequency f   6 FMG
Sum of freequency n   30 TV
comulative
freequency cf   20 SBM
How to calculate the Standard
Deviation & Variance
X (X-Mean) (X-Mean)2
3 -1 1
6 2 4
2 -2 4 Mean = 24/6
1 -3 9 =4
7 3 9
5 1 1 S.D =√28/6
24  28 =2.160246899

V = 28/6
= 4.6666667
How to calculate the Standard
Deviation & Variance for group data
Group X F FX (X-Mean) (X-Mean)2 F(X-Mean)2
30-35 32.5 12 390 -12 144 1728
35-40 37.5 18 675 -7 49 882
40-45 42.5 29 1232.5 -2 4 116
45-50 47.5 32 1520 3 9 288
50-55 52.5 16 840 8 64 1024
55-60 57.5 8 460 13 169 1352
Total   115 5117.5   439 5390

Mean = ∑FX/∑F
Mean = 5117.5/115
= 44.5

V = 5390/115 S.D = √5390/115


= 46.86956522 = 6.846135057
How to calculate the Standard
Deviation & Variance for group data
How to calculate the Correlation
• When two sets of data are strongly linked
together we say they have a High Correlation.

Correlation can have a value:


• 1 is a perfect positive correlation
• 0 is no correlation (the values don't seem linked at all)
• -1 is a perfect negative correlation
How to calculate the Correlation
X Y X-Mean(X) Y-Mean(Y) X-Mean(X) * Y-Mean(Y) X-Mean(X)2 Y-Mean(Y)2
14.2 215 -4.475 -187.4167 838.6897325 20.025625 35125.019
16.4 325 -2.275 -77.4167 176.1229925 5.175625 5993.3454
11.9 185 -6.775 -217.4167 1472.998143 45.900625 47270.021
15.2 332 -3.475 -70.4167 244.6980325 12.075625 4958.5116
18.5 406 -0.175 3.5833 -0.6270775 0.030625 12.840039
22.1 522 3.425 119.5833 409.5728025 11.730625 14300.166
19.4 412 0.725 9.5833 6.9478925 0.525625 91.839639
25.1 614 6.425 211.5833 1359.422703 41.280625 44767.493
23.4 544 4.725 141.5833 668.9810925 22.325625 20045.831
18.1 421 -0.575 18.5833 -10.6853975 0.330625 345.33904
22.6 445 3.925 42.5833 167.1394525 15.405625 1813.3374
17.2 408 -1.475 5.5833 -8.2353675 2.175625 31.173239
224.1 4829     5325.025 176.9825 174754.92

Mean (X) = 224.1/12 = 18.675


Mean (Y) = 4829/12 = 402.4166667
Cor = 5325.025/√(176.9825*174754.9)
= 5325.025/5561.345344
= 0.957506623
How to calculate the Correlation
• High Positive Correlation
Exercise
Group X F X Y
30-35 32.5 12 19.2 165
35-40 37.5 18 15.4 225
40-45 42.5 29 18.9 175
45-50 47.5 32 12.2 322
50-55 52.5 16 10.5 496
55-60 57.5 8 21.1 512
Total   115 10.4 472
23.1 634
28.4 164
Find Standard Deviation & Variance 14.1 781
20.6 905
11.2 128
Find Correlation
Least Squares Regression
• Regression analysis is a powerful
statistical method that allows you to examine the
relationship between two or more variables of
interest.

• Imagine you have


some points, and want 
To have a line that
best fits them like this:

• Try to have the line as close as possible to all points, and


a similar number of points above and below the line.
Least Squares Regression
• The Line
• Our aim is to calculate the values m (slope) and b (y-
intercept) in the equation of a line :
y = mx + b
Where:
• y = how far up
• x = how far along
• m = Slope or Gradient (how steep the line is)
• b = the Y Intercept (where the line crosses the Y axis)
Least Squares Regression
• To find the line of best fit for N points:
Least Squares Regression
Let's have an example to see how to do it!

Example: Sam found how many hours of sunshine vs.


how many ice creams were sold at the shop from
Monday to Friday:

"x" "y"
Hours of Sunshine Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Least Squares Regression

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and


Step 1: For each (x,y) calculate x2 and xy:
Σxy):

x y x2 xy
x y x2
xy
2 4 4 8
2 4 4 8
3 5 9 15
3 5 9 15
5 7 25 35
5 7 25 35
7 10 49 70
7 10 49 70
9 15 81 135
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5


Least Squares Regression
• Step 3: Calculate Slope m:

• Step 4: Calculate Intercept b:


Least Squares Regression
• Step 5: Assemble the equation of a line:
y = mx + b
y = 1.518x + 0.305
Let's see how it works
out:

x y y = 1.518x + 0.305 Sam hears the weather forecast


which says "we expect 8 hours of sun
2 4 3.34 tomorrow", so he uses the above
equation to estimate that he will sell
3 5 4.86

5 7 7.89 y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

7 10 10.93 Sam makes fresh waffle cone mixture for


14 ice creams just in case.
9 15 13.97
Chi-Square Test
• This test only works for categorical data (data in
categories), such as Gender {Men, Women} or
color {Red, Yellow, Green, Blue} etc, but not
numerical data such as height or weight.

• The numbers must be large enough. Each entry


must be 5 or more. In our example we have
values such as 209, 282, etc, so we are good to
go.
Chi-Square Test
• Our first step is to state our hypotheses:

• The two hypotheses are.
• Gender and preference for cats or dogs
are independent.
• Gender and preference for cats or dogs are not
independent.
Chi-Square Test
Add up rows and columns:

  Cat Dog  
Men 207 282 489
Women 231 242 473
  438 524 962

Calculate "Expected Value" for each entry:


Multiply each row total by each column total and divide by the
overall total:

  Cat Dog  
Men 489×438/962 489×524/962 489
Women 473×438/962 473×524/962 473
  438 524 962
Chi-Square Test
Which gives us:

  Cat Dog  
Men 222.64 266.36 489
Women 215.36 257.64 473
  438 524 962

Subtract expected from actual, square it, then divide by


expected using this formula to calculate Chi-Sq:
Chi-Square Test
Dog  
Cat
Men (207-222.64)2 / 222.64 (282-266.36)2 / 266.36 489
Women (231-215.36)2 / 215.36 (242-257.64)2 / 257.64 473
  438 524 962

Which is:
  Cat Dog  
Men 1.099 0.918 489
Women 1.136 0.949 473
  438 524 962

Now add up those values:

1.099 + 0.918 + 1.136 + 0.949 = 4.102

Chi-Square is 4.102
Chi-Square Test
• From Chi-Square to p
• But first you will need a "Degree of Freedom"
(DF)
• Calculate Degrees of Freedom
• Multiply (rows − 1) by (columns − 1)
DF = (2 − 1)(2 − 1) = 1×1 = 1

The result is:


= 3.84
• Done!
Chi-Square Test
• The larger the Chi-Square (Χ2) value, the more
likely the variables are related.
• Here X2<, so that
• result is thought of as being "significant" meaning we
think the variables are not independent.

Вам также может понравиться