Data Mining: Statistical Methods

DATA MINING
LECTURE 2
Statistical Methods
Statistical Methods
• Mean
• Median
• Mood
• Mean, median, and mode are three kinds of
"averages". There are many "averages" in
statistics, but these are, I think, the three most
common, and are certainly the three you are most
likely to encounter.
Statistical Methods
• The "mean" is the "average" you're used to,
where you add up all the numbers and then
divide by the number of numbers.
• The "median" is the "middle" value in the list of
numbers. To find the median, your numbers have
to be listed in numerical order from smallest to
largest, so you may have to rewrite your list
before you can find the median.
• The "mode" is the value that occurs most often. If
no number in the list is repeated, then there is no
mode for the list.
What is mean?
• Mean is the average of numbers.
Example:
3, 5, 6, 9, 8
Mean = all values/Total number of values
Mean = 3+5+6+9+8/5
Mean = 6.2
How to calculate the mean for data with
frequencies?
Age (X) Frequency (F) Age * Frequency (FX)
22 5 22 * 5 = 110
33 2 33 * 2 = 66
44 6 44 * 6 = 264
66 4 66 * 4 = 264
Total ( ∑ ) ( ∑F ) = 17 ( ∑FX ) = 704
Mean= ∑FX / ∑F
Mean = 704/ 17
Mean = 41
What is Median?
• Median is the middle value among all values.
How to calculate median for an odd number of

values?
Example:
9, 8, 5, 6, 3
• Arrange values in order

3, 5, 6, 8, 9
Median = 6
How to calculate median for an even
number of values?
Example:
9, 8, 5, 6, 3, 4
Arrange values in order

3, 4, 5, 6, 8, 9
Add 2 middle values and calculate their mean.
Median = 5+6/2
Median = 5.5
What is Mode?
• The mode is the most occurring value.
How to calculate mode?

Example:
3, 6, 6, 8, 9
Mode = 6 (because 6 is occurring 2 times and all

other values occur only one time).
How to calculate the estimated mean
and estimated median of grouped
data?
Groups X F CF
85.5-90.5 88 6 6 Mean= ∑FX/∑F
90.5-95.5 93 4 10 Median = L+(h/f) *((n/2)-cf)
95.5-100.5 98 10 20 =95.5+(5/10)*(15-10)
100.5-105.5 103 6 26 =98
105.5-110.5 108 3 29 Median =L+(((TV/2)-SBM)/FMG)*GW
=95.5+(((30/2)-10)/10)*5
110.5-115.5 113 1 30
=98
Total 30
Median Group n/2 30/2 15 th observations

95.5-100.5
lower boundry L (93+98)/2 95.5 L
Group Width h 100.5-95.5 5 GW
frequency f 10 FMG
Sum of freequency n 30 TV
comulative freequency cf 10 SBM
How to calculate the estimated mode
of the above-grouped data?
Groups X F
85.5-90.5 88 6
90.5-95.5 93 4 f1 FBMG
Mode = L+((Fm-F1)/((2*fm)-f1-f2))*h
95.5-100.5 98 10 fm FMG = 95.5+((10-4)/((2*10)-4-6)))*5
100.5-105.5 103 6 f2 FAMG = 98.5
105.5-110.5 108 3
110.5-115.5 113 1
Total 30
Mode = L+((FMG-FBMG)/((FMG-FBMG)+(FMG-FAMG)))*GW
= 95.5+((10-4)/((10-4)+(10-6)))*5
= 98.5
How to calculate the Quartile Q1
Groups X F CF
85.5-90.5 88 6 6
Q1 = L+(h/f) *((n/4)-c)
90.5-95.5 93 4 10
= 90.5+(5/4)*(7.5-6)
95.5-100.5 98 10 20 15 th =92.375
100.5-105.5 103 6 26
105.5-110.5 108 3 29
110.5-115.5 113 1 30
Total 30
Median Group n/4 30/4 7.5 th observations

90.5-95.5
lower boundry L 90.5 L
frequency f 4 FMG
comulative freequency cf 6 SBM
Groups X F CF
85.5-90.5 88 6 6
90.5-95.5 93 4 10
95.5-100.5 98 10 20 15 th
100.5-105.5 103 6 26
105.5-110.5 108 3 29
Q2 = L+(h/f) *(2(n/4)-c)
110.5-115.5 113 1 30 = 95.5+(5/10)*(2*(7.5)-10)
Total 30 =98
Median Group 2*(n/4) 2*(30/4) 15 th observations

95.5-100.5
freequency f 10 FMG
comulative
freequency cf 10 SBM
Groups X F CF Q3= L+(h/f) *(3(n/4)-c)
85.5-90.5 88 6 6 =100.5+(5/6)*(3*(30/4)-20)
90.5-95.5 93 4 10
95.5-100.5 98 10 20
=102.5833
100.5-105.5 103 6 26 22.5 th
105.5-110.5 108 3 29
110.5-115.5 113 1 30
Total 30
Median Group 3*(n/4) 3*(30/4) 22.5 th observations

100.5-105.5
freequency f 6 FMG
comulative
freequency cf 20 SBM
How to calculate the Standard
Deviation & Variance
X (X-Mean) (X-Mean)2
3 -1 1
6 2 4
2 -2 4 Mean = 24/6
1 -3 9 =4
7 3 9
5 1 1 S.D =√28/6
24 28 =2.160246899
V = 28/6
= 4.6666667
Deviation & Variance for group data
Group X F FX (X-Mean) (X-Mean)2 F(X-Mean)2
30-35 32.5 12 390 -12 144 1728
35-40 37.5 18 675 -7 49 882
40-45 42.5 29 1232.5 -2 4 116
45-50 47.5 32 1520 3 9 288
50-55 52.5 16 840 8 64 1024
55-60 57.5 8 460 13 169 1352
Total 115 5117.5 439 5390
Mean = ∑FX/∑F
Mean = 5117.5/115
= 44.5
V = 5390/115 S.D = √5390/115

= 46.86956522 = 6.846135057
Deviation & Variance for group data
How to calculate the Correlation
• When two sets of data are strongly linked
together we say they have a High Correlation.
Correlation can have a value:

• 1 is a perfect positive correlation
• 0 is no correlation (the values don't seem linked at all)
• -1 is a perfect negative correlation
X Y X-Mean(X) Y-Mean(Y) X-Mean(X) * Y-Mean(Y) X-Mean(X)2 Y-Mean(Y)2
14.2 215 -4.475 -187.4167 838.6897325 20.025625 35125.019
16.4 325 -2.275 -77.4167 176.1229925 5.175625 5993.3454
11.9 185 -6.775 -217.4167 1472.998143 45.900625 47270.021
15.2 332 -3.475 -70.4167 244.6980325 12.075625 4958.5116
18.5 406 -0.175 3.5833 -0.6270775 0.030625 12.840039
22.1 522 3.425 119.5833 409.5728025 11.730625 14300.166
19.4 412 0.725 9.5833 6.9478925 0.525625 91.839639
25.1 614 6.425 211.5833 1359.422703 41.280625 44767.493
23.4 544 4.725 141.5833 668.9810925 22.325625 20045.831
18.1 421 -0.575 18.5833 -10.6853975 0.330625 345.33904
22.6 445 3.925 42.5833 167.1394525 15.405625 1813.3374
17.2 408 -1.475 5.5833 -8.2353675 2.175625 31.173239
224.1 4829 5325.025 176.9825 174754.92
Mean (X) = 224.1/12 = 18.675

Mean (Y) = 4829/12 = 402.4166667
Cor = 5325.025/√(176.9825*174754.9)
= 5325.025/5561.345344
= 0.957506623
• High Positive Correlation
Exercise
Group X F X Y
30-35 32.5 12 19.2 165
35-40 37.5 18 15.4 225
40-45 42.5 29 18.9 175
45-50 47.5 32 12.2 322
50-55 52.5 16 10.5 496
55-60 57.5 8 21.1 512
Total 115 10.4 472
23.1 634
28.4 164
Find Standard Deviation & Variance 14.1 781
20.6 905
11.2 128
Find Correlation
Least Squares Regression
• Regression analysis is a powerful
statistical method that allows you to examine the
relationship between two or more variables of
interest.
• Imagine you have

some points, and want
To have a line that
best fits them like this:
• Try to have the line as close as possible to all points, and

a similar number of points above and below the line.
• The Line
• Our aim is to calculate the values m (slope) and b (y-
intercept) in the equation of a line :
y = mx + b
Where:
• y = how far up
• x = how far along
• m = Slope or Gradient (how steep the line is)
• b = the Y Intercept (where the line crosses the Y axis)
• To find the line of best fit for N points:
Let's have an example to see how to do it!
Example: Sam found how many hours of sunshine vs.

how many ice creams were sold at the shop from
Monday to Friday:
"x" "y"
Hours of Sunshine Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and

Step 1: For each (x,y) calculate x2 and xy:
Σxy):
x y x2 xy
x y x2
xy
2 4 4 8
2 4 4 8
3 5 9 15
3 5 9 15
5 7 25 35
5 7 25 35
7 10 49 70
7 10 49 70
9 15 81 135
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
Also N (number of data values) = 5

• Step 3: Calculate Slope m:
• Step 4: Calculate Intercept b:

• Step 5: Assemble the equation of a line:
y = mx + b
y = 1.518x + 0.305
Let's see how it works
out:
x y y = 1.518x + 0.305 Sam hears the weather forecast

which says "we expect 8 hours of sun
2 4 3.34 tomorrow", so he uses the above
equation to estimate that he will sell
3 5 4.86
5 7 7.89 y = 1.518 x 8 + 0.305 = 12.45 Ice Creams
7 10 10.93 Sam makes fresh waffle cone mixture for

14 ice creams just in case.
9 15 13.97
Chi-Square Test
• This test only works for categorical data (data in
categories), such as Gender {Men, Women} or
color {Red, Yellow, Green, Blue} etc, but not
numerical data such as height or weight.
• The numbers must be large enough. Each entry

must be 5 or more. In our example we have
values such as 209, 282, etc, so we are good to
go.
Chi-Square Test
• Our first step is to state our hypotheses:
• The two hypotheses are.
• Gender and preference for cats or dogs
are independent.
• Gender and preference for cats or dogs are not
independent.
Chi-Square Test
Add up rows and columns:
Cat Dog
Men 207 282 489
Women 231 242 473
438 524 962
Calculate "Expected Value" for each entry:

Multiply each row total by each column total and divide by the
overall total:
Cat Dog
Men 489×438/962 489×524/962 489
Women 473×438/962 473×524/962 473
438 524 962
Chi-Square Test
Which gives us:
Cat Dog
Men 222.64 266.36 489
Women 215.36 257.64 473
438 524 962
Subtract expected from actual, square it, then divide by

expected using this formula to calculate Chi-Sq:
Chi-Square Test
Dog
Cat
Men (207-222.64)2 / 222.64 (282-266.36)2 / 266.36 489
Women (231-215.36)2 / 215.36 (242-257.64)2 / 257.64 473
438 524 962
Which is:
Cat Dog
Men 1.099 0.918 489
Women 1.136 0.949 473
438 524 962
Now add up those values:
1.099 + 0.918 + 1.136 + 0.949 = 4.102
Chi-Square is 4.102
Chi-Square Test
• From Chi-Square to p
• But first you will need a "Degree of Freedom"
(DF)
• Calculate Degrees of Freedom
• Multiply (rows − 1) by (columns − 1)
DF = (2 − 1)(2 − 1) = 1×1 = 1
The result is:

= 3.84
• Done!
Chi-Square Test
• The larger the Chi-Square (Χ2) value, the more
likely the variables are related.
• Here X2<, so that
• result is thought of as being "significant" meaning we
think the variables are not independent.

Data Mining: Statistical Methods

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining: Statistical Methods

Загружено:

Авторское право:

Доступные форматы

DATA MINING

Mean = all values/Total number of values

How to calculate median for an odd number of

• Arrange values in order

Arrange values in order

Add 2 middle values and calculate their mean.

How to calculate mode?

Mode = 6 (because 6 is occurring 2 times and all

Median Group n/2 30/2 15 th observations

Median Group n/4 30/4 7.5 th observations

Median Group 2*(n/4) 2*(30/4) 15 th observations

Median Group 3*(n/4) 3*(30/4) 22.5 th observations

V = 5390/115 S.D = √5390/115

Correlation can have a value:

Mean (X) = 224.1/12 = 18.675

• Imagine you have

• Try to have the line as close as possible to all points, and

Example: Sam found how many hours of sunshine vs.

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and

Also N (number of data values) = 5

• Step 4: Calculate Intercept b:

x y y = 1.518x + 0.305 Sam hears the weather forecast

5 7 7.89 y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

7 10 10.93 Sam makes fresh waffle cone mixture for

• The numbers must be large enough. Each entry

Calculate "Expected Value" for each entry:

Subtract expected from actual, square it, then divide by

Now add up those values:

1.099 + 0.918 + 1.136 + 0.949 = 4.102

The result is:

Вам также может понравиться

Median Group 2(n/4) 2(30/4) 15 th observations

Median Group 3(n/4) 3(30/4) 22.5 th observations