New BASIC STATISTICS (2) Slides

BASIC STATISTICS
STATISTICS
Definition:
It is a branch of applied Mathematics which is concerned with the collection, organization,
characterization, analysis and interpretation of data which is usually in form of numbers.
According to Moore and Mccabe (1999), statistics is the science of collecting, organising and
interpreting numerical facts which we call data.

Terms used in descriptive statistics
 Data: are the numerical facts, measurement and observation.

The collection of measurement or observation is called data set.
Data is plural while datum is singular. Data can be raw or array.
 Raw data: are collected data which have not been organized
numerically. It is also a set of unorganised, unarranged and
unprocessed information.
Example: 20, 50, 15, 75, 35, 45, 80, 46, 51, 32.
 Array: is an organised data that arranged in ascending or
descending order of magnitude. From example above:
 Array: 15, 20, 32, 35, 45, 46, 50, 51, 75, 80
Terms used in descriptive statistics Cont.
 Frequency: A frequency is the number of times each value or group of values of a

variable occurs.
 Frequency table shows the number of time a particular event or score/value occurs
in a given data. Frequency table shows at a glance the number of times each event
appears.
 Class intervals are the number of groups that a particular data is classified. It is
also a symbol which defines a class such as 1 – 5. A class interval which has either
no upper or lower class limit is called an open class interval. For example, a class
with weight 78 and above is an open class interval.
 Class limits refers to the minimum and maximum value that the class interval may
contain. The end of numbers in the class intervals are called class limits. The end
numbers 1 and 5 are called class limits. The smaller number 1 is the lower class
limit and larger number 5 is the upper class limit.
Terms used in descriptive statistics Cont.
 Class boundary refers to the actual class limit of a class interval is called class
boundary. Consider the class interval 60kg – 62kg, theoretically, the interval
includes all measurements from 59.5 to 62.5kg. The smaller number 59.5 is the
lower class boundary while the larger number 62.5 is the upper class boundary.
GRAPHICAL REPRESENTATION OF DATA
 Pictorial representation
 Line Graphs
 Bar Graphs
 Histograms
 Frequency polygon
 Pie charts
 XY Graphs which show relationships between two sets of data e.g.
Scattergrams.
MEASURES OF CENTRAL TENDENCY
They indicate how much the terms in the distribution move

towards the middle; i.e centre.
There are three Measures of Central Tendency namely
mode
median
mean
MODE
 It is the piece of data with the highest frequency, i.e, it appears most often than the other
characters.
 A distribution can have no mode, one mode or even more.
 a). If a distribution has one mode, we say it is unimodal.
 b). If it has two modes, we say it is bimodal.
 C). If it has three modes, it is called a trimodal distribution
 d). There is also a multimodal distribution which has many modes

Mode conti…
ADVANTAGES
 There is no need to arrange numbers in order of size.

 It is not affected by extreme values (outliers) as is with the range.
 It is easy to calculate
DISADVANTAGES
 It does not use all the terms or characters in the distribution.
 It only focuses on those which appear more frequently than others.
 it is not normally used for further statistical calculations
E.g find the mode of the following distributions

a) 82, 69, 81, 82, 74, 81
b) 1000, 101, 500, 60
c) 0, 5, 0, 5, 0, 5, 4, 5
SOLUTIONS
a) The modes are 81 and 82 hence it has a bimodal
b) No mode
c) The mode is 5 hence it is a unimodal

MEDIAN
It is the term which occupies the middle position when the terms /numbers are
arranged in terms of size.
The terms can be arranged in ascending or descending order
When the number of terms in the distribution is odd the median is found using the
formula:
Median = ½ (n+1) term
th
Where n= number of terms in the distribution eg if n= 15

Median = ½ (15+1) th term
=8
MEDIAN CONTI…
If the number of terms in the distribution is even the median is half (1/2) of the sum of two middle terms e.g. if n= 20
Median = 10th term +11th ÷ 2
E.g. find the median of each of the following distributions
a) 7, 1, 4, 9, 7, 8, 6, 5, 6,3
b) 77, 29, 36, 24, 82, 100, 105, 19, 60, 50
c) 99, 81, 74, 65, 50, 28, 3

SOLUTION
(a) Rank the numbers or arrange them in ascending order
(b) 1 3 4 5 6 7 7 8 9 n= 9
Median= ½ (9+1) th term
= 5th term
Therefore Median = 6
MEDIAN continues
(b) Ranking the numbers
19 24 29 36 50 60 77 82 100 105
Median = ½ (5th+6th) term
= 50+60
2
= 110
2
Therefore the median = 55

MEDIAN continues
(c) Ranking or arranging in ascending order
3 28 50 65 74 81 99
n = 7+1 = 8 = 4th
2 2
Median = the fourth number which is 65
ADVANTAGES
 Easy to calculate
 Not affected by extreme values
DISADVANTAGES
-Does not use all the terms in the distribution
-It is not always used for further statistical calculations
-It can be different from the terms in the distribution
- Can be tedious when there is a large data set
MEAN
 It is also called arithmetic mean or average
 It is obtained by adding all the terms in the distribution and dividing the sum by the number of terms in the
distribution.
Mean = sum of terms
It is represented by x̅ (x bar)
∑ Summation sign. This means something is being added
x1 + x2+ - - - - -=∑ x
MEAN continues
∑x = (x₁ + x₂ + x₃ - - - -xn)
n
n = Number of terms in the distribution
x̅ = ∑x
n
MEAN continues
ADVANTAGES OF MEAN
- It uses all the terms in the distribution

- It is used for further statistical calculations eg in finding the variance and standard
deviation.
- There is no need to rank the numbers.
DISADVANTAGES
-It is affected by the outliers

-It can be different from all the terms in the distribution
- It is influenced by extreme values
Find the mean of each of the following
a) 60, 74, 88, 36, 54, 81, 93, 96, 50 68
b) 1000, 4000, 3600, 7200, 1112
Quantiles
 Quantiles are point value which divides the set of observation into equal
groups with known proportions in each group. Examples are quartiles,
deciles, percentiles, etc. One out of these will be discussed briefly.
 Quartiles
 Quartiles are the three-point value (Q1, Q2, Q3) which divides an array data
into four equal parts.
 Q1 is the first quartile of the distribution. Is the point that divides the
distribution into ratio 1:3 (25%= ¼),
 position of the lower quartile =
Quartiles Cont.
 Q is the median of the distribution. Is the point that divides the
2
distribution into two equal parts (50%).
 Position of the median=
 Q3 is the upper quartile of the distribution. Is the point that divides the
distribution into ratio 3:1 (75%=3/4)
 Position of the upper quartile = ¾ (
 N is the number of values or the number of times a raw score occurs.
Quartiles Cont.
 Method 1
 How to compute quartiles Using odd number of scores
 Compute Q1, Q2, Q3 and Interquartile Range of the following scores 3, 1, 5,
9, 8, 6, 7
 The number of scores is 7 (odd)
 Arrange the marks in ascending order (from the lowest to the highest score)
 Array 1 3 5 6 7 8 9
 Rank 1st 2nd 3rd 4th 5th 6th 7th
 Q1 Q2 Q3
Quartiles Cont.

Method 2
Computation of Q1, Q2, and Q3 Using position formula to locate the scores
Using the example above in the first method and locate the position of Q1,Q2 and
Q3
 N is the number of scores = 7 (odd)
 position of the lowest quartiles (Q1)= = = = 2nd
 Position of the Q2 (median) = = =4th
 Position of the upper quartile (Q3) = ¾ (=¾(= 6th
Quartiles Cont.
Q1 = 2nd position = 3
Q2 = 4th position(Median)= 6
Q3 = 6th position = 8
Interquartile Range= Q3 – Q1 = 8 – 3 =5
How to compute quartiles Using even number of scores

 Compute Q1,Q2 and Q3 of the following scores 2, 1, 5, 9, 4, 8, 6, 7
 The number of scores is 8 (even)
There are many approaches for computing the score for Q1, Q2 and Q3 using
even number scores
Quartiles Cont.

Method 1
 Arrange the scores in ascending order (from the lowest score to the highest
score)
 Array 1 2 4 5 6 7 8 9
 Rank 1st 2nd 3rd 4th 5th 6th 7th 8th
 Q1 Q2 Q3
Q1 = is between 2nd and 3rd = 2 + 4 = 6 2 = 3

Q2 = is between 4th and 5th = 5 + 6 =112 = 5.5
Q3 = is between 6th and 7th = 7 + 8 = 152 = 7.5
Quartiles Cont.
 Method 2
 Computation of Q1, Q2, and Q3 Using position formula to locate the
scores
 Using the example above in the second method and locate the position of
Q1, Q2 and Q3
 N is the number of student = 8 (even)

 position of the lowest quartiles (Q1)= = = = 2.25th
 Position of the Q2 (median)= = =4.5th
 Position of the upper quartile= ¾ ( =¾ (= 6.75th
Quartiles Cont.
 Method 2
 Q1 = is between 2nd and 3rd
 position 2.25th
 = 2nd + 0.25 (3rd – 2nd )
 = 2 + 0.25(4 - 2)
 = 2 + 0.25 (2)
 = 2 + 0.5
 = 2.5
Quartiles Cont.
 Q2 = is between 4th and 5th
 Position 4.5th
 4th + 0.5 (5th – 4th )
 5 + 0.5 (6 – 5)
 5 + 0.5 (1)
 5 + 0.5
 5.5
 Q2 = 5.5
Quartiles Cont.
 Q3 = is between 6th and 7th
 Position 6.75th
 6th + 0.75 (7th – 6th )
 7 + 0.75 (8 – 7)
 7 + 0.75 (1)
 7 + 0.75
 7.75
MEASURES OF DISPERSION /VARIABILITY/ Scatter/
SPREAD
They are also called measures of spread, variability or scatter.
They indicate how much the terms in the distribution are spread or scattered from the
mean/average.
The distribution can be measured relative to the mean (starting from the mean)
Measures of Dispersion include the variance, Standard deviation and range

RANGE
It focuses on the difference between the greatest and the smallest values in the distribution. There
are two types of range namely the ordinary range and the inclusive range.
a). Ordinary range
It is the difference between the greatest and the smallest value in the distribution.
Ordinary range= x max – x min
= Greatest value – Smallest value

RANGE continues
b. Inclusive range
=(Greatest value –Smallest value) + 1
= (x max –x min) + 1
Find the range (ordinary and inclusive) of the following distribution
a) 85, 36, 14, 91, 99, 64, 50, 28, 12, 9, 3
Ordinary range = x max- x min
= 99 – 3
= 96
RANGE continues
Inclusive range =( x max – x min ) + 1
= (99 – 3 ) + 1
= 96 + 1
= 97
Advantage of range
 Easy to calculate
Disadvantages
 Only focuses on outliers and ignores all the other terms.
 It is greatly affected by outliers.
VARIANCE
Population variance is denoted by σ²

Sample variance is denoted by S². Variance is the mean of the sum of the squared deviation.
How to find the variance
1. Find the mean of the distribution
2. Subtract the mean from each term in the distribution to get the deviations i.e
(x₁ - x̅ )(x₂-x̅ )- - - - (x n- x̅)
3. Square each deviation to get (x₁ - x̅)², (x₂ -x̅)²- - - - - - (x n - x̅ )²

VARIANCE continues

4) Add the squared deviation

(x₁ -x)̅ ² + (x₂ - x)̅ ² +- - - - (xn -x)̅ ² = ∑ (x -x̅ )²
 The formula for computing Variance (
 Sample variance (=
 Which means sum of the squared deviations Divide the
sum by n-1

VARIANCE continues

 The sample variance formula use denominator of n-1 instead of N.

This is the adjustment that is necessary to correct for the bias in
sample variability.
 Population variance formula use denominator of N.

 Population variance =
CALCULATION OF VARIANCE
X X̅ X - X̅ (X - X̅ )²
60 71 -11 121
83 71 12 144
71 71 0 0
63 71 -8 64
89 71 18 324
90 71 19 361
40 71 -30 961
72 71 1 1
∑ ∑ (X - X̅ )² = 1976
VARIANCE continues
Using
 sample variance formula
Variance
= 1976
8 -1
= 1976
7
=282, 2857143
= 282,29 (2 decimal place)

Calculation of standard deviation

The formula is: SD =
 Population standard deviation=
 Sample standard deviation=
Using sample standard deviation formula

S
SD =
SD =16,80136049
= 16,80 (2 d p)
INTERPRETATION OF VARIANCE AND
STANDARD DEVIATION
A large value of the standard deviation/variance shows that the values are widely
scattered relatively to the mean which means the greater the variance / standard
deviation the more widely spaced the terms are above and below the mean. The
smaller the variance the more closely packed the values are around the mean.
MEASURES OF ASSOCIATION
It focuses on the relationship between variables.
Correlation It is the degree of association between 2 or more variables or factors.
There are two types of variables
1) Independent variables (x) and the dependent variable (y).
The independent variable is the variable which is manipulated by the researcher during an
experiment.
The dependent variable is the factor which is influenced by the manipulation of the independent
variable
MEASURES OF ASSOCIATION continues
CORRELATION CO-EFFICIENT
It is the number which shows the size and direction of association between variables.
r normally represents a correlation co-efficient
The maximum value of r is + 1
The minimum value of r is – 1
This means r lies between -1 and + 1
When r is + 1 it is perfect positive correlation
When it is between 0,9 – 0,99 very strong positive correlation
“ “ 0,7 – 0,89 strong positive correlation
“ “ “ 0,4 – 0,69 moderate positive correlation
“ “ ‘ 0,2 – 0,39 weak positive correlation
“ “ “ 0,1 – 0,19 very weak positive correlation
CORRELATION CO-EFFICIENT continues
-1 perfect negative correlation

-0,9 to – 0,99 very strong negative correlation
-0,7to – 0,89 strong negative correlation
- 0,4 to – 0,69 moderate negative correlation
-0,2 to - 0,39 weak negative correlation
- 0,01 to – 0,19 very weak negative correlation
The 2 popular correlation co-efficiency are the Pearson’s product correlation co-efficient (r)
and the Spearman’s rank order correlation co-efficient (rho)
PEARSON’S PRODUCT CORRELATION CO-EFFICIENT (r)
Its mainly strength is that it uses actual values of the variables.
It is calculated using the following formulae

PEARSON’S PRODUCT CORRELATION
continues
The following table can be used to obtain the values which are to be substituted in the
formulae
x y x² y² Xy
x₁ y₁ x₁² y₁² x₁ y₁
x₂ y₂ x₂² y₂² x₂ y₂
x₃ y₃ x₃² y₃² x₃ y₃
x₄ y₄ x₄² y₄² x₄ y₄
∑x ∑y ∑ x² ∑ y² ∑xy
PEARSON’S PRODUCT WORKED EXAMPLE
Ten Form 4 pupils at a certain school wrote two tests one in History and the other one in
Mathematics and results are as follows
pu A B C D E F G H I J
pil
HIS 80 74 56 52 78 90 73 65 40 75
TO
RY
Mat 40 52 75 74 50 54 59 60 71 48
hs
PEARSON WORKED EXAMPLE continues
x y x² y² Xy
80 40 6400 1600 3200

74 52 5476 2704 3848
56 75 3136 5625 4200
52 74 2704 5476 3848
78 50 6084 2500 3900
90 54 8100 2916 4860
73 59 5329 3481 4307
65 60 4225 3600 3900
40 71 1600 5041 2840
75 48 5625 2304 3600
∑ x 683 ∑ y 583 ∑x² 48 679 ∑y² 35 247 ∑ x y 38 503

Pearson worked example continues

n = 10
r
PEARSON WORKED EXAMPLE continues

= - 0,823 391 899

PEARSON continues
Therefore r = - 0, 8 23 to 3 decimal
There is a very strong negative correlation between History marks and Mathematics marks
SPEARMAN’S RANK ORDER CORRELATION CO – EFFICIENT (rho)

This correlation co-efficient does not use the actual scores of the variables. It uses the rank
order of the scores (variables). The values of x and y are ranked separately either in ascending
or descending order. The corresponding rank orders are subtracted, squared and finally added
leading to ∑d².
SPEARMAN’S RANK ORDER CORRELATION CO -
EFFICIENT continues
The following table can be used
x y Rank x Rank y D= r x – d²
(r x) (r y) ry
Maths mark (x)
Physics mark (y)

SPEARMAN’S RANK continues
Maths (x) 50 60 75 42 92 61
Physics 52 58 80 47 95 60
(y)
SPEARMAN’S RANK ORDER continues
x y rx ry Rx -ry d²
50 52 2 2 0 0
60 3 3 0 0
75 58 5 5 0 0
42 80 1 1 0 0
92 47 6 6 0 0
61 95 4 4 0 0
60
∑ 380 ∑ 392 ∑ d² 0
SPEARMAN’S RANK ORDER continues

= 1- 0
=1
rho = 1 There is a perfect positive correlation between Maths and Physics marks.
SPEARMAN’S RANK ORDER EXAMPLE
2
AGE 61 71 72 74 83 54 74 67 57 61
(X)
MAS 63 61 51 58 48 75 57 60 75 61
S (Y)
SPEARMAN RANK continues
x y Rx Ry d= r x-r y d²
61 63 3,5 8 -4,5 20,25

71 61 6 6,5 - 0,5 0,25
72 51 7 2 5 25
74 58 8,5 4 4,5 20,25
83 48 10 1 9 81
54 75 1 9,5 -8,5 72,25
74 57 8,5 3 5,5 30,25
67 60 5 5 0 0
57 75 2 9,5 -7,5 56,25
61 61 3,5 6,5 -3 9
∑ d² =314,5
SPEARMAN continues
n = 10
When ranking if there are common numbers you add the numbers and divide by the number for
example 75 in the above table under (y) it falls under position 9 and 10 so it becomes 9+10
=19 divided by 2 = 9,5
SPEARMAN continues

=
= 1- 1,906060
= -0,906060
= - 0,91
There is a very strong negative correlation between age and mass that is as some gets older the
mass decreases
NORMAL DISTRIBUTION CURVE
It is one of the most important distribution in statistics because it mirrors/reflects the

distribution of many real life measurements such as mass, height, weight etc.
Characteristics of a normal distribution curve
1. It represents a cross section of a bell
2. It is symmetrical
3. It is asymptotic to the horizontal axis i.e. it does not come into contact with the horizontal
axis
4. The total area under the curve is one (1)
5. It begins with the low frequency which raises at the middle and evenly subsides towards
the end.
6. The mean, mode and median are = and they coincide on the line of symmetry.

New BASIC STATISTICS (2) Slides

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

New BASIC STATISTICS (2) Slides

Загружено:

Авторское право:

Доступные форматы

BASIC STATISTICS

It is a branch of applied Mathematics which is concerned with the collection, organization,

characterization, analysis and interpretation of data which is usually in form of numbers.

interpreting numerical facts which we call data.

 Data: are the numerical facts, measurement and observation.

 Frequency: A frequency is the number of times each value or group of values of a

They indicate how much the terms in the distribution move

There are three Measures of Central Tendency namely

 A distribution can have no mode, one mode or even more.

 a). If a distribution has one mode, we say it is unimodal.

 b). If it has two modes, we say it is bimodal.

 C). If it has three modes, it is called a trimodal distribution

 d). There is also a multimodal distribution which has many modes

 There is no need to arrange numbers in order of size.

E.g find the mode of the following distributions

a) The modes are 81 and 82 hence it has a bimodal

c) The mode is 5 hence it is a unimodal

Where n= number of terms in the distribution eg if n= 15

Median = 10th term +11th ÷ 2

E.g. find the median of each of the following distributions

b) 77, 29, 36, 24, 82, 100, 105, 19, 60, 50

c) 99, 81, 74, 65, 50, 28, 3

(a) Rank the numbers or arrange them in ascending order

Median= ½ (9+1) th term

Median = ½ (5th+6th) term

Therefore the median = 55

 It is also called arithmetic mean or average

Mean = sum of terms

∑ Summation sign. This means something is being added

- It uses all the terms in the distribution

-It is affected by the outliers

How to compute quartiles Using even number of scores

Q1 = is between 2nd and 3rd = 2 + 4 = 6 2 = 3

 N is the number of student = 8 (even)

They are also called measures of spread, variability or scatter.

Measures of Dispersion include the variance, Standard deviation and range

a). Ordinary range

Ordinary range= x max – x min

= Greatest value – Smallest value

Population variance is denoted by σ²

3. Square each deviation to get (x₁ - x̅)², (x₂ -x̅)²- - - - - - (x n - x̅ )²

4) Add the squared deviation

 The sample variance formula use denominator of n-1 instead of N.

 Population variance formula use denominator of N.

= 282,29 (2 decimal place)

Using sample standard deviation formula

-1 perfect negative correlation

It is calculated using the following formulae

80 40 6400 1600 3200

∑ x 683 ∑ y 583 ∑x² 48 679 ∑y² 35 247 ∑ x y 38 503

= - 0,823 391 899

The following table can be used

Maths mark (x)

Physics mark (y)

61 63 3,5 8 -4,5 20,25

It is one of the most important distribution in statistics because it mirrors/reflects the

Вам также может понравиться