Вы находитесь на странице: 1из 59

BASIC STATISTICS

STATISTICS

Definition:

It is a branch of applied Mathematics which is concerned with the collection, organization,

characterization, analysis and interpretation of data which is usually in form of numbers.

According to Moore and Mccabe (1999), statistics is the science of collecting, organising and

interpreting numerical facts which we call data.


Terms used in descriptive statistics

 Data: are the numerical facts, measurement and observation.


The collection of measurement or observation is called data set.
Data is plural while datum is singular. Data can be raw or array. 
 Raw data: are collected data which have not been organized
numerically. It is also a set of unorganised, unarranged and
unprocessed information.
Example: 20, 50, 15, 75, 35, 45, 80, 46, 51, 32.
 Array: is an organised data that arranged in ascending or
descending order of magnitude. From example above:
 Array: 15, 20, 32, 35, 45, 46, 50, 51, 75, 80
Terms used in descriptive statistics Cont.

 Frequency: A frequency is the number of times each value or group of values of a


variable occurs.
 Frequency table shows the number of time a particular event or score/value occurs
in a given data. Frequency table shows at a glance the number of times each event
appears.
 Class intervals are the number of groups that a particular data is classified. It is
also a symbol which defines a class such as 1 – 5. A class interval which has either
no upper or lower class limit is called an open class interval. For example, a class
with weight 78 and above is an open class interval. 
 Class limits refers to the minimum and maximum value that the class interval may
contain. The end of numbers in the class intervals are called class limits. The end
numbers 1 and 5 are called class limits. The smaller number 1 is the lower class
limit and larger number 5 is the upper class limit.
Terms used in descriptive statistics Cont.

 Class boundary refers to the actual class limit of a class interval is called class
boundary. Consider the class interval 60kg – 62kg, theoretically, the interval
includes all measurements from 59.5 to 62.5kg. The smaller number 59.5 is the
lower class boundary while the larger number 62.5 is the upper class boundary.
GRAPHICAL REPRESENTATION OF DATA

 Pictorial representation
 Line Graphs
 Bar Graphs
 Histograms
 Frequency polygon
 Pie charts
 XY Graphs which show relationships between two sets of data e.g.
Scattergrams.
MEASURES OF CENTRAL TENDENCY

They indicate how much the terms in the distribution move


towards the middle; i.e centre.

There are three Measures of Central Tendency namely

mode

median

mean
MODE

 It is the piece of data with the highest frequency, i.e, it appears most often than the other

characters.

 A distribution can have no mode, one mode or even more.

 a). If a distribution has one mode, we say it is unimodal.

 b). If it has two modes, we say it is bimodal.

 C). If it has three modes, it is called a trimodal distribution

 d). There is also a multimodal distribution which has many modes


Mode conti…
ADVANTAGES

 There is no need to arrange numbers in order of size.


 It is not affected by extreme values (outliers) as is with the range.
 It is easy to calculate
DISADVANTAGES
 It does not use all the terms or characters in the distribution.
 It only focuses on those which appear more frequently than others.
 it is not normally used for further statistical calculations

E.g find the mode of the following distributions


a) 82, 69, 81, 82, 74, 81
b) 1000, 101, 500, 60
c) 0, 5, 0, 5, 0, 5, 4, 5
SOLUTIONS

a) The modes are 81 and 82 hence it has a bimodal

b) No mode

c) The mode is 5 hence it is a unimodal


MEDIAN
It is the term which occupies the middle position when the terms /numbers are
arranged in terms of size.
The terms can be arranged in ascending or descending order
When the number of terms in the distribution is odd the median is found using the
formula:
Median = ½ (n+1) term
th

Where n= number of terms in the distribution eg if n= 15


Median = ½ (15+1) th term
=8
MEDIAN CONTI…

If the number of terms in the distribution is even the median is half (1/2) of the sum of two middle terms e.g. if n= 20

Median = 10th term +11th ÷ 2

E.g. find the median of each of the following distributions

a) 7, 1, 4, 9, 7, 8, 6, 5, 6,3

b) 77, 29, 36, 24, 82, 100, 105, 19, 60, 50

c) 99, 81, 74, 65, 50, 28, 3


SOLUTION

(a) Rank the numbers or arrange them in ascending order

(b) 1 3 4 5 6 7 7 8 9 n= 9

Median= ½ (9+1) th term

= 5th term

Therefore Median = 6
MEDIAN continues
(b) Ranking the numbers
19 24 29 36 50 60 77 82 100 105

Median = ½ (5th+6th) term

= 50+60
2

= 110
2

Therefore the median = 55


MEDIAN continues
(c) Ranking or arranging in ascending order
3 28 50 65 74 81 99
n = 7+1 = 8 = 4th
2 2
Median = the fourth number which is 65
ADVANTAGES
 Easy to calculate
 Not affected by extreme values
DISADVANTAGES
-Does not use all the terms in the distribution
-It is not always used for further statistical calculations
-It can be different from the terms in the distribution
- Can be tedious when there is a large data set
MEAN

 It is also called arithmetic mean or average

 It is obtained by adding all the terms in the distribution and dividing the sum by the number of terms in the

distribution.

Mean = sum of terms

It is represented by x̅ (x bar)

∑ Summation sign. This means something is being added

x1 + x2+ - - - - -=∑ x
MEAN continues

∑x = (x₁ + x₂ + x₃ - - - -xn)

n
n = Number of terms in the distribution

x̅ = ∑x
n
MEAN continues

ADVANTAGES OF MEAN

- It uses all the terms in the distribution


- It is used for further statistical calculations eg in finding the variance and standard
deviation.
- There is no need to rank the numbers.
DISADVANTAGES

-It is affected by the outliers


-It can be different from all the terms in the distribution
- It is influenced by extreme values
Find the mean of each of the following
a) 60, 74, 88, 36, 54, 81, 93, 96, 50 68
b) 1000, 4000, 3600, 7200, 1112
Quantiles
  Quantiles are point value which divides the set of observation into equal
groups with known proportions in each group. Examples are quartiles,
deciles, percentiles, etc. One out of these will be discussed briefly.

 Quartiles
 Quartiles are the three-point value (Q­­1, Q2, Q3) which divides an array data
into four equal parts.
 Q­­1 is the first quartile of the distribution. Is the point that divides the
distribution into ratio 1:3 (25%= ¼),
 position of the lower quartile =
Quartiles Cont.

  Q­­ is the median of the distribution. Is the point that divides the
2
distribution into two equal parts (50%).
 Position of the median=
 Q­­3 is the upper quartile of the distribution. Is the point that divides the
distribution into ratio 3:1 (75%=3/4)
 Position of the upper quartile = ¾ (
 N is the number of values or the number of times a raw score occurs.
Quartiles Cont.
 Method 1
 How to compute quartiles Using odd number of scores
 Compute Q1, Q2, Q3 and Interquartile Range of the following scores 3, 1, 5,
9, 8, 6, 7
 The number of scores is 7 (odd)
 Arrange the marks in ascending order (from the lowest to the highest score)
 Array 1 3 5 6 7 8 9
 Rank 1st 2nd 3rd 4th 5th 6th 7th
 Q1 Q2 Q3
Quartiles Cont.
 
Method 2
Computation of Q1, Q2, and Q3 Using position formula to locate the scores
Using the example above in the first method and locate the position of Q1,Q2 and
Q3
 N is the number of scores = 7 (odd)
 position of the lowest quartiles (Q1)= = = = 2nd
 Position of the Q2 (median) = = =4th
 Position of the upper quartile (Q3) = ¾ (=¾(= 6th
Quartiles Cont.
Q1 = 2nd position = 3
Q2 = 4th position(Median)= 6
Q3 = 6th position = 8
Interquartile Range= Q3 – Q1 = 8 – 3 =5

How to compute quartiles Using even number of scores


 Compute Q1,Q2 and Q3 of the following scores 2, 1, 5, 9, 4, 8, 6, 7
 The number of scores is 8 (even)
There are many approaches for computing the score for Q1, Q2 and Q3 using
even number scores
Quartiles Cont.
 
Method 1
 Arrange the scores in ascending order (from the lowest score to the highest
score)
 Array 1 2 4 5 6 7 8 9
 Rank 1st 2nd 3rd 4th 5th 6th 7th 8th
 Q1 Q2 Q3

Q1 = is between 2nd and 3rd = 2 + 4 = 6 2 = 3


Q2 = is between 4th and 5th = 5 + 6 =112 = 5.5
Q3 = is between 6th and 7th = 7 + 8 = 152 = 7.5
Quartiles Cont.
  Method 2
 Computation of Q1, Q2, and Q3 Using position formula to locate the
scores
 Using the example above in the second method and locate the position of
Q1, Q2 and Q3

 N is the number of student = 8 (even)


 position of the lowest quartiles (Q1)= = = = 2.25th
 Position of the Q2 (median)= = =4.5th
 Position of the upper quartile= ¾ ( =¾ (= 6.75th
Quartiles Cont.

 Method 2
 Q1 = is between 2nd and 3rd
 position 2.25th
 = 2nd + 0.25 (3rd – 2nd )
 = 2 + 0.25(4 - 2)
 = 2 + 0.25 (2)
 = 2 + 0.5
 = 2.5
Quartiles Cont.
 Q2 = is between 4th and 5th
 Position 4.5th
 4th + 0.5 (5th – 4th )
 5 + 0.5 (6 – 5)
 5 + 0.5 (1)
 5 + 0.5
 5.5
 Q2 = 5.5
Quartiles Cont.
 Q3 = is between 6th and 7th
 Position 6.75th
 6th + 0.75 (7th – 6th )
 7 + 0.75 (8 – 7)
 7 + 0.75 (1)
 7 + 0.75
 7.75
MEASURES OF DISPERSION /VARIABILITY/ Scatter/
SPREAD

They are also called measures of spread, variability or scatter.

They indicate how much the terms in the distribution are spread or scattered from the

mean/average.

The distribution can be measured relative to the mean (starting from the mean)

Measures of Dispersion include the variance, Standard deviation and range


RANGE

It focuses on the difference between the greatest and the smallest values in the distribution. There

are two types of range namely the ordinary range and the inclusive range.

a). Ordinary range

It is the difference between the greatest and the smallest value in the distribution.

Ordinary range= x max – x min

= Greatest value – Smallest value


RANGE continues
b. Inclusive range
=(Greatest value –Smallest value) + 1
= (x max –x min) + 1
Find the range (ordinary and inclusive) of the following distribution
a) 85, 36, 14, 91, 99, 64, 50, 28, 12, 9, 3
Ordinary range = x max- x min
= 99 – 3
= 96
RANGE continues
Inclusive range =( x max – x min ) + 1
= (99 – 3 ) + 1
= 96 + 1
= 97
Advantage of range
 Easy to calculate

Disadvantages
 Only focuses on outliers and ignores all the other terms.
 It is greatly affected by outliers.
VARIANCE

Population variance is denoted by σ²


Sample variance is denoted by S². Variance is the mean of the sum of the squared deviation.
How to find the variance
1. Find the mean of the distribution

2. Subtract the mean from each term in the distribution to get the deviations i.e
(x₁ - x̅ )(x₂-x̅ )- - - - (x n- x̅)

3. Square each deviation to get (x₁ - x̅)², (x₂ -x̅)²- - - - - - (x n - x̅ )²


VARIANCE continues
  

4) Add the squared deviation


(x₁ -x)̅ ² + (x₂ - x)̅ ² +- - - - (xn -x)̅ ² = ∑ (x -x̅ )²
 The formula for computing Variance (
 Sample variance (=
 Which means sum of the squared deviations Divide the
sum by n-1

VARIANCE continues
  

 The sample variance formula use denominator of n-1 instead of N.


This is the adjustment that is necessary to correct for the bias in
sample variability.

 Population variance formula use denominator of N.


 Population variance =
CALCULATION OF VARIANCE

X X̅ X - X̅ (X - X̅ )²
60 71 -11 121

83 71 12 144
71 71 0 0
63 71 -8 64
89 71 18 324
90 71 19 361
40 71 -30 961

72 71 1 1

∑ ∑ (X - X̅ )² = 1976
VARIANCE continues
Using
   sample variance formula
Variance

= 1976
8 -1

= 1976
7

=282, 2857143

= 282,29 (2 decimal place)


Calculation of standard deviation
  
The formula is: SD =
 Population standard deviation=
 Sample standard deviation=

Using sample standard deviation formula


S

SD =

SD =16,80136049

= 16,80 (2 d p)
INTERPRETATION OF VARIANCE AND
STANDARD DEVIATION
A large value of the standard deviation/variance shows that the values are widely

scattered relatively to the mean which means the greater the variance / standard

deviation the more widely spaced the terms are above and below the mean. The

smaller the variance the more closely packed the values are around the mean.
MEASURES OF ASSOCIATION
It focuses on the relationship between variables.
Correlation It is the degree of association between 2 or more variables or factors.
There are two types of variables
1) Independent variables (x) and the dependent variable (y).
The independent variable is the variable which is manipulated by the researcher during an
experiment.
The dependent variable is the factor which is influenced by the manipulation of the independent
variable
MEASURES OF ASSOCIATION continues
CORRELATION CO-EFFICIENT

It is the number which shows the size and direction of association between variables.
r normally represents a correlation co-efficient
The maximum value of r is + 1
The minimum value of r is – 1
This means r lies between -1 and + 1
When r is + 1 it is perfect positive correlation
When it is between 0,9 – 0,99 very strong positive correlation
“ “ 0,7 – 0,89 strong positive correlation
“ “ “ 0,4 – 0,69 moderate positive correlation
“ “ ‘ 0,2 – 0,39 weak positive correlation
“ “ “ 0,1 – 0,19 very weak positive correlation
CORRELATION CO-EFFICIENT continues

-1 perfect negative correlation


-0,9 to – 0,99 very strong negative correlation
-0,7to – 0,89 strong negative correlation
- 0,4 to – 0,69 moderate negative correlation
-0,2 to - 0,39 weak negative correlation
- 0,01 to – 0,19 very weak negative correlation
The 2 popular correlation co-efficiency are the Pearson’s product correlation co-efficient (r)
and the Spearman’s rank order correlation co-efficient (rho)
PEARSON’S PRODUCT CORRELATION CO-EFFICIENT (r)
Its  mainly strength is that it uses actual values of the variables.

It is calculated using the following formulae


PEARSON’S PRODUCT CORRELATION
continues
The following table can be used to obtain the values which are to be substituted in the
formulae

x y x² y² Xy
x₁ y₁ x₁² y₁² x₁ y₁
x₂ y₂ x₂² y₂² x₂ y₂
x₃ y₃ x₃² y₃² x₃ y₃
x₄ y₄ x₄² y₄² x₄ y₄
∑x ∑y ∑ x² ∑ y² ∑xy
PEARSON’S PRODUCT WORKED EXAMPLE

Ten Form 4 pupils at a certain school wrote two tests one in History and the other one in
Mathematics and results are as follows

pu A B C D E F G H I J
pil
HIS 80 74 56 52 78 90 73 65 40 75
TO
RY
Mat 40 52 75 74 50 54 59 60 71 48
hs
PEARSON WORKED EXAMPLE continues
x y x² y² Xy

80 40 6400 1600 3200


74 52 5476 2704 3848
56 75 3136 5625 4200
52 74 2704 5476 3848
78 50 6084 2500 3900
90 54 8100 2916 4860
73 59 5329 3481 4307
65 60 4225 3600 3900
40 71 1600 5041 2840
75 48 5625 2304 3600

∑ x 683 ∑ y 583 ∑x² 48 679 ∑y² 35 247 ∑ x y 38 503


Pearson worked example continues
  

n = 10

r
PEARSON WORKED EXAMPLE continues
  

= - 0,823 391 899


PEARSON continues

Therefore r = - 0, 8 23 to 3 decimal

There is a very strong negative correlation between History marks and Mathematics marks
SPEARMAN’S RANK ORDER CORRELATION CO – EFFICIENT (rho)

  

This correlation co-efficient does not use the actual scores of the variables. It uses the rank
order of the scores (variables). The values of x and y are ranked separately either in ascending
or descending order. The corresponding rank orders are subtracted, squared and finally added
leading to ∑d².
SPEARMAN’S RANK ORDER CORRELATION CO -
EFFICIENT continues

The following table can be used

x y Rank x Rank y D= r x – d²
(r x) (r y) ry

Maths mark (x)

Physics mark (y)


SPEARMAN’S RANK continues
Maths (x) 50 60 75 42 92 61
Physics 52 58 80 47 95 60
(y)
SPEARMAN’S RANK ORDER continues

x y rx ry Rx -ry d²
50 52 2 2 0 0
60 3 3 0 0
75 58 5 5 0 0
42 80 1 1 0 0
92 47 6 6 0 0
61 95 4 4 0 0
60

∑ 380 ∑ 392 ∑ d² 0
SPEARMAN’S RANK ORDER continues

  

= 1- 0
=1

rho = 1 There is a perfect positive correlation between Maths and Physics marks.
SPEARMAN’S RANK ORDER EXAMPLE
2
AGE 61 71 72 74 83 54 74 67 57 61
(X)

MAS 63 61 51 58 48 75 57 60 75 61
S (Y)
SPEARMAN RANK continues
x y Rx Ry d= r x-r y d²

61 63 3,5 8 -4,5 20,25


71 61 6 6,5 - 0,5 0,25
72 51 7 2 5 25
74 58 8,5 4 4,5 20,25
83 48 10 1 9 81
54 75 1 9,5 -8,5 72,25
74 57 8,5 3 5,5 30,25
67 60 5 5 0 0
57 75 2 9,5 -7,5 56,25
61 61 3,5 6,5 -3 9

∑ d² =314,5
SPEARMAN continues
n =  10
When ranking if there are common numbers you add the numbers and divide by the number for
example 75 in the above table under (y) it falls under position 9 and 10 so it becomes 9+10
=19 divided by 2 = 9,5
SPEARMAN continues
  

=
= 1- 1,906060

= -0,906060

= - 0,91

There is a very strong negative correlation between age and mass that is as some gets older the
mass decreases
NORMAL DISTRIBUTION CURVE

It is one of the most important distribution in statistics because it mirrors/reflects the


distribution of many real life measurements such as mass, height, weight etc.
Characteristics of a normal distribution curve
1. It represents a cross section of a bell
2. It is symmetrical
3. It is asymptotic to the horizontal axis i.e. it does not come into contact with the horizontal
axis
4. The total area under the curve is one (1)
5. It begins with the low frequency which raises at the middle and evenly subsides towards
the end.
6. The mean, mode and median are = and they coincide on the line of symmetry.

Вам также может понравиться