Вы находитесь на странице: 1из 57

GTU 302 BIOSTATISTICS

DESCRIPTIVE STATISTICS

DR NORAINI ABDUL GHAFAR


norainiag@usm.my
Outline
Descriptive statistics
Descriptive statistics for qualitative data
Descriptive statistics for quantitative data

Measuring central tendency


Mean
Mode
Median
Measuring dispersion @ variability
Range
Interquartile
Variance
Standard deviation
Normal distribution
Criteria for determining normal distribution
Presenting qualitative & quantitative data
Statistics

Descriptive Inferential
statistics statistics
What is Descriptive Statistics?
= A discipline of statistics that summarizes a
collection of data and present in a way that can
be easily and clearly understood
Two basic methods to summarize data:
Two Basic Methods in Descriptive Statistics
Numerical Method Graphical Method
Calculate the mean and SD Create
Precise and objective - a frequency table
- a bar-chart
- a line graph
- a box plot
Shows distribution
patterns in the data
4
What is Descriptive Statistics?
- As compared to Inferential Statistics
Descriptive Statistics Inferential Statistics
Involve: Involve:
Collection of data Hypothesis testing
Organization of data Confidence Interval
Enumeration of frequency of
characteristics Allows researchers to INFER
Summarization and presentation (GENERALIZE) the
characteristics of the sample
of data
(statistics) to the population
(parameter)
Describe the characteristics of
OUR OBSERVED data
Numerical method
Graphical method
5
Why use Descriptive Statistics?
Descriptive statistics are used to present quantitative descriptions in a
manageable form.

In research, we may have lots of measures (data) or we may measure


a large number of people on any measure.
Use descriptive statistics to SHOW LARGE AMOUNT OF DATA IN A
SENSIBLE WAY.

wanamir@kb.usm.my 6
Why it is so important?
First stage of statistical analysis
Identifying outliers, keying error
Checking for data symmetry, normality
Presenting the data
Descriptive Statistics for Qualitative Data

VARIABLE

QUALITATIVE / CATEGORICAL QUANTITATIVE / NUMERICAL

NOMINAL ORDINAL DISCRETE CONTINUOUS

Frequency Measure of centrality/


Measure of dispersion
(%)
Mean (SD)
Median (IQR)
8
Descriptive Statistics for Qualitative Data:
Frequency Distribution
Frequency Distribution
= A table (at the minimum) that displays how many times in a
data set each score occurs

9
Descriptive Statistics for Qualitative Data:
Frequency Distribution
Valid percent
= The percentage of those who gave a valid response to the question that
belongs to the category.
- If there are no missing cases, valid percent column = percent column

Cumulative percent
= Provides the rolling addition of
percentages from the first
category to the last valid
category.

10
Descriptive Statistics for Quantitative Data:
measures of centrality/central tendency
1)Measures of centrality @
central tendency
Mode

Median
Mean
Mode
Frequently occurring number in a given dataset

120 114 116 117 114 121 124 114 Unimodal

114 appears 3 times, other measurement only once


Mode is 114 since it is the most frequently occurring value

117 120 114 116 115 114 121 117 124 Bimodal
Mean
Often called average
Sum of all data values in a dataset divided by the
number of data values

Mean=Sum of all data values



No. of data n

Used for variables that are quantifiable


Mean calculation

20 18 16 22 27 11
The mean will be:

(20 + 18 + 16 + 22 + 27 + 11) = 19
6
Other examples:
Mean age of Malay teachers in a primary
school
Mean weight of primary school students
Median
Exact middle value in a distribution
Divides the data set into two exact halves

3500 3950 4200 4750 5200

The median is?


Sort the data first (ascending or descending)
Determine the median
For even numbers of data?
Sort, sum the two middle numbers, and divide the
sum by 2

24 29 32 35 39 40
32+35 = 33.5
2
Choosing an appropriate measure of tendency
Have learned 3 types & how they differ in terms of
finding the centre of a data distribution
When do we use which measure?
Mode: most frequent occurring data-useful for nominal
measurement
Median & mean: useful when the variables being
measured could be quantified
Important note:
Mean: extremely sensitive to unusual cases
Influenced by a presence of outlier cases

Data set 1: 108 112 116 120 124


Data set 2: 108 112 116 120 205
Mean data set 1: 116, mean data set 2: 132.2
Note the difference

outlier
Should not be used when unusual, or outlying, data
values present in the data set
Median should be used
2)Descriptive Statistics for Quantitative Data:
Measures of dispersion (variability)
Dispersion or variability refers to the spread
of the values around the central tendency.

Two common measures of dispersion:


(1) Range, and
(2) Standard Deviation (SD)
Range = The highest score minus the lowest score

Examples:
What is the range for the data set:
44, 46, 47, 52, 56, 58, 60, 63 and 65?
What is the range for the data set:
44, 46, 47, 52, 56, 58, 60, 90 and 98?
What is the range for the data set:
9, 15, 47, 52, 56, 58, 60, 63 and 65?
Standard deviation
SD is a more accurate and detailed estimate of
dispersion/variability/variation because an outlier can greatly
exaggerate the range.

SD shows a relation of a set of scores to the mean of the sample.

SD takes into consideration the deviations of the individual scores


from the mean. Then, each individual deviation is squared to avoid
the problem of plus and minus.
Standard Deviation (SD)

SD Formula:

2

SD
Steps: n 1
= Difference between the mean and each score
= Squaring the difference
= Summing all the squared differences
= divide the sum of all the squared differences by the number
of scores (N) minus 1, and extracting the square root.
Test score, x 2
23 23 25 = -2 4
22 22 25 = -3 9
26 26 25 = +1 1 Manual calculation of SD:
21 21 25 = -4 16
E.g. The test scores of an
30 30 25 = +5 25
Immunology Course:
24 24 25 = +1 1

2
20 20 25 = -5 25
SD
27 27 25 = +2 4 n 1
25 25 25 = 0 0
134
32 32 25 = +7 49
=25 10 1
2 =134
2

3.8586
SD
SD shows an 'average' number for the distance of the majority of measures from the
mean.
SD is of value used with the Normal distribution, where known proportions of the
measurements fall within 1SD, 2SD or 3SD of the mean.
E.g. If the average score is 56 in a normal distribution, with a SD of 6,
then
68.3% of scores will be 56 6 (= between 50 and 62)
95.4% of scores will be 56 12 (= between 44 and 68)
99.7% of scores will be 56 18 (= between 38 and 74)

38 44 50 56 62 68 74
Or, if divided into 6 bands,

38 44 50 56 62 68 74
13.6% 2.1%
2.1% of 13.6% 34.1% 34.1
(62 & 68) (68 & 74)
scores will (44 & 50) 50 & 56 (56 & 62)
be in
between
38 & 44
Thus, for a given a set normally distributed data set, the mean and
the standard deviation can be calculated, and from this can be
derived the probability of future measures falling into the three
bands (1SD, 2SD or 3SD)
Variance:
Measures the amount of dispersion/variability/spread about the mean of a
sample
Denoted by the symbol (sigma)
Total of squares of deviation of observations from the mean / number of
degree of freedom

( x x ) 2
s
2 i -

n 1
Interquartile range (IQR)
What is an interquartile range (IQR) ?
The interquartile range (IQR) is the distance between
the 75th percentile and the 25th percentile. The IQR is
essentially the range of the middle 50% of the data.
Because it uses the middle 50%, the IQR is not
affected by outliers or extreme values
The IQR is also equal to the length of the box in a box
plot.

Min score Q1 Q2 Q3 Max score

33
How to compute Inter-quartile Range?
Like the standard deviation, the interquartile range (IQR) is a
descriptive statistic used to summarize the extent of the
spread of your data.

The IQR is the distance between the 1st quartile (25th


percentile) and 3rd quartile (75th percentile).

Q3 Q1 = IQR
To find these numbers you must divide your data set in half,
and find the median of each half and that will be your Q1 and
Q3.

wanamir@kb.usm.my 34
How to compute Inter-quartile Range?
If you have an odd number, then EXCLUDE the median of the
entire set, so as follows:
For example, take the following dataset:
1st 3rd

3 5 7 8 9 21 40 90 120
We exclude the 9 as the median, of the whole set and the 1st quartile is 6 (5+7
divided by 2) and the 3rd quartile is 65 (40+90 divided by 2), making the IQR = 65-
6=59.
OR 1st 1st

3 5 7 8 40 90 120
We exclude the 8 as the median of the whole set and the 1st quartile is 5 and
the 3rd quartile is 90. (IQR = 90 - 5 = 85.)
wanamir@kb.usm.my 35
i x[i] Quartile
1 102
2 104
Inter-quartile Range: Example
3 105 Q1
4 107
For the data in this table the 5 108
interquartile range is 6 109 Q2 - median
IQR = 115 105 = 10 7 110
8 112
9 115 Q3
10 116
11
wanamir@kb.usm.my 118 36
Median?
Q1?
Inter-quartile Range Q3?
IQR?
+-----+-+
x o |-------| | |---|
+-----+-+
+---+---+---+---+---+---+---+---+---+---+---+---+ number line
0 1 2 3 4 5 6 7 8 9 10 11 12

For the data set in this box plot:


lower (first) quartile (Q1, x.25) = 7
median (second quartile) (Median, x.5) = 8.5
upper (third) quartile (Q3, x.75) = 9
interquartile range, IQR = Q3 Q1 = 2
wanamir@kb.usm.my 37
Can also be presented in a box-and-whisker
diagram or plot

A convenient way of graphically depicting groups of


numerical data through their five-number
summaries:
the smallest observation (sample minimum),
lower quartile (Q1)
median (Q2)
upper quartile (Q3), and
largest observation (sample maximum)

A boxplot may also indicate which observations, if


any, might be considered outliers.
38
How to create a box-and-whisker plot:
1. Arrange your data in numerical order

2. Find the median (Q2 or Quartile 2) of your data. The median


divides the data into two halves.

3. To divide the data into quarters, you then find the medians
of these two halves.
(Note: Remember to EXCLUDE the median of your overall
data before finding the two sub-medians)

4. The lower (or smaller sub-median) is called Q1; The larger (or
bigger sub-median) is called Q3.

39
How to create a box-and-whisker plot:
5. Draw a line and mark positions of Q1, Q2 and Q3.
Make a box indicating the 3 points.

Q1 Q2 Q3

6. Mark the minimum point (lowest score from the data set) below
Q1 and the maximum point (highest Score from the data set)
above Q3. Add whiskers to the box.

Min score Q1 Q2 Q3 Max score

40
Significance of IQR
Use
Unlike (total) range, the interquartile range
is a robust statistic, having a breakdown
point of 25%, and is thus often preferred to
the total range

The IQR is used to build box plots, simple


graphical representations of a probability
distribution.

41
Significance of IQR
For a symmetric distribution (so the median equals
the midhinge, the average of the first and third
quartiles), half the IQR equals the median absolute
deviation (MAD).

The median is the corresponding measure of central


tendency.

42
NORMAL DISTRIBUTION
Distribution of a data
Why? Determining appropriate statistical test
How? Checking a few criteria
Normal curve
Skewness (1)
Kurtosis (1)
Box & whisker plot
Histogram with overlaid normal curve

Normal curve= normal distr.


Parametric test
Box and whisker plot

Equal tail= normal distr.


Parametric test
Normal curve

Normal Distribution

Positively Skewed

Negatively Skewed

46
Characteristics of Distribution
Positively Skewed (right)
Majority of data fell to the left
of the mean

Cluster at the lower end of


the distribution

Tail to the right


Mode Median Mean

47
Characteristics of Distribution

Negatively Skewed (left)

Majority of data fell to the right


of the mean

Cluster at the lower end of


the distribution

Tail to the left


Mean Median Mode

48
Presenting the data
Descriptive Statistics: GRAPH (1): Bar Chart

A diagram showing the relation between typically two


variable quantities, each measured along one of a pair
of axes at right angles.

Bar Charts
X- axis for discrete categories
(e.g. Blood group; Gender)

Y-axis for frequencies (or


percentages)

50
Descriptive Statistics: GRAPH (2): Histogram
X- axis for continuous variable (e.g. Weight; Height)
Y-axis for percentages (or frequencies)

51
Descriptive Statistics: GRAPH (3): Line Graph

The line graph serves a similar


function as a histogram. Thus should be
used for continuous variables

The frequency of any value in


histogram is represented by a single
column, whereas the frequency of any
value in line graph is represented by a
single point on a line.

52
Presenting qualitative (categorical) data
Presented statistically in term of
frequency or percentage (%)
In table form
called frequency table

Graphically in the form of chart


Pie Chart or Bar Chart
The data represents the education level for a sample of
seven districts in Kelantan

Qualitative Education Freq. %


variable level
Primary 24 6.0 Frequency
Secondary 58 14.5 table
Degree 230 57.5
Master 50 12.5
PhD 38 9.5
Total 400 100.0
Bar & pie chart
SUMMARY
Categorical Data Numerical Data
Statistics Statistics
Frequency (%) Mean (SD)
Graphs Median (IQR)
Bar chart
Pie chart
Graphs
Histogram
Box Plot

56
THANK YOU

NEXT LECTURE: INFERENTIAL STATISTICS


& HYPOTHESIS TESTING
(Sunday 10-11am, DKC)

Вам также может понравиться