Gtu 302 Biostatistics: Descriptive Statistics

GTU 302 BIOSTATISTICS
DESCRIPTIVE STATISTICS
DR NORAINI ABDUL GHAFAR

norainiag@usm.my
Outline
Descriptive statistics
Descriptive statistics for qualitative data
Descriptive statistics for quantitative data
Measuring central tendency

Mean
Mode
Median
Measuring dispersion @ variability
Range
Interquartile
Variance
Standard deviation
Normal distribution
Criteria for determining normal distribution
Presenting qualitative & quantitative data
Statistics
Descriptive Inferential
statistics statistics
What is Descriptive Statistics?
= A discipline of statistics that summarizes a
collection of data and present in a way that can
be easily and clearly understood
Two basic methods to summarize data:
Two Basic Methods in Descriptive Statistics
Numerical Method Graphical Method
Calculate the mean and SD Create
Precise and objective - a frequency table
- a bar-chart
- a line graph
- a box plot
Shows distribution
patterns in the data
4
What is Descriptive Statistics?
- As compared to Inferential Statistics
Descriptive Statistics Inferential Statistics
Involve: Involve:
Collection of data Hypothesis testing
Organization of data Confidence Interval
Enumeration of frequency of
characteristics Allows researchers to INFER
Summarization and presentation (GENERALIZE) the
characteristics of the sample
of data
(statistics) to the population
(parameter)
Describe the characteristics of
OUR OBSERVED data
Numerical method
Graphical method
5
Why use Descriptive Statistics?
Descriptive statistics are used to present quantitative descriptions in a
manageable form.
In research, we may have lots of measures (data) or we may measure

a large number of people on any measure.
Use descriptive statistics to SHOW LARGE AMOUNT OF DATA IN A
SENSIBLE WAY.
wanamir@kb.usm.my 6
Why it is so important?
First stage of statistical analysis
Identifying outliers, keying error
Checking for data symmetry, normality
Presenting the data
Descriptive Statistics for Qualitative Data
VARIABLE
QUALITATIVE / CATEGORICAL QUANTITATIVE / NUMERICAL
NOMINAL ORDINAL DISCRETE CONTINUOUS
Frequency Measure of centrality/

Measure of dispersion
(%)
Mean (SD)
Median (IQR)
8
Descriptive Statistics for Qualitative Data:
Frequency Distribution
= A table (at the minimum) that displays how many times in a
data set each score occurs
9
Descriptive Statistics for Qualitative Data:
Valid percent
= The percentage of those who gave a valid response to the question that
belongs to the category.
- If there are no missing cases, valid percent column = percent column
Cumulative percent
= Provides the rolling addition of
percentages from the first
category to the last valid
category.
10
Descriptive Statistics for Quantitative Data:
measures of centrality/central tendency
1)Measures of centrality @
central tendency
Mode
Median
Mean
Mode
Frequently occurring number in a given dataset
120 114 116 117 114 121 124 114 Unimodal
114 appears 3 times, other measurement only once

Mode is 114 since it is the most frequently occurring value
117 120 114 116 115 114 121 117 124 Bimodal
Mean
Often called average
Sum of all data values in a dataset divided by the
number of data values
Mean=Sum of all data values

No. of data n
Used for variables that are quantifiable

Mean calculation
20 18 16 22 27 11
The mean will be:
(20 + 18 + 16 + 22 + 27 + 11) = 19
6
Other examples:
Mean age of Malay teachers in a primary
school
Mean weight of primary school students
Median
Exact middle value in a distribution
Divides the data set into two exact halves
3500 3950 4200 4750 5200
The median is?

Sort the data first (ascending or descending)
Determine the median
For even numbers of data?
Sort, sum the two middle numbers, and divide the
sum by 2
24 29 32 35 39 40
32+35 = 33.5
2
Choosing an appropriate measure of tendency
Have learned 3 types & how they differ in terms of
finding the centre of a data distribution
When do we use which measure?
Mode: most frequent occurring data-useful for nominal
measurement
Median & mean: useful when the variables being
measured could be quantified
Important note:
Mean: extremely sensitive to unusual cases
Influenced by a presence of outlier cases
Data set 1: 108 112 116 120 124

Data set 2: 108 112 116 120 205
Mean data set 1: 116, mean data set 2: 132.2
Note the difference
outlier
Should not be used when unusual, or outlying, data
values present in the data set
Median should be used
2)Descriptive Statistics for Quantitative Data:
Measures of dispersion (variability)
Dispersion or variability refers to the spread
of the values around the central tendency.
Two common measures of dispersion:

(1) Range, and
(2) Standard Deviation (SD)
Range = The highest score minus the lowest score
Examples:
What is the range for the data set:
44, 46, 47, 52, 56, 58, 60, 63 and 65?
44, 46, 47, 52, 56, 58, 60, 90 and 98?
9, 15, 47, 52, 56, 58, 60, 63 and 65?
Standard deviation
SD is a more accurate and detailed estimate of
dispersion/variability/variation because an outlier can greatly
exaggerate the range.
SD shows a relation of a set of scores to the mean of the sample.
SD takes into consideration the deviations of the individual scores

from the mean. Then, each individual deviation is squared to avoid
the problem of plus and minus.
Standard Deviation (SD)
SD Formula:

2
SD
Steps: n 1
= Difference between the mean and each score
= Squaring the difference
= Summing all the squared differences
= divide the sum of all the squared differences by the number
of scores (N) minus 1, and extracting the square root.
Test score, x 2
23 23 25 = -2 4
22 22 25 = -3 9
26 26 25 = +1 1 Manual calculation of SD:
21 21 25 = -4 16
E.g. The test scores of an
30 30 25 = +5 25
Immunology Course:
24 24 25 = +1 1

2
20 20 25 = -5 25
SD
27 27 25 = +2 4 n 1
25 25 25 = 0 0
134
32 32 25 = +7 49
=25 10 1
2 =134
2
3.8586
SD
SD shows an 'average' number for the distance of the majority of measures from the
mean.
SD is of value used with the Normal distribution, where known proportions of the
measurements fall within 1SD, 2SD or 3SD of the mean.
E.g. If the average score is 56 in a normal distribution, with a SD of 6,
then
68.3% of scores will be 56 6 (= between 50 and 62)
38 44 50 56 62 68 74
Or, if divided into 6 bands,
38 44 50 56 62 68 74
13.6% 2.1%
2.1% of 13.6% 34.1% 34.1
(62 & 68) (68 & 74)
scores will (44 & 50) 50 & 56 (56 & 62)
be in
between
38 & 44
Thus, for a given a set normally distributed data set, the mean and
the standard deviation can be calculated, and from this can be
derived the probability of future measures falling into the three
bands (1SD, 2SD or 3SD)
Variance:
Measures the amount of dispersion/variability/spread about the mean of a
sample
Denoted by the symbol (sigma)
Total of squares of deviation of observations from the mean / number of
degree of freedom
( x x ) 2
s
2 i -
n 1
Interquartile range (IQR)
What is an interquartile range (IQR) ?
The interquartile range (IQR) is the distance between
the 75th percentile and the 25th percentile. The IQR is
essentially the range of the middle 50% of the data.
Because it uses the middle 50%, the IQR is not
affected by outliers or extreme values
The IQR is also equal to the length of the box in a box
plot.
Min score Q1 Q2 Q3 Max score
33
How to compute Inter-quartile Range?
Like the standard deviation, the interquartile range (IQR) is a
descriptive statistic used to summarize the extent of the
spread of your data.
The IQR is the distance between the 1st quartile (25th

percentile) and 3rd quartile (75th percentile).
Q3 Q1 = IQR
To find these numbers you must divide your data set in half,
and find the median of each half and that will be your Q1 and
Q3.
wanamir@kb.usm.my 34
How to compute Inter-quartile Range?
If you have an odd number, then EXCLUDE the median of the
entire set, so as follows:
For example, take the following dataset:
1st 3rd
3 5 7 8 9 21 40 90 120
We exclude the 9 as the median, of the whole set and the 1st quartile is 6 (5+7
divided by 2) and the 3rd quartile is 65 (40+90 divided by 2), making the IQR = 65-
6=59.
OR 1st 1st
3 5 7 8 40 90 120
We exclude the 8 as the median of the whole set and the 1st quartile is 5 and
the 3rd quartile is 90. (IQR = 90 - 5 = 85.)
i x[i] Quartile
1 102
2 104
Inter-quartile Range: Example
3 105 Q1
4 107
For the data in this table the 5 108
interquartile range is 6 109 Q2 - median
IQR = 115 105 = 10 7 110
8 112
9 115 Q3
10 116
11
wanamir@kb.usm.my 118 36
Median?
Q1?
Inter-quartile Range Q3?
IQR?
+-----+-+
x o |-------| | |---|
+-----+-+
+---+---+---+---+---+---+---+---+---+---+---+---+ number line
0 1 2 3 4 5 6 7 8 9 10 11 12
For the data set in this box plot:

lower (first) quartile (Q1, x.25) = 7
median (second quartile) (Median, x.5) = 8.5
upper (third) quartile (Q3, x.75) = 9
interquartile range, IQR = Q3 Q1 = 2
Can also be presented in a box-and-whisker
diagram or plot
A convenient way of graphically depicting groups of

numerical data through their five-number
summaries:
the smallest observation (sample minimum),
lower quartile (Q1)
median (Q2)
upper quartile (Q3), and
largest observation (sample maximum)
A boxplot may also indicate which observations, if

any, might be considered outliers.
38
How to create a box-and-whisker plot:
1. Arrange your data in numerical order
2. Find the median (Q2 or Quartile 2) of your data. The median

divides the data into two halves.
3. To divide the data into quarters, you then find the medians
of these two halves.
(Note: Remember to EXCLUDE the median of your overall
data before finding the two sub-medians)
4. The lower (or smaller sub-median) is called Q1; The larger (or
bigger sub-median) is called Q3.
39
How to create a box-and-whisker plot:
5. Draw a line and mark positions of Q1, Q2 and Q3.
Make a box indicating the 3 points.
Q1 Q2 Q3
6. Mark the minimum point (lowest score from the data set) below
Q1 and the maximum point (highest Score from the data set)
above Q3. Add whiskers to the box.
Min score Q1 Q2 Q3 Max score
40
Significance of IQR
Use
Unlike (total) range, the interquartile range
is a robust statistic, having a breakdown
point of 25%, and is thus often preferred to
the total range
The IQR is used to build box plots, simple

graphical representations of a probability
distribution.
41
Significance of IQR
For a symmetric distribution (so the median equals
the midhinge, the average of the first and third
quartiles), half the IQR equals the median absolute
deviation (MAD).
The median is the corresponding measure of central

tendency.
42
NORMAL DISTRIBUTION
Distribution of a data
Why? Determining appropriate statistical test
How? Checking a few criteria
Normal curve
Skewness (1)
Kurtosis (1)
Box & whisker plot
Histogram with overlaid normal curve
Normal curve= normal distr.

Parametric test
Box and whisker plot
Equal tail= normal distr.

Parametric test
Normal curve
Normal Distribution
Positively Skewed
Negatively Skewed
46
Characteristics of Distribution
Positively Skewed (right)
Majority of data fell to the left
of the mean
Cluster at the lower end of

the distribution
Tail to the right

Mode Median Mean
47
Characteristics of Distribution
Negatively Skewed (left)
Majority of data fell to the right

of the mean
Cluster at the lower end of

the distribution
Tail to the left

Mean Median Mode
48
Presenting the data
Descriptive Statistics: GRAPH (1): Bar Chart
A diagram showing the relation between typically two

variable quantities, each measured along one of a pair
of axes at right angles.
Bar Charts
X- axis for discrete categories
(e.g. Blood group; Gender)
Y-axis for frequencies (or

percentages)
50
Descriptive Statistics: GRAPH (2): Histogram
X- axis for continuous variable (e.g. Weight; Height)
Y-axis for percentages (or frequencies)
51
Descriptive Statistics: GRAPH (3): Line Graph
The line graph serves a similar

function as a histogram. Thus should be
used for continuous variables
The frequency of any value in

histogram is represented by a single
column, whereas the frequency of any
value in line graph is represented by a
single point on a line.
52
Presenting qualitative (categorical) data
Presented statistically in term of
frequency or percentage (%)
In table form
called frequency table
Graphically in the form of chart

Pie Chart or Bar Chart
The data represents the education level for a sample of
seven districts in Kelantan
Qualitative Education Freq. %

variable level
Primary 24 6.0 Frequency
Secondary 58 14.5 table
Degree 230 57.5
Master 50 12.5
PhD 38 9.5
Total 400 100.0
Bar & pie chart
SUMMARY
Categorical Data Numerical Data
Statistics Statistics
Frequency (%) Mean (SD)
Graphs Median (IQR)
Bar chart
Pie chart
Graphs
Histogram
Box Plot
56
THANK YOU
NEXT LECTURE: INFERENTIAL STATISTICS

& HYPOTHESIS TESTING
(Sunday 10-11am, DKC)

Gtu 302 Biostatistics: Descriptive Statistics

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Gtu 302 Biostatistics: Descriptive Statistics

Загружено:

Авторское право:

Доступные форматы

GTU 302 BIOSTATISTICS

DR NORAINI ABDUL GHAFAR

Measuring central tendency

In research, we may have lots of measures (data) or we may measure

QUALITATIVE / CATEGORICAL QUANTITATIVE / NUMERICAL

NOMINAL ORDINAL DISCRETE CONTINUOUS

Frequency Measure of centrality/

120 114 116 117 114 121 124 114 Unimodal

114 appears 3 times, other measurement only once

Mean=Sum of all data values

Used for variables that are quantifiable

3500 3950 4200 4750 5200

The median is?

Data set 1: 108 112 116 120 124

Two common measures of dispersion:

SD shows a relation of a set of scores to the mean of the sample.

SD takes into consideration the deviations of the individual scores

Min score Q1 Q2 Q3 Max score

The IQR is the distance between the 1st quartile (25th

For the data set in this box plot:

A convenient way of graphically depicting groups of

A boxplot may also indicate which observations, if

2. Find the median (Q2 or Quartile 2) of your data. The median

Min score Q1 Q2 Q3 Max score

The IQR is used to build box plots, simple

The median is the corresponding measure of central

Normal curve= normal distr.

Equal tail= normal distr.

Cluster at the lower end of

Tail to the right

Negatively Skewed (left)

Majority of data fell to the right

Cluster at the lower end of

Tail to the left

A diagram showing the relation between typically two

Y-axis for frequencies (or

The line graph serves a similar

The frequency of any value in

Graphically in the form of chart

Qualitative Education Freq. %

NEXT LECTURE: INFERENTIAL STATISTICS

Вам также может понравиться