Академический Документы
Профессиональный Документы
Культура Документы
DESCRIPTIVE STATISTICS
Descriptive Inferential
statistics statistics
What is Descriptive Statistics?
= A discipline of statistics that summarizes a
collection of data and present in a way that can
be easily and clearly understood
Two basic methods to summarize data:
Two Basic Methods in Descriptive Statistics
Numerical Method Graphical Method
Calculate the mean and SD Create
Precise and objective - a frequency table
- a bar-chart
- a line graph
- a box plot
Shows distribution
patterns in the data
4
What is Descriptive Statistics?
- As compared to Inferential Statistics
Descriptive Statistics Inferential Statistics
Involve: Involve:
Collection of data Hypothesis testing
Organization of data Confidence Interval
Enumeration of frequency of
characteristics Allows researchers to INFER
Summarization and presentation (GENERALIZE) the
characteristics of the sample
of data
(statistics) to the population
(parameter)
Describe the characteristics of
OUR OBSERVED data
Numerical method
Graphical method
5
Why use Descriptive Statistics?
Descriptive statistics are used to present quantitative descriptions in a
manageable form.
wanamir@kb.usm.my 6
Why it is so important?
First stage of statistical analysis
Identifying outliers, keying error
Checking for data symmetry, normality
Presenting the data
Descriptive Statistics for Qualitative Data
VARIABLE
9
Descriptive Statistics for Qualitative Data:
Frequency Distribution
Valid percent
= The percentage of those who gave a valid response to the question that
belongs to the category.
- If there are no missing cases, valid percent column = percent column
Cumulative percent
= Provides the rolling addition of
percentages from the first
category to the last valid
category.
10
Descriptive Statistics for Quantitative Data:
measures of centrality/central tendency
1)Measures of centrality @
central tendency
Mode
Median
Mean
Mode
Frequently occurring number in a given dataset
117 120 114 116 115 114 121 117 124 Bimodal
Mean
Often called average
Sum of all data values in a dataset divided by the
number of data values
20 18 16 22 27 11
The mean will be:
(20 + 18 + 16 + 22 + 27 + 11) = 19
6
Other examples:
Mean age of Malay teachers in a primary
school
Mean weight of primary school students
Median
Exact middle value in a distribution
Divides the data set into two exact halves
24 29 32 35 39 40
32+35 = 33.5
2
Choosing an appropriate measure of tendency
Have learned 3 types & how they differ in terms of
finding the centre of a data distribution
When do we use which measure?
Mode: most frequent occurring data-useful for nominal
measurement
Median & mean: useful when the variables being
measured could be quantified
Important note:
Mean: extremely sensitive to unusual cases
Influenced by a presence of outlier cases
outlier
Should not be used when unusual, or outlying, data
values present in the data set
Median should be used
2)Descriptive Statistics for Quantitative Data:
Measures of dispersion (variability)
Dispersion or variability refers to the spread
of the values around the central tendency.
Examples:
What is the range for the data set:
44, 46, 47, 52, 56, 58, 60, 63 and 65?
What is the range for the data set:
44, 46, 47, 52, 56, 58, 60, 90 and 98?
What is the range for the data set:
9, 15, 47, 52, 56, 58, 60, 63 and 65?
Standard deviation
SD is a more accurate and detailed estimate of
dispersion/variability/variation because an outlier can greatly
exaggerate the range.
SD Formula:
2
SD
Steps: n 1
= Difference between the mean and each score
= Squaring the difference
= Summing all the squared differences
= divide the sum of all the squared differences by the number
of scores (N) minus 1, and extracting the square root.
Test score, x 2
23 23 25 = -2 4
22 22 25 = -3 9
26 26 25 = +1 1 Manual calculation of SD:
21 21 25 = -4 16
E.g. The test scores of an
30 30 25 = +5 25
Immunology Course:
24 24 25 = +1 1
2
20 20 25 = -5 25
SD
27 27 25 = +2 4 n 1
25 25 25 = 0 0
134
32 32 25 = +7 49
=25 10 1
2 =134
2
3.8586
SD
SD shows an 'average' number for the distance of the majority of measures from the
mean.
SD is of value used with the Normal distribution, where known proportions of the
measurements fall within 1SD, 2SD or 3SD of the mean.
E.g. If the average score is 56 in a normal distribution, with a SD of 6,
then
68.3% of scores will be 56 6 (= between 50 and 62)
95.4% of scores will be 56 12 (= between 44 and 68)
99.7% of scores will be 56 18 (= between 38 and 74)
38 44 50 56 62 68 74
Or, if divided into 6 bands,
38 44 50 56 62 68 74
13.6% 2.1%
2.1% of 13.6% 34.1% 34.1
(62 & 68) (68 & 74)
scores will (44 & 50) 50 & 56 (56 & 62)
be in
between
38 & 44
Thus, for a given a set normally distributed data set, the mean and
the standard deviation can be calculated, and from this can be
derived the probability of future measures falling into the three
bands (1SD, 2SD or 3SD)
Variance:
Measures the amount of dispersion/variability/spread about the mean of a
sample
Denoted by the symbol (sigma)
Total of squares of deviation of observations from the mean / number of
degree of freedom
( x x ) 2
s
2 i -
n 1
Interquartile range (IQR)
What is an interquartile range (IQR) ?
The interquartile range (IQR) is the distance between
the 75th percentile and the 25th percentile. The IQR is
essentially the range of the middle 50% of the data.
Because it uses the middle 50%, the IQR is not
affected by outliers or extreme values
The IQR is also equal to the length of the box in a box
plot.
33
How to compute Inter-quartile Range?
Like the standard deviation, the interquartile range (IQR) is a
descriptive statistic used to summarize the extent of the
spread of your data.
Q3 Q1 = IQR
To find these numbers you must divide your data set in half,
and find the median of each half and that will be your Q1 and
Q3.
wanamir@kb.usm.my 34
How to compute Inter-quartile Range?
If you have an odd number, then EXCLUDE the median of the
entire set, so as follows:
For example, take the following dataset:
1st 3rd
3 5 7 8 9 21 40 90 120
We exclude the 9 as the median, of the whole set and the 1st quartile is 6 (5+7
divided by 2) and the 3rd quartile is 65 (40+90 divided by 2), making the IQR = 65-
6=59.
OR 1st 1st
3 5 7 8 40 90 120
We exclude the 8 as the median of the whole set and the 1st quartile is 5 and
the 3rd quartile is 90. (IQR = 90 - 5 = 85.)
wanamir@kb.usm.my 35
i x[i] Quartile
1 102
2 104
Inter-quartile Range: Example
3 105 Q1
4 107
For the data in this table the 5 108
interquartile range is 6 109 Q2 - median
IQR = 115 105 = 10 7 110
8 112
9 115 Q3
10 116
11
wanamir@kb.usm.my 118 36
Median?
Q1?
Inter-quartile Range Q3?
IQR?
+-----+-+
x o |-------| | |---|
+-----+-+
+---+---+---+---+---+---+---+---+---+---+---+---+ number line
0 1 2 3 4 5 6 7 8 9 10 11 12
3. To divide the data into quarters, you then find the medians
of these two halves.
(Note: Remember to EXCLUDE the median of your overall
data before finding the two sub-medians)
4. The lower (or smaller sub-median) is called Q1; The larger (or
bigger sub-median) is called Q3.
39
How to create a box-and-whisker plot:
5. Draw a line and mark positions of Q1, Q2 and Q3.
Make a box indicating the 3 points.
Q1 Q2 Q3
6. Mark the minimum point (lowest score from the data set) below
Q1 and the maximum point (highest Score from the data set)
above Q3. Add whiskers to the box.
40
Significance of IQR
Use
Unlike (total) range, the interquartile range
is a robust statistic, having a breakdown
point of 25%, and is thus often preferred to
the total range
41
Significance of IQR
For a symmetric distribution (so the median equals
the midhinge, the average of the first and third
quartiles), half the IQR equals the median absolute
deviation (MAD).
42
NORMAL DISTRIBUTION
Distribution of a data
Why? Determining appropriate statistical test
How? Checking a few criteria
Normal curve
Skewness (1)
Kurtosis (1)
Box & whisker plot
Histogram with overlaid normal curve
Normal Distribution
Positively Skewed
Negatively Skewed
46
Characteristics of Distribution
Positively Skewed (right)
Majority of data fell to the left
of the mean
47
Characteristics of Distribution
48
Presenting the data
Descriptive Statistics: GRAPH (1): Bar Chart
Bar Charts
X- axis for discrete categories
(e.g. Blood group; Gender)
50
Descriptive Statistics: GRAPH (2): Histogram
X- axis for continuous variable (e.g. Weight; Height)
Y-axis for percentages (or frequencies)
51
Descriptive Statistics: GRAPH (3): Line Graph
52
Presenting qualitative (categorical) data
Presented statistically in term of
frequency or percentage (%)
In table form
called frequency table
56
THANK YOU