Вы находитесь на странице: 1из 11

BIOL 4243 IUG

59
Assessing normality

Not all continuous random variables are normally distributed. It is important to
evaluate how well the data set seems to be adequately approximated by a normal
distribution. In this section some statistical tools will be presented to check whether a
given set of data is normally distributed.

1. Previous knowledge of the nature of the distribution

Problem: A researcher working with sea stars needs to know if sea star size (length of
radii) is normally distributed.
What do we know about the size distributions of sea star populations?
1. Has previous work with this species of sea star shown them to be normally
distributed?
2. Has previous work with a closely related species of seas star shown them to be
normally distributed?
3. Has previous work with seas stars in general shown them to be normally
distributed?
If you can answer yes to any of the above questions and you do not have a reason to
think your population should be different, you could reasonably assume that your
population is also normally distributed and stop here.
However, if any previous work has shown non-normal distribution of sea stars you
had probably better use other techniques.

2. Construct charts

For small- or moderate-sized data sets, the stem-and-leaf display and box-and-
whisker plot will look symmetric.
For large data sets, construct a histogram or polygon and see if the distribution
bell-shaped or deviates grossly from a bell-shaped normal distribution. Look
for skewness and asymmetry. Look for gaps in the distribution intervals with
no observations. However, remember that normality requires more than just
symmetry; the fact that the histogram is symmetric does not mean that the data
come from a normal distribution. Also, data sampled from normal distribution
will sometimes look distinctly different from the parent distribution. So, we
need to develop some techniques that allow us to determine if data are
significantly different from a normal distribution.

3. Normal Counts method

Count the number of observations within 1, 2, and 3 standard deviations of the mean
and compare the results with what is expected for a normal distribution in the 68-95-
99.7 rule. According to the rule,
68% of the observations lie within one standard deviation of the mean.
95% of observations within two standard deviations of the mean.
99.7% of observations within three standard deviations of the mean.

Example: As part of a demonstration one semester, I collected data on the heights of
sample of 25 IUG biostatistics students. These data are presented in the table below.
Does the sample shown below have been drawn from normally distributed
populations?
BIOL 4243 IUG

60

Table. Heights, in inches, of 25 IUG biostatistics students.

71.0 69.0 70.0 72.5 73.0
70.0 71.5 70.5 72.0 71.0
68.5 69.0 69.0 68.5 74.0
67.0 69.0 71.5 66.0 70.0
68.5 74.0 74.5 74.0
Solution:
For normal Counts method, determine the following

Heights, in
inches

Frequency


66 1

67 1

68.5
3
69 4

70 3

70.5 1

71 2
Total = 17
71.5 2

72 1

72.5 1

73 1

74 3

74.5 1

Total 24


x = 70.6; s = 2.3
x s is 72.9 to 68.3.
17 out of the 24 observations i.e. 17/24 = 0.70 = 70% fall within x s , i.e. between
72.9 and 68.3, which is approximately equal to 68%.There is no reason to doubt that
the sample is drawn from a normal population.

4. Compute descriptive summary measures

a. The mean, median and mode will have similar values.
b. The interquartile range approximately equal to 1.33 s.
c. The range approximately equal 6 s.

5. Evaluate normal probability plot

If the data come from a normal or approximately normal distribution, the plotted
points will fall approximately along a straight line (a 45 degree line). However, if your
sample departs from normality, the points on the graph will deviate from that line.
If they trail off from a straight-line pack in a curve at the top end, observed values
bigger than expected, thats right skewed (see below).If the observed values trail off
at the bottom end, thats left skewed.

Realize that it is important note that any worthwhile computer statistics package will
construct these graphs for you (see below).
BIOL 4243 IUG

61
6. Measure of Skewness and Kurtosis
Skewness: The normal distribution is symmetrical. Asymmetrical distributions are
sometimes called skewed. Skewness is calculated as follows:
3
1
3
( - )
( - 1)( - 2)
n
i
i
n x x
skewness
s n n
=
=


where x is the mean, s is the standard deviation, and n is the number of data points
A perfectly normal distribution will have a skewness statistic of zero. If this statistic
departs significantly from 0, then we lose confidence that our sample comes from a
normally distributed population.
If it is negative, then the distribution is skewed to the left or negatively
skewed distribution.
If it is positive, then the distribution is skewed right or positively skewed
distribution.


Negatively skewed distribution
or Skewed to the left
Skewness <0

Normal distribution
Symmetrical
Skewness = 0

Positively skewed distribution
or Skewed to the right
Skewness > 0

Kurtosis: A bell curve will also depart from normality if the tails fail to fall off at
the proper rate. If they decrease too fast, the distribution ends up too peaked. If
they dont decrease fast enough, the distribution is too flat in the middle and too fat in
the tails. One statistic commonly used to measure kurtosis is typically calculated
using the formula,
4
2
( 1) 3( 1)
( 1)( 2)( 3) ( 2)( 3)
i
x x n n n
kurtosis
n n n s n n

| |
+
=
| `

\ .
)


where x is the mean, s is the standard deviation, and n is the number of data points
A perfectly normal distribution will also have a kurtosis statistic of zero.
If kurtosis is significantly less than zero, then our distribution is flat, it is said to
be platykurtic.
If kurtosis is significantly greater than 0, the distribution is pointed or peaked, it
is called leptokurtic.

Platykurtic distribution
Low degree of peakedness
Kurtosis <0

Normal distribution
Mesokurtic distribution
Kurtosis = 0

Leptokurtic distribution
High degree of peakedness
Kurtosis > 0
BIOL 4243 IUG

62
You wont have you calculate it by hand. The calculation itself is sensitive to
rounding errors because they are raised to the third and fourth powers.

Using SPSS to Evaluate Data for Normality

Before the advent of good computers and statistical programs, users could be forgiven
for trying to avoid any surplus calculations. Now that both are available and much
easier to use, tests for normality should always be carried out as a best practice in
statistics. SPSS offers a variety of methods for evaluating normality.

Normal probability plot (P-P plot)
The P-P plot graphs the expected cumulative probability against the observed
cumulative probability.

1. Open the SPSS file containing your data.
2. From the main menu, select Graph and then P-P From the list of available
variables, move the variables you wish to analyze to the variable window. If you
select multiple variables then SPSS will create separate plots for each.
3. In the box for Test Distribution be sure that the pop-up menu is set for a Normal
distribution. In addition, be sure that the Estimate from data box is checked.
4. In the box for Proportion Estimation Formula, select the radio button for the
Rankit method.
5. Finally, in the Ranks Assigned to Ties box, select the radio button for High.
6. Click on OK to obtain the plot and complete your analysis

Normal P-P Plot of Student's heights (inches)
Observed Cum Prob
1.00 .75 .50 .25 0.00
E
x
p
e
c
t
e
d

C
u
m

P
r
o
b
1.00
.75
.50
.25
0.00

Q-Q plot

Repeat the above steps, but in this time select Q-Q (main menu, Graph and then Q-Q)
BIOL 4243 IUG

63
Normal Q-Q Plot of Student's heights (inches)
Observed Value
76 74 72 70 68 66 64
E
x
p
e
c
t
e
d

N
o
r
m
a
l
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0


Limitation of visual method

One limitation to any visual approach for evaluating normality is that your conclusion is
open to some uncertainty. How, for example, can you put a quantitative statement on the
confidence of your conclusion? How linear is linear and how much deviation from
linearity is acceptable? One approach to obtaining a more quantitative determination of
whether a data set is normally distributed is the Kolmogorov-Smirnov test or Shapiro-
Wilks tests.

Use data set to tests of normality

1. Open the SPSS file containing your data and from the main menu select Analyze
2. Descriptive Statistics
3. Explore
4. Move your variable from the Variable List window to the Dependent List
window.
5. Under Display click Both
6. Click Plots
7. Under Boxplots check Factor levels together
8. Under Descriptive check " Histogram" and stem-and-leaf
9. Check Normality plots with tests
10. Click continue
11. Click OK
12. Evaluate the plot for evidence of normality.

BIOL 4243 IUG

64
Descriptives
70.583 .4698
69.611
71.555
70.616
70.250
5.297
2.3015
66.0
74.5
8.5
3.375
.076 .472
-.647 .918
Mean
Lower Bound
Upper Bound
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Student's
heights (inches)
Statistic Std. Error

Assessment of skewness and kurtosis

In fact (like all estimates), we are unlikely to ever see the values of zero in either
skewness or kurtosis statistics. The real question is whether the given estimates vary
significantly from zero. For this question we need the standard error of skewness, and
similarly the standard error of kurtosis.

In SPSS, the Explore command provides skewness and kurtosis statistics at once in
addition to the standard errors of skewness and kurtosis.
The key value we are looking for is whether the value of zero is within the 95%
confidence interval.
For assessing skewness:
0.076 + 0.472 = 0.548
0.076 - 0.472 = -0.396
For assessing kurtosis:
-0.647 + 0.918 = 0.271
-0.647 - 0.918 = -1.565

Thus the 95% confidence interval for the skewness score ranges from 0.548 to -0.396,
and the 95% confidence interval for the kurtosis score ranges from 0.271 to -1.565. In
both cases, zero is within our bounds thus we can accept that our statistic is not
significantly different from a distribution of zero. Therefore this is normal
distribution.
Kolmogorov-Smirnov / Shapiro-Wilks tests for normality
The first is the Kolmogorov-Smirnov test for normality, sometimes termed the KS
Lilliefors test. The second is the Shapiro-Wilks test. The advice from SPSS is to use
the latter test (Shapiro-Wilks test) when sample sizes are small (n < 50). The output,
using the data from above is presented below.
BIOL 4243 IUG

65
Tests of Normality
.129 24 .200* .961 24 .469
Student's
heights (inches)
Statistic df Sig. Statistic df Sig.
Kolmogorov-Smirnov
a
Shapiro-Wilk
This is a lower bound of the true significance.
*.
Lilliefors Significance Correction
a.

The value listed as Asymp. Sig. is probability lies between 0 and 1. In general, a sig.
value 0.05 is considered good evidence that the data set is not normally
distributed. A value greater than 0.05 implies that there is insufficient evidence to
suggest that the data set is not normally distributed.
In our example, the significance of 0.469 accordingly, means that our distribution is
not significantly different from a normal distribution.
Boxplots

It is very hard to detect normality using a box plot. But at the very least, look for
symmetry and the presence of outliers. Severe skewness and/or outliers are indication
of non-normality.

Normal distribution: If there are only a few outliers, and the median line evenly
divides the box, then data values in a sample that otherwise comes from a normal or
near-normal distribution.

Skewness: If there are numerous outliers to one side or the other of the box, or the
median line does not evenly divide the box, then the population distribution from
which the data were sampled may be skewed.

Skewness to the right: If the boxplot shows outliers at the upper range of the data
(above the box), the median line does not evenly divide the box, and the upper tail of
the boxplot is longer than the lower tail, then the population distribution from which
the data were sampled may be skewed to the right. Here is a hypothetical example of
a boxplot for data sampled from a distribution that is skewed to the right:


BIOL 4243 IUG

66
Skewness to the left: If the boxplot shows outliers at the lower range of the data
(below the box), the median line does not evenly divide the box, and the lower tail of
the boxplot is longer than the upper tail, then the population distribution from which
the data were sampled may be skewed to the left. Here is a hypothetical example of a
boxplot for data sampled from a distribution that is skewed to the left.















From the boxplots of student heights in our example, we see that the distribution
appear to be reasonably symmetric and approximately normal.

24 N =
Student's heights (i
76
74
72
70
68
66
64

Negatively
Skewed
Positively
Skewed
Symmetric
(Not Skewed)
Negatively
Skewed
Positively
Skewed
Symmetric
(Not Skewed)
BIOL 4243 IUG

67
Normalizing Transformations
Many of the statistical tests (parametric tests) are based on the assumption that the
data are normally distributed. However, if we actually plot the data from a study, we
rarely see perfectly normal distributions. Most often, the data will be skewed to some
degree or show some deviation from mesokurtosis. Two questions immediately arise:
A) Can we analyze these data with parametric tests and. if not,
B) Is there something we can do to the data to make them more normal?

What to do if Not Normal?

According to some researchers, sometimes violations of normality are not problematic
for running parametric- tests. When a variable is not normally distributed (a
distributional requirement for many different analyses), we can create a transformed
variable and test it for normality. If the transformed variable is normally distributed,
we can substitute it in our analysis.

Data transformation Data transformation involves performing a mathematical
operation on each of the scores in a set of data, and thereby converting the data into a
new set of scores which are then employed to analyze the results of an experiment.

To solve for Positive Skew

Square roots, logarithmic, and inverse (1/X) transforms "pull in" the right side of the
distribution in toward the middle and normalize right (positive) skew. Inverse
transforms are stronger than logarithmic, which are stronger than roots.

Square root transformation

The square-root transformation can be effective in normalizing distributions that have
a slightly to moderate positive skew.
Data taken from a Poisson distribution are sometimes effectively normalized with a
square-root transformation.
The square-root transformation is obtained through use of the equation Y = X ,
where X is the original score (observation) and Y represents the transformed score.
Cube roots, fourth roots, etc., will be increasingly effective for data that are
increasingly skewed.
When you use the square root transformation, be careful; don't have any zeros or
negative numbers among your raw data
If for example, there are any zero values, add a constant C, where C is some small
positive value such as 0.5, and replace each observation by 0.5 X + .
If there are negative numbers with positive numbers, add a constant to each number to
make all values greater than 0.
Although this transformation is not used as frequently in medicine as the log
transformation, it can be very useful when a log transformation overcorrects.

Logarithmic transformation

A logarithmic transformation may be useful in normalizing distributions that have
more severe positive skew than a square-root transformation. Such distribution is
termed lognormal because it can be made normal by log transforming the values.
BIOL 4243 IUG

68
When log transforming data, we can choose to take logs either to base 10 (the
'common' log) or to base e (the 'natural' log, abbreviated ln). The log transformation is
similar to the square root transformation in that zeros and negative numbers are taboo.
Use the same technique to eliminate them. Some people use the smallest possible
value for their variable as a constant, others use an arbitrarily small number, such as
0.001 or, most commonly, 1.
The back-transformation of a log is called the antilog; the antilog of the natural log is
the exponential, e (e = 2.71828).
In medicine, the log transformation is frequently used because many variables have
right-skewed distributions.

Inverse or reciprocal transformation

A reciprocal transformation exerts the most extreme adjustment with regard to
normality. It is used to normalize very or absolutely skewed data. Accordingly, the
reciprocal transformation is often able to normalize data that the square-root and
logarithmic transformations are unable to normalize.
The reciprocal transformation is obtained through use of the equation Y = 1/x. If any
of the scores are equal to zero, the equation y= l/(x+ 1) should be employed. When
inversed, large numbers become small, and small numbers become large.
It's possible that you chose a transformation that overcorrected and turned a moderate
left skew into a moderate right one. This gains you nothing except heartache. So, if
this has happened, go back and try a less "powerful" transformation; perhaps square
root rather than log, or log rather than reciprocal.

To solve for Negative Skew

If skewness is actually negative, "flip" the curve over, so the skew left curves become
skewed right, allowing us to use the transformation procedures of positively skewed
distributions. Flipping the curve over require reflection of the variable before
transforming. Reflection simply involves the following:
Before the data are transformed, we can find the maximum value (9), add 1 to it (to
avoid too many zeros when we're done), and subtract each raw value from this
number. For example, if we started out with the numbers :
1 1 2 4 8 9
then we would subtract each number from 10 (the maximum, 9, plus 1), yielding :
9 9 8 6 2 1
We would then use the transformations for right -skewed data, rather than left-skewed.

More transformations

Power transformation Y = (X)
p
:

The most common type of transformation useful for biological data is the power
transformation, which transforms X to (X)
p
, where
p
is power greater than zero.
Values of
p
less than one correct right skew, which is the common situation (using a
power of 2/3 is common when attempting to normalize). Values of
p
greater than 1
correct left skew. The square transformation i.e.
p
= 2 for example, achieves the reverse
of the square-root transformation. If X is skewed to the left (negatively skewed), the
distribution of Y = (X)
p
is often approximately Normal.

For right skew, decreasing
p
decreases right skew. Too great reduction of
p
will
overcorrect and cause left skew.
BIOL 4243 IUG

69
The arcsine (arcsin) transformation:

The arcsine of a number is the angle whose sine is that number .
The arcsine transformation (also referred to as an angular or inverse sine
transformation) is used to normalize data when data are Proportions between 0 and 1
or percentages between 0% and 100%. The arcsine transformation which expresses
the value of Y in degrees is obtained through use of the equation
Y = arcsin X ,
where X will be a proportion between 0 and 1. This means that, the arcsine
transformation requires two steps.
First, obtain the square root of x.
Second, by using calculator, find the angle whose sine equal this value.

To get the arcsine value for a percentage (e.g. 50%), divide this by 100 ( = 0.5), take
the square root (= 0.7071), then press "sin-1" key in your calculator to get the arcsine
value (= 45). To get the arcsine value for a proportion (e.g. 0.4), take the square root
(= 0.6325), then press "sin-1" to get the arcsine value (= 39.23).

In Excel after obtaining the square root X (Fig. 1), find the ASIN function (Fig.2)
and then multiply by 57.3 to find the arcsine in degrees.
Through use of the equations that compute an arcsine in degrees, is 0 (for a
proportion of zero) to 90 (for a proportion of 1).
Fig. 1 Fig. 2

After any transformation, you must re-check your data to ensure the transformation
improved the distribution of the data (or at least didnt make it any worse!).
Sometimes, log or square root transformations can skew data just as severely in the
opposite direction.
If transformation does not bring data to a normal distribution, the investigators might
well choose a nonparametric procedure that does not make any assumptions about
the shape of the distribution.

SPSS/PC
You can use the COMPUTE command to transform the data. Select Transform -
Compute - Target Variable (input a new variable name) - Numeric Expression (input
transform formula). It's usually a good idea to make up a new variable to hold the
transformed data, rather than over-writing the original values; in this way, you can
easily undo your mistakes.

Вам также может понравиться