Вы находитесь на странице: 1из 8

Skewness

Skewness is asymmetry in a statistical distribution, in which the curve appears


distorted or skewed either to the left or to the right. Skewness can be quantified to
define the extent to which a distribution differs from a normal distribution.
In a normal distribution, the graph appears as a classical, symmetrical "bellshaped curve." The mean, or average, and the mode, or maximum point on the
curve, are equal.

In a perfect normal distribution (green solid curve in the illustration below),


the tails on either side of the curve are exact mirror images of each other.

When a distribution is skewed to the left (red dashed curve), the tail on the
curve's left-hand side is longer than the tail on the right-hand side, and the
mean is less than the mode. This situation is also called negative skewness.

When a distribution is skewed to the right (blue dotted curve), the tail on the
curve's right-hand side is longer than the tail on the left-hand side, and the
mean is greater than the mode. This situation is also called positive skewness.

Literature Review
Since Karl Pearson (1895), statisticians have studied the properties of various statistics of
skewness, and have discussed their utility and limitations. This research stream covers more than
a century.
For an overview, see Arnold and Groenveld (1995), Groenveld and Meeden (1984), and Rayner,
Best and Matthews (1995). Empirical studies have examined bias, mean squared error, Type I
error, and power for samples of various sizes drawn from various populations.
A recent study by Tabor (2010) ranked 11 different statistics in terms of their power for detecting
skewness in samples from populations with varying degrees of skewness.
MacGillivray (1986) concludes that the relative importance of the different orderings and
measures depends on circumstances, and it is unlikely that any one could be described as most
important . He notes that describing skewness is really a special case of comparing
distributions. This key point is perhaps a bit subtle for students. Students (and instructors) merely
need to bear in mind that we are not testing for symmetry in general. Rather, the (often implicit)
null hypothesis must refer to a specific symmetric population. Because the most common
reference point is the normal distribution (especially in an introductory statistics class) we will
limit our discussion accordingly.
A few software packages (e.g., Stata, Visual Statistics, early versions of Minitab) report the
traditional Fisher-Pearson coefficient of skewness:

Following Pearsons notation, this statistic is sometimes referred to as 1 which is awkward


because g1 can be negative. Pearson and Hartley (1970) provide tables for g1as a test for
departure from normality (i.e., testing the sample against one particular symmetric distribution).
Although well documented and widely referenced in the literature, this formula does not
correspond to what students will see in most software packages nowadays. Major software
packages available to educators (e.g., Minitab, Excel, SPSS, SAS) include an adjustment for
sample size, and provide the adjusted Fisher-Pearson standardized moment coefficient 1 :

In large samples, g1 and G1 will be similar. This statistic is included in Excels Data Analysis >
Descriptive Statistics and is calculated by the Excel function =SKEW(Array).

Alternative formula shown in Wikipedia:


Adjustment for sample size approaches unity as n increases.

Joanes and Gill (1998) compare bias and mean squared error (MSE) of different measures of
skewness in samples of various sizes from normal and skewed populations. G1 is shown to
perform well, for example, having small MSE in samples from skewed populations.

A minimalist might say that

Its sign reflects the direction of skewness.

It compares the sample with a normal (symmetric) distribution.


Values far from zero suggest a non-normal (skewed) population.
The statistic has an adjustment for sample size.
The adjustment is of little consequence in large samples.

Horswell and Looney (1993, p. 437) note that The performance of skewness tests is shown to be
very sensitive to the kurtosis of the underlying distribution. Few instructors say much about
kurtosis, partly because it is difficult explain, but also because it is difficult to judge from
histograms.
Kurtosis is essentially a property of symmetric distributions (Balanda and MacGillivray 1988). Data
sets containing extreme values will not only be skewed, but also generally will be leptokurtic. We
cannot therefore speak of non-normal skewness as if it were separable from non-normal kurtosis.
The best we can do is to focus on the skewness statistic simply as one test for departure from the
symmetric normal distribution .

Older mathematical statistics textbooks (e.g., Yule and Kendall 1950; Kenney and Keeping 1954;
Clark and Schkade 1974) refer to skewness measures that directly compare either the mean and
mode, or the mean and median.

For empirical calculations, Yule and Kendall (1950, p. 161) recommend using a statistic that
compares the mean ( x ) and median (m):

The attraction of this statistic (henceforward the Pearson 2 skewness coefficient) is that it is
consistent with the intuitive approaches developed earlier. You can see its sign at a glance. It
shows how many standard deviations apart the two measures of center are.

Hotelling and Solomons (1932) first showed that the statistic will lie between 1 and +1, so Sk2
will lie between 3 and +3 (although, in practice, it rarely approaches these limits.) This statistic
is no longer seen in textbooks.

Arnold and Groeneveld (1995) note that Sk2 has some desirable properties (it is zero for
symmetric distributions, it is unaffected by scale shift, and it reveals either left- or right-skewness
equally well).
But two issues must be examined before suggesting a revival of this intuitively attractive statistic:
(1) we need a table of critical values for Sk2, and
(2) we should compare the power of Sk2 and G1.

H0 : The population is normal with = 5 and = 1 vs. Ha : The population is skewed right with
= 5 and = 1
Next we generated samples of size 10 from this skewed population and computed the ratio of
mean/median. If the ratio is greater than 1.07(95% percentiles), then we reject the null
hypothesis and correctly conclude that the population is skewed. The proportion of times this
occurred is an estimate of the power of this statistic to detect skewness since power is the
probability of rejecting a null hypothesis, assuming that an alternative hypothesis is true. That is,
power is the probability that we decide the population is skewed to the right when in fact it
actually is skewed to the right.
One of the more popular choices was to compare the distance from Q3 to the maximum to the
distance from Q1 to the minimum. Visually, this would be comparing the right whisker on a
boxplot to the left whisker. This could be expressed as a ratio: (max Q3)/(Q1 min) where values
greater than 1 would indicate right skewness.

Variability can dramatically reduce your statistical power during hypothesis testing.
Statistical power is the probability that a test will detect a difference (or effect) that actually
exists.
For random samples, increasing the sample size is like increasing the resolution of a picture of the
populations. With just a few samples, the picture is so fuzzy that wed only be able to see differences
between the most distinct of populations. However, if we collect a very large sample, the picture
becomes sharp enough to determine the difference between even very similar populations.

it is much easier to obtain higher power values when you have lower variability.
the sample size that you need to have an 80% chance of detecting the same difference
between the means drops dramatically with less variability.
you should always calculate power and sample size before a study to avoid conducting a lowpower analysis, and Minitab makes this very easy to do. But you should also assess

power after a study that produced insignificant results. use the standard deviation estimate
from the sample data, which may be more accurate than the pre-study estimate.

Statisticians have derived formulas for calculating the power of many experimental designs. They can be
useful as a back of the envelope calculation of how large a sample youll need. Be careful, though, because
the assumptions behind the formulas can sometimes be obscure, and worse, they can be wrong.
Here is a common formula used to calculate power 1

=(|tc|N21(12))

is our measure of power. Because its the probability of getting a statistically significant
result, will be a number between 0 and 1. is the CDF of the normal distribution, and 1 is its inverse.
Everything else in this formula, we have to plug in:

t is the average outcome in the treatment group. Suppose its 65.

c is the average outcome in the control group. Suppose its 60.

Together, assumptions about t and c define our assumption about the size of the treatment effect:
65-60= 5.

is the standard deviation of outcomes. This is how we make assumptions about how noisy our
experiment will be one of the assumptions were making is that sigma is the same for both the treatment
and control groups. Suppose = 20.

is our significance level the convention in many disciplines is that should be equal to
0.05. N is the total number of subjects. This is the only variable that is under the direct control of the
researcher. This formula assumes that every subject had a 50/50 chance of being in control. Suppose that
N=500.
Working through the formula, we find that under this set of assumptions, = 0.80, meaning that we have
an 80% chance of recovering a statistically significant result with this design. Click here for a google
spreadsheet that includes this formula. You can copy these formulas directly into Excel. If youre
comfortable in R, here is code that will accomplish the same calculation.
power_calculator <- function(mu_t, mu_c, sigma, alpha=0.05, N){
lowertail <- (abs(mu_t - mu_c)*sqrt(N))/(2*sigma)
uppertail <- -1*lowertail

beta <- pnorm(lowertail- qnorm(1-alpha/2), lower.tail=TRUE) + 1- pnorm(uppertail- qnorm(1-alpha/2),


lower.tail=FALSE) return(beta)
}

Split Sample Statistics

SSS=ln

Where

IQR R
IQR L

( )
IQR R =87.5 th62.5 th percentiles , IQR L =37.5 th12.5 th percentiles

Вам также может понравиться