Вы находитесь на странице: 1из 3

GV207 Political Analysis, Week 03

Department of Government, University of Essex

Descriptive statistics 2: Variability


Looking at our variables
Whatever we do we should always 'know' our variables. Last week we thought about their levels of
measurement, what they look like , and what their typical or average values are. We are now moving
on to think a bit more rigorously about them. So, this week we're looking at how to measure the
variability or dispersion of our variables.
Briefly, some notation before we start our journey:
Measure
Population mean
Population variance
Population standard deviation
Population size (number of observations)
Sample mean (of x)
Sample variance
Sample standard deviation
Sample size (number of observations)

Symbol

N
s2
s
n

Variability with continuous variables


The standard measures of variability we use are the variance and the standard deviation. In many
ways they are the same thing: the variance is the standard deviation squared and the standard
deviation is the square root of the variance. The standard deviation is nice however, and thus more
commonly used, because it is a measure of the variability that is in the original units of the variable.

The above is the equation for the sample variance of a variable. Lets decompose some of the
(potential) mystique around it:

Here we take an observation x and subtract the sample mean . So now we know how
much this observation deviates from the variables mean. Why do we then square this?

Whats the doing? This is the summation operator. We want to add up all of the individual
observations deviations from the mean of the variable. Basically it saves from having to write:

is the final piece. Here, we divide the sum of squared deviations by the number of
observations minus 1. Effectively were taking the mean of our squared deviations. 1

If that makes sense then you have an idea of how the variance works. And if thats the case then the
standard deviation is easy:

Subtracting 1 makes this calculation slightly different than when we usually calculate the mean. This is
a correction due to the fact we are calculating the standard deviation for a sample (if we don't do this then it's
biased.) For details, see Wikipedia: http://en.wikipedia.org/wiki/Bessel%27s_correction.
1

GV207 Political Analysis, Week 03

Department of Government, University of Essex

Another way to think about how this works is to imagine changing one of the values in the top half of
the equation, whilst holding everything else constant, and think of what happens to the variance (s2 )
and standard deviation (s). What happens if we increase x whilst and n stay constant? Why?

How do we do this in Stata?


What took quite a while to do in the past (and which countless maths teachers still enjoy inflicting
upon students) is a breeze in Stata. In fact we have seen the command before:
summarize varname, detail

Doing so gives us a whole bunch of output:


human development index 2000 (undp 2004)

1%
5%
10%
25%
50%
75%
90%
95%
99%

Percentiles
.323
.354
.406
.521

Smallest
.279
.323
.325
.342

.751
.854
.932
.939
.943

Largest
.94
.942
.943
.954

Obs
Sum of Wgt.
Mean
Std. Dev.
Variance
Skewness
Kurtosis

119
119
.7004874
.1879502
.0353253
-.4881419
2.059273

Whilst telling us the variance and standard deviation of our variable, it also gives us the mean, the
quartiles (if we want the interquartile range), the maximum and minimum (if we want the range) and
some other things that may be interesting later on in the course. Its that simple to find out the
variability of our variables.

Graphical visualisations of variability


Numbers are dry. We do get a pretty decent impression of what our variable looks like using these
measures. But I bet if you saw a graph instead youd have a much better idea of how the variable is
distributed.

Histograms - nominal, ordinal, interval


histogram varname

A histogram is a way of trying to approximate the distribution of a variable. It tells us the frequency
of observations that occur in a variable. To do so it classes observations of the variable into bins,
i.e. categories. If we have a nominal or ordinal level variable, it's pretty much equivalent to a bar
chart, with the bins typically ending up just being the different categories that the variable takes on. 2
For a interval level variable Stata will choose bins for us, although we can set their widths for
ourselves.

Beware! Mindlessly running histograms in Stata and not checking this can lead to trouble.
2

GV207 Political Analysis, Week 03

Department of Government, University of Essex

kdensity varname

Kernel density estimation is effectively the same thing but, instead of bins and bars, a smoother line is
applied to the variable.

Boxplot - ordinal, continuous


graph box varname

Boxplots plot the quartiles of our variable. The box encompasses the interquartile range (Q3 Q1, i.e.
the difference between first quartile and third quartile) and contains an indicator for the median (Q2,
i.e. second quartile or 50th percentile). The whiskers show the upper (Q3 + 1.5 (Q3 Q1)) and
lower (Q1 1.5 (Q3 Q1)) adjacent limits. We can also easily compare boxplots for different
groups. For example, we could compare the distribution of GDP per capita across different Freedom
House regime types.
graph box gdppc2000, over(fhcat2000)

Stata exercise
We will use the data set Democracy small.dta and summarise some variables using the techniques
discussed in the lecture and this handout.
1. Load up the data set and open a do-file for you to write the commands in. (Dont forget to save
this file at the end!)
2. First find out what the variables in our data set are measuring by using the describe command.
3. Summarise all of the variables in the data set using summarize.
4. Find a continuous variable. Then calculate the mean, median, variance and standard deviation of
this variable using the summarize command with the detail option.
5. Do the mean and the median differ? If so what could be the reason why?
6. Using the same variable, create a histogram and/or a kernel density plot. Does the variable look
normally distributed? How does this relate to the mean and the median?
7. Also create a boxplot. Are there any outliers outside of the whiskers? If so use the list command
in combination with the if-condition to find out which countries they are.
8. Find a dummy, nominal or ordinal variable (but with not too many categories). Create the boxplot
for your continuous variable again for each category of the new variable chosen, using the over
option.
9. Find an ordinal variable. Find out the 1st, 2nd and 3rd quartile using the summarize command
with the detail option.
10. Plot a histogram of this variable to get an idea of its variability.

Вам также может понравиться