Академический Документы
Профессиональный Документы
Культура Документы
Symbol
N
s2
s
n
The above is the equation for the sample variance of a variable. Lets decompose some of the
(potential) mystique around it:
Here we take an observation x and subtract the sample mean . So now we know how
much this observation deviates from the variables mean. Why do we then square this?
Whats the doing? This is the summation operator. We want to add up all of the individual
observations deviations from the mean of the variable. Basically it saves from having to write:
is the final piece. Here, we divide the sum of squared deviations by the number of
observations minus 1. Effectively were taking the mean of our squared deviations. 1
If that makes sense then you have an idea of how the variance works. And if thats the case then the
standard deviation is easy:
Subtracting 1 makes this calculation slightly different than when we usually calculate the mean. This is
a correction due to the fact we are calculating the standard deviation for a sample (if we don't do this then it's
biased.) For details, see Wikipedia: http://en.wikipedia.org/wiki/Bessel%27s_correction.
1
Another way to think about how this works is to imagine changing one of the values in the top half of
the equation, whilst holding everything else constant, and think of what happens to the variance (s2 )
and standard deviation (s). What happens if we increase x whilst and n stay constant? Why?
1%
5%
10%
25%
50%
75%
90%
95%
99%
Percentiles
.323
.354
.406
.521
Smallest
.279
.323
.325
.342
.751
.854
.932
.939
.943
Largest
.94
.942
.943
.954
Obs
Sum of Wgt.
Mean
Std. Dev.
Variance
Skewness
Kurtosis
119
119
.7004874
.1879502
.0353253
-.4881419
2.059273
Whilst telling us the variance and standard deviation of our variable, it also gives us the mean, the
quartiles (if we want the interquartile range), the maximum and minimum (if we want the range) and
some other things that may be interesting later on in the course. Its that simple to find out the
variability of our variables.
A histogram is a way of trying to approximate the distribution of a variable. It tells us the frequency
of observations that occur in a variable. To do so it classes observations of the variable into bins,
i.e. categories. If we have a nominal or ordinal level variable, it's pretty much equivalent to a bar
chart, with the bins typically ending up just being the different categories that the variable takes on. 2
For a interval level variable Stata will choose bins for us, although we can set their widths for
ourselves.
Beware! Mindlessly running histograms in Stata and not checking this can lead to trouble.
2
kdensity varname
Kernel density estimation is effectively the same thing but, instead of bins and bars, a smoother line is
applied to the variable.
Boxplots plot the quartiles of our variable. The box encompasses the interquartile range (Q3 Q1, i.e.
the difference between first quartile and third quartile) and contains an indicator for the median (Q2,
i.e. second quartile or 50th percentile). The whiskers show the upper (Q3 + 1.5 (Q3 Q1)) and
lower (Q1 1.5 (Q3 Q1)) adjacent limits. We can also easily compare boxplots for different
groups. For example, we could compare the distribution of GDP per capita across different Freedom
House regime types.
graph box gdppc2000, over(fhcat2000)
Stata exercise
We will use the data set Democracy small.dta and summarise some variables using the techniques
discussed in the lecture and this handout.
1. Load up the data set and open a do-file for you to write the commands in. (Dont forget to save
this file at the end!)
2. First find out what the variables in our data set are measuring by using the describe command.
3. Summarise all of the variables in the data set using summarize.
4. Find a continuous variable. Then calculate the mean, median, variance and standard deviation of
this variable using the summarize command with the detail option.
5. Do the mean and the median differ? If so what could be the reason why?
6. Using the same variable, create a histogram and/or a kernel density plot. Does the variable look
normally distributed? How does this relate to the mean and the median?
7. Also create a boxplot. Are there any outliers outside of the whiskers? If so use the list command
in combination with the if-condition to find out which countries they are.
8. Find a dummy, nominal or ordinal variable (but with not too many categories). Create the boxplot
for your continuous variable again for each category of the new variable chosen, using the over
option.
9. Find an ordinal variable. Find out the 1st, 2nd and 3rd quartile using the summarize command
with the detail option.
10. Plot a histogram of this variable to get an idea of its variability.