Академический Документы
Профессиональный Документы
Культура Документы
Descriptive statistics are used to describe the basic features of the data in a study. They
provide simple summaries about the sample and the measures. Together with simple
graphics analysis, they form the basis of virtually every quantitative analysis of data.
Every time you try to describe a large set of observations with a single indicator you run
the risk of distorting the original data or losing important detail. The batting average
doesn't tell you whether the batter is hitting home runs or singles. It doesn't tell whether
she's been in a slump or on a streak. The GPA doesn't tell you whether the student was in
difficult courses or easy ones, or whether they were courses in their major field or in
other disciplines. Even given these limitations, descriptive statistics provide a powerful
summary that may enable comparisons across people or other units.
Univariate Analysis
Univariate analysis involves the examination across cases of one variable at a time. There
are three major characteristics of a single variable that we tend to look at:
• the distribution
• the central tendency
• the dispersion
In most situations, we would describe all three of these characteristics for each of the
variables in our study.
One of the most common ways to describe a single variable is with a frequency
distribution. Depending on the particular variable, all of the data values may be
represented, or you may group the values into categories first (e.g., with age, price, or
temperature variables, it would usually not be sensible to determine the frequencies for
each value. Rather, the value are grouped into ranges and the frequencies determined.).
Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1
shows an age frequency distribution with five categories of age ranges defined. The same
frequency distribution can be depicted in a graph as shown in Figure 2. This type of
graph is often referred to as a histogram or bar chart.
Table 2. Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use
percentages to describe the:
• Mean
• Median
• Mode
The Mean or average is probably the most commonly used method of describing central
tendency. To compute the mean all you do is add up all the values and divide by the
number of values. For example, the mean or average quiz score is determined by
summing all the scores and dividing by the number of students taking the exam. For
example, consider the test score values:
The Median is the score found at the exact middle of the set of values. One way to
compute the median is to list all scores in numerical order, and then locate the score in
the center of the sample. For example, if there are 500 scores in the list, score #250 would
be the median. If we order the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36
There are 8 scores and score #4 and #5 represent the halfway point. Since both of these
scores are 20, the median is 20. If the two middle scores had different values, you would
have to interpolate to determine the median.
The mode is the most frequently occurring value in the set of scores. To determine the
mode, you might again order the scores as shown above, and then count each one. The
most frequently occurring value is the mode. In our example, the value 15 occurs three
times and is the model. In some distributions there is more than one modal value. For
instance, in a bimodal distribution there are two values that occur most frequently.
Notice that for the same set of 8 scores we got three different values -- 20.875, 20, and 15
-- for the mean, median and mode respectively. If the distribution is truly normal (i.e.,
bell-shaped), the mean, median and mode are all equal to each other.
Dispersion. Dispersion refers to the spread of the values around the central tendency.
There are two common measures of dispersion, the range and the standard deviation. The
range is simply the highest value minus the lowest value. In our example distribution, the
high value is 36 and the low is 15, so the range is 36 - 15 = 21.
The Standard Deviation is a more accurate and detailed estimate of dispersion because
an outlier can greatly exaggerate the range (as was true in this example where the single
outlier value of 36 stands apart from the rest of the values. The Standard Deviation shows
the relation that set of scores has to the mean of the sample. Again lets take the set of
scores:
15,20,21,20,36,15,25,15
to compute the standard deviation, we first find the distance between each value and the
mean. We know from above that the mean is 20.875. So, the differences from the mean
are:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875
Notice that values that are below the mean have negative discrepancies and values above
it have positive ones. Next, we square each discrepancy:
Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here,
the sum is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the
result is 350.875 / 7 = 50.125. This value is known as the variance. To get the standard
deviation, we take the square root of the variance (remember that we squared the
deviations earlier). This would be SQRT(50.125) = 7.079901129253.
Although this computation may seem convoluted, it's actually quite simple. To see this,
consider the formula for the standard deviation:
In the top part of the ratio, the numerator, we see that each score has the the mean
subtracted from it, the difference is squared, and the squares are summed. In the bottom
part, we take the number of scores minus 1. The ratio is the variance and the square root
is the standard deviation. In English, we can describe the standard deviation as:
the square root of the sum of the squared deviations from the mean divided by the
number of scores minus one
Although we can calculate these univariate statistics by hand, it gets quite tedious when
you have more than a few values and variables. Every statistics program is capable of
calculating them easily for you. For instance, I put the eight scores into SPSS and got the
following table as a result:
N 8
Mean 20.8750
Median 20.0000
Mode 15.00
Std. Deviation 7.0799
Variance 50.1250
Range 21.00
The standard deviation allows us to reach some conclusions about specific scores in our
distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to
it!), the following conclusions can be reached:
• approximately 68% of the scores in the sample fall within one standard deviation
of the mean
• approximately 95% of the scores in the sample fall within two standard deviations
of the mean
• approximately 99% of the scores in the sample fall within three standard
deviations of the mean
For instance, since the mean in our example is 20.875 and the standard deviation is
7.0799, we can from the above statement estimate that approximately 95% of the scores
will fall in the range of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and
35.0348. This kind of information is a critical stepping stone to enabling us to compare
the performance of an individual on one variable with their performance on another, even
when the variables are measured on entirely different scales.
Correlation
The correlation is one of the most common and most useful statistics. A correlation is a
single number that describes the degree of relationship between two variables. Let's work
through an example to show you how this statistic is computed.
Correlation Example
Let's assume that we want to look at the relationship between two variables, height (in
inches) and self esteem. Perhaps we have a hypothesis that how tall you are effects your
self esteem (incidentally, I don't think we have to worry about the direction of causality
here -- it's not likely that self esteem causes your height!). Let's say we collect some
information on twenty individuals (all male -- we know that the average height differs for
males and females so, to keep this example simple we'll just use males). Height is
measured in inches. Self esteem is measured based on the average of 10 1-to-5 rating
items (where higher scores mean higher self esteem). Here's the data for the 20 cases
(don't take this too seriously -- I made this data up to illustrate what a correlation is):
Now, let's take a quick look at the histogram for each variable:
And, here are the descriptive statistics:
What does a "positive relationship" mean in this context? It means that, in general, higher
scores on one variable tend to be paired with higher scores on the other and that lower
scores on one variable tend to be paired with lower scores on the other. You should
confirm visually that this is generally true in the plot above.
We use the symbol r to stand for the correlation. Through the magic of mathematics it
turns out that r will always be between -1.0 and +1.0. if the correlation is negative, we
have a negative relationship; if it's positive, the relationship is positive. You don't need to
know how we came up with this formula unless you want to be a statistician. But you
probably will need to know how the formula relates to real data -- how you can use the
formula to compute the correlation. Let's look at the data we need for the formula. Here's
the original data with the other necessary columns:
Self Esteem
Person Height (x) x*y x*x y*y
(y)
1 68 4.1 278.8 4624 16.81
2 71 4.6 326.6 5041 21.16
3 62 3.8 235.6 3844 14.44
4 75 4.4 330 5625 19.36
5 58 3.2 185.6 3364 10.24
6 60 3.1 186 3600 9.61
7 67 3.8 254.6 4489 14.44
8 68 4.1 278.8 4624 16.81
9 71 4.3 305.3 5041 18.49
10 69 3.7 255.3 4761 13.69
11 68 3.5 238 4624 12.25
12 67 3.2 214.4 4489 10.24
13 63 3.7 233.1 3969 13.69
14 62 3.3 204.6 3844 10.89
15 60 3.4 204 3600 11.56
16 63 4 252 3969 16
17 65 4.1 266.5 4225 16.81
18 67 3.8 254.6 4489 14.44
19 63 3.4 214.2 3969 11.56
20 61 3.6 219.6 3721 12.96
Sum = 1308 75.1 4937.6 85912 285.45
The first three columns are the same as in the table above. The next three columns are
simple computations based on the height and self esteem data. The bottom row consists
of the sum of each column. This is all the information we need to compute the
correlation. Here are the values from the bottom row of the table (where N is 20 people)
as they are related to the symbols in the formula:
Now, when we plug these values into the formula given above, we get the following (I
show it here tediously, one step at a time):
So, the correlation for our twenty cases is .73, which is a fairly strong positive
relationship. I guess there is a relationship between height and self esteem, at least in this
made up data!
The easiest way to test this hypothesis is to find a statistics book that has a table of
critical values of r. Most introductory statistics texts would have a table like this. As in all
hypothesis testing, you need to first determine the significance level. Here, I'll use the
common significance level of alpha = .05. This means that I am conducting a test where
the odds that the correlation is a chance occurrence is no more than 5 out of 100. Before I
look up the critical value in a table I also have to compute the degrees of freedom or df.
The df is simply equal to N-2 or, in this example, is 20-2 = 18. Finally, I have to decide
whether I am doing a one-tailed or two-tailed test. In this example, since I have no strong
prior theory to suggest whether the relationship between height and self esteem would be
positive or negative, I'll opt for the two-tailed test. With these three pieces of information
-- the significance level (alpha = .05)), degrees of freedom (df = 18), and type of test
(two-tailed) -- I can now test the significance of the correlation I found. When I look up
this value in the handy little table at the back of my statistics book I find that the critical
value is .4438. This means that if my correlation is greater than .4438 or less than -.4438
(remember, this is a two-tailed test) I can conclude that the odds are less than 5 out of 100
that this is a chance occurrence. Since my correlation 0f .73 is actually quite a bit higher,
I conclude that it is not a chance finding and that the correlation is "statistically
significant" (given the parameters of the test). I can reject the null hypothesis and accept
the alternative.
I used a simple statistics program to generate random data for 10 variables with 20 cases
(i.e., persons) for each variable. Then, I told the program to compute the correlations
among these variables. Here's the result:
C1 C2 C3 C4 C5 C6 C7
C8 C9 C10
C1 1.000
C2 0.274 1.000
C3 -0.134 -0.269 1.000
C4 0.201 -0.153 0.075 1.000
C5 -0.129 -0.166 0.278 -0.011 1.000
C6 -0.095 0.280 -0.348 -0.378 -0.009 1.000
C7 0.171 -0.122 0.288 0.086 0.193 0.002 1.000
C8 0.219 0.242 -0.380 -0.227 -0.551 0.324 -0.082
1.000
C9 0.518 0.238 0.002 0.082 -0.015 0.304 0.347
-0.013 1.000
C10 0.299 0.568 0.165 -0.122 -0.106 -0.169 0.243
0.014 0.352 1.000
This type of table is called a correlation matrix. It lists the variable names (C1-C10)
down the first column and across the first row. The diagonal of a correlation matrix (i.e.,
the numbers that go from the upper left corner to the lower right) always consists of ones.
That's because these are the correlations between each variable and itself (and a variable
is always perfectly correlated with itself). This statistical program only shows the lower
triangle of the correlation matrix. In every correlation matrix there are two triangles that
are the values below and to the left of the diagonal (lower triangle) and above and to the
right of the diagonal (upper triangle). There is no reason to print both triangles because
the two triangles of a correlation matrix are always mirror images of each other (the
correlation of variable x with variable y is always equal to the correlation of variable y
with variable x). When a matrix has this mirror-image quality above and below the
diagonal we refer to it as a symmetric matrix. A correlation matrix is always a symmetric
matrix.
To locate the correlation for any pair of variables, find the value in the table for the row
and column intersection for those two variables. For instance, to find the correlation
between variables C5 and C2, I look for where row C2 and column C5 is (in this case it's
blank because it falls in the upper triangle area) and where row C5 and column C2 is and,
in the second case, I find that the correlation is -.166.
OK, so how did I know that there are 45 unique correlations when we have 10 variables?
There's a handy simple little formula that tells how many pairs (e.g., correlations) there
are for any number of variables:
where N is the number of variables. In the example, I had 10 variables, so I know I have
(10 * 9)/2 = 90/2 = 45 pairs.
Other Correlations
The specific type of correlation I've illustrated here is known as the Pearson Product
Moment Correlation. It is appropriate when both variables are measured at an interval
level. However there are a wide variety of other types of correlations for other
circumstances. for instance, if you have two ordinal variables, you could use the
Spearman rank Order Correlation (rho) or the Kendall rank order Correlation (tau). When
one measure is a continuous interval level one and the other is dichotomous (i.e., two-
category) you can use the Point-Biserial Correlation.
Inferential Statistics
With inferential statistics, you are trying to reach conclusions that extend beyond the
immediate data alone. For instance, we use inferential statistics to try to infer from the
sample data what the population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups is a dependable
one or one that might have happened by chance in this study. Thus, we use inferential
statistics to make inferences from our data to more general conditions; we use descriptive
statistics simply to describe what's going on in our data.
Here, I concentrate on inferential statistics that are useful in experimental and quasi-
experimental research design or in program outcome evaluation. Perhaps one of the
simplest inferential test is used when you want to compare the average performance of
two groups on a single measure to see if there is a difference. You might want to know
whether eighth-grade boys and girls differ in math test scores or whether a program group
differs on the outcome measure from a control group. Whenever you wish to compare the
average performance between two groups you should consider the t-test for differences
between groups.
Most of the major inferential statistics come from a general family of statistical models
known as the General Linear Model. This includes the t-test, Analysis of Variance
(ANOVA), Analysis of Covariance (ANCOVA), regression analysis, and many of the
multivariate methods like factor analysis, multidimensional scaling, cluster analysis,
discriminant function analysis, and so on. Given the importance of the General Linear
Model, it's a good idea for any serious social researcher to become familiar with its
workings. The discussion of the General Linear Model here is very elementary and only
considers the simplest straight-line model. However, it will get you familiar with the idea
of the linear model and help prepare you for the more complex analyses described below.
One of the keys to understanding how groups are compared is embodied in the notion of
the "dummy" variable. The name doesn't suggest that we are using variables that aren't
very smart or, even worse, that the analyst who uses them is a "dummy"! Perhaps these
variables would be better described as "proxy" variables. Essentially a dummy variable is
one that uses discrete numbers, usually 0 and 1, to represent different groups in your
study. Dummy variables are a simple idea that enable some pretty complicated things to
happen. For instance, by including a simple dummy variable in an model, I can model
two separate lines (one for each treatment group) with a single equation. To see how this
works, check out the discussion on dummy variables.
One of the most important analyses in program outcome evaluations involves comparing
the program and non-program group on the outcome variable or variables. How we do
this depends on the research design we use. research designs are divided into two major
types of designs: experimental and quasi-experimental. Because the analyses differ for
each, they are presented separately.
When you've investigated these various analytic models, you'll see that they all come
from the same family -- the General Linear Model. An understanding of that model will
go a long way to introducing you to the intricacies of data analysis in applied and social
research contexts.
The T-Test
The t-test assesses whether the means of two groups are statistically different from each
other. This analysis is appropriate whenever you want to compare the means of two
groups, and especially appropriate as the analysis for the posttest-only two-group
randomized experimental design.
Figure 1. Idealized distributions for treated and comparison group posttest values.
Figure 1 shows the distributions for the treated (blue) and control (green) groups in a
study. Actually, the figure shows the idealized distribution -- the actual distribution
would usually be depicted with a histogram or bar graph. The figure indicates where the
control and treatment group means are located. The question the t-test addresses is
whether the means are statistically different.
What does it mean to say that the averages for two groups are statistically different?
Consider the three situations shown in Figure 2. The first thing to notice about the three
situations is that the difference between the means is the same in all three. But, you
should also notice that the three situations don't look the same -- they tell very different
stories. The top example shows a case with moderate variability of scores within each
group. The second situation shows the high variability case. the third shows the case with
low variability. Clearly, we would conclude that the two groups appear most different or
distinct in the bottom or low-variability case. Why? Because there is relatively little
overlap between the two bell-shaped curves. In the high variability case, the group
difference appears least striking because the two bell-shaped distributions overlap so
much.
This leads us to a very important conclusion: when we are looking at the differences
between scores for two groups, we have to judge the difference between their means
relative to the spread or variability of their scores. The t-test does just this.
The top part of the formula is easy to compute -- just find the difference between the
means. The bottom part is called the standard error of the difference. To compute it,
we take the variance for each group and divide it by the number of people in that group.
We add these two values and then take their square root. The specific formula is given in
Figure 4:
Figure 4. Formula for the Standard error of the difference between the means.
Remember, that the variance is simply the square of the standard deviation.
The t-test, one-way Analysis of Variance (ANOVA) and a form of regression analysis are
mathematically equivalent (see the statistical analysis of the posttest-only randomized
experimental design) and would yield identical results.
Dummy Variables
A dummy variable is a numerical variable used in regression analysis to represent
subgroups of the sample in your study. In research design, a dummy variable is often
used to distinguish different treatment groups. In the simplest case, we would use a 0,1
dummy variable where a person is given a value of 0 if they are in the control group or a
1 if they are in the treated group. Dummy variables are useful because they enable us to
use a single regression equation to represent multiple groups. This means that we don't
need to write out separate equation models for each subgroup. The dummy variables act
like 'switches' that turn various parameters on and off in an equation. Another advantage
of a 0,1 dummy-coded variable is that even though it is a nominal-level variable you can
treat it statistically like an interval-level variable (if this made no sense to you, you
probably should refresh your memory on levels of measurement). For instance, if you
take an average of a 0,1 variable, the result is the proportion of 1s in the distribution.
To illustrate dummy variables, consider the simple regression model for a posttest-only
two-group randomized experiment. This model is essentially the same as conducting a t-
test on the posttest means for two groups or conducting a one-way Analysis of Variance
(ANOVA). The key term in the model is β 1, the estimate of the difference between the
groups. To see how dummy variables work, we'll use this simple model to show you how
to use them to pull out the separate sub-equations for each subgroup. Then we'll show
how you estimate the difference between the subgroups by subtracting their respective
equations. You'll see that we can pack an enormous amount of information into a single
equation using dummy variables. All I want to show you here is that β 1 is the difference
between the treatment and control groups.
To see this, the first step is to compute what the equation would be for each of our two
groups separately. For the control group, Z = 0. When we substitute that into the
equation, and recognize that by assumption the error term averages to 0, we find that the
predicted value for the control group is β 0, the intercept. Now, to figure out the
treatment group line, we substitute the value of 1 for Z, again recognizing that by
assumption the error term averages to 0. The equation for the treatment group indicates
that the treatment group value is the sum of the two beta values.
Now, we're ready to move on to the second step -- computing the difference between the
groups. How do we determine that? Well, the difference must be the difference between
the equations for the two groups that we worked out above. In other word, to find the
difference between the groups we just find the difference between the equations for the
two groups! It should be obvious from the figure that the difference is β 1. Think about
what this means. The difference between the groups is β 1. OK, one more time just for
the sheer heck of it. The difference between the groups in this model is β 1!
Whenever you have a regression model with dummy variables, you can always see how
the variables are being used to represent multiple subgroup equations by following the
two steps described above:
• create separate equations for each subgroup by substituting the dummy values
• find the difference between groups by finding the difference between their
equations