Академический Документы
Профессиональный Документы
Культура Документы
Page 1 of 12
Frequency distributions
A table reporting the number of observations/cases falling into each category of the variable
Can be used for all levels of measurement, but are better for variables with a small number of categories
Content of table depends on level of measurement
o % allows us to make comparisons across groups easily
o valid % if there are missing data
o cumulative % for ordinal, interval, and ratio variables
Syntax:
freq vars=satfin.
Unedited output from SPSS (fine for data screening, but not for presenting your research to others):
SATFI N SATISFACTION WITH FINANCI AL SITUATION
Cumulative
Frequency Percent Valid Perc ent Percent
Valid 1 SATISFIED 834 29.6 29.8 29.8
2 MORE OR LESS 1261 44.8 45.0 74.7
3 NOT AT ALL SAT 708 25.1 25.3 100.0
Total 2803 99.5 100.0
Missing 8 DK 9 .3
9 NA 5 .2
Total 14 .5
Total 2817 100.0
Edited output:
Table 1. Satisfaction with Financial Situation a (2000; N=2,803).
Frequency Percent Valid Percent Cumulative Percent
Valid 1 Satisfied 834 29.6 29.8 29.8
2 More or less satisfied 1261 44.8 45.0 74.7
3 Not satisfied at all 708 25.1 25.3 100.0
Total 2803 99.5 100.0
Missing 8 Don’t know 9 .3
9 Not available 5 .2
Total 14 .5
Total 2817 100.0
Source: Davis and Smith (2007).
a. Question wording: “We are interested in how people are getting along financially these days. So far as you and your family are
concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied
at all?”
Page 2 of 12
Graphs
Generating Simple Graphs in SPSS
1. Bar graphs, pie charts, and histograms: Analyze → Descriptive Statistics →
Frequencies; Select the ‘Charts’ box
2. Line charts: Graphs → Legacy Dialogs
Bar graph Pie chart
Figure 1. Marital Status (2000; N=2,816). Figure 2. Satisfaction with Financial Situation (2000;
50 N=2,803).
45
40 Not at all
25.3%
Satisfied
29.8%
30
25
20
16
10
10
Percent
More or less
4 45.0%
0
MARRIED DIVORCED NEVER MARRIED
WIDOWED SEPARATED
Source: Davis and Smith (2007). Source: Davis and Smith (2007).
Histogram Line Chart
Figure 3. The Number of Work Hours Last Week Figure 4. The Number of Work Hours Last Week
(2000; N=1,818). (2000; N=1,818).
800 700
600
600
500
400
400
300
200 200
Frequency
100
Count
0
0
5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0
3 8 15 20 25 30 35 40 45 50 55 60 66 74 84
10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0
Source: Davis and Smith (2007). Source: Davis and Smith (2007).
Note: SPSS collapses the data into ‘bins’ – that is,
scores from 0 to 5 are summarized by the first bar;
you can change this in the chart editor.
References
Davis, James Allan and Smith, Tom W. General social surveys, 1972-2006 [machine-readable data file]
/Principal Investigator, James A. Davis; Director and Co-Principal Investigator, Tom W. Smith; Co-
Principal Investigator, Peter V. Marsden; Sponsored by National Science Foundation. --NORC ed.--
Chicago: National Opinion Research Center [producer]; Storrs, CT: The Roper Center for Public
Opinion Research, University of Connecticut [distributor], 2007.
Page 3 of 12
Central tendency (See the Excel file for examples)
Mode
The mode is the category (nominal/ordinal) or score (interval-ratio) with the largest frequency
The mode is always the category or score on the variable, not the frequency or percent
Bimodal and essentially bimodal distributions
Mean
The mean is the average, which is obtained by adding up all of the scores and dividing by the number of scores:
N
y
i 1
i
y
N
Dichotomous variables (i.e., those with only two categories) are special.
Sex (1) Male – frequency = 3
(2) Female – frequency = 7
If recoded sex such that 1=0 and 2=1 (i.e., into a ‘dummy variable’), then the mean would be .7, which is the
proportion female!
Page 4 of 12
Variability (See the Excel file for examples)
The Importance of Measuring Variability
If we just look at central tendency, we may be misled because two distributions can have the exact same mean,
median, and/or mode, but different degrees of variability.
Range
The range is the difference between the maximum observed value and the minimum observed value.
The drawback is that there may be outliers. In the presence of outliers, you should use the inter-quartile range.
Inter-quartile Range
The inter-quartile range is the difference between the values at the lower and upper quartiles. It is similar to the
range except that it focuses on two less extreme scores (i.e., the lower and upper quartiles instead of the
minimum and maximum values).
Once you identify the cases at the lower and upper quartiles, subtract the score for the case at the lower quartile
from that of the case at the upper quartile. The result is the inter-quartile range.
Page 5 of 12
Variance and Standard Deviation
Variance ( SY2 ): a measure of variation for interval-ratio variables; it is the average of the squared deviations
from the mean (note: it is usually N-1).
S y2
( y y) 2
N
Standard deviation (SY): a measure of variation for interval-ratio variables; it is equal to the square root of the
variance or it is the average deviation from the mean.
Sy
( y y) 2
Skewed distributions are characterized by extreme values on one side of the distribution
Those that have extremely high values (compared to the rest of the distribution) are positively skewed
Those that have extremely low values (compared to the rest of the distribution) are negatively skewed
The mean is pulled toward the side with the extreme values; the median, however, is unaffected
The easiest way to tell between positive and negative skew is to compare the mean to the median
o If the mean is higher than the median then the variable is positively skewed
o If the mean is lower than the median then the variable is negatively skewed
Positive Skew Negative Skew
30 30
25 25
20 20
Percent
Percent
15 15
10 10
5 5
0 0
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Score on X Score on X
Page 6 of 12
How do you choose a measure of central tendency or variability?
1. Level of measurement:
Nominal Ordinal Interval and ratio
Central tendency Preferred Mode Mode Mean
Median Median
Mode
Variability Preferred Standard deviation
Variance
Range
Inter-quartile range
2. Shape of the distribution – for skewed interval-ratio variables, use the median and inter-quartile range
3. Research objective – for example, do you want the most typical value, the value in the middle of the
distribution, or the average of all scores?
Measures of central tendency and variability in SPSS
1. Analyze → Descriptive Statistics → Frequencies; Select the ‘Statistics’ box
Data Screening
Page 7 of 12
Missing Data
What are some of its sources?
The use of contingency questions in questionnaires
The use of multiple ballots
Non-response and refusals
Interviewer errors
Why does it matter?
Missing data reduces sample size, which reduces statistical power
Missing data can make a representative sample non-representative
Missing data can influence our estimates
There are a variety of methods available to deal with missing data – the most advanced are far beyond the scope
of this course. Regardless, you need to examine how many cases have missing data on each variable and why.
Outliers
What are they?
Definitions vary, but cases more than 3 standard deviations from the mean can be considered outliers
What causes them and why do they matter?
They may be data errors
They may result from the inclusion of a case from a different population
They may influence our estimates because they can have great leverage – we’ll discuss this later
You can search for outliers using your descriptive statistics, frequency distributions, and graphs.
SPSS will identify the five highest and lowest outliers in the Explore command:
Ex trem e Values
Page 8 of 12
Box Plot Box plot
Figure 5. The Number of Work Hours Last Week Summary plot based on the median, quartiles, and extreme
values.
(2000; N=1,818).
100 A line across the box indicates the median [marked in red in
Figure 5; 40 in this example].
60 The whiskers are lines that extend from the box to the highest and
lowest non-outlier values. The highest and lowest non-outlier
values are defined as up to 1.5 of the inter-quartile range. In this
40 example, IQR=9, 1.5*9=13.5, so the whiskers extend from 49 to
62.5 and from 40 to 26.5.
20 Outliers (i.e., cases with values between 1.5 and 3 box lengths
from the upper or lower edge of the box) are represented by
circles; IQR=9 so1.5*9=13.5 and 3*9=27; Outliers are between
0 62.5 and 76 and 13 and 26.5.
NUMBER OF HOURS WORK
Source: Davis and Smith (2007). Extreme values (i.e., cases with values more than 3 box lengths
from the upper or lower edge of the box) are represented by
asterisks. Extreme values extend above 76 and below 13 in this
example.
Normality
Why does it matter?
Some statistical procedures assume normality – they are invalid if this assumption is invalid
Even when it is not assumed, skewed distributions can influence estimation and hypothesis testing by
causing other statistical problems
You can examine normality using any of the charts listed above (e.g., a histogram) as well as quantile-normal
plots: Analyze → Descriptive Statistics → Explore; Select the ‘Plots’ box and check the
‘Normality plots with tests’ box
Page 9 of 12
Transforming Variables
Recoding Variables in SPSS
Using the Menu:
Transform → Recode → Into Different Variables
SPSS Syntax:
recode tvhours (0=0) (1=1) (2=2) (3 4=3) (5 thru 24=4) (else=sysmis) into tvhrcat.
Cumulative Cumulative
Frequency Percent Valid Percent Percent Frequency Percent Valid Perc ent Percent
Valid 0 107 3.8 5.9 5.9 Valid .00 107 3.8 5.9 5.9
1 380 13.5 20.8 26.6 1.00 380 13.5 20.8 26.6
2 510 18.1 27.9 54.5 2.00 510 18.1 27.9 54.5
3 310 11.0 16.9 71.5
3.00 543 19.3 29.7 84.2
4 233 8.3 12.7 84.2
4.00 289 10.3 15.8 100.0
5 95 3.4 5.2 89.4
Total 1829 64.9 100.0
6 64 2.3 3.5 92.9
Missing Sy stem 988 35.1
7 18 .6 1.0 93.9
8
Total 2817 100.0
47 1.7 2.6 96.4
10
11
24
4
.9
.1
1.3
.2
97.8
98.0
Note – Be sure to add value labels to the tvhrcat variable so that you
12 21 .7 1.1 99.1 will remember that 3 now means 3 or 4 and 4 now means 5 or more.
13 1 .0 .1 99.2
14 2 .1 .1 99.3
15 7 .2 .4 99.7 SPSS Syntax for adding value labels:
20 3 .1 .2 99.8
21 1 .0 .1 99.9
24 2 .1 .1 100.0
add value labels tvhrcat
Total 1829 64.9 100.0 0 ‘0’
Missing -1 NAP
98 DK
940 33.4
1 '1'
3 .1
99 NA 45 1.6 2 '2'
Total 988 35.1 3 '3 or 4'
Total 2817 100.0
4 '5 or more'.
Page 10 of 12
Computing Variables in SPSS
Using the Menu:
Transform → Compute
SPSS Syntax:
compute wktvdiff=hrs1-tvhours.
Here is the result:
Statistics
WKTVDIFF
WKTVDIFF 400
N Valid 1181
Missing 1636
Mean 39.6842
Median 39.0000 300
Mode 38.00
Std. Deviation 13.83398
Variance 191.37898
200
Skewness .184
Std. Error of Skewness .071
Kurtos is 1.690
Std. Error of Kurtosis .142
100
Frequency
Range 98.00
Minimum -10.00
Maximum 88.00
Percentiles 25 35.0000 0
50 -10.0 10.0 30.0 50.0 70.0 90.0
39.0000
0.0 20.0 40.0 60.0 80.0
75 46.0000
Page 11 of 12
Transformations to reduce skewness and to pull in outliers
It is also possible to transform variables to reduce skew and to pull in outliers; tvhours is positively skewed:
TV HOURS HOURS P ER DAY WATCHING TV
Cumulative
Frequency Percent Valid P erc ent Percent
Valid 0 107 3.8 5.9 5.9
1 380 13.5 20.8 26.6
2 510 18.1 27.9 54.5
3 310 11.0 16.9 71.5
600
4 233 8.3 12.7 84.2
5 95 3.4 5.2 89.4
6 64 2.3 3.5 92.9
7 18 .6 1.0 93.9 500
8 47 1.7 2.6 96.4
10 24 .9 1.3 97.8
11 4 .1 .2 98.0 400
12 21 .7 1.1 99.1
13 1 .0 .1 99.2
14 2 .1 .1 99.3 300
15 7 .2 .4 99.7
20 3 .1 .2 99.8
21 1 .0 .1 99.9
200
24 2 .1 .1 100.0
Total 1829 64.9 100.0
Missing -1 NA P 940 33.4
Frequency
100
98 DK 3 .1
99 NA 45 1.6
Total 988 35.1
0
Total 2817 100.0
SPSS Syntax:
* Natural log - you have to add 1 because the natural log of 0 is undefined.
compute tvhr_ln=ln(tvhours+1).
TV HR_LN
Cumulative
Frequency Percent Valid Perc ent Percent
Valid .00 107 3.8 5.9 5.9
.69 380 13.5 20.8 26.6 600
1.10 510 18.1 27.9 54.5
1.39 310 11.0 16.9 71.5
1.61 233 8.3 12.7 84.2 500
1.79 95 3.4 5.2 89.4
1.95 64 2.3 3.5 92.9
2.08 18 .6 1.0 93.9 400
2.20 47 1.7 2.6 96.4
2.40 24 .9 1.3 97.8
2.48 4 .1 .2 98.0 300
2.56 21 .7 1.1 99.1
2.64 1 .0 .1 99.2
2.71 2 .1 .1 99.3
200
2.77 7 .2 .4 99.7
3.04 3 .1 .2 99.8
3.09 1 .0 .1 99.9
Frequency
3.22 100
2 .1 .1 100.0
Total 1829 64.9 100.0
Missing Sy stem 988 35.1
0
Total 2817 100.0
Page 12 of 12