Вы находитесь на странице: 1из 12

Topic 2.

Descriptive Statistics, Data Screening, and Transforming Variables

Univariate Descriptive Statistics


 Procedures that help us organize, screen, and describe data from a sample or a population
 These allow us to identify patterns that would be impossible to see if we looked at the data all at once
 Tables
o Frequency distribution
 Graphs
o For nominal and ordinal variables
 Pie chart
 Bar graph
o For interval and ratio variables
 Histogram
 Line chart
 Box plot
 Quantile-normal plots
 Statistics
o Central Tendency
 Mode, median, mean
o Variability
 Range, inter-quartile range, variance and standard deviation
o Other statistics
 Quantiles and percentiles, quartiles, skewness, kurtosis

First things first


You should use all of these to examine your data. Statistics are the most efficient (a single number vs. a whole
table or chart), but they are often misleading because they omit many details to achieve this efficiency. You
should begin every analysis by examining each variable (i.e., before you do anything else that is more
complicated).

The Uses of frequency distributions and graphs


 Summarizing vast quantities of data
 Obtaining information about the sample – how representative of the population are the sample data?
 Data screening – searching for unusual cases/outliers, coding errors, missing data
 Testing assumptions of statistical tests – for example, normality

Some guidelines for creating good frequency distributions and graphs


 Choose the design based on the level of measurement of the variable
 Categories should be mutually exclusive and exhaustive
 The table/graph should stand on its own (i.e., it should pass the “lost in the parking lot” test)
 Always include a descriptive title, labels, the data source, and the total number of cases

Page 1 of 12
Frequency distributions
 A table reporting the number of observations/cases falling into each category of the variable
 Can be used for all levels of measurement, but are better for variables with a small number of categories
 Content of table depends on level of measurement
o % allows us to make comparisons across groups easily
o valid % if there are missing data
o cumulative % for ordinal, interval, and ratio variables

Frequency Distributions in SPSS


Analyze → Descriptive Statistics → Frequencies

Syntax:
freq vars=satfin.

Unedited output from SPSS (fine for data screening, but not for presenting your research to others):
SATFI N SATISFACTION WITH FINANCI AL SITUATION

Cumulative
Frequency Percent Valid Perc ent Percent
Valid 1 SATISFIED 834 29.6 29.8 29.8
2 MORE OR LESS 1261 44.8 45.0 74.7
3 NOT AT ALL SAT 708 25.1 25.3 100.0
Total 2803 99.5 100.0
Missing 8 DK 9 .3
9 NA 5 .2
Total 14 .5
Total 2817 100.0

Edited output:
Table 1. Satisfaction with Financial Situation a (2000; N=2,803).
Frequency Percent Valid Percent Cumulative Percent
Valid 1 Satisfied 834 29.6 29.8 29.8
2 More or less satisfied 1261 44.8 45.0 74.7
3 Not satisfied at all 708 25.1 25.3 100.0
Total 2803 99.5 100.0
Missing 8 Don’t know 9 .3
9 Not available 5 .2
Total 14 .5
Total 2817 100.0
Source: Davis and Smith (2007).

a. Question wording: “We are interested in how people are getting along financially these days. So far as you and your family are
concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied
at all?”

Page 2 of 12
Graphs
Generating Simple Graphs in SPSS
1. Bar graphs, pie charts, and histograms: Analyze → Descriptive Statistics →
Frequencies; Select the ‘Charts’ box
2. Line charts: Graphs → Legacy Dialogs
Bar graph Pie chart
Figure 1. Marital Status (2000; N=2,816). Figure 2. Satisfaction with Financial Situation (2000;
50 N=2,803).
45

40 Not at all

25.3%
Satisfied

29.8%
30

25

20

16

10
10
Percent

More or less
4 45.0%
0
MARRIED DIVORCED NEVER MARRIED
WIDOWED SEPARATED

Source: Davis and Smith (2007). Source: Davis and Smith (2007).
Histogram Line Chart
Figure 3. The Number of Work Hours Last Week Figure 4. The Number of Work Hours Last Week
(2000; N=1,818). (2000; N=1,818).
800 700

600

600
500

400
400

300

200 200
Frequency

100
Count

0
0
5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0
3 8 15 20 25 30 35 40 45 50 55 60 66 74 84
10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0

Source: Davis and Smith (2007). Source: Davis and Smith (2007).
Note: SPSS collapses the data into ‘bins’ – that is,
scores from 0 to 5 are summarized by the first bar;
you can change this in the chart editor.
References
Davis, James Allan and Smith, Tom W. General social surveys, 1972-2006 [machine-readable data file]
/Principal Investigator, James A. Davis; Director and Co-Principal Investigator, Tom W. Smith; Co-
Principal Investigator, Peter V. Marsden; Sponsored by National Science Foundation. --NORC ed.--
Chicago: National Opinion Research Center [producer]; Storrs, CT: The Roper Center for Public
Opinion Research, University of Connecticut [distributor], 2007.
Page 3 of 12
Central tendency (See the Excel file for examples)
Mode
 The mode is the category (nominal/ordinal) or score (interval-ratio) with the largest frequency
 The mode is always the category or score on the variable, not the frequency or percent
 Bimodal and essentially bimodal distributions

Median (an order statistic)


 The median is the score than divides the distribution into two equal parts so that half the cases are above
it and half are below it
 Order is implied in the definition of the median. You have to be able to order the categories of the
variable from least to most or most to least to calculate the median. You can only calculate the median
for ordinal and interval-ratio variables
 Note – if you are finding the median for an ordinal variable with an even number of categories, it is
inappropriate to calculate the average of the middle two cases
 The median is not affected by the scores assigned to the categories

Quantiles, percentiles, and quartiles (order statistics)


 A percentile is a score at or below which a specific percentage of cases fall; quantiles differ only in that
the value is a proportion rather than a percent
 The lower quartile is equal to the 25th percentile and the .25 quantile. The middle quartile is equal to the
median, the 50th percentile, and the .50 quantile. The upper quartile is equal to the 75th percentile and
the .75 quantile

Mean
The mean is the average, which is obtained by adding up all of the scores and dividing by the number of scores:
N

y
i 1
i
y
N

Some important properties of the mean:


1. You can only calculate the mean for interval-ratio variables
2. The mean is the center of gravity of the distribution because it incorporates all scores in the distribution
(unlike the median and mode). This means that if we subtract the mean from each case and add up the
differences, the result will always equal zero
3. Unlike the mode and the median, the mean is sensitive to extreme scores in the distribution. This is because
it takes into account every score

Dichotomous variables (i.e., those with only two categories) are special.
Sex (1) Male – frequency = 3
(2) Female – frequency = 7

If recoded sex such that 1=0 and 2=1 (i.e., into a ‘dummy variable’), then the mean would be .7, which is the
proportion female!

Page 4 of 12
Variability (See the Excel file for examples)
The Importance of Measuring Variability
If we just look at central tendency, we may be misled because two distributions can have the exact same mean,
median, and/or mode, but different degrees of variability.

Commuting times for two different routes:


Freeway Minutes Downtown Minutes
Week 1 Week 1
Monday 21 Monday 31
Tuesday 19 Tuesday 27
Wednesday 20 Wednesday 29
Thursday 28 Thursday 27
Friday 53 Friday 31
Week 2 Week 2
Monday 24 Monday 30
Tuesday 18 Tuesday 28
Wednesday 19 Wednesday 27
Thursday 28 Thursday 27
Friday 60 Friday 33

Mean 29.0 Mean 29.0

 On average, it takes 29.0 minutes to get to work regardless of the


route
 Notice that there is more variability day-to-day if you take the freeway
 If you value being on-time, then you should take the downtown route
and leave the house about 33 minutes before work

Range
The range is the difference between the maximum observed value and the minimum observed value.
The drawback is that there may be outliers. In the presence of outliers, you should use the inter-quartile range.

Inter-quartile Range
The inter-quartile range is the difference between the values at the lower and upper quartiles. It is similar to the
range except that it focuses on two less extreme scores (i.e., the lower and upper quartiles instead of the
minimum and maximum values).

To find the case at the lower quartile: N*.25


To find the case at the upper quartile: N*.75

Once you identify the cases at the lower and upper quartiles, subtract the score for the case at the lower quartile
from that of the case at the upper quartile. The result is the inter-quartile range.

Page 5 of 12
Variance and Standard Deviation
Variance ( SY2 ): a measure of variation for interval-ratio variables; it is the average of the squared deviations
from the mean (note: it is usually N-1).

S y2 
 ( y  y) 2
N

Standard deviation (SY): a measure of variation for interval-ratio variables; it is equal to the square root of the
variance or it is the average deviation from the mean.

Sy 
 ( y  y) 2

The shape of the distribution


Distributions can be either symmetric or skewed
 A symmetric distribution is a mirror image of itself if you divide the distribution in half
 In a unimodal symmetric distribution the mean, median, and mode are equal

Skewed distributions are characterized by extreme values on one side of the distribution
 Those that have extremely high values (compared to the rest of the distribution) are positively skewed
 Those that have extremely low values (compared to the rest of the distribution) are negatively skewed
 The mean is pulled toward the side with the extreme values; the median, however, is unaffected
 The easiest way to tell between positive and negative skew is to compare the mean to the median
o If the mean is higher than the median then the variable is positively skewed
o If the mean is lower than the median then the variable is negatively skewed
Positive Skew Negative Skew

30 30

25 25

20 20
Percent
Percent

15 15

10 10

5 5

0 0
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Score on X Score on X

Page 6 of 12
How do you choose a measure of central tendency or variability?
1. Level of measurement:
Nominal Ordinal Interval and ratio
Central tendency Preferred Mode Mode Mean
Median Median
Mode
Variability Preferred Standard deviation
Variance
Range
Inter-quartile range

2. Shape of the distribution – for skewed interval-ratio variables, use the median and inter-quartile range
3. Research objective – for example, do you want the most typical value, the value in the middle of the
distribution, or the average of all scores?
Measures of central tendency and variability in SPSS
1. Analyze → Descriptive Statistics → Frequencies; Select the ‘Statistics’ box

2. Analyze → Descriptive Statistics → Explore; Select the ‘Statistics’ box


3. Analyze → Descriptive Statistics → Descriptives; Select the ‘Options' box
Statistics

MARITAL MARITAL STATUS HRS1 NUMBER OF HOURS WORKED LAST WEEK


N Valid 1818
N Valid 2,816 Correction
Missing 999
Missing 1
Mean 41.90
Mode 1 ‘Married’ Median 40.00
SATFIN SATISFACTION WITH FINANCIAL SITUATION Mode 40
N Valid 2,803 Correction Std. Deviation 13.395
Missing 14 Variance 179.430
Median 2.00 ‘More or less’ Skewness .212
Mode 2 ‘More or less’ Std. Error of Skewness .057
Kurtos is 1.668
Std. Error of Kurtosis .115
Range 86
Minimum 3
Maximum 89
Percentiles 25 40.00
50 40.00
75 49.00

Data Screening
Page 7 of 12
Missing Data
What are some of its sources?
 The use of contingency questions in questionnaires
 The use of multiple ballots
 Non-response and refusals
 Interviewer errors
Why does it matter?
 Missing data reduces sample size, which reduces statistical power
 Missing data can make a representative sample non-representative
 Missing data can influence our estimates

There are a variety of methods available to deal with missing data – the most advanced are far beyond the scope
of this course. Regardless, you need to examine how many cases have missing data on each variable and why.

Outliers
What are they?
 Definitions vary, but cases more than 3 standard deviations from the mean can be considered outliers
What causes them and why do they matter?
 They may be data errors
 They may result from the inclusion of a case from a different population
 They may influence our estimates because they can have great leverage – we’ll discuss this later

You can search for outliers using your descriptive statistics, frequency distributions, and graphs.
SPSS will identify the five highest and lowest outliers in the Explore command:
Ex trem e Values

Case Number Value


HRS1 NUMBER OF Highes t 1 1089 89
HOURS W ORKED 2 1986 89
LAST WEEK 3 2494 89
4 2223 89
5 2298 .a
Lowes t 1 787 3
2 914 3
3 1245 3
4 2288 4
5 1785 .b
a. Only a partial list of cases wit h the value 89 are s hown in the
table of upper extremes.
b. Only a partial list of cases wit h the value 4 are shown in the
table of lower extremes.

You can also identify outliers by using a box plot:


Analyze → Descriptive Statistics → Explore; Select the ‘Plots’ box

Page 8 of 12
Box Plot Box plot
Figure 5. The Number of Work Hours Last Week Summary plot based on the median, quartiles, and extreme
values.
(2000; N=1,818).
100 A line across the box indicates the median [marked in red in
Figure 5; 40 in this example].

80 The box represents the inter-quartile range which contains 50% of


values [from 40 to 49 in this example].

60 The whiskers are lines that extend from the box to the highest and
lowest non-outlier values. The highest and lowest non-outlier
values are defined as up to 1.5 of the inter-quartile range. In this
40 example, IQR=9, 1.5*9=13.5, so the whiskers extend from 49 to
62.5 and from 40 to 26.5.

20 Outliers (i.e., cases with values between 1.5 and 3 box lengths
from the upper or lower edge of the box) are represented by
circles; IQR=9 so1.5*9=13.5 and 3*9=27; Outliers are between
0 62.5 and 76 and 13 and 26.5.
NUMBER OF HOURS WORK

Source: Davis and Smith (2007). Extreme values (i.e., cases with values more than 3 box lengths
from the upper or lower edge of the box) are represented by
asterisks. Extreme values extend above 76 and below 13 in this
example.
Normality
Why does it matter?
 Some statistical procedures assume normality – they are invalid if this assumption is invalid
 Even when it is not assumed, skewed distributions can influence estimation and hypothesis testing by
causing other statistical problems

You can examine normality using any of the charts listed above (e.g., a histogram) as well as quantile-normal
plots: Analyze → Descriptive Statistics → Explore; Select the ‘Plots’ box and check the
‘Normality plots with tests’ box

Figure 6. A Normal Q-Q Plot of Work Hours.


4
The Q refers to quantiles. This plot displays the
3
quantiles of our variable against the expected
2 quantiles (i.e., if it were normal)
1
If the variable is normally distributed, the dots will
0 all fall on the line
-1
This plot suggests that work hours deviates from
-2 normality; for example, there are:
 More cases between 4 and 24 hours per
-3
week than expected
-4  Fewer cases between 27 and 39 hours per
0 10 20 30 40 50 60 70 80 90 100
week than expected
Observed Value  Etc.

Page 9 of 12
Transforming Variables
Recoding Variables in SPSS
Using the Menu:
Transform → Recode → Into Different Variables

SPSS Syntax:
recode tvhours (0=0) (1=1) (2=2) (3 4=3) (5 thru 24=4) (else=sysmis) into tvhrcat.

The original variable: The recoded variable:


TVHOURS HOURS PER DAY WATCHING TV TVHRCAT

Cumulative Cumulative
Frequency Percent Valid Percent Percent Frequency Percent Valid Perc ent Percent
Valid 0 107 3.8 5.9 5.9 Valid .00 107 3.8 5.9 5.9
1 380 13.5 20.8 26.6 1.00 380 13.5 20.8 26.6
2 510 18.1 27.9 54.5 2.00 510 18.1 27.9 54.5
3 310 11.0 16.9 71.5
3.00 543 19.3 29.7 84.2
4 233 8.3 12.7 84.2
4.00 289 10.3 15.8 100.0
5 95 3.4 5.2 89.4
Total 1829 64.9 100.0
6 64 2.3 3.5 92.9
Missing Sy stem 988 35.1
7 18 .6 1.0 93.9
8
Total 2817 100.0
47 1.7 2.6 96.4
10
11
24
4
.9
.1
1.3
.2
97.8
98.0
Note – Be sure to add value labels to the tvhrcat variable so that you
12 21 .7 1.1 99.1 will remember that 3 now means 3 or 4 and 4 now means 5 or more.
13 1 .0 .1 99.2
14 2 .1 .1 99.3
15 7 .2 .4 99.7 SPSS Syntax for adding value labels:
20 3 .1 .2 99.8
21 1 .0 .1 99.9
24 2 .1 .1 100.0
add value labels tvhrcat
Total 1829 64.9 100.0 0 ‘0’
Missing -1 NAP
98 DK
940 33.4
1 '1'
3 .1
99 NA 45 1.6 2 '2'
Total 988 35.1 3 '3 or 4'
Total 2817 100.0
4 '5 or more'.

Page 10 of 12
Computing Variables in SPSS
Using the Menu:
Transform → Compute

SPSS Syntax:
compute wktvdiff=hrs1-tvhours.
Here is the result:
Statistics
WKTVDIFF
WKTVDIFF 400
N Valid 1181
Missing 1636
Mean 39.6842
Median 39.0000 300
Mode 38.00
Std. Deviation 13.83398
Variance 191.37898
200
Skewness .184
Std. Error of Skewness .071
Kurtos is 1.690
Std. Error of Kurtosis .142
100
Frequency

Range 98.00
Minimum -10.00
Maximum 88.00
Percentiles 25 35.0000 0
50 -10.0 10.0 30.0 50.0 70.0 90.0
39.0000
0.0 20.0 40.0 60.0 80.0
75 46.0000

Page 11 of 12
Transformations to reduce skewness and to pull in outliers
It is also possible to transform variables to reduce skew and to pull in outliers; tvhours is positively skewed:
TV HOURS HOURS P ER DAY WATCHING TV

Cumulative
Frequency Percent Valid P erc ent Percent
Valid 0 107 3.8 5.9 5.9
1 380 13.5 20.8 26.6
2 510 18.1 27.9 54.5
3 310 11.0 16.9 71.5
600
4 233 8.3 12.7 84.2
5 95 3.4 5.2 89.4
6 64 2.3 3.5 92.9
7 18 .6 1.0 93.9 500
8 47 1.7 2.6 96.4
10 24 .9 1.3 97.8
11 4 .1 .2 98.0 400
12 21 .7 1.1 99.1
13 1 .0 .1 99.2
14 2 .1 .1 99.3 300
15 7 .2 .4 99.7
20 3 .1 .2 99.8
21 1 .0 .1 99.9
200
24 2 .1 .1 100.0
Total 1829 64.9 100.0
Missing -1 NA P 940 33.4
Frequency

100
98 DK 3 .1
99 NA 45 1.6
Total 988 35.1
0
Total 2817 100.0

SPSS Syntax:
* Natural log - you have to add 1 because the natural log of 0 is undefined.
compute tvhr_ln=ln(tvhours+1).
TV HR_LN

Cumulative
Frequency Percent Valid Perc ent Percent
Valid .00 107 3.8 5.9 5.9
.69 380 13.5 20.8 26.6 600
1.10 510 18.1 27.9 54.5
1.39 310 11.0 16.9 71.5
1.61 233 8.3 12.7 84.2 500
1.79 95 3.4 5.2 89.4
1.95 64 2.3 3.5 92.9
2.08 18 .6 1.0 93.9 400
2.20 47 1.7 2.6 96.4
2.40 24 .9 1.3 97.8
2.48 4 .1 .2 98.0 300
2.56 21 .7 1.1 99.1
2.64 1 .0 .1 99.2
2.71 2 .1 .1 99.3
200
2.77 7 .2 .4 99.7
3.04 3 .1 .2 99.8
3.09 1 .0 .1 99.9
Frequency

3.22 100
2 .1 .1 100.0
Total 1829 64.9 100.0
Missing Sy stem 988 35.1
0
Total 2817 100.0

Page 12 of 12

Вам также может понравиться