Вы находитесь на странице: 1из 16

Statistics Spring 2008

Lab #2 Descriptives
Descriptive analysis involves examining the characteristics of individual variables, as compared to inferential
statistics which examines the relationship between variables.
There are two types of variables -- categorical and continuous -- and the characteristics of interest for each variable
are different. For categorical variables, you are interested in the count, such as demographic characteristics of your
study (e.g., 50 males and 56 females). For continuous variables, there are many different characteristics to examine,
such as mean, median, mode, range, variability, etc., but the mean is typically the most useful descriptor.
This document explains how to examine characteristics of categorical and continuous variables. Descriptive analysis
involves the same SPSS commands as for Data Screening (e.g., Explore, Frequencies), so you are already intimately
familiar with how to conduct descriptive analysis.
This document also explains how to create composites by averaging together individual variables into new
composited variables. Compositing can involve a few different tasks, such as reverse coding items, averaging items
with different scale ranges, and conducting reliability analysis to determine if from a statistical point of view the
individual items should be averaged together. All those tasks are described below.
This document also explain new skills that are related to descriptive analysis you may want to learn, such as how to
transform a continuous variable into a categorical variable, how to create a new variable based upon the combination
of two or more variables, and how to use syntax.
1. Descriptive Statistics
Your two options for descriptive analysis are: Frequencies command and Explore command.
Both provide much of the same information, except
a. Frequencies -- groups together the descriptive information into one grid; and displays histograms with
a normal curve (whereas Explore displays histograms but without normal curves)
b. Explore -- displays descriptive information for each variable separately; and displays boxplots
Frequencies:
1.
Select Analyze --> Descriptive Statistics --> Frequencies
2. Move all variables into the Variable(s) window.
3. Click Statistics and put a checkmark next to every descriptive statistics you are interested in viewing.
4.
Click Charts and put a checkmark next to the chart type you are interested in viewing.
5. Click OK.
Output below is for the first four demographic questions.
Statistics box provides a grid format of the descriptive statistics for each variable.

After the Statistics box, the frequency distribution for each variable is displayed. Below is the frequency
distribution for gender:

Next comes the histograms. Below are the histograms for age and gender. I chose to display to you these two
histograms because it illustrates how the Frequency histogram is useful for displaying both categorical and
continuous variables. Also, notice that both histograms are not normally distributed. Not every variable needs
to be normally distributed. Plus, categorical variables with few answer choices (e.g., 2, 3, 4, 5, 6) will rarely
conform to a normal curve. Finally, in the age histogram notice the sharp drop-off below the 20 line. This is
because we restricted participation in the study to people who were aged 18 or older.

If you double-click on the histogram in the SPSS output viewer, it opens a new window containing the
histogram with many new drop-down options to manipulate the histogram. There are too many options to
explain them all, so feel free to try each one, and if you have specific questions, please let me know. However,
one option I wanted to present to you was the ability to change the scale range on the histogram axis. For
example, if you double-click on the age histogram, it opens a new window. Then, double-click on the
horizontal axis, which opens the Properties window. Then, click on scale, and change the scale range from
18 to 72 (which is the minimum and maximum in our sample), and change the increment value to 1. Click
Apply. The new histogram for age is displayed below. Notice how much more information is displayed.

Now lets use the Explore command:


1.
Select Analyze --> Descriptive Statistics --> Explore
2. Move all variables into the Variable(s) window.
3. Click Plots and unclick Stem-and-leaf
4. Click Options and click Exclude cases pairwise
5. Click OK.
Output below is for only the four system variables in our dataset because copy/pasting the output for all
variables in our dataset would take up too much space in this document.
Case Processing Summary shows the number of cases that are valid, missing, total.

Descriptives shows the same information as the Frequencies command, but now each variable is
displayed separately.

Next, the boxplot for each variable is displayed. Below is the boxplot for edu because I want to show you a
boxplot that contains both mild outliers (round dots) and extreme outliers (stars).
5

What if you want descriptive statistics within groups? For example, imagine a study that manipulated the
presence or absence of a weapon during a crime, and the Dependent Variable was measuring the level of
emotional reaction to the crime. In addition to looking for descriptive statistics of your DV within the entire
study (so collapsing across both groups), you may also want descriptive statistics for your DV within each
group. Another example of when you would want descriptive statistics within groups is when your study
involves a verdict choice. Typically, you not only report the percentage of guilty/not-guilty verdicts across the
entire study, but you also want to report the percentage of guilty/not-guilty verdicts within each group in your
study. I present an example of this situation on the next page, and how to present this data in a Figure.
How to conduct descriptive statistics within groups:
In our dataset about Legal Beliefs, lets treat gender as the grouping variable because sometimes you also
want to present the gender split amongst your variables:
1.
Select Analyze --> Descriptive Statistics --> Explore
2. Move all variables into the Variable(s) window.
Move sex into the Factor List
3. Click Statistics, and click Outliers
4. Click Plots, and unclick Stem-and-leaf
5. Click OK.
Output on next page is for system1

Descriptives box tells you descriptive statistics about the variable. Notice that information for males and
females is displayed separately.

WRITE-UP: You typically discuss the characteristics of demographics in the beginning of the Method section,
not the Results section, and you also typically only present data for gender (see below). If you want to discuss
more than just gender, such as age, education, political afflitiation, income, etc., then you would create a
Figure to display all the data. For descriptive statitics other than demographics, you would present that data in
the Results section. If there are only a few descriptive statistics, you discuss them in the text of the Results
section (see below). If there are many descriptive statistics, you present them in a Figure, and then discuss
only the most pertinent information from the Figure when you are writing the Results/Discussion section.
a. Here is a sample write-up for gender: The sample consisted of 327 participants, with many more
females (n = 248) than males (n = 76), and three participants who did not indicate gender.
b. Here is a sample write-up for how you would discuss descriptive statistics in the Results section:
When asked what percentage of people brought to trial did in fact commit the crime, the average
response was 78%.
c. Here is a Figure (from another paper I wrote a few years ago):
(FYI see http://www.docstyles.com/apa15.htm for how to format Figures and Tables in APA format)

EVALUATION: Since evaluating descriptive statistics in Results sections or Figure is simply reading the
descriptive statistics that are reported, I dont have any advice for evaluating descriptive statistics other than to
pay attention if there are any other descriptive statistics that were not reported that you may find helpful or
would want the author to include in the paper.

2. Other graphs
SPSS has the ability to create other types of graphs beyond histograms and boxplots, but they provide little
information beyond the information provided by histograms and boxplots. The other charts are:
a. Bar charts
b. 3D bar charts
c. Line charts
d. Area charts
e. Pie charts
To access these charts:
1.
Select Graphs --> choose either Chart Builder or Legacy Charts
2. Move chosen variables into the appropriate open spaces
3. Click OK.
Legacy Charts are the old way that SPSS builds charts. Each chart has a separate command window, each
with its own unique options and characteristics. The options and characteristics are very straightforward and
easy to use.
Chart Builder is new to SPSS. It reportedly has more functionality, but it is also complex and sometimes
difficult to manipulate. I would suggest first using the Legacy Charts to get a better understanding of each type
of chart.
3. Composites averaging items together
Why do we create composites? The rule of thumb in statistics is the more, the better. In terms of measuring
constructs, this means that you typically want to ask many questions about the same construct in order to
adequately tap into the entire construct of interest. For example, in a study about happiness, asking, how
happy are you right now perfectly maps onto the construct of how happy you are right now. But, if your
intended construct is happiness, you need to ask more questions to tap the entire theoretical construct, such
as how happy do you feel, how happy are you with your life in general, and etc. Thus, for every construct,
researchers ask many questions by either using established scales of the topic, or creating their own measures
to tap all the facets of the construct. When you analyze the data, you start by conducting descriptive analysis
of each individual question. Then, you composite all 10 questions together into 1 variable by averaging
together all 10 questions. Researchers are typically more interested in that 1 composite variable than the 10
individual items (unless the 10 questions are uniquely taping different sub-parts of the entire construct, and the
researchers are interested in each sub-part). So, after first conducting descriptive analysis of each item, you
then conduct descriptive analysis of the 1 composite variable.
How do you create a composite?
1.
Select Transform --> Compute Variable
2. Type a new name for your composite in the Target Variable box.
3. Drag mean from the Function group into the open box above
4. Replace the question marks (?) with each item to be composited, separated by a comma (,)
5. Click OK.
The newly created composite will appear at the end of the data file.
Is it appropriate to create a composite with my questions? We described above how to create a composite, but
another question is whether its appropriate to create the composite given the questions and data in your study.
8

You can answer that question from a theoretical point of view, and a statistical point of view. I describe below
both points of view:
From a theoretical point of view
a. From a theoretical point of view, it is possible your questions do not measure the same construct, and
thus it is inappropriate to average them together. For example, the face content of each item may
measure different concepts. Imagine questions about your political group orientation. A question about
whether you think of yourself as a republican or democrat, may tap a different construct then if you
ask whether you feel like a republican or democrat. You need to examine your questions and make a
determination of whether you feel its appropriate to average the items together.
b. Another option is create separate composites, one for each concept that is measured. For example,
maybe you composite together all the questions about how you feel about your political group
membership, and create another composite of the questions about how you think of your political
group membership. After creating the separate composites, you can then also merge all the questions
together (so merge all the separate composites together) into 1 big composite. In this case, you would
call the separate composites you merged together the sub-parts or sub-factors of the 1 big
composite. Also, from a theoretical point of view you need to decide how to label or characterize this
big composite.
c. It is acceptable to create composites from a theoretical point of view even if it is not appropriate from a
statistical point of view. I discuss next the benchmarks for deciding whether or not its statistically
appropriate to merge items together into a composite, but assuming those benchmarks are not met in
your data, it is still appropriate to merge items together from a purely theoretical point of view.
However, you must state in your paper that the statistical benchmarks were not met, and then explain
the theoretical basis for why you are still merging the items together. (FYI if the statistical
benchmarks are met, then you rarely see researchers explain the theoretical basis for why the items
were merged together.)
From a statistical point of view
a. From a statistical point of view, it is possible your questions do not measure the same construct, and
thus it is inappropriate to average them together. For example, you can use Factor Analysis to
determine if the items fall into 1 big composite (called a factor), or if they fall into separate subfactors. I will explain Factor Analysis at the end of the semester, but only if you request it. Factor
analysis is not one of the more typical statistical tests. Instead, researchers decide how the items group
together from a theoretical point of view, and then proceed to test their judgment by conducting
Reliability Analysis, which provides a benchmark for determining whether or not the items group
together. In other words, Reliability Analysis is called a confirmatory test because its confirming
your decisions, whereas Factor Analysis is considered a exploratory test because it is used to explore
which, if any, of the items group together into which set of factors or sub-factors.
b. Reliability Analysis is rather straightforward to conduct:
1.
Select Analyze --> Scale --> Reliability Analysis
2. Move all variables into the Variable(s) window.
3. Click Statistics and put a checkmark next to item and Scale if item deleted
4. Click OK.
Reliability Statistics give you the Alpha number which is the determination of whether or not the items
group together from a statistical point of view. Alpha ranges from 0 to 1, and the higher the number, the
stronger the items group together statistically. Output below is for the three prosecutor questions.
Alphas above .9 are great, above .8 are good, above .7 are ok, above .6 are borderline.
In this case, Alpha=.68, which is acceptable to merge the three items together into a composite. Also, the
smaller the sample, the more likely you will find smaller Alpha levels because there is less data to identify
intercorrelations. In smaller samples, smaller Alpha levels are acceptable to create composites.

The other output from the analysis is helpful to interpret your data. Case Processing Summary tells you the
number of valid cases included in the analysis. Notice that only listwise deletion is possible. Item Statistics
gives you descriptive information about each item. Item-total Statistics tells you the Alpha levels if each
items is removed. Notice that Alpha improves to .78 if we remove prosecutor3. In this case, because there
are so few items (e.g., 3), I would suggest not removing prosecutor3, even though it improves Alpha,
because only 2 items is not much of a composite. If we were analyzing many items (e.g., 4+), then it would be
more appropriate to exclude items.

WRITE-UP: Thethreeitemsmeasuringattitudestowardprosecutorsformedareliablecomposite(=.68).
EVALUATION: For each composite in the paper, the author(s) need to report the alpha level, which is the
statistic that tells you whether or not the items group together statistically. Alpha is determined by the strength
of the bivariate relationships amongst all the items in the composite. The higher the internal consistency
amongst items, the higher the Alpha level. Alphas above .9 are great, above .8 are good, above .7 are ok, above
.6 are borderline. Also, the smaller the sample, the more likely you will find smaller Alpha levels because
there is less data to identify intercorrelations.

4. Items with different scale ranges


If you are going to composite together multiple items, all the items need to have the same scale range.
For example, lets say we ask two happiness questions: (1) How happy are you right now? on a 1-7 scale,
and (2) How happy do you feel?, on a -3 to 3 scale. Notice that the two questions are about the same
construct (so theoretically you can merge them together), and also notice that the total range of the scales for
both items are 7 points, BUT the scale ranges are along different dimensions. Compositing involves averaging
items together. If we average together these two items, the resulting average will not be interpretable because
of the different scale ranges. For example, a 1 on the first item is the lowest possible answer choice, but a
1 on the second item is one of the highest possible choices. The solution is to transform both scale ranges
10

into a common metric. This is accomplished by first standardizing both items. Then, we composite the
newly transformed items.
Before we get to how to standardize items, I want to point out why I included in the example a scale that
ranged from a negative number (-3) to a positive number (3). Sometimes when you are measuring constructs,
there is a natural mid-point or neutral point, such as with happiness where you could have 0 happiness at
the moment. In this situation, it can be beneficial to include an answer choice that is neutral or 0. Notice that
if we asked the same question but with a 1-7 scale, if you wanted to indicate you are feeling zero happiness at
the moment, your only answer choice would be a 1, which you may not feel indicates you absence of
happiness. Another reason to include a scale that ranges from negative to positive is that your construct also
ranges from negative to positive. For example, imagine a question that asked about your feelings about the
death penalty. You could have a negative view or a positive view of the death penalty, so in order to tap that
construct you need to include in the scale range answer choices that reflect positive and negative. Another way
to reflect both positive and negative in a scale with the labels. For example, you could ask the same question
about your feelings toward the death penalty on a 1-7 scale, but have the labels for 1 be strongly oppose, and
for 4 be neutral, and for 7 be strongly support.
I also want to point out that standardizing your items to transform items to a common metric is necessary
when any of the scale ranges differ, not just with negative versus positive items, as in the example above. For
example, you may ask questions about the death penalty that are so similar that you want to vary the scale
ranges of the items so that you tap into more information (and also force the subjects to pay more attention to
the items because all items with the same scale range may allow lazy subjects to answer the same way on
similar questions without thinking carefully about their answers).
To Standardize items:
1.
Select Analyze --> Descriptive Statistics --> Descriptives
2. Move all variables into the Variable(s) window.
3. Put a checkmark next to Save standardized values as variables
4.
Click OK.
The newly standardized variables are listed at the end of the data file. Each standardized variable is listed in a
separate column. You can then analyze the new standardized variables as you would any other variable in your
data set, including averaging them together to create a composite.

5. Reverse coding items


If you are going to composite together multiple items, all the items need to be in the same direction. This
means that indicating a higher (or lower) response each scale must correspond conceptually to answering
higher (or lower) on the other items you want to composite together.
For example, lets say we ask two happiness questions: (1) How happy are you right now? on a 1-7 scale,
and (2) How unhappy you are right now?, on a 1-7 scale. Notice that the two questions are about the same
construct (so theoretically you can merge them together), and also notice that the total range of the scales for
both items are 7 points, BUT that conceptually answer higher (or lower) on one item is the same as answering
lower (or higher) on the other item. Before we can composite them together, we need to transform all the items
so that they are in the same direction. Thus, we could either reverse code the scale range for the first item, or
the second item (but not obviously both items). Composites typically contain multiple items, so you typically
have to reverse code multiple items. Also, when choosing which set of items to reverse code (e.g., either the
items that are in the positive direction, or items that are in the negative direction), you should think ahead to
the statistical analyses you want to conduct and how you want output from those statistical analyses (or the
relationship between those variables) to be conceptualized. For example, imagine a study testing the
relationship between happiness and income. If your hypothesis is that more income is correlated with more
happiness, then conceptually we want our happiness composite to code in the positive direction (so that
higher on the scale means more happiness) so that the outcome is easier to interpret. Notice, that if we code
the happiness composite in the opposite direction (so that lower means more happiness), we will still get the
same conceptual outcome as with the positively coded composite -- that more happiness is correlated with
11

more income -- but the interpretation of the outcome will be more difficult because we will get a negative
correlation between the variables (because lower on the happiness scale is more happiness, and more
happiness is correlated with higher income. Thus, think ahead to your intended results and code all the items
in the appropriate direction.

To reverse code items:


1.
Select Transform --> Recode into different variables
2. Move one item into the Input Window
3. Type a name for the new variable.
(I like to use the same name as the original variable, but labeled with _rev, such as system1_rev)
4. Click Changes
5. Click Old and New Values
6. Enter the old value and the new value and click Add
(If reverse coding a 1-7 scale, then old=1, new=7; old=2, new=6; old=3, new=5, and etc.)
7. Click Continue
8.
Click OK.
The newly reverse coded variable is listed at the end of the data file.
Notice that instead of Recode into different variables, there is an option for Recode into same variables. I
do not use this function because I like to leave the old variable intact because I like to keep a permanent record
of each variable, and you may forget you reverse coded it and reverse code it again, and you may make a
mistake in reverse code that can't be undone if the old variable has been replaced.

6. SYNTAX
Up this point, we have learned that SPSS has two windows Data Editor (grid of data) and Viewer (output).
SPSS has a third window Syntax.
What is syntax? When you point-and-click in the Data Editor for SPSS to calculate a mean, or outlier, or
correlation, or whatever, SPSS is calculating the statistical formulas for those tests. SPSS is basically a big
calculator that can perform many different calculations. When you point-and-click in the Data Editor for
SPSS, you are telling SPSS how to perform those calculations, such as include Kurtosis, or exclude cases
pairwise, or run correlations on these three specific variables, and not the other variables. Another way to
tell SPSS to perform those same operations is to use programming language. In the syntax window, you can
type programming language, then hit the run button, and SPSS will perform the calculations. This process is
analogous to how a website designer writes computer code to design a website, but you dont see the code,
only the website design. Similarly, the point-and-click functionality in SPSS is analogues to the website design
you see, and the syntax functionality in SPSS is analogues to the background computer code that you typically
dont see.
Why use syntax? The point-and-click interface is very easy to use. You dont need to learn the syntax
programming language which can sometimes get overwhelming and difficult to understand. However, there
are some advantages to syntax. For one, you can perform multiple operations easier than with the point-andclick interface. For example, in the previous section about reverse coding items, you can only reverse code 1
item at a time. If you want to reverse code multiple items, you have to repeat the same steps over and over.
Syntax makes that repetition less time-consuming. I present an example below about reverse coding, but I
want to point out that you can use syntax for any point-and-click command. For example, for every command
in SPSS, instead of clicking OK as the last step, you can click PASTE instead as the last step, and it will
display the syntax.
To reverse code items:
1.
Select Transform --> Recode into different variables
2. Move one item into the Input Window
3. Type a name for the new variable.
12

(I like to use the same name as the original variable, but labeled with _rev, such as system1_rev)
4. Click Changes
5. Click Old and New Values
6. Enter the old value and the new value and click Add
(If reverse coding a 1-7 scale, then old=1, new=7; old=2, new=6; old=3, new=5, and etc.)
7. Click Continue
8.
Click PASTE
Notice that the last action is PASTE, not OK.
The syntax window will open, and the command you just initiated is displayed using syntax code.
I have pasted below the syntax for our example. RECODE is the command to perform. Notice that the old
variable name and new variable name are in the command line. Notice that it ends with EXECUTE.. If we
wanted to run this command, we would highlight the entire syntax, and click the arrow button:
RECODE system1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system1_rev.
EXECUTE.

We are using this example to show how using syntax can speed up repetitive actions. So if we copy/paste the
syntax over and over, we can then type in the other variables we need reverse code. Then, we highlight all the
syntax, and click the arrow button to run the syntax.
RECODE system1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system1_rev.
EXECUTE.
RECODE system2 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system2_rev.
EXECUTE.
RECODE system3 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system3_rev.
EXECUTE.
RECODE system4 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) INTO system4_rev.
EXECUTE.

Another way to use syntax is to keep a record of your statistical analyses because the syntax indicates not only
which statistical analyses was performed, but it also provides a record of how you performed those statistical
analyses and which options you chose to use. The Output window provides that record by displaying the
syntax for every analyses that is conducted.

6. Transforming continuous variables into categorical variables


(and categorical variables into different categorical variables)
It is possible to transform continuous variables into categorical variables. For example, imagine a study about
happiness where your happiness item (or composite) ranges from 1 to 7. You might be interested in
categorizing the subjects as either high happiness (4 through 7 on the scale) or low happiness (1 through 4 on
the scale). This is called dichotomizing the variable because you are creating a new variable that has only
two options.
Another example of why you would want to transform a continuous variable into a categorical variable is if
there are only a few responses on some of the answer choices in the continuous variable. For example,
imagine a scale range from 1-11 in which answer choice 4 and/or answer choice 9 received only 1 response
each. 1 response is not enough data for meaningful interpretation. You may want to collapse the 11 point scale
into 3 or 4 categories. As another example, look at the rel_category in our dataset which measures the
religious category memberships of the subjects. The frequency distribution is listed on the next page. Hindu
received only 6 responses, and Jewish received only 9 responses. You may want to merge those responses into
other and/or merge all the data into Christian versus other. Notice that creating the new categorical
variable is answering a different research question than the original categorical variable.

13

Transforming variables in this way uses the same SPSS command as for reverse coding items.
To transform the variables:
1.
Select Transform --> Recode into different variables
2. Move one item into the Input Window
3. Type a name for the new variable.
(I like to use the same name as the original variable, but labeled with _cat, such as system1_cat)
4. Click Changes
5. Click Old and New Values
6. Click Range and enter the range of values of the old variable, and assign a number for new variable.
(e.g., 1-3.999 become a 1, and 4.0001-7 becomes a 2)
7. Click Continue
8.
Click OK.
The newly transformed variable is listed at the end of the data file. I would suggest then going into the
Variable View and assigning value labels in the Values column that reflect how you cut the variable. For
example, if you just created a new categorical variable where 1-3.999 become a 1, and 4.0001-7 becomes a
2, then assign 1=1-3.999, and 2=4.001-7. Thus, you keep a record of what the 1 and 2 means.
How do I decide where to split up the variable? This is a complex question with a complex answer:
If you are dichotomizing a variable, you can split at the midpoint of the scale from a theoretical point of view
because that is conceptually the middle response. Plus, sometimes you choose to use an odd scale range
because that is designed to have a true mid-point. However, what if in your dataset there are more subjects in
the high or low end of the scale. Splitting at the mid-point of the scale might create a vastly unequal
distribution when you dichotomize the variable. What if, for example, splitting at the midpoint of the scale has
70-80% of the subjects in one end, and 10-20% in the other. You are already losing valuable information by
reducing from a continuous variable to a categorical variable, and if you have unbalanced categories, you are
losing even more information. In this case, you could choose to split at the median, even if the median is not
the midpoint of the scale. From a theoretical point of view, the median is a good choice for splitting the
variable because it is the mid-point of that sample. Samples are not always normally distributed. Research is
about discovering empirical reality, so sometimes reality dictates how subjects respond to the question, and
maybe assuming the midpoint of the scale is the true midpoint of the construct is inaccurate. Plus, from a
statistical point of view, the median truly splits the sample into halves. However, what if in your dataset the
median is a very high or low number on the scale range. For example, what if on a 1-7 point scale, the median
is a 2 or a 6. In this situation half of the scores are bunched into a small range (e.g., 2 points in this example),
whereas the other half are more evenly distributed across a larger range (e.g., 5 point in this example). Once
again, you are losing valuable information by dichotomizing in this way. In summary, theoretical and
statistical considerations when dichotomizing variables. One solution is to dichotomize in both ways and
analyze the data using both variables.
The same theoretical and practical considerations come into play when you are deciding to split the variables
in other ways. You may decide, for example, to cut the continuous variable in thirds, or fourth, or fifths.
Sometimes when you cut the variable into thirds, your new categorical variable only includes the top and
bottom third. Sometimes you are only interested in the more polarized decisions. Sometimes you can
14

strengthen the relationship between your variables by only including the polarized judgments. From a
theoretical point of view it can make sense to drop the middle third because they are the subjects who are
somewhat undecided about the construct. Plus, think about why dichotomizing continuous variables results in
reduced information and reduced statistical power. Subjects in the continuous variable who are near the middle
are now the same as subjects near the top/bottom after you dichotomize the variable. In a 100 point scale for
example, the subjects who respond 49 and 51 are treated the same as the subjects who respond 0 and 100,
respectively. Thus, you are reducing your ability to detect true relationships in the study because the subjects
close to the middle may be masking relationships amongst your variables by diluting the strength of the
high/low categories in the variable. Eliminating the middle third when you cut the continuous variable in
thirds is one way to create a categorical variable while minimizing your loss of power.
From a practical point of view, if you are dichotomizing a variable, you dont truly cut it in half because if you
cut a 1-7 point scale from 1-4 and 4-7, for example, a subject who answered 4 is technically in both
categories. Thus, when you use the typically create a small degree of separation, such as 1-3.999 and 4.001-7.
When splitting a continuous identification variable into two groups, another question is whether you want to
have equal N size for just that variable, or have equal N across that variable AND another variable. For
example, I conducted a study about how republicans and democrats identify with their political party. Lets say
I want to dichotomize my measure of identification. When splitting the continuous identification variable
into two groups, the question is whether you want to have equal N size for just the identification variable, or
have equal N across both identification and the republican v. democrat variable. For example, if you split the
identification variable down the middle, you might have many more republicans in the low or high
identification condition, and vice versa for democrats. On the other hand, you could split the identification
variable separately for republicans and then again democrats, and then combine together, so that way you have
equal N across both variables. I believe both are defensible options to choose. My opinion is the first option is
the best (grand median or midpoint) because then the high and low groups will have equivalent psychological
meaning across party affiliation. In other words, high and low mean the same thing for both republicans
and democrats even if cell size is unequal.

6. Creating new variables based upon the combination of two or more variables
Sometimes you want to create a new variable that is a combination of two or more other variables. For
example, I conducted a study about how republicans and democrats identify with their political party. For each
subject, I asked what is their political party affiliation and how much they identify with that political party.
Lets say I want a new variable of only highly identified republicans but lowly identified democrats. In this
case I want to create a new variable that is a combination of my two questions. Here is another example:
Assume that when I asked my first question about political party affiliation, there were four options
Republican, Democrat, other, none. If I wanted to create a new variable of only highly identified republicans
and democrats, I cant simply cut the identification question in half because the top half will contain more
than just democrats and republican, it will contain those who responded none or other. In this situation, I
need a way to create a new variable that takes into account different option choices.
How do I create a new variable based upon the combination of two or more variables? Below, I explain the
steps for using the Compute variable command. However, I want to first explain conceptually what the task
entails. In essence, we are going to tell SPSS to create a new variable that is labeled as 1 if it satisfies certain
criteria (such as high on variable 1, but low on variable 2), and then labeled as 2 if it satisfies other criteria
(such as high on variable 2, but low on variable 1). In other words, we can specify a long combination of
criteria, and have subjects who meet that criteria labeled as 1 or 2 (or 3 or 4, depending on how many
categories you want in your new variable). As an example, we could create a new variable that has subjects
listed as 1 if they are republican and high identifiers, and subjects listed as 2 if they are democrat and high
identifiers. Thus, the new categorical variable will have two categories that are a combination of my two
questions about political party affiliation and identification with that political party. You create each new
category separately. Thus, in our example about creating a new variable that contains only highly identified
republicans and highly identified democrats, we first use the Compute variable command to create 1 if
15

highly identified republicans. Then, we repeat the process by using the Compute variable command to
assign a 2 if highly identified democrats.
To transform the variables:
1.
Select Transform --> Compute Variable
2. Type a new name for your new variable in the Target Variable box.
3. In the Numeric Expression box, type the number of a category
(e.g., Lets start by assigning category 1)
4. Click the If button, and click Include if case satisfied condition.
5. Move the old variable into the open box, and specify the restriction.
(e.g., if identification if identify variable, and political party affiliation was party variable, then I need
to specify only those subjects who are highly identified (e.g., greater than 4 on the identify variable) and
who are simultaneously republicans (e.g., republicans are labeled 1 on the party variable). So, I would
type the following into the box -- identify>4 & party=1.
7. Click Continue
8.
Click OK.
THEN WE REPEAT TO CREATE THE SECOND CATEGORY
1.
Select Transform --> Compute Variable
2. Type a SAME name in the Target Variable box as you did the first time.
3. In the Numeric Expression box, type the number of a category
(e.g., 2)
4. Click the If button, and click Include if case satisfied condition.
5. Move the old variable into the open box, and specify the restriction.
(e.g., if identification if identify variable, and political party affiliation was party variable, then I need
to specify only those subjects who are highly identified (e.g., greater than 4 on the identify variable) and
who are simultaneously DEMOCRATS (e.g., democrats are labeled 2 on the party variable). So, I would
type the following into the box -- identify>4 & party=2.
7. Click Continue
8.
Click OK.
To summary, the Numeric Expression box is the number we want to assign in the new category (1 or 2)
And, the criteria for who is assigned that number is specified in the If box.
And, the name of the new categorical variable was labeled in the Target Variable box
The new variable will appear at the end of the data file.

16

Вам также может понравиться