Вы находитесь на странице: 1из 19

CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

SPSS Familiarization: A Self-Instructing Tutorial

The intent of this exercise is to introduce you to the SPSS environment and the most
common applications of interest in the context of this course. The dataset NPTS1990.sav
provided along with this document should be used for this exercise.

Contents
1. Components of the SPSS environment…………………………………………2
2. Reading in Data……………………………………………………………...….3
3. Exploratory Analyses…………………………………………………………...6
a. Frequency Distributions………………………………………………...6
b. Descriptive Statistics……………………………………………………7
c. Cross Tabulations……………………………………………………….8
4. Creating Variables……………………………………………………………..11
a. New Variables…………………………………………………………11
b. Recoding……………………………………………………………….11
5. Linear Regression Model……………………………………………………...15
6. Analyses on Subsets of Data…………………………………………………..17

Acknowledgements: This tutorial was prepared by Prof. Siva Srinivasan of the University
of Florida.

SPSS Familiarization Page 1 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

1. Components of the SPSS Environment

The SPSS environment comprises three major components: (1) The Data Editor,
(2) The Syntax File, and (3) The Output Viewer.

1.1 The Data Editor

This is the primary window of the SPSS program. The data are displayed in this window
in the format of a typical spreadsheet. There are two “views” of this window:
 In the Data View the data values are displayed. Each column typically represents
a variable. Each row of data represents a case (i.e., values of all variables for a
particular household or person for our travel modeling applications).
 In the Variable View, details of the variables are listed. Each row represents
details for one variable (hence there are as many rows in the Variable View as
there are columns in Data View). Some of the useful variable attributes include
variable labels (a lengthy meaningful description of the variable), format
(numeric, character, number of decimal places, etc.), and value labels (see Section
4.2 for more on value labels; this is important).

1.2 The Syntax File

The processing of data in SPSS can be performed using the menu items & dialog boxes
(i.e., the Graphical User Interface or the GUI) or by directly providing the appropriate
commands in a Syntax File. SPSS has its own scripting language and the command
syntaxes are provided in the Help files. Further, it is also possible to generate the syntax
for any analysis using the GUI and “paste” it to the syntax file.

The use of syntax files is highly recommended for the following reasons:
1. It helps you maintain a log of all the processing that you have done on the
data. You can also add comments to the syntax file, and so you can maintain a
very good documentation of the data processing.
2. In case you lose your results, you can re-create them by simply running the
syntax file (instead of working though the GUI all over again). So make sure
that you save your syntax file.
3. It makes data processing faster in the long run. For example, if you have just
run a model and now want to run it again after changing a few variables, you
can simply copy-paste the syntax from the first run and change only the
relevant variables instead of re-specifying everything through the GUI.

1.3 The Output Viewer

The results of SPSS analysis are displayed on a separate window called the Output
Viewer. These can be directly saved (as .spo files). Alternatively, specific results from
the output file can be copied to commonly used applications such as MS Word, Excel,
and PowerPoint (simply right-click on the result to copy).

SPSS Familiarization Page 2 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

2. Reading in Data

SPSS is capable of handing input data in various formats. In this course, you will be
provided all data in the SPSS format (.sav files).

Open the SPSS program. Click on FILE->OPEN->DATA…. Then, navigate to the


folder containing the data file NPTS1990.sav, select this file and click on the PASTE
button. A new syntax file opens up and the command for opening the file is pasted as
shown in the figure at the end of this page. Note: Each command ends with a period (“.”).

In the syntax file, add a comment before the command indicating that you are opening the
required file. All lines of code which represent comments should begin with “/* ”. It is
preferable to have a blank line between comments and commands.

Now highlight the command, right click and select RUN CURRENT to run this
command.

The file is opened in the Data Editor window and you should see the following Data and
Variable Views.

SPSS Familiarization Page 3 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

Save this syntax file and keep this open though out this familiarization exercise. As you
keep doing more analysis, you will be pasting all syntax to this file. Keep saving this file
periodically.

SPSS Familiarization Page 4 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

The data file comprises a sample of 2000 households drawn from the 1990 US National
Personal Transportation Survey (NPTS). The following variables are included:

Variable Name Description


houseid Household-identifying ID number
ntrip Number of trips made by the household
hhsize Total number of persons in the household
num0to4 Number of persons aged 0-4 in the household
num5to21 Number of persons aged 5-21 in the household
numadult Number of adults in the household
numwork Number of workers in the household
numdrive Number of drivers in the household
income Household income
numcars Number of automobiles in the household

Note: Just for your information


 For additional details on the NPTS 1990 survey visit:
http://npts.ornl.gov/npts/1990/
 This national survey is now called the National Household Travel Survey
(NHTS). For the most-recent survey for which data are already publicly available
was conducted in 2001 visit: http://nhts.ornl.gov/2001/index.shtml
 The most recent in this series was conducted in 2008-2009. This survey is of
particular importance to Florida, as the state Department of Transportation (DOT)
“bought” additional samples – about 14,000 additional households covering the
entire state. See http://nhts.ornl.gov/nhts2008.shtml

SPSS Familiarization Page 5 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

3. Exploratory Analysis

As a precursor to any statistical-modeling exercise, it is always a good idea to perform


exploratory analysis of your data (averages, range, variances, cross-tabulations, etc.).
This helps you to understand your data better (Does it look reasonable? Are there
outliers? Is there internal consistency? Are there missing values?). Further you can assess
what you can and cannot do with the available data (For example, the ability to estimate a
regression coefficient on a variable is related to the variance of that variable in the
sample).

3.1 Frequency Distributions

Frequencies are a good way to learn about categorical and integer data when the range of
data values is not very large. In this exercise we will generate the frequency distributions
for two variables in the data file.

In the Data Editor window (or in the Syntax File Window), click on ANALYZE-
>DESCRIPTIVE STATISTICS->FREQUENCIES... A new “Frequencies” dialog box
opens up. Select the two variables (ntrip and numcars) of interest by highlighting each of
the variables from the list and clicking on the “>” button).

Once the two variables are selected, click on the PASTE button. The syntax for running
the frequency distributions on the two variables is added to the syntax file already open.
Add comments as appropriate (see figure; zoom and see). Highlight the command, right
click, and select RUN CURRENT. The frequencies are displayed on an Output Viewer
window. (Note: You can also simply click on the OK button without clicking on the PASTE button to
run the frequency analysis, but you will not be able to save the syntax. However, it is a recommended
practice to use the syntax file for data analysis/processing).

To copy the results to an EXCEL document, simply right click on the result (the
frequency table in this case) and select COPY. Open an EXCEL document and paste.

SPSS Familiarization Page 6 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

Why do you think there are so few households that make only one trip during the day?
(Answer: People generally come back home on the same day, making at least 2 trips)

3.2 Descriptive Statistics

Descriptive Statistics include summary measures such as average, variance, range,


skewness, etc. We can use this for analyzing continuous data variables and when the
range of data values is large for using Frequency analysis.

In the Data Editor window (or in the Syntax File Window), click on ANALYZE-
>DESCRIPTIVE STATISTICS->DESCRIPTIVES... A new “Descriptives” dialog
box opens up. Select the two variables (ntrip and income) of interest by highlighting each
of the variables from the list and clicking on the “>” button). One can use the OPTIONS
button to specify the statistics of interest. Mean, standard deviation, minimum, and
maximum are the statistics provided by default and these are adequate for our purposes.

Once the two variables are selected, click PASTE. The syntax for generating the
descriptive statistics for the two variables is added to the syntax file already open. Add

SPSS Familiarization Page 7 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

comments as appropriate. Highlight the command, right click, and select RUN
CURRENT. The results are displayed on an Output Viewer window.

3.3 Cross Tabulations

Cross Tabulations are a useful tool to explore internal consistency of data in the file. For
example, if we have data on both total number of people and number of children in the
household, we would expect that number of people >= number of children for each
household. This can be explored by cross tabulating number of people against number of
children.

Alternatively, Cross Tabulations are also useful as a simple bivariate-analysis tool. That
is, we can explore whether there is a systematic relationship between two variables. In
this exercise, we will examine whether the size of a household is related to the
automobile holdings of the household.

In the Data Editor window (or in the Syntax File Window), click on ANALYZE-
>DESCRIPTIVE STATISTICS->CROSSTABS... A new “Crosstabs” dialog box
opens up. Select the variable numcars for the “Rows” and the variable hhsize for the
‘Columns” (Again, highlight the variable of interest from the list and clicking on the
appropriate “>” button).

Once the two variables are selected, click PASTE. The syntax for cross tabulating hhsize
(in columns) against number of cars (in rows) is added to the Syntax File already open.
Add comments as appropriate. Highlight the command, right click, and select RUN
CURRENT. The results are displayed on an Output Viewer window.

SPSS Familiarization Page 8 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

SPSS Familiarization Page 9 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

The results are interpreted as follows: There are 87 households in the sample with one
person and zero cars, 190 households with 2 persons and one car, and so on.

We see that there are 61 two-person households with three cars and 32 five-person
households with three cars. Does this mean that two-person households are more likely
than five-person households to own three cars?

Let us now examine the same relationship in terms


of percentages. In the Syntax File, make another
copy of the cross-tabulation syntax and replace the
“COUNT” (following /CELLS =) with
“COLUMN”. Note that the same can also be
accomplished from the GUI. In the Crosstabs dialog
box, click CELLS and check “Column Percentages”

Run this new syntax, we get the following output. In this case, the results are COLUMN
percentages, i.e., 20.3% of 1 person households own no cars, 28.9% of two person
households have one car, and so on.

Now look at the numbers for two-person and five-person households with three cars.
What do you conclude?
What can you conclude about the auto ownership levels of 10 person households?
What broad conclusions would you draw about the “impact” of household size on car
ownership?
Which of the two cross tabulations you have developed is necessary for making these
conclusions?

SPSS Familiarization Page 10 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

4. Creating Variables

4.1 Creating New Variables

First we will look at creating new variables (adding columns). We will do this by directly
typing in the command.

Type in the following to the Syntax File:

/* create a new variable: number of non workers in the household


COMPUTE numnonwork = hhsize - numwork.
VARIABLE LABELS numnonwork 'number of non workers in household'.
EXECUTE.

Note that the above can also be accomplished using the GUI. Click on TRANSFORM-
>COMPUTE VARIABLES and provide the necessary inputs in the dialog box that pops
up. Click PASTE to get the above syntax pasted on to the syntax file.

Run the above command. A new data column gets appended to the file (in the Data
View). In the Variable View, an additional row gets added. Since this an integer
variable, you can set the number of decimal places for this variable to 0 using the
Variable View.

4.2 Recoding Variables

Recoding an existing variable is another approach to creating new variables. Such an


exercise may be required for many reasons. For example, a categorical variable can be
aggregated to fewer categories for simple exploratory analysis. A continuous variable
may be recoded into discrete categories for the purposes of exploring non-linearities in
the empirical specifications.

SPSS Familiarization Page 11 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

As an example, we will recode the continuous income variable into the following 3
categories (arbitrarily chosen for demonstration purposes): low income (less than 30K),
medium-income (30-50K), and high income (higher than 50 K).

In the Data Editor window (or in the Syntax File Window), Click on TRANSFORM-
>RECODE INTO DIFFERENT VARIABLES. The “Recode into different variables”
dialog box opens up. Select income as the variable to be recoded. Enter inccats as the
name of the output variable and provide a label to this variable (income in categories).
Click on CHANGE.

Now click on the OLD AND NEW VALUES button to define the transformation.
 Check “Range: Lowest through _______” and enter the value 30000 in the box.
Enter 1 under New Values and click ADD
 Now Check “Range ______through _______” and enter the values 30000 and
50000 as the range in the appropriate boxes. Enter 2 under New Values and click
ADD.
 Check “Range: _______through Highest” and enter the value 50000 in the box.
Enter 3 under New Values and click ADD
 Click the CONTINUE button.
 You get back to the “Recode Into Different Values” window. Click PASTE.

SPSS Familiarization Page 12 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

The Syntax for the recoding gets pasted on to the syntax file.

Now to provide more meaningful descriptions of the categories (1,2, and 3) we have
created, enter the following in the Syntax File:

VALUE LABELS inccats


1 'less than 30K'
2 '30K - 50K'
3 'more than 50K'.
EXECUTE.

Highlight the RECODE and VALUE LABELS command and run. The new variable
with the appropriate labels is created. Since this an integer variable, you can set the
number of decimal places for this variable to 0 using the Variable View.

SPSS Familiarization Page 13 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

Run a frequency distribution on the newly created variable. You should see the following
distribution:

Run a cross tabulation of the continuous income on the categorical income variable to
see whether the variable has been correctly re-coded.

SPSS Familiarization Page 14 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

5. Linear Regression Model

As an example, we will estimate the following simple regression model:


NTRIPi   0  1 ( HHSIZEi )   2 ( NUMCARSi )   i

In the Data Editor window, Click on ANALYZE->REGRESSION->LINEAR... The


“Linear Regression” dialog box opens up. Select ntrip as the dependent variable. Select
hhsize and numcars as the independent variables. Leave the METHOD as “Enter”. Click
PASTE. The syntax for this model is pasted on to the SYNTAX file.

NOTE: By default a constant is always added to the regression model. There is no need
to include a column of ones in the data file.

Run the command for regression from the syntax file. The results are displayed on the
Output Viewer.

SPSS Familiarization Page 15 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

Under the model summary, we have the R2 and the adjusted R2 values. The standard error
of estimate is the standard deviation of the error term (i.e., s).

Under the ANOVA, we have the values for SST (total sum of squares), SSE (residual
sum of squares), and SSR (regression sum of squares). The value under the column “df”
for the row Total, would be N-1, where N= sample size=2000. The value under the
column “df” for the row Regression, would be the number of explanatory variables
(K=2).

Note that (1) SST = SSE + SSR, (2) R2 = SSR/SST, and (3) s2 = SSE/(N-K-1) [N =
sample size = 2000, K = number of explanatory variables = 2]

Under the Coefficients, we have the estimates of the model coefficients/parameters, the
standard errors, and the t statistics. Important Note: Although we call the parameters
“betas” in class, SPSS provides these under the column “B”. Do NOT use the values
provided in the column “Beta” by SPSS. The estimates of the model parameters are
 0  0.232; 1  2.184;  2  0.826 . Note also that the t values are = (B / Std Error
(B)).
6. Analyses on Subsets of Data

SPSS Familiarization Page 16 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

This section of the exercise is focused on performing analysis on a subset of the data file
rather than the whole without having to physically split the file. For example, one might
be interested in estimating different models for different sub groups of the population
(this is called market segmentation). In this exercise, we are going to estimate a model
specifically for the non-low-income households (i.e., income >= 30K).

In the Data Editor window, Click on DATA->SELECT CASES… The Select Cases
Window opens up. Check “If Condition is satisfied” and click on the IF button. A new
window, “Select Cases If” opens. Enter the selection criterion (inccats >= 2) and click
CONTINUE.

You will be returned to the previous window. Make sure that the option “Filtered” is
chosen for “Unselected cases are” and click PASTE. Syntax for selecting the
appropriate subset of data for further analysis is generated and pasted on to the syntax
file. Once this syntax is run (don’t run it just yet), all further analysis will be done on the
data subset although the data file continues to physically have all the records.

Since we want to estimate the same specification as before for the regression model,
simply copy-paste the code for running the regression model.

SPSS Familiarization Page 17 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

Once the model is estimated, we want to restore the dataset to its original status.
In the Data Editor window, Click on DATA->SELECT CASES… The Select Cases
Window opens up. Check “All Cases” and click PASTE. Syntax for selecting the entire
data for further analysis is generated and pasted on to the syntax file

Now, highlight the entire command syntax (selecting only the subset, regression model,
and selecting all the data again) and run.

As always, the results are displayed on to the output viewer.

SPSS Familiarization Page 18 of 19


CGN 6655: Regional Transportation Design and Development NAVEEN ELURU

You will see that this model was estimated using only the 1312 households with income
>= 30K. [As already discussed, the value under the column “df” for the row Total, would
be N-1, where N= sample size. Further, from the frequency distribution results on the
inccat variable, we know that there are 1312 households in the middle/high income
categories.

SPSS Familiarization Page 19 of 19

Вам также может понравиться