Вы находитесь на странице: 1из 714

Praise for Larry Hatcher

The writing is exceptionally clear and easy to follow, and precise definitions are
provided to avoid confusion. Examples are used to illustrate each concept, and
those examples are, like everything in this book, clear and logically presented.
Sample SAS output is provided for every analysis, with each part labeled and
thoroughly explained so the reader understands the results.
Sheri Bauman, Ph.D.
Assistant Professor
Department of Educational Psychology
University of Arizona, Tucson

[Larry Hatcher] once again manages to provide clear, concise, and detailed
explanations of the SAS program and procedures, including appropriate examples
and sample write-ups.
Frank Pajares
Winship Distinguished Research Professor
Emory University

The Student Guide and the Exercises books are excellent choices for use in
quantitative courses in psychology and education.

Bert W. Westbrook, Ph.D.


Professor of Psychology
Alumni Distinguished Undergraduate Professor
North Carolina State University

Step-by-Step

S T U D E N T

G U I D E

BASIC
STATISTICS
Using

SAS

L A R R Y H ATC H E R , P H . D .

The correct bibliographic citation for this manual is as follows: Hatcher, Larry. 2003. Step-by-Step Basic Statistics
Using SAS: Student Guide. Cary, NC: SAS Institute Inc.

Step-by-Step Basic Statistics Using SAS: Student Guide


Copyright 2003 by SAS Institute Inc., Cary, NC, USA
ISBN 1-59047-148-2
All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.
U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related
documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in
FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
1st printing, April 2003
SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software
to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit
the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.

Dedication

To my friends at Saginaw Valley State University.

ii Step-by-Step Basic Statistics Using SAS: Student Guide

Contents
Acknowledgments .............................................................................................ix
Chapter 1: Using This Student Guide ..............................................................1
Introduction ........................................................................................................................... 3
Introduction to the SAS System ............................................................................................ 4
Contents of This Student Guide ............................................................................................ 6
Conclusion .......................................................................................................................... 11

Chapter 2: Terms and Concepts Used in This Guide ..................................13


Introduction ......................................................................................................................... 15
Research Hypotheses and Statistical Hypotheses ............................................................. 16
Data, Variables, Values, and Observations ........................................................................ 21
Classifying Variables According to Their Scales of Measurement...................................... 24
Classifying Variables According to the Number of Values They Display ............................ 27
Basic Approaches to Research........................................................................................... 29
Using Type-of-Variable Figures to Represent Dependent and
Independent Variables ..................................................................................................... 32
The Three Types of SAS Files ............................................................................................ 37
Conclusion .......................................................................................................................... 45

Chapter 3: Tutorial: Writing and Submitting SAS Programs .......................47


Introduction ......................................................................................................................... 48
Tutorial Part I: Basics of Using the SAS Windowing Environment..................................... 50
Tutorial Part II: Opening and Editing an Existing SAS Program ......................................... 75
Tutorial Part III: Submitting a Program with an Error ......................................................... 94
Tutorial Part IV: Practicing What You Have Learned ....................................................... 102
Summary of Steps for Frequently Performed Activities .................................................... 105
Controlling the Size of the Output Page with the OPTIONS Statement............................ 109
For More Information......................................................................................................... 110
Conclusion ........................................................................................................................ 110

Chapter 4: Data Input .....................................................................................111


Introduction ....................................................................................................................... 113
Example 4.1: Creating a Simple SAS Data Set ............................................................... 117
Example 4.2: A More Complex Data Set ......................................................................... 122
Using PROC MEANS and PROC FREQ to Identify Obvious Problems
with the Data Set........................................................................................................... 131
Using PROC PRINT to Create a Printout of Raw Data..................................................... 139
The Complete SAS Program............................................................................................. 142
Conclusion ........................................................................................................................ 144

Chapter 5: Creating Frequency Tables ........................................................145


Introduction ....................................................................................................................... 146
Example 5.1: A Political Donation Study.......................................................................... 147
Using PROC FREQ to Create a Frequency Table............................................................ 152

iv Contents
Examples of Questions That Can Be Answered by Interpreting
a Frequency Table ........................................................................................................ 155
Conclusion ........................................................................................................................ 157

Chapter 6: Creating Graphs ..........................................................................159


Introduction ....................................................................................................................... 160
Reprise of Example 5.1: the Political Donation Study....................................................... 161
Using PROC CHART to Create a Frequency Bar Chart ................................................... 162
Using PROC CHART to Plot Means for Subgroups.......................................................... 174
Conclusion ........................................................................................................................ 177

Chapter 7: Measures of Central Tendency and Variability ........................179


Introduction ....................................................................................................................... 181
Reprise of Example 5.1: The Political Donation Study...................................................... 181
Measures of Central Tendency: The Mode, Median, and Mean ...................................... 183
Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE ................................ 187
Using PROC UNIVARIATE to Determine the Shape of Distributions ............................... 190
Simple Measures of Variability: The Range, the Interquartile Range,
and the Semi-Interquartile Range ................................................................................. 200
More Complex Measures of Central Tendency: The Variance and
Standard Deviation........................................................................................................ 204
Variance and Standard Deviation: Three Formulas ......................................................... 207
Using PROC MEANS to Compute the Variance and Standard Deviation ........................ 210
Conclusion ........................................................................................................................ 214

Chapter 8: Creating and Modifying Variables and Data Sets ....................215


Introduction ....................................................................................................................... 217
Example 8.1: An Achievement Motivation Study ............................................................. 218
Using PROC PRINT to Create a Printout of Raw Data..................................................... 222
Where to Place Data Manipulation and Data Subsetting Statements............................... 225
Basic Data Manipulation ................................................................................................... 228
Recoding a Reversed Item and Creating a New Variable for the
Achievement Motivation Study...................................................................................... 235
Using IF-THEN Control Statements .................................................................................. 239
Data Subsetting................................................................................................................. 248
Combining a Large Number of Data Manipulation and
Data Subsetting Statements in a Single Program......................................................... 256
Conclusion ........................................................................................................................ 260

Chapter 9: z Scores........................................................................................261
Introduction ....................................................................................................................... 262
Example 9.1: Comparing Mid-Term Test Scores for Two Courses................................. 266
Converting a Single Raw-Score Variable into a z-Score Variable .................................... 268
Converting Two Raw-Score Variables into z-Score Variables .......................................... 278
Standardizing Variables with PROC STANDARD............................................................. 285
Conclusion ........................................................................................................................ 286

Contents v

Chapter 10: Bivariate Correlation .................................................................287


Introduction ....................................................................................................................... 290
Situations Appropriate for the Pearson Correlation Coefficient......................................... 290
Interpreting the Sign and Size of a Correlation Coefficient ............................................... 293
Interpreting the Statistical Significance of a Correlation Coefficient ................................. 297
Problems with Using Correlations to Investigate Causal Relationships............................ 299
Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables................. 303
Using PROC PLOT to Create a Scattergram.................................................................... 307
Using PROC CORR to Compute the Pearson Correlation
between Two Variables................................................................................................. 313
Using PROC CORR to Compute All Possible Correlations
for a Group of Variables ................................................................................................ 320
Summarizing Results Involving a Nonsignificant Correlation............................................ 324
Using the VAR and WITH Statements to Suppress the Printing of
Some Correlations ........................................................................................................ 329
Computing the Spearman Rank-Order Correlation Coefficient for
Ordinal-Level Variables................................................................................................. 332
Some Options Available with PROC CORR ..................................................................... 333
Problems with Seeking Significant Results ....................................................................... 335
Conclusion ........................................................................................................................ 338

Chapter 11: Bivariate Regression.................................................................339


Introduction ....................................................................................................................... 341
Choosing between the Terms Predictor Variable, Criterion Variable,
Independent Variable, and Dependent Variable ............................................................... 341
Situations Appropriate for Bivariate Linear Regression .................................................... 344
Example 11.1: Predicting Weight Loss from a Variety of Predictor Variables.................. 346
Using PROC REG: Example with a Significant Positive
Regression Coefficient .................................................................................................. 350
Using PROC REG: Example with a Significant Negative Regression Coefficient ........... 371
Using PROC REG: Example with a Nonsignificant Regression Coefficient..................... 379
Conclusion ........................................................................................................................ 383

Chapter 12: Single-Sample t Test .................................................................385


Introduction ....................................................................................................................... 387
Situations Appropriate for the Single-Sample t Test ......................................................... 387
Results Produced in a Single-Sample t Test..................................................................... 388
Example 12.1: Assessing Spatial Recall in a Reading Comprehension
Task (Significant Results) ............................................................................................. 393
One-Tailed Tests versus Two-Tailed Tests ...................................................................... 406
Example 12.2: An Illustration of Nonsignificant Results................................................... 407
Conclusion ........................................................................................................................ 412

Chapter 13: Independent-Samples t Test ....................................................413


Introduction ....................................................................................................................... 415
Situations Appropriate for the Independent-Samples t Test ............................................. 417
Results Produced in an Independent-Samples t Test....................................................... 420

vi Contents
Example 13.1: Observed Consequences for Modeled Aggression:
Effects on Subsequent Subject Aggression (Significant Differences)........................... 428
Example 13.2: An Illustration of Results Showing Nonsignificant Differences................. 446
Conclusion ........................................................................................................................ 450

Chapter 14: Paired-Samples t Test...............................................................451


Introduction ....................................................................................................................... 453
Situations Appropriate for the Paired-Samples t Test ....................................................... 453
Similarities between the Paired-Samples t Test and the Single-Sample t Test ................ 457
Results Produced in a Paired-Samples t Test .................................................................. 461
Example 14.1: Womens Responses to Emotional versus Sexual Infidelity .................... 463
Example 14.2: An Illustration of Results Showing Nonsignificant Differences................. 483
Conclusion ........................................................................................................................ 487

Chapter 15: One-Way ANOVA with One Between-Subjects Factor ..........489


Introduction ....................................................................................................................... 491
Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor ........... 491
A Study Investigating Aggression ..................................................................................... 494
Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size ... .497
Some Possible Results from a One-Way ANOVA ............................................................ 500
Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect ................... 505
Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect ............. 529
Conclusion ........................................................................................................................ 537

Chapter 16: Factorial ANOVA with Two Between-Subjects Factors.........539


Introduction ....................................................................................................................... 542
Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors ........... 542
Using Factorial Designs in Research ................................................................................ 546
A Different Study Investigating Aggression....................................................................... 546
Understanding Figures That Illustrate the Results of a Factorial ANOVA......................... 550
Some Possible Results from a Factorial ANOVA.............................................................. 553
Example of a Factorial ANOVA Revealing Two Significant Main Effects
and a Nonsignificant Interaction.................................................................................... 565
Example of a Factorial ANOVA Revealing Nonsignificant Main Effects
and a Nonsignificant Interaction.................................................................................... 607
Example of a Factorial ANOVA Revealing a Significant Interaction ................................. 617
Using the LSMEANS Statement to Analyze Data from Unbalanced Designs................... 625
Learning More about Using SAS for Factorial ANOVA ..................................................... 627
Conclusion ........................................................................................................................ 628

Chapter 17: Chi-Square Test of Independence ............................................629


Introduction ....................................................................................................................... 631
Situations That Are Appropriate for the Chi-Square Test of Independence...................... 631
Using Two-Way Classification Tables............................................................................... 634
Results Produced in a Chi-Square Test of Independence ................................................ 637
A Study Investigating Computer Preferences ................................................................... 640
Computing Chi-Square from Raw Data versus Tabular Data ........................................... 642

Contents vii
Example of a Chi-Square Test That Reveals a Significant Relationship .......................... 643
Example of a Chi-Square Test That Reveals a Nonsignificant Relationship .................... 661
Computing Chi-Square from Raw Data............................................................................. 668
Conclusion ........................................................................................................................ 671

References .......................................................................................................673
Index..................................................................................................................675

viii Contents

Acknowledgments

During the development of these books, Caroline Brickley, Gretchen Rorie Harwood,
Stephenie Joyner, Sue Kocher, Patsy Poole, and Hanna Schoenrock served as editors. All
were positive, supportive, and helpful. They made the books stronger, and I thank them
for their guidance.
A number of other people at SAS made valuable contributions in a variety of areas. My
sincere thanks go to those who reviewed the books for technical accuracy and readability:
Jim Ashton, Jim Ford, Marty Hultgren, Catherine Lumsden, Elizabeth Maldonado, Paul
Marovich, Ted Meleky, Annette Sanders, Kevin Scott, Ron Statt, and Morris Vaughan. I
also thank Candy Farrell and Karen Perkins for production and design; Joan Stout for
indexing; Cindy Puryear and Patricia Spain for marketing; and Cate Parrish for the cover
designs.
Special thanks to my wife Ellen, who was loving and supportive throughout.

x Step-by-Step Basic Statistics Using SAS: Student Guide

Using This
Student Guide
Introduction............................................................................................ 3
Overview...................................................................................................................3
Intended Audience and Level of Proficiency .............................................................3
Platform and Version ................................................................................................3
Materials Needed......................................................................................................4
Introduction to the SAS System ............................................................ 4
Why Do You Need This Student Guide?...................................................................4
What Is the SAS System?.........................................................................................5
Who Uses SAS? .......................................................................................................5
Using the SAS System for Statistical Analyses.........................................................5
Contents of This Student Guide............................................................. 6
Overview...................................................................................................................6
Chapter 2: Terms and Concepts Used in This Guide...............................................7
Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and
Submit SAS Programs ..........................................................................................7
Chapter 4: Data Input...............................................................................................7
Chapter 5: Creating Frequency Tables ....................................................................7
Chapter 6: Creating Graphs.....................................................................................8
Chapter 7: Measures of Central Tendency and Variability.......................................8
Chapter 8: Creating and Modifying Variables and Data Sets...................................8
Chapter 9: Standardized Scores (z Scores).............................................................8
Chapter 10: Bivariate Correlation.............................................................................9
Chapter 11: Bivariate Regression ............................................................................9
Chapter 12: Single-Sample t Test ............................................................................9

2 Step-by-Step Basic Statistics Using SAS: Student Guide

Chapter 13: Independent-Samples t Test ................................................................9


Chapter 14: Paired-Samples t Test..........................................................................9
Chapter 15: One-Way ANOVA with One Between-Subjects Factor.......................10
Chapter 16: Factorial ANOVA with Two Between-Subjects Factors ......................10
Chapter 17: Chi-Square Test of Independence .....................................................10
References .............................................................................................................10
Conclusion.............................................................................................11

Chapter 1: Using This Student Guide 3

Introduction
Overview
This chapter introduces you to the SAS System, a computer application that can be used to
perform statistical analyses. It explains just what SAS is, where it is installed, and describes
some of the advantages associated with using SAS for data analysis. Finally, it briefly
summarizes what you will learn in each of the chapters that comprise this Student Guide.
Intended Audience and Level of Proficiency
This guide is intended for those who want to learn how to use SAS to perform elementary
statistical analyses. The guide assumes that many students using it have not already taken a
course on elementary statistics. To assist these students, this guide briefly reviews basic
terms and concepts in statistics at an elementary level. It was designed to be easily
understood by first and second year college students.
This book was also designed to be user-friendly to those who may have little or no
experience with personal computers. The beginning of Chapter 3, Tutorial: Using the SAS
Windowing Environment to Write and Submit SAS Programs, reviews basic concepts in
using Microsoft Windows, such as selecting menus, double-clicking icons, and so forth.
Those who already have experience in using Windows will be able to quickly skim through
this elementary material.
Platform and Version
This guide shows how to use the SAS System for Windows, as opposed to other operating
environments. This is most apparent in Chapter 3, Using the SAS Windowing Environment
to Write and Submit SAS Programs. However, the remaining chapters show how to write
SAS code to perform statistical analyses, and most of this material will be useful to all SAS
users, regardless of the operating environment. This is because, for the most part, the same
SAS code can be used on a wide variety of operating environments to obtain the same
results.
This book was designed for those using the SAS System Version 8 and later versions. It may
also be helpful to those using earlier versions of SAS (such as V6 or V7). However, if you
are using one of these earlier versions, it is likely that some of the SAS system options
described here are not available with your version. It is also likely that some of the SAS
output that you obtain will be arranged differently than the output that is presented here.

4 Step-by-Step Basic Statistics Using SAS: Student Guide

Materials Needed
To complete the activities described in this book, you will need

access to a personal computer on which the SAS System for Windows has been
installed,

one (and preferably two) 3.5-inch disks, formatted for IBM PCs (or some other type of
storage media).

Some students using this book will also use its companion volume, Step-by-Step Basic
Statistics Using SAS: Exercises. The chapters in the Exercises book parallel most of the
chapters contained in this Student Guide. Each chapter in the Exercises book contains two
assignments for students to complete. Complete solutions are provided for the oddnumbered exercises, but not for the even-numbered ones. The Exercises book can give you
useful practice in learning how to use SAS, but it is not absolutely required.

Introduction to the SAS System


Why Do You Need This Student Guide?
This Student Guide shows you how to use a computer application called the SAS System to
perform elementary statistical analyses. Until recently, students in elementary statistics
courses typically performed statistical computations by hand or with a pocket calculator. In
recent years, however, the increased availability of computers has made it possible for
students to also use statistical software packages such as SPSS and the SAS System to
perform these analyses. This latter approach allows students to focus more on conceptual
issues in statistics, and spend less time on the mechanics of performing mathematical
operations by hand. Step by step, this Student Guide will introduce you to the SAS System,
and will show you how to use it to perform a variety of statistical analyses that are
commonly used in the social and behavioral sciences and in education.

Chapter 1: Using This Student Guide 5

What Is the SAS System?


The SAS System is a modular, integrated, and hardware-independent application. It is used
as an information delivery system by business organizations, governments, and universities
worldwide.
SAS is used for virtually every aspect of information management in organizations,
including decision support, project management, financial analysis, quality improvement,
data warehousing, report writing, and presentations. However, this guide will focus on just
one aspect of SAS: its ability to perform the types of statistical analyses that are appropriate
for research in the social sciences and education.
By the time you have completed this text, you will have accomplished two objectives: you
will have learned how to perform elementary statistical analyses using SAS, and you will
have become familiar with a widely used information delivery system.
Who Uses SAS?
The SAS System is widely used in business organizations and universities. Consider the
following statistics from July 2002:

SAS supports over 40 operating environments, including Windows, OS/2, and UNIX.

SAS Institutes computer software products are installed at over 38,400 sites in 115
countries.

Approximately 71% of SAS installations are in business locations, 18% are education
sites, and 11% are government sites. It is used for teaching and research at about 3,000
university locations.

It is estimated that SAS software products are used by more than 3.5 million people
worldwide.

90% of all Fortune 500 companies are SAS clients.

Using the SAS System for Statistical Analyses


SAS is a particularly powerful tool for social scientists and educators because it allows them
to easily perform virtually any type of statistical analysis that may be required in their
research. SAS is comprehensive enough to perform the most sophisticated multivariate
analyses, but is so easy to use that undergraduates can perform simple analyses after only a
short period of instruction.
In a sense, the SAS System may be viewed as a library of prewritten statistical algorithms.
By submitting a brief SAS program, you can access a procedure from the library

6 Step-by-Step Basic Statistics Using SAS: Student Guide

and use it to analyze a set of data. For example, below are the SAS statements used to call
up the algorithm that calculates Pearson correlation coefficients:
PROC CORR
RUN;

DATA=D1;

The preceding statements will cause SAS to compute the Pearson correlation between every
possible pair of numeric variables in your data set. Being able to call up complex
procedures with such a simple statement is what makes SAS so powerful. By contrast, if
you had to prepare your own programs to compute Pearson correlations by using a
programming language such as FORTRAN or BASIC, it would require many statements,
and there would be many opportunities for error. By using SAS instead, most of the work
has already been completed, and you are able to focus on the results of the analysis rather
than on the mechanics of obtaining those results.

Contents of This Student Guide


Overview
This guide has two objectives: to teach the basics of using SAS in general and, more
specifically, to show how to use SAS procedures to perform elementary statistical analyses.
Chapters 14 provide an overview to the basics of using SAS. The remaining chapters cover
statistical concepts in a sequence that is representative of the sequence followed in most
elementary statistics textbooks.
Chapters 1017 introduce you to inferential statistical procedures (the type of procedures
that are most often used to analyze data from research). Each chapter shows you how to
conduct the analysis from beginning to end. Each chapter also provides an example of how
the analysis might be summarized for publication in an academic journal in the social
sciences or education. For the most part, these summaries are written according to the
guidelines provided in the Publication Manual of the American Psychological Association
(1994).
Many students using this book will also use its companion volume, Step-by-Step Basic
Statistics Using SAS: Exercises. For Chapters 317 in this student guide, the corresponding
chapter in the exercise book provides you with a hands-on exercise that enables you to
practice the data analysis skills that you are learning.
The following sections provide a summary of the contents of the remaining chapters in this
guide.

Chapter 1: Using This Student Guide 7

Chapter 2: Terms and Concepts Used in This Guide


Chapter 2 defines some important terms related to research and statistics that will be used
throughout this guide. It also introduces you to the three types of files that you will work
with during a typical session with SAS: the SAS program, the SAS log, and the SAS output
file.
Chapter 3: Tutorial: Using the SAS Windowing Environment to
Write and Submit SAS Programs
The SAS windowing environment is a powerful application that you will use to create,
edit, and submit SAS programs. You will also use it to review your SAS logs and output.
Chapter 3 provides a tutorial that teaches you how to use this application. Step by step, it
shows you how to write simple SAS programs and interpret their results. By the end of this
chapter, you should be ready to use the SAS windowing environment to write and submit
SAS programs on your own.
Chapter 4: Data Input
Chapter 4 shows you how to use the DATA and INPUT statements to create SAS data sets.
You will learn how to read both numeric and character variables by using a simple, list style
for data input. By the end of the chapter, you will be prepared to input the data sets that will
be presented throughout the remainder of this guide.
Chapter 5: Creating Frequency Tables
Chapter 5 shows you how to create frequency tables that are useful for understanding your
data and answering some types of research questions. For example, imagine that you ask a
sample of 150 people to tell you their age. If you then used SAS to create a frequency table
for this age variable, you would be able to easily answer questions such as

How many people are age 30?

How many people are age 30 or younger?

What percent of people are age 45?

What percent of people are age 45 or younger?

8 Step-by-Step Basic Statistics Using SAS: Student Guide

Chapter 6: Creating Graphs


Chapter 6 shows you how to use SAS to create frequency bar chartsbar charts that
indicate the number of people who displayed a given value on a variable. For example,
imagine that you asked 150 people to indicate their political party. If you used SAS to create
a frequency bar chart, the resulting chart would indicate the number of people who are
democrats, the number who are republicans, and the number who are independents.
Chapter 6 also shows how to create bar charts that plot subgroup means. For example,
assume that, in the political party study described above, you asked the 150 subjects to
indicate both their political party and their age. You could then use SAS to create a bar chart
that plots the mean age for people in each party. For instance, the resulting bar chart might
show that the average age for democrats was 32.12, the average age for republicans was
41.56, and the average age for independents was 37.33.
Chapter 7: Measures of Central Tendency and Variability
Chapter 7 shows you how to compute measures of variability (e.g., the interquartile range,
standard deviation, and variance) as well as measures of central tendency (e.g., the mean,
median, and mode) for numeric variables. It also shows how to use stem-and-leaf plots to
determine whether a distribution is skewed or approximately normal in shape.
Chapter 8: Creating and Modifying Variables and Data Sets
Chapter 8 shows how to use subsetting IF statements to create new data sets that contain a
specified subgroup from the original sample. It also shows how to use mathematical
operators and IF-THEN statements to recode variables and to create new variables from
existing variables.
Chapter 9: Standardized Scores (z Scores)
Chapter 9 shows how to transform raw scores into standardized variables (z score variables)
with a mean of 0 and a standard deviation of 1. You will learn how to do this by using the
data manipulation statements that you learned about in Chapter 8. Chapter 9 also illustrates
how you can review the sign and absolute magnitude of a z score to understand where a
particular observation stands on the variable in question.

Chapter 1: Using This Student Guide 9

Chapter 10: Bivariate Correlation


Bivariate correlation coefficients allow you to determine the nature of the relationship
between two numeric variables. Chapter 10 shows you how to use the CORR procedure to
compute Pearson correlation coefficients for interval- and ratio-level variables. You will
also learn to interpret the p values (probability values) that are produced by PROC CORR to
determine whether a given correlation coefficient is significantly different from zero.
Chapter 10 also shows how to use PROC PLOT to create a two-dimensional scattergram
that illustrates the relationship between two variables.
Chapter 11: Bivariate Regression
Bivariate regression is used when you want to predict scores on an interval- or ratio-level
criterion variable from an interval- or ratio-level predictor variable. Chapter 11 shows you
how to use the REG procedure to compute the slope and intercept for the regression
equation, along with predicted values and residuals of prediction.
Chapter 12: Single-Sample t Test
Chapter 12 shows how to use the TTEST procedure to perform a single-sample t test. This is
an inferential procedure that is useful for determining whether a sample mean is
significantly different from a specified population mean. You will learn how to interpret the
t statistic, and the p value associated with that t statistic.
Chapter 13: Independent-Samples t Test
You use an independent-samples t test to determine whether there is a significant difference
between two groups of subjects with respect to their mean scores on the dependent variable.
Chapter 13 explains when to use the equal-variance t statistic versus the unequal-variance t
statistic, and shows how to use the TTEST procedure to conduct this analysis.
Chapter 14: Paired-Samples t Test
The paired-samples t test is also appropriate when you want to determine whether there is a
significant difference between two sample means. The paired-samples approach is indicated
when each score in one sample is dependent upon a corresponding score in the second
sample. This will be the case in studies in which the same subjects provide repeated
measures on the same dependent variable under different conditions, or when matching
procedures are used. Chapter 14 shows how to perform this analysis using the TTEST
procedure.

10 Step-by-Step Basic Statistics Using SAS: Student Guide

Chapter 15: One-Way ANOVA with One Between-Subjects Factor


One-way analysis of variance (ANOVA) is an inferential procedure similar to the
independent-samples t test, with one important difference: while the t test allows you to test
the significance of the difference between two sample means, a one-way ANOVA allows
you to test the significance of the difference between more than two sample means. Chapter
15 shows how to use the GLM procedure to perform a one-way ANOVA, and then to follow
with multiple comparison (post hoc) tests.
Chapter 16: Factorial ANOVA with Two Between-Subjects Factors
A one-way ANOVA, as described in Chapter 15, may be appropriate for analyzing data
from an experiment in which the researcher manipulates only one independent variable. In
contrast, a factorial ANOVA with two between-subjects factors may be appropriate for
analyzing data from an experiment in which the researcher manipulates two independent
variables simultaneously. Chapter 16 shows how to perform this type of analysis. It provides
examples of results in which the main effects are significant, as well as results in which the
interaction is significant.
Chapter 17: Chi-Square Test of Independence
Nonparametric statistical procedures are procedures that do not require stringent
assumptions about the nature of the populations under study. Chapter 17 illustrates one of
the most common nonparametric procedures: the chi-square test of independence. This test
is appropriate when you want to study the relationship between two variables that assume a
limited number of values. Chapter 17 shows how to conduct the test of significance and
interpret the results presented in the two-way classification table created by the FREQ
procedure.
References
Many statistical procedures are illustrated in this guide by showing you how to analyze
fictitious data from an empirical study. Many of these studies are loosely based on actual
investigations reported in the research literature. These studies were chosen to help
introduce you to the types of empirical investigations that are often conducted in the social
and behavioral sciences and in education. The References section at the end of this guide
provides complete references for the actual studies that inspired the fictitious studies
reported here.

Chapter 1: Using This Student Guide 11

Conclusion
This guide assumes that some of the students using it have not yet completed a course on
elementary statistics. This means that some readers will be unfamiliar with terms used in
data analysis, such as observations, null hypothesis, dichotomous variables, and so
on. To remedy this, the following chapter, "Terms and Concepts Used in This Guide,"
provides a brief primer on basic terms and concepts in statistics. This chapter should lay a
foundation that will make it easier to understand the chapters to follow.

12 Step-by-Step Basic Statistics Using SAS: Student Guide

Terms and
Concepts Used
in This Guide
Introduction...........................................................................................15
Overview.................................................................................................................15
A Common Language for Researchers...................................................................15
Why This Chapter Is Important ...............................................................................15
Research Hypotheses and Statistical Hypotheses ..............................16
Example: A Goal-Setting Study..............................................................................16
The Research Question ..........................................................................................16
The Research Hypothesis.......................................................................................16
The Statistical Null Hypothesis................................................................................18
The Statistical Alternative Hypothesis.....................................................................19
Directional versus Nondirectional Alternative Hypotheses ......................................19
Summary ................................................................................................................21
Data, Variables, Values, and Observations ..........................................21
Defining the Instrument, Gathering Data, Analyzing Data, and
Drawing Conclusions...........................................................................................21
Variables, Values, and Observations ......................................................................22
Classifying Variables According to Their Scales of Measurement......24
Introduction .............................................................................................................24
Nominal Scales .......................................................................................................25
Ordinal Scales.........................................................................................................25
Interval Scales ........................................................................................................26
Ratio Scales............................................................................................................27

14 Step-by-Step Basic Statistics Using SAS: Student Guide

Classifying Variables According to the Number of Values


They Display .....................................................................................27
Overview.................................................................................................................27
Dichotomous Variables ...........................................................................................27
Limited-Value Variables ..........................................................................................28
Multi-Value Variables ..............................................................................................28
Basic Approaches to Research ............................................................29
Nonexperimental Research ....................................................................................29
Experimental Research...........................................................................................31
Using Type-of-Variable Figures to Represent Dependent and
Independent Variables .....................................................................32
Overview.................................................................................................................32
Figures to Represent Types of Variables................................................................33
Using Figures to Represent the Types of Variables Assessed
in a Specific Study...............................................................................................34
The Three Types of SAS Files...............................................................37
Overview.................................................................................................................37
The SAS Program...................................................................................................37
The SAS Log...........................................................................................................42
The SAS Output File ...............................................................................................44
Conclusion.............................................................................................45

Chapter 2: Terms and Concepts Used in This Guide 15

Introduction
Overview
This chapter has two objectives. This first is to introduce you to basic terms and concepts
related to research design and data analysis. This chapter describes the different types of
variables that might be analyzed when conducting research, the classification of these
variables according to their scale of measurement or other characteristics, and the
differences between nonexperimental versus experimental research.
The chapters second objective is to introduce you to the three types of files that you will
work with when you perform statistical analyses with SAS. These include the SAS program
file, the SAS log file, and the SAS output file.
After completing this chapter, you should be familiar with the fundamental terms and
concepts that are relevant to data analysis, and you will have a foundation to begin learning
about the SAS System in detail in subsequent chapters.
A Common Language for Researchers
Research in the behavioral sciences and in education is extremely diverse. In part, this is
because the behavioral sciences represent a wide variety of disciplines, including
psychology, sociology, anthropology, political science, management, and other fields.
Further complicating matters is the fact that, within each discipline, a wide variety of
methods are used to conduct research. These methods can include unobtrusive observation,
participant observation, case studies, interviews, focus groups, surveys, ex post facto
studies, laboratory experiments, and field experiments.
Despite this diversity in methods used and topics investigated, most scientific investigations
still share a number of characteristics. Regardless of field, most research involves an
investigator who gathers data and performs analyses to determine what the data mean. In
addition, most researchers in the behavioral sciences and education use a common language
in reporting their research; researchers from all fields typically speak of testing null
hypotheses and obtaining significant p values.
Why This Chapter Is Important
The purpose of this chapter is to review some fundamental concepts and terms that are
shared in the behavioral sciences and in education. You should familiarize (or refamiliarize)
yourself with this material before proceeding to the subsequent chapters, as most of the
terms introduced here will be referred to again and again throughout the text. If you have
not yet taken a course in statistics, this chapter will provide an elementary introduction; if
you have already completed a course in statistics, it will provide a quick review.

16 Step-by-Step Basic Statistics Using SAS: Student Guide

Research Hypotheses and Statistical Hypotheses


Example: A Goal-Setting Study
Imagine that you have been hired by a large insurance company to find ways of improving
the productivity of its insurance agents. Specifically, the company would like you to find
ways to increase the number of insurance policies sold by the average agent. You will
therefore begin a program of research to identify the determinants of agent productivity. In
the course of this program, you will work with research questions, research hypotheses, and
statistical hypotheses.
The Research Question
The process of research often begins by developing a clear statement of the research
question (or questions). The research question is a statement of what you hope to have
learned by the time the research has been completed. It is good practice to revise and refine
the research question several times to ensure that you are very clear about what it is you
really want to know.
For example, in the current example, you might begin with the question What is the
difference between agents who sell much insurance versus agents who sell little insurance?
A more specific question might be What variables have a causal effect on the amount of
insurance sold by agents? Upon reflection, you might realize that the insurance company
really only wants to know what things management can do to cause the agents to sell more
insurance. This might eliminate from consideration those variables that are not under
managements control, and can substantially narrow the focus of the research program. This
narrowing, in turn, leads to a more specific statement of the research question such as What
variables under the control of management have a causal effect on the amount of insurance
sold by agents? Once the research question has been more clearly defined in this way, you
are in a better position to develop a good hypothesis that provides a possible answer to the
question.
The Research Hypothesis
An hypothesis is a statement about the predicted relationships among events or variables. A
good hypothesis in the present case might identify a specific variable that is expected to
have a causal effect on the amount of insurance sold by agents. For example, a research
hypothesis might predict that the agents level of training will have a positive effect on the
amount of insurance sold. Or it might predict that the agents level of achievement
motivation will positively affect sales.

Chapter 2: Terms and Concepts Used in This Guide 17

In developing the hypothesis, you might be influenced by any of a number of sources: an


existing theory, some related research, or even personal experience. Let's assume that in the
present situation, for example, you have been influenced by goal-setting theory. This theory
states, among other things, that higher levels of work performance are achieved when
difficult goals are set for employees. Drawing on goal-setting theory, you now state the
following hypothesis: The difficulty of the goals that agents set for themselves is
positively related to the amount of insurance they sell. Notice how this statement satisfies
our definition for a research hypothesis, as it is a statement about the predicted relationship
between two variables. The first variable can be labeled goal difficulty, and the second
can be labeled amount of insurance sold.
The predicted relationship between goal difficulty and amount of insurance sold is
illustrated in Figure 2.1. Notice that there is an arrow extending from goal difficulty to
amount of insurance sold. This arrow reflects the prediction that goal difficulty is the causal
variable, and amount of insurance sold is the variable being affected.

Figure 2.1. Causal relationship between goal difficulty and amount of


insurance sold, as predicted by the research hypothesis.

In Figure 2.1, you can see that the variable being affected (insurance sold) appears on the
left side of the figure, and that the causal variable (goal difficulty) appears on the right. This
arrangement might seem a bit unusual to you, since most figures that portray causal
relationships have the order reversed (with the causal variable on the left and the variable
being affected on the right). However, this guide will always use the arrangement that
appears in Figure 2.1, for reasons that will become clear later.
You can see that the research hypothesis stated above is quite broad in nature. In many
research situations, however, it is helpful to state hypotheses that are more specific in the
predictions they make. For example, assume that there is an instrument called the Smith
Goal Difficulty Scale. Scores on this fictitious instrument can range from zero to 100, with
higher scores indicating more difficult goals. If you administered this scale to a sample of
agents, you could develop a more specific research hypothesis along the following lines:
Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts
of insurance than agents who score below 60.

18 Step-by-Step Basic Statistics Using SAS: Student Guide

The Statistical Null Hypothesis


Beginning in Chapter 10, Bivariate Correlation, this guide will show you how to use the
SAS System to perform tests of null hypotheses. The way that you state a specific null
hypothesis will vary depending on the nature of your research question and the type of data
analysis that you are performing. Generally speaking, however, a statistical null hypothesis
is typically a prediction that there is no difference between groups in the population, or that
there is no relationship between variables in the population.
For example, consider the research hypothesis stated in the preceding section: Agents who
score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance
than agents who score below 60. Assume that you conduct a study to investigate this
research hypothesis. You identify two groups of subjects:

50 Agents who score 60 or above on the Smith Goal Difficulty Scale (the high goaldifficulty group).

50 Agents who score below 60 on the Smith Goal Difficulty Scale (the low goaldifficulty group).

You observe these agents over a 12-month period, and record the amount of insurance that
they sell. You want to investigate the following (fairly specific) research hypothesis:
Research hypothesis: The average amount of insurance sold by the high goal-difficulty
group will be greater than the average amount sold by the low goal-difficulty group.
You plan to analyze the data using a statistical procedure such as a t test (which will be
discussed in Chapter 13, Independent-Samples t Test). One way to structure this analysis
is to begin with the following statistical null hypothesis:
Statistical null hypothesis: In the population, there is no difference between the high
goal-difficulty group and the low goal-difficulty group with respect to their mean scores
on the amount of insurance sold.
Notice that this is a prediction of no difference between the groups. You will analyze the
data from your sample, and if the observed difference is large enough, you will reject this
null hypothesis of no difference. Rejecting this statistical null hypothesis means that you
have obtained some support for your original research hypothesis (the hypothesis that there
is a difference between the groups).
Statistical null hypotheses are often represented symbolically. For example, this is how you
could have symbolically represented the preceding statistical null hypothesis:
H0:

1 = 2

Chapter 2: Terms and Concepts Used in This Guide 19

where
H0

is the symbol used to represent the null hypothesis

is the symbol used to represent the mean amount of insurance sold by Group 1 (the
high goal-difficulty group) in the population

is the symbol used to represent the mean amount of insurance sold by Group 2 (the
low goal-difficulty group) in the population.

The Statistical Alternative Hypothesis


A statistical alternative hypothesis is typically a prediction that there is a difference
between groups in the population, or that there is relationship between variables in the
population. The alternative hypothesis is the counterpart to the null hypothesis; if you reject
the null hypothesis, you tentatively accept the alternative hypothesis.
There are different ways that you can state alternative hypotheses. One way is simply to
predict that there is a difference between the population means, without predicting which
population mean is higher. Here is one way of stating that type of alternative hypothesis for
the current study:
Statistical alternative hypothesis: In the population, there is a difference between the
high goal-difficulty group and the low goal-difficulty group with respect to their mean
scores on the amount of insurance sold.
The alternative hypothesis also can be stated symbolically
H1:

1 2

The H1 symbol above is the symbol for an alternative hypothesis. Notice that the not equal
symbol () is used to represent the prediction that the means will not be equal.
Directional versus Nondirectional Alternative Hypotheses
Nondirectional hypotheses. The preceding section illustrated a nondirectional alternative
hypothesis, also known as a two-sided or two-tailed alternative hypothesis. With the type of
study described here (a study in which group means are being compared), a nondirectional
alternative hypothesis simply predicts that one population mean differs from the other
population meanit does not predict which population mean will be higher. You would
obtain support for this nondirectional alternative hypothesis if the high goal-difficulty group
sold significantly more insurance, on the average, than the low goal-difficulty group. You
would also obtain support for this nondirectional alternative hypothesis if the low goaldifficulty group sold significantly more insurance than the high goal-difficulty group. With
a nondirectional alternative hypothesis, you are predicting some type of difference, but you
are not predicting the specific nature, or direction, of the difference.

20 Step-by-Step Basic Statistics Using SAS: Student Guide

Directional hypotheses. In some situations it might be appropriate to use a directional


alternative hypothesis. With the type of study described above, a directional alternative
hypothesis (also known as a one-sided or one-tailed alternative hypothesis) not only
predicts that there will be a difference, but also makes a specific prediction about which
population will display the higher mean.
For example, in the present study, previous research might lead you to predict that the
population of high goal-difficulty employees will sell more insurance, on the average, than
the population of low goal-difficulty employees. If this were the case, you might state the
following directional alternative hypothesis:
Statistical alternative hypothesis: In the population, mean amount of insurance sold
by the high goal-difficulty group is greater than the mean amount of insurance sold by
the low goal-difficulty group.
This alternative hypothesis can also be stated symbolically
H1:

1 > 2

where
1

represents the mean amount of insurance sold by Group 1 (the high goal-difficulty
group) in the population

represents the mean amount of insurance sold by Group 2 (the low goal-difficulty
group) in the population.

Notice that the greater than symbol (>) is used to represent the prediction that the mean
for the high goal-difficulty population is greater than the mean for the low goal-difficulty
population.
Choosing directional versus nondirectional tests. Which type of alternative hypothesis
should you use in your research? Most statistics textbooks recommend using a
nondirectional, or two-sided, alternative hypothesis, in most cases. The problem with the
directional hypothesis is that if your obtained sample means are in the opposite direction of
the direction that you predict, it can cause you to fail to reject the null hypothesis even when
there are very large differences between the sample means.
For example, assume that you state the directional alternative hypothesis presented above
(i.e., In the population, mean amount of insurance sold by the high goal-difficulty group is
greater than the mean amount of insurance sold by the low goal-difficulty group). Because
your alternative hypothesis is a directional hypothesis, the null hypothesis you are testing is
as follows:
H0:

1 2

which means, In the population, the mean amount of insurance sold by the high goaldifficulty group (Group 1) is less than or equal to the mean amount of insurance sold by the
low goal-difficulty group (Group 2).

Chapter 2: Terms and Concepts Used in This Guide 21

Clearly, to reject the null hypothesis, the high goal-difficulty group (Group 1) must display a
mean that is greater than the low goal-difficulty group (Group 2). If Group 2 displays the
higher mean, then you might not reject the null hypothesis, no matter how great that
difference might be. This presents a problem because the finding that Group 2 scored higher
than Group 1 may be of great interest to other researchers (particularly because it is not what
many would have expected). This is why, in many situations, nondirectional tests are
preferred over directional tests.
Summary
In summary, research projects often begin with a statement of a research hypothesis. This
allows you to develop a specific, testable statistical null hypothesis and an alternative
hypothesis. The analysis of your data will lead you to one of two results:

If the results are significant, you can reject the null hypothesis and tentatively accept the
alternative hypothesis. Assuming the means are in the predicted direction, this type of
result provides some support for your initial research hypothesis.

If the results are nonsignificant, you fail to reject the null hypothesis. This type of result
fails to provide support for your initial research hypothesis.

Data, Variables, Values, and Observations


Defining the Instrument, Gathering Data, Analyzing Data, and
Drawing Conclusions
With the null hypothesis stated, you can now test it by conducting a study in which you
gather and analyze relevant data. Data is defined as a collection of scores that are obtained
when subject characteristics and/or performance are observed and recorded. For example,
you can choose to test your hypothesis by conducting a simple correlational study: You
identify a group of 100 agents and determine

the difficulty of the goals that have been set for each agent

the amount of insurance sold by each.

Different types of instruments can be used to obtain different types of data. For example,
you might use a questionnaire to assess goal difficulty, but rely on company records for
measures of insurance sold. Once the data are gathered, each agent will have one score
indicating the difficulty of his or her goals, and a second score indicating the amount of
insurance he or she has sold.
You would then analyze the data to see if the agents with the more difficult goals did, in
fact, sell more insurance. If so, the study results would lend some support to your research
hypothesis; if not, the results would fail to provide support. In either case, you would be

22 Step-by-Step Basic Statistics Using SAS: Student Guide

able to draw conclusions regarding the tenability of your hypotheses, and would have made
some progress toward answering your research question. The information learned in the
current study might stimulate new questions or new hypotheses for subsequent studies, and
the cycle would repeat. For example, if you obtained support for your hypothesis with a
correlational study, you might choose to follow it up with a study using a different research
method, perhaps an experimental study (the difference between these methods will be
described below). Over time, a body of research evidence would accumulate, and
researchers would be able to review this body to draw general conclusions about the
determinants of insurance sales.
Variables, Values, and Observations
Definitions. When discussing data, one often speaks in terms of variables, values, and
observations. Further complicating matters is the fact that researchers make distinctions
between different types of variables (such as quantitative variables versus classification
variables). This section discusses the distinctions between these terms.

Variables. For the type of research discussed in this book, a variable refers to some
specific characteristic of a subject that can assume one or more different values. For the
subjects in the study described above, amount of insurance sold is an example of a
variable: Some subjects had sold a large amount of insurance, and others had sold less.
A different variable was goal difficulty: Some subjects had more difficult goals,
while others had less difficult goals. Subject age was a third variable, while subject sex
(male versus female) was yet another.

Values. A value, on the other hand, refers to either a particular subject's relative
standing on a quantitative variable, or a subject's classification within a classification
variable. For example, the amount of insurance sold is a quantitative variable that can
assume a large number of values: One agent might sell $2,000,000 worth of insurance
in one year, one might sell $100,000 worth, and another might sell $0 worth. Subject
age is another quantitative variable that can assume a wide variety of values. In the
sample studied, these values ranged from a low of 22 years to a high of 64 years.

Quantitative variables. You can see that, in both of these examples, a particular value
is a type of score that indicates where the subject stands on the variable. The word
score is an appropriate substitute for the word value in these cases because both
amount of insurance sold and age are quantitative variables: variables that
represent the quantity, or amount, of the construct that is being assessed. With
quantitative variables, numbers typically serve as values.

Classification variables. A different type of variable is a classification variable or,


alternatively, qualitative variable or categorical variable. With classification
variables, different values represent different groups to which the subject might belong.
Sex is a good example of a classification variable, as it might assume only one of two
values: A particular subject is classified as being either a male or a female. Political
Party is an example of a classification variable that can assume a larger number of

Chapter 2: Terms and Concepts Used in This Guide 23

values: A subject might be classified as being a republican, a democrat, or an


independent. These variables are classification variables and not quantitative variables
because the values only represent membership in a singular, specific group
membership that cannot be represented meaningfully with a numeric value.

Observational units. In discussing data, researchers often make references to


observational units, that can be defined as the individual subjects (or other objects) that
serve as the source of the data. Within the behavioral sciences and education, an
individual person usually serves as the observational unit under study (although it is also
possible to use some other entity, such as an individual school or organization, as the
observational unit). In this text, the individual person is used as the observational unit in
most examples. Researchers will often refer to the number of observations or
number of cases included in their data set, and this typically refers to the number of
subjects who were studied.

An example. For a more concrete illustration of the concepts discussed so far, consider the
data set displayed in Table 2.1:
Table 2.1
Insurance Sales Data
________________________________________________________________________
Goal
difficulty
Overall
Observation
Name
Sex
Age
scores
ranking
Sales
________________________________________________________________________
1
Bob
M
34
97
2
$598,243
2
Walt
M
56
80
1
$367,342
3
Jane
F
36
67
4
$254,998
4
Susan
F
24
40
3
$80,344
5
Jim
M
22
37
5
$40,172
6
Mack
M
44
24
6
$0
________________________________________________________________________

The preceding table reports information regarding six research subjects: Bob, Walt, Jane,
Susan, Jim, and Mack; therefore, we would say that the data set includes six observations.
Information about a particular observation (subject) is displayed as a row running
horizontally from left to right across the table.
The first column of the data set (running vertically from top to bottom) is headed
Observation, and it simply provides an observation number for each subject.
The second column (headed Name) provides a name for each subject.
The remaining five columns report information about the five research variables that are
being studied.
The column headed Sex reports subject sex, which might assume one of two values: M
for male and F for female.

24 Step-by-Step Basic Statistics Using SAS: Student Guide

The column headed Age reports the subject's age in years.


The Goal Difficulty Scores column reports the subject's score on a fictitious goal
difficulty scale. In this example, each participant has a score on a 20-item questionnaire
about the difficulty of his or her work goals. Depending on how they respond to the
questionnaire, subjects receive a score ranging from a low of zero (meaning that the subject
views the work goals as extremely easy) to a high of 100 (meaning that the goals are viewed
as extremely difficult).
The column headed Overall Ranking, shows how the subjects were ranked by their
supervisor according to their overall effectiveness as agents. A rank of 1 represents the
most effective agent, and a rank of 6 represents the least effective.
The column headed Sales reveals the amount of insurance sold by each agent (in dollars)
during the most recent year.
Table 2.1 provides a very small data set with six observations and five research variables
(sex, age, goal difficulty, overall ranking, and sales). One of the variables was a
classification variable (sex), while the remainder were quantitative variables. The numbers
or letters that appear within a particular column represent some of the values that could be
assumed by that variable.

Classifying Variables According to Their Scales of


Measurement
Introduction
One of the most important schemes for classifying a variable involves its scale of
measurement. Researchers generally discuss four different scales of measurement:
nominal, ordinal, interval, and ratio. Before analyzing a data set, it is important to
determine which scales of measurement were used because certain types of statistical
procedures require specific scales of measurement. For example, a one-way analysis of
variance generally requires that the dependent variable be an interval-level or ratio-level
variable; the chi-square test of independence allows you to analyze nominal-level variables;
other statistics make other assumptions about the scale of measurement used with the
variables that are being studied.

Chapter 2: Terms and Concepts Used in This Guide 25

Nominal Scales
A nominal scale is a classification system that places people, objects, or other entities into
mutually exclusive categories. A variable that is measured using a nominal scale is a
classification variable: It simply indicates the name of the group to which each subject
belongs. The examples of classification variables provided earlier (e.g., sex and political
party) also serve as examples of nominal-level variables: They tell you which group a
subject belongs to, but they do not provide any quantitative information about the subjects.
That is, the sex variable might tell you that some subjects are males and other are females,
but it does not tell you that some subjects possess more of a specific characteristic relative to
others. With the remaining three scales of measurement, however, some quantitative
information is provided.
Ordinal Scales
Values on an ordinal scale represent the rank order of the subjects with respect to the
variable that is being assessed. For example, Table 2.1 includes one variable called Overall
Ranking, which represents the rank-ordering of the subjects according to their overall
effectiveness as agents. The values on this ordinal scale represent a hierarchy of levels with
respect to the construct of effectiveness: We know that the agent ranked 1 was
perceived as being more effective than the agent ranked 2, that the agent ranked 2 was
more effective than the one ranked 3, and so forth.
However, an ordinal scale has a serious limitation in that equal differences in scale values
do not necessarily have equal quantitative meaning. For example, notice the rankings
reproduced here:
Overall
ranking
_______

Name
______

1
2
3
4
5
6

Walt
Bob
Susan
Jane
Jim
Mack

Notice that Walt was ranked #1 while Bob was ranked #2. The difference between these
two rankings is 1 (because 2 1 = 1), so we might say that there is one unit of difference
between Walt and Bob. Now notice that Jim was ranked #5 while Mack was ranked #6.
The difference between these two rankings is also 1 (because 6 5 = 1), so we might say
that there is also 1 unit of difference between Jim and Mack. Putting the two together, we
can see that the difference in ranking between Walt and Bob is equal to the difference in
ranking between Jim and Mack. But does this mean that the difference in overall
effectiveness between Walt and Bob is equal to the difference in overall effectiveness
between Jim and Mack? Not necessarily. It is possible that Walt was just barely superior to

26 Step-by-Step Basic Statistics Using SAS: Student Guide

Bob in effectiveness, while Jim was substantially superior to Mack. These rankings tell us
very little about the quantitative differences between the subjects with regard to the
underlying construct (effectiveness, in this case). An ordinal scale simply provides a rank
order of who is better than whom.
Interval Scales
With an interval scale, equal differences between scale values do have equal quantitative
meaning. For this reason, you can see that the interval scale provides more quantitative
information than the ordinal scale. A good example of an interval scale is the Fahrenheit
scale used to measure temperature. With the Fahrenheit scale, the difference between 70
degrees and 75 degrees is equal to the difference between 80 degrees and 85 degrees: the
units of measurement are equal throughout the full range of the scale.
However, the interval scale also has an important limitation: it does not have a true zero
point. A true zero point means that a value of zero on the scale represent zero quantity of
the construct being assessed. It should be obvious that the Fahrenheit scale does not have a
true zero point. When the thermometer reads zero degrees, that does not mean that there is
absolutely no heat present in the environmentit is still possible for the temperature to go
lower (into the negative numbers).
Researchers in the social sciences often assume that many of their man-made variables are
measured on an interval scale. For example, in the preceding study involving insurance
agents, you would probably assume that scores from the goal difficulty questionnaire
constitute an interval-level scale; that is, you would likely assume that the difference
between a score of 50 and 60 is approximately equal to the difference between a score of 70
and 80. Many researchers would also assume that scores from an instrument such as an
intelligence test are also measured at the interval level of measurement.
On the other hand, some researchers are skeptical that instruments such as these have true
equal-interval properties, and prefer to refer to them as quasi-interval scales.
Disagreements concerning the level of measurement achieved with such paper-and-pencil
instruments continues to be a controversial topic within many disciplines.
In any case, it is clear that there is no true zero point with either of the preceding
instruments: a score of zero on the goal difficulty scale does not indicate the complete
absence of goal difficulty, and a score of zero on an intelligence test does not indicate the
complete absence of intelligence. A true zero point can be found only with variables
measured on a ratio scale.

Chapter 2: Terms and Concepts Used in This Guide 27

Ratio Scales
Ratio scales are similar to interval scales in that equal differences between scale values do
have equal quantitative meaning. However, ratio scales also have a true zero point, which
gives them an additional property: with ratio scales, it is possible to make meaningful
statements about the ratios between scale values.
For example, the system of inches used with a common ruler is an example of a ratio scale.
There is a true zero point with this system, in that zero inches does in fact indicate a
complete absence of length. With this scale, it is possible to make meaningful statements
about ratios. It is appropriate to say that an object four inches long is twice as long as an
object two inches long. Age, as measured in years, is also on a ratio scale: a 10-year-old
house is twice as old as a 5-year-old house. Notice that it is not possible to make these
statements about ratios with the interval-level variables discussed above. One would not say
that a person with an IQ of 160 is twice as intelligent as a person with an IQ of 80, as there
is no true zero point with that scale.
Although ratio-level scales are most commonly used for reporting the physical properties of
objects (e.g., height, weight), they are also common in the type of research that is discussed
in this manual. For example, the study discussed above included the variables age and
amount of insurance sold (in dollars). Both of these have true zero points, and are
measured as ratio scales.

Classifying Variables According to the Number of Values


They Display
Overview
The preceding section showed that variables can be classified according to their scale of
measurement. Sometimes is also useful to classify variables according to the number of
values they display. There might be any number of approaches for doing this, but this guide
uses a simple division of variables into three groups according to the number of possible
values: dichotomous variables, limited-value variables, and multi-value variables.
Dichotomous Variables
A dichotomous variable is a variable that assumes just two values. These variables are
sometimes called binary variables. Here are some examples of dichotomous variables:

Suppose that you obtain Smith Anxiety Test scores from 50 male subjects and 50 female
subjects. In this study, subject sex is a dichotomous variable, because it can assume
just two values, male versus female.

28 Step-by-Step Basic Statistics Using SAS: Student Guide

Suppose that you conduct an experiment to determine whether the herbal supplement
ginkgo biloba causes improvement in a rats ability to learn. You begin with 20 rats, and
randomly assign them to two groups. Ten rats are assigned to the 100 mg group (they
receive 100 mg of ginkgo), and the other ten rats are assigned to the 0 mg group (they
receive no ginkgo). In this study, the independent variable that you are manipulating is
amount of ginkgo administered. This is a dichotomous variable because it assumes
just two values 0 mg versus 100 mg.

Limited-Value Variables
A limited-value variable is a variable that assumes just two to six values in your sample.
Here are some examples of limited-value variables:

Suppose that you obtain Smith Anxiety Test scores from 50 Caucasian subjects, 50
African-American subjects, and 50 Asian-American subjects. In this study, subject
race is a limited-value variable because it assumes just three values: Caucasian
versus African-American versus Asian-American.

Suppose that you again conduct an experiment to determine whether ginkgo biloba
causes improvements in a rats ability to learn. You begin with 100 rats, and randomly
assign them to four groups: Twenty-five rats are assigned to the 150 mg group, 25 rats
are assigned to the 100 mg group, 25 rats are assigned to the 50 mg group, and 25 rats
are assigned to the 0 mg group. In this study, the independent variable that you are
manipulating is still amount of ginkgo administered. You know that this is a limitedvalue variable because it assumes just four values 0 mg versus 50 mg versus 100
mg versus 150 mg.

Multi-Value Variables
Finally, this book defines a multi-value variable as a variable that assumes more than six
values in your sample. Here are some examples of multi-value variables:

Assume that you obtain Smith Anxiety Test scores from 100 subjects. With the Smith
Anxiety Test, scores (values) may range from 099, with higher scores indicating
greater anxiety. In analyzing the data, you see that your subjects displayed a wide
variety of scores, for example:

One subject received a score of 2.


One subject received a score of 5.
Two subjects received a score of 10.
Five subjects received a score of 21.
Seven subjects received a score of 33.
Eight subjects received a score of 45.
Nine subjects received a score of 53.

Chapter 2: Terms and Concepts Used in This Guide 29

Seven subjects received a score of 68.


Six subjects received a score of 72.
Six subjects received a score of 81.
One subject received a score of 89.
One subject received a score of 91.

Other subjects received yet other scores. Clearly, scores on the Smith Anxiety Test
constitute a multi-value variable in your sample because your subjects displayed more
than six values on this variable.

Assume that, in the ginkgo biloba study just described, you assess your dependent
variable (learning) in the rats by having them work at a maze-solving task. First, you
teach each rat that, if it can correctly find its way through a maze, it will be rewarded
with food at the end. You then allow each rat to try to find its way through a series of
mazes. Each rat is allowed 30 trials30 opportunities to get through a maze. Your
measure of learning, therefore, is the number of mazes that each rat correctly negotiates.
This score can range from zero (if the rat is not successful on any of the trials), to 30 (if
the rat is successful on all of the trials). A rat also can score anywhere in between these
extremes. In analyzing the data, you find that the rats displayed a wide variety of scores
on this successful trials dependent variable, for example:

One rat displayed zero successful trials.


Two rats displayed three successful trials.
Three rats displayed eight successful trials.
Four rats displayed 10 successful trials.
Five rats displayed 14 successful trials.
Six rats displayed 15 successful trials.
Six rats displayed 19 successful trials.
Two rats displayed 21 successful trials.
One rat displayed 27 successful trials.
One rat displayed 28 successful trials.

Other rats displayed yet other scores. Clearly, scores on the successful trials variable
constitute a multi-value variable in your sample because the rats displayed more than six
values on this variable.

Basic Approaches to Research


Nonexperimental Research
Naturally-occurring variables. Much research can be described as being either
nonexperimental or experimental in nature. In nonexperimental research (also called

30 Step-by-Step Basic Statistics Using SAS: Student Guide

correlational, nonmanipulative, or observational research), the researcher simply studies


the naturally-occurring relationship between two or more naturally-occurring variables. A
naturally-occurring variable is a variable that is not manipulated or controlled by the
researcher; it is simply measured as it normally exists.
The insurance study described previously is a good example of nonexperimental research, in
that you simply measured two naturally-occurring variables (goal difficulty and amount of
insurance sold) to determine whether they were related. If, in a different study, you
investigated the relationship between IQ and college grade point average (GPA), this would
also be an example of nonexperimental research.
Criterion versus predictor variables. With nonexperimental designs, researchers often
refer to criterion variables and predictor variables. A criterion variable is an outcome
variable that can be predicted from one or more predictor variables. The criterion variable is
often the main focus of the study in that it is the outcome variable mentioned in the
statement of the research problem. With our insurance example, the criterion variable is the
amount of insurance sold.
The predictor variable, on the other hand, is the variable that is used to predict values on
the criterion. In some studies, you might even believe that the predictor variable has a
causal effect on the criterion. In the insurance study, for example, the predictor variable was
goal difficulty. Because you believed that goal difficulty can positively affect insurance
sales, you conducted a study in which goal difficulty was the predictor and insurance sales
was the criterion. You do not necessarily have to believe that there is a causal relationship
between two variables to conduct a study such as this, however; you might simply be
interested in determining whether it is possible to predict one variable from the other.
Cause-and-effect relationships. It should be noted here that nonexperimental research that
investigates the relationship between just two variables generally provides very weak
evidence concerning cause-and-effect relationships. The reasons for this can be seen by
reviewing our study on insurance sales. If the psychologist conducts this study and finds
that the agents with the more difficult goals also tend to sell more insurance, does that mean
that having difficult goals caused them to sell more insurance? Not necessarily. It can also
be argued that selling a lot of insurance increases the agents' self-confidence, and that this
causes them to set higher work goals for themselves. Under this second scenario, it was
actually the insurance sales that had a causal effect on goal difficulty.
As this example shows, with nonexperimental research it is often possible to obtain a single
finding that is consistent with a number of different, contradictory causal explanations.
Hence, a strong inference that variable A had a causal effect on variable B is generally not
possible when one conducts simple correlational research with just two variables. To obtain
stronger evidence of cause and effect, researchers generally either analyze the relationships
among a larger number of variables using sophisticated statistical procedures that are
beyond the scope of this text (such as structural equation modeling), or drop the
nonexperimental approach entirely and instead use experimental research methods. The
nature of experimental research is discussed in the following section.

Chapter 2: Terms and Concepts Used in This Guide 31

Experimental Research
General characteristics. Most experimental research can be identified by three important
characteristics:

subjects are randomly assigned to experimental conditions

the researcher manipulates an independent variable

subjects in different experimental conditions are treated similarly with regard to all
variables except the independent variable.

To illustrate these concepts, let's describe a possible experiment in which you test the
hypothesis that goal difficulty positively affects insurance sales. First you identify a group
of 100 agents who will serve as subjects. Then you randomly assign 50 agents to a
difficult goal condition. Subjects in this group are told by their superiors to make at least
25 cold calls (sales calls) to potential policyholders per week. Assume that this is a
relatively difficult goal. The other 50 agents have been randomly assigned to the easy goal
condition. They have been told to make just 5 cold calls to potential policy holders per
week. To the extent possible, you see to it that agents in both groups are treated similarly
with respect to everything except for the difficulty of the goals that are set for them.
After one year, you determine how much new insurance each agent has sold that year. You
find that the average agent in the difficult goal condition sold new policies totaling
$156,000, while the average agent in the easy goal condition sold policies totaling only
$121,000.
Independent versus dependent variables. It is possible to use some of the terminology
associated with nonexperimental research when discussing this experiment. For example, it
would be appropriate to continue to refer to the amount of insurance sold as being a criterion
variable because this is the outcome variable of central interest. You also could continue to
refer to goal difficulty as the predictor variable because you believe that this variable will
predict sales to some extent.
Notice that goal difficulty is now a somewhat different variable, however. In the
nonexperimental study, goal difficulty was a naturally-occurring variable that could take on
a wide variety of values (whatever score the subject received on the goal difficulty
questionnaire). In the present experiment, however, goal difficulty is a manipulated
variable, which means that you (as the researcher) determined what value of the variable
would be assigned to each subject. In the experiment, the goal difficulty variable could
assume only one of two values: Subjects were either in the difficult goal group or the easy
goal group. Therefore, goal difficulty is now a classification variable that codes group
membership.
Although it is acceptable to speak of predictor and criterion variables within the context of
experimental research, it is more common to speak in terms of independent variables and
dependent variables. The independent variable is that variable whose values (or levels) are

32 Step-by-Step Basic Statistics Using SAS: Student Guide

selected by the experimenter to determine what effect the independent variable has on the
dependent variable. The independent variable is the experimental counterpart to a predictor
variable. A dependent variable, on the other hand, is some aspect of the subject's behavior
that is assessed to determine whether it has been affected by the independent variable. The
dependent variable is the experimental counterpart to a criterion variable. In the present
example experiment, goal difficulty is the independent variable, and the amount of
insurance sold is the dependent variable.
Remember that the terms predictor variable and criterion variable can be used with
almost any type of researchexperimental or nonexperimental. However, the terms
independent variable and dependent variable should be used only with experimental
researchresearch conducted under controlled conditions with a true manipulated
independent variable.
Levels of the independent variable. Researchers often speak in terms of the different
levels of the independent variable. These levels are also referred to as experimental
conditions or treatment conditions, and correspond to the different groups to which a subject
might be assigned. The present example included two experimental conditions: a difficult
goal condition, and an easy goal condition.
With respect to the independent variable, it is common to speak of the experimental group
versus the control group. Generally speaking, the experimental group is the group that
receives the experimental treatment of interest, while the control group is an equivalent
group of subjects that does not receive this treatment. The simplest type of experiment
consists of one experimental group and one control group. For example, the present study
could have been redesigned so that it simply consisted of an experimental group that was
assigned the goal of making 25 cold calls (the difficult goal condition), as well as a control
group in which no goals were assigned (the no-goal condition). Obviously, it is possible to
expand the study by creating more than one experimental group. This could be
accomplished in the present case by assigning one experimental group the difficult goal of
25 cold calls and the second experimental group the easy goal of 5 cold calls. The control
group could still be assigned zero goals.

Using Type-of-Variable Figures to Represent Dependent


and Independent Variables
Overview
Many studies in the social sciences and education are designed to investigate the
relationship between just two variables. In an experiment, researchers generally refer to
these as the independent and dependent variables; in a nonexperimental study, researchers
often call them the predictor and criterion variables.

Chapter 2: Terms and Concepts Used in This Guide 33

Some chapters in this guide will describe studies in which a researcher investigates the
relationship between predictor and criterion variables. To help you better visualize the
nature of the variables being analyzed, most of these chapters will provide a type-ofvariable figure: a figure that graphically illustrates the number of values that are assumed
by the two variables in the study.
This section begins by presenting the symbols that will represent three types of variables:
dichotomous variables, limited-value variables, and multi-value variables. It then provides a
few examples of the type-of-variable figures that you will see in subsequent chapters of this
book.
Figures to Represent Types of Variables
Dichotomous variables. A dichotomous variable is one that assumes just two values. For
example, the variable sex is a dichotomous variable that can assume just the values of
male versus female).
Below is the type-of-variable symbol that will represent a dichotomous variable:

D i
The Di that appears inside the boxes is an abbreviation for Dichotomous. The figure
includes two boxes to help you remember that a dichotomous variable is one that assumes
only two values.
Limited-value variables. A limited-value variable is one that assumes only two to six
values. For example, the variable political party would be a limited-value variable if it
assumed only the values of democrat versus republican versus independent.
Below is the type-of-variable symbol that will represent a limited-value variable:

The Lmt that appears inside the boxes is an abbreviation for Limited. The figure
includes three boxes to remind you that a limited-value variable is one that can have only
two to six values.
Multi-value variables. A multi-value variable is one that assumes more than six values. For
example, if you administered an IQ test to a sample of 300 subjects, then IQ scores would
be a multi-value variable if more than six different IQ scores appeared in your sample.
Below is the type-of-variable symbol that will represent a multi-value variable:

This figure consists of seven boxes to help you remember that a multi-value variable is one
that assumes more than six values in your sample.

34 Step-by-Step Basic Statistics Using SAS: Student Guide

Using Figures to Represent the Types of Variables Assessed in a


Specific Study
As was stated earlier, when a study is a true experiment, the two variables that are being
investigated are typically referred to as a dependent variable and an independent variable. It
is possible to construct a type-of-variable figure that illustrates the nature of the dependent
variable, as well as the nature of the independent variable, in a single figure.
The research hypothesis. For example, earlier this chapter developed the research
hypothesis that goal difficulty will have a positive causal effect on the amount of insurance
sold by insurance agents. This hypothesis was illustrated by the causal figure presented in
Figure 2.1. That figure is again reproduced here as Figure 2.2. Notice that, in this figure, the
dependent variable (amount of insurance sold) appears on the left, and the independent
variable (goal difficulty) appears on the right.

Figure 2.2. Predicted causal relationship between goal difficulty (the


independent variable) and amount of insurance sold (the dependent variable).

An experiment with two conditions. In this example, you conduct a simple experiment to
investigate this research hypothesis. You begin with 100 insurance agents, and randomly
assign each agent to either an experimental group or a control group. The 50 agents in the
experimental group (the difficult-goal condition) are told to make 25 cold calls each week.
The 50 agents in the control group (the easy-goal condition) are told to make 5 cold calls
each week. After one year, you measure your dependent variable: The amount of insurance
(in dollars) sold by each agent. When you review the data, you find that the agents displayed
a wide variety of scores on this dependent variable: some agents sold $0 worth of insurance,
some agents sold $5,000,000 worth of insurance, and most sold somewhere in between these
two extremes. As a group, they displayed far more than six values on this dependent
variable.
The type-of-variable figure for the preceding study is shown below:

D i

When illustrating an experiment with a type-of-variable figure, this guide will use the
convention of placing the symbol for the dependent variable on the left side of the equals
sign (=), and placing the symbol for the independent variable on the right side of the equals
sign. You can see that this convention was followed in the preceding figure: the word
Multi on the left of the equals sign represents the fact that the dependent variable in your

Chapter 2: Terms and Concepts Used in This Guide 35

study (amount of insurance sold) was a multi-value variable. You knew this, because the
agents displayed more than six values on this variable. In addition, the letters Di on the
right side of the equals sign represents the fact that the independent variable (goal difficulty)
was a dichotomous variable. You knew this, because this independent variable consisted of
just two values (conditions): a difficult-goal condition and an easy-goal condition.
Because the dependent variable is on the left and the independent variable is on the right, the
preceding type-of-variable figure is similar to Figure 2.2, which illustrated the research
hypothesis. In that figure, the dependent variable was also on the left, and the independent
variable was also on the right
The preceding type-of-variable figure could be used to illustrate any experiment in which
the dependent variable was a multi-value variable and the independent variable was a
dichotomous variable. In Chapter 13, Independent-Samples t Test, you will learn that data
from this type of experiment is often analyzed using a statistical procedure called a t test.
A warning about statistical assumptions. Please note that, when you are deciding whether
it is appropriate to analyze a data set with a t test, it is not sufficient to simply verify that the
dependent variable is a multi-value variable and that the independent variable is a
dichotomous variable. There are many statistical assumptions that must be satisfied for a t
test to be appropriate, and those assumptions will not be discussed in this chapter (they will
be discussed in the chapters on t tests). The type-of-variable figure was presented above to
help you visualize the type of situation in which a t test is often performed. Each chapter of
this guide that discusses an inferential statistical procedure (such as a t test) also will
describe the assumptions that must be met in order for the test to be valid.
An experiment with three conditions. Now lets modify the experiment somewhat, and
observe how it changes the type-of-variable figure. Assume that you now have 150 subjects,
and your independent variable now consists of three conditions, rather than just two:

The 50 agents in experimental group #1 (the difficult-goal condition) are told to make
25 cold calls each week.

The 50 agents in experimental group #2 (the easy-goal condition) are told to make 5
cold calls each week.

The 50 agents in the control group (the no-goal condition) are not given any specific
goals about the number of cold calls to make each week.

Assume that everything else about the study remains the same. That is, you use the same
dependent variable, the number of values observed on the dependent variable still exceed
six, and so forth. If this were the case, you would illustrate this revised study with the
following figure:

Notice that Multi still appears to the left of the equals sign because your dependent
variable has not changed. However, Lmt now appears to the right of the equals sign

36 Step-by-Step Basic Statistics Using SAS: Student Guide

because the independent variable now has three values rather than two. This means that the
independent variable is now a limited-value variable, not a dichotomous variable.
The preceding figure could be used to illustrate any experiment in which the dependent
variable was a multi-value variable, and the independent variable was a limited-value
variable. In Chapter 15, One-Way ANOVA with One Between-Subjects Factor, you will
learn that data from this type of experiment can be analyzed using a statistical procedure
called a one-way ANOVA (assuming that other assumptions are met; those assumptions will
be discussed in Chapter 15).
A correlational study. Finally, lets modify the experiment one more time, and observe
how it changes the type-of-variable figure. This time you are interested in the same research
hypothesis, but you are doing a nonexperimental study rather than an experiment. In this
study, you will not manipulate an independent variable. Instead, you will simply measure
two naturally occurring variables and will determine whether they are correlated in a sample
of 200 insurance agents. The two variables are

Goal difficulty. Each agent completes a scale that assesses the difficulty of the goals
that the agent sets for himself/herself. With this scale, scores can range from 0 to 99,
with higher scores representing more difficult goals. When you analyze the data, you
find that this variable displays more than six values in this sample (i.e., you find that the
agents get a wide variety of different scores).

Amount of insurance sold. For each agent, you review records to determine how much
insurance the agent has sold during the previous year. Assume that this variable also
displays more than six observed values in your sample.

By analyzing your data, you want to determine whether there is a significant correlation
between goal difficulty and the amount of insurance sold. You hope to find that agents who
had high scores on goal difficulty also tended to have high scores on insurance sold.
Because this is nonexperimental research, it is not appropriate to speak in terms of an
independent variable and a dependent variable. Instead, you will refer to goal difficulty as
a predictor variable, and insurance sold as a criterion variable. When preparing a type-ofvariable figures for this type of study, the criterion variable should appear to the left of the
equals sign, and the predictor variable should appear to the right.

Chapter 2: Terms and Concepts Used in This Guide 37

The correlational study that was described above can be represented with the following
type-of-variable figure:

The Multi appearing to the left of the equals sign represents the criterion variable in your
study: amount of insurance sold. You knew that it was a multi-value variable, because it
displayed more than six values in your sample. The Multi appearing on the right of the
equals sign represents the predictor variable in your study: scores on the goal difficulty
scale.
The preceding figure could be used to illustrate any correlational study in which the
criterion variable and predictor variable were both multi-value variables. In Chapter 10 of
this guide (Bivariate Correlation), you will learn that data from this type of study are often
analyzed by computing a Pearson correlation coefficient (assuming that other assumptions
are met).

The Three Types of SAS Files


Overview
The purpose of this section is to provide a very general overview of the procedure that you
will follow when you submit a SAS program and then interpret the results. To do this, the
current section will present a short SAS program and briefly describe the output that it
creates.
Generally speaking, you will work with three types of files when you use the SAS System:
one file will contain the SAS program, one will contain the SAS log, and one will contain
the SAS output. The differences between these three types of files are discussed next.
The SAS Program
A SAS program consists of a set of statements written by the user. These statements
provide the SAS System with the data to be analyzed, tell SAS about the nature of the data,
and indicate which statistical analyses should be performed on the data. These statements
are usually typed as data lines in a file in the computers memory.
Some fictitious data. This section will illustrate a simple SAS program that analyzes some
fictitious data. Suppose that you have administered two tests (Test 1 and Test 2) to a group
of eight people. Scores on a particular test can range from 0 to 9. Table 2.2 presents the
scores that the eight subjects earned on Test 1 and Test 2.

38 Step-by-Step Basic Statistics Using SAS: Student Guide


Table 2.2
Scores Earned on Test 1 and Test 2
____________________________
Subject
Test 1
Test 2
____________________________
Marsha

Charles

Jack

Cathy

Emmett

Marie

Cindy

Susan
5
4
____________________________

The way that the information is arranged in Table 2.2 is representative of the way that
information is arranged in most SAS data sets. Each vertical column (running from the top
to the bottom) provides information about a different variable. The headings in Table 2.2 tell
us that:

The first column provides information about the Subject variable: It provides the first
name for each subject.

The second column provides information about the Test 1 variable: It provides each
subjects score on Test 1.

The third column provides information about the Test 2 variable: It provides each
subjects score on Test 2.

In contrast, each horizontal row in the table (running from left to right) provides
information about a different subject. For example,

The first row provides information about the subject named Marsha. Where the row for
Marsha intersects with the column headed Test 1, you can see that she obtained a score
of 2 on Test 1. Where the row for Marsha intersects with the column headed Test 2,
you can see that she obtained a score of 3 on Test 2.

The second row provides information about the subject named Charles. Where the row
for Charles intersects with the column headed Test 1, you can see that he obtained a
score of 2 on Test 1. Where the row for Charles intersects with the column headed
Test 2, you can see that he also obtained a score of 2 on Test 2.

The rows for the remaining subjects can be interpreted in the same way.

Chapter 2: Terms and Concepts Used in This Guide 39

The SAS program. Suppose that you now want to analyze subject scores on the two tests.
Specifically, you want to compute the means and some other descriptive statistics for Test 1
and Test 2.
Following is a complete SAS program that enters the data presented in Table 2.2. It also
computes means and some other descriptive statistics for Test 1 and Test 2.
OPTIONS LS=80 PS=60;
DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 3
2 2
3 3
3 4
4 3
4 4
5 3
5 4
;
PROC MEANS DATA=D1;
TITLE1 'JANE DOE';
RUN;
It will be easier to refer to the different components of the preceding program if we assign
line numbers to each line. We will then be able to use these line numbers to refer to specific
statements. Therefore, the program is reproduced again below, this time with line numbers
added (remember that you would not actually type these line numbers if you were writing a
program to be analyzed by the SAS Systemthe line numbers should already appear on
your computer screen if you use the SAS windowing environment and follow the directions
to be provided in the Chapter 3 of this guide):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 3
2 2
3 3
3 4
4 3
4 4
5 3
5 4
;
PROC MEANS DATA=D1;
TITLE1 'JANE DOE';
RUN;

40 Step-by-Step Basic Statistics Using SAS: Student Guide

This chapter does not discuss SAS programming statements in detail. However, the
preceding program will make more sense to you if the functions of its various parts are
briefly explained:

Line 1 of the preceding program contains the OPTIONS statement. This is a global
statement that can be used to modify how the SAS System operates. In this example, the
OPTIONS statement is used to specify how large each page of SAS output should be
when it is printed.

Line 2 contains the DATA statement. You use this statement to start the DATA step
(explained below) and assign a name to the data set that you are creating.

Line 3 contains the INPUT statement. You use this statement to assign names to the
variables that SAS will work with.

Line 4 contains the DATALINES statement. This statement tells SAS that the data lines
will begin with the next line of the program.

Lines 512 are the data lines that will be read by SAS. You can see that these data lines
were taken directly from Table 2.2: Line 5 contains scores on Test 1 and Test 2 from
Marsha; line 6 contains scores on Test 1 and Test 2 from Charles, and so on. There are
eight data lines because there were eight subjects. Obviously, the subjects names have
not been included as part of the data set (although they can be included, if you choose).

Line 13 is the null statement. It is very short, consisting of a single semicolon. This
null statement tells SAS that the data lines have ended.

Line 14 contains the PROC MEANS statement. It tells SAS to compute means and other
descriptive statistics for all numeric variables in the data set.

Line 15 contains the TITLE1 statement. You use this statement to assign a title, or
heading, that will appear on each page of output. Here, the title will be JANE DOE.

Finally, Line 16 contains the RUN statement that signals the end of the program.

Subsequent chapters will discuss the use of the preceding statements in much more detail.
What is the single most common programming error? For new SAS users, the most
common programming error usually involves omitting a required semicolon (;). Remember
that every SAS statement must end with a semicolon (in the preceding program, notice that
the DATA statement ends with a semicolon, as does the INPUT statement and the PROC
MEANS statement). When you obtain an error in running a SAS program, one of the first
things that you should do is inspect the program for missing semicolons.

Chapter 2: Terms and Concepts Used in This Guide 41

The DATA step versus the PROC step. There is another, more fundamental way, to divide
a SAS program into its constituent components. It is possible to think of each SAS program
as consisting of a DATA step and a PROC step. Below, we show how the preceding
program can be divided in this way:

DATA step

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 3
2 2
3 3
3 4
4 3
4 4
5 3
5 4
;

PROC step

PROC MEANS DATA=D1;


TITLE1 'JANE DOE';
RUN;

The differences between these steps are described below.


In the DATA step, programming statements create and/or modify a SAS data set. Among
other things, statements in the DATA step may

assign a name to the data set

assign names to the variables to be included in the data set

provide the actual data to be analyzed

recode existing variables

create new variables from existing variables.

In contrast to the DATA step, the PROC step includes statements that request specific
statistical analyses of the data. For example, the PROC step might request that correlations
be computed for all pairs of numeric variables, or might request that a t test be performed.
In the preceding example, the PROC step requested that means be computed.
What text editor will I use to write my SAS program? An editor is a computer application
that allows you to create lines of text, such as the lines that constitute a SAS program. If you
are working on a mainframe or mid range computer system, you might have a variety of
editors that can be used to write your SAS programs; just ask the staff at your computer
facility.

42 Step-by-Step Basic Statistics Using SAS: Student Guide

For many users, it is best to use the SAS windowing environment to write SAS programs.
The SAS windowing environment is an integrated application that allows users to create
and edit SAS programs, submit them for interactive analysis, view the results on their
screens, manage files, and perform other activities. This application is available at most
locations where the SAS System is installed (including personal computers). Chapter 3 of
this guide provides a tutorial that shows you how to use the SAS windowing environment.
After submitting the SAS program. Once the preceding program has been submitted for
analysis, SAS will create two types of files reporting the results of the analysis. One file is
called the SAS log or log file, and the other file is the SAS output file. The following
sections explain the purpose of these files.
The SAS Log
The SAS log is generated by SAS after you submit your program. It is a summary of notes
and messages generated by SAS as your program executes. These notes and messages will
help you verify that your SAS program ran correctly. Specifically, the SAS log provides

a reprinting of the SAS program that was submitted (minus the data lines)

a listing of notes indicating how many variables and observations are contained in the
data set

a listing of any notes, warnings, or error messages generated during the execution of the
SAS program.

Chapter 2: Terms and Concepts Used in This Guide 43

Log 2.1 provides a reproduction of the SAS log generated for the preceding program:
NOTE: SAS initialization used:
real time
14.54 seconds
1
2
3
4

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

NOTE: The data set WORK.D1 has 8 observations and 2 variables.


NOTE: DATA statement used:
real time
1.59 seconds

13
14
15
16

;
PROC MEANS DATA=D1;
TITLE1 'JANE DOE';
RUN;

NOTE: There were 8 observations read from the dataset WORK.D1.


NOTE: PROCEDURE MEANS used:
real time
1.64 seconds
Log 2.1. Log file created by the current SAS program.

Notice that the statements constituting the SAS program have assigned line numbers, which
are reproduced in the SAS log. The data lines are not normally reproduced as part of the
SAS log unless they are specifically requested.
About halfway down the log, the following note appears:
NOTE: The data set WORK.D1 has 8 observations and 2 variables.

This note indicates that the data set that you created (named D1) contains 8 observations and
2 variables. You would normally check this note to verify that the data set contains all of
the variables that you intended to input (in this case 2), and that it contains data from all of
your subjects (in this case 8). So far, everything appears to be correct.
If you had made any errors in writing the SAS program, there also would have been ERROR
messages in the SAS log. Often, these error messages provide you with some help in
determining what was wrong with the program. For example, a message can indicate that
SAS was expecting a program statement that was not included. Chapter 3, Tutorial: Using
the SAS Windowing Environment to Write and Submit SAS Programs will discuss error
messages in more detail, and will provide you with some practice in debugging a program
with an error.
Once the error or errors have been identified, you must revise the original SAS program and
resubmit it for analysis. After processing is complete, again review the new SAS log to see

44 Step-by-Step Basic Statistics Using SAS: Student Guide

if the errors have been eliminated. If the log indicates that the program ran correctly, you
are free to review the results of the analyses in the SAS output file.
Very often you will submit a SAS program and, after a few seconds, the SAS output
window will appear on your computer screen. Some users mistakenly assume that this
means that their program ran without errors. But this is not necessarily the case. Very often
some parts of your program will run correctly, but other parts will have errors. The only way
to be sure is to carefully review all of the SAS log before reviewing the SAS output.
Chapter 3 will lead you through these steps.
The SAS Output File
The SAS output file contains the results of the statistical analyses requested in the SAS
program. An output file is sometimes called a listing file, because it contains a listing of
the results of the analyses that were requested.
Because the program above requested the MEANS procedure, the output file that was
produced by this program will contain means, standard deviations, and some other
descriptive statistics for the two variables. Output 2.1 presents the SAS output that would be
produced by the preceding SAS program. Numbers (such as and ) have been added to
the output to more easily identify specific sections.
JANE DOE

The MEANS Procedure


Variable
N
Mean
Std Dev
Minimum
Maximum
----------------------------------------------------------------------------TEST1
8
3.5000000
1.1952286
2.0000000
5.0000000
TEST2
8
3.2500000
0.7071068
2.0000000
4.0000000
----------------------------------------------------------------------------Output 2.1. SAS output produced by PROC MEANS.

At the top of the output page is the name JANE DOE. This name appears here because
JANE DOE was included in the TITLE1 statement of the program. Later, this guide will
show you how to insert your name in the TITLE1 statement, so that your name will appear
at the top of each of your output pages.
Below the heading Variable, SAS prints the names of each of the variables being
analyzed. In this case, the variables are called TEST1 and TEST2. To the right of the
heading TEST1, descriptive statistics for Test 1 can be found. Statistics for Test 2 appear
to the right of TEST2.
Below the heading N, the number of valid observations being analyzed is reported. You
can see that the SAS System analyzed eight observations for TEST1, and eight observations
for TEST2.

Chapter 2: Terms and Concepts Used in This Guide 45

The average score on each variable is reproduced under Mean.


Standard deviations appear in the column labeled Std Dev. You can see that, for Test 1,
the mean was 3.5 and the standard deviation was 1.1952. For Test 2, the corresponding
figures were 3.25 and 0.7071.
Below the headings Minimum and Maximum, you will find the lowest and highest
scores observed for the two variables, respectively.
Once you have obtained this output file from your analysis, you can review it on your
computer monitor, or print it out at a printer. Chapter 3 will show you how to interpret your
output.

Conclusion
This chapter has introduced you (or reintroduced you) to the terminology that is used by
researchers in the behavioral sciences and education. With this foundation, you are now
ready to learn about performing data analyses with SAS.
The preceding section indicated that you must use some type of text editor to write SAS
programs. For most users, it is advantageous to use the SAS windowing environment for
this purpose. With the SAS windowing environment, you can write and submit SAS
programs, view the results on your monitor, print the results, and save your SAS programs
on a disketteall from within one application. Chapter 3 provides a hands-on tutorial that
shows you how to perform these activities within the SAS windowing environment.

46 Step-by-Step Basic Statistics Using SAS: Student Guide

Tutorial:
Writing and
Submitting
SAS
Programs
Introduction...........................................................................................48
Overview.................................................................................................................48
Materials You Will Need for This Tutorial................................................................48
Conventions and Definitions ...................................................................................48
Tutorial Part I: Basics of Using the SAS Windowing Environment ......50
Tutorial Part II: Opening and Editing an Existing SAS Program ..........75
Tutorial Part III: Submitting a Program with an Error ..........................94
Tutorial Part IV: Practicing What You Have Learned.........................102
Summary of Steps for Frequently Performed Activities.....................105
Overview...............................................................................................................105
Starting the SAS Windowing Environment ............................................................105
Opening an Existing SAS Program from a Floppy Disk ........................................106
Finding and Correcting an Error in a SAS Program ..............................................107
Controlling the Size of the Output Page with the
OPTIONS Statement .......................................................................109
For More Information ..........................................................................110
Conclusion...........................................................................................110

48 Step-by-Step Basic Statistics Using SAS: Student Guide

Introduction
Overview
This chapter shows you how to use the SAS windowing environmentan application that
enables you to create and edit SAS programs in a text editor, submit programs for execution,
review and print the results of the analysis, and perform related activities. This chapter
assumes that you are using the SAS System for Windows on an IBM-compatible computer.
The tutorial in this chapter is based on Version 8 of SAS. If you are using Version 7 of SAS,
you can still use the tutorial presented here (with some minor adjustments), because the
interfaces for Version 7 and Version 8 are very similar. However, if you are using Version 6
of SAS, the interface that you are using is substantially different from the Version 8
interface.
The majority of this chapter consists of a tutorial that is divided into four parts. Part I shows
you how to start the SAS windowing environment, create a short SAS program, save it on a
3.5-inch floppy disk, submit it for execution, and print the resulting SAS log and SAS
output files. Part II shows you how to open an existing SAS file and edit it. Part III takes
you through the steps involved in debugging a program with an error. Finally, Part IV gives
you the opportunity to practice what you have learned. In addition, two short sections at the
end of the chapter summarize the steps that are involved in frequently performed activities,
and show you how to use the OPTIONS statement to control the size of your output page.
Materials You Will Need for This Tutorial
To complete this tutorial, you will need access to a computer on which the SAS System for
Windows has been installed. You will also need at least one (and preferably two) 3.5
diskettes formatted for IBM-compatible computers (as opposed to Macintosh computers).
Conventions and Definitions
Here is a brief explanation of the computer-related terms that are used in this chapter:

The ENTER key. Depending on the computer you are using, this key is identified by
Enter, Return, CR, New Line, the symbol of an arrow looping backward
, or
some other identifier. This key is equivalent to the return key on a typewriter.
Enter

The backspace key. This is the key that allows you to delete text one letter at a time.
The key is identified by the word Backspace or Delete, or possibly by an arrow
pointing backward:
.
Backspace

Menus. This book uses the abbreviation menu for pull-down menu. A menu is a
list of commands that you can access by clicking a word on the menu bar at the top of a
window. For example, if you click the word File on the menu bar of the Editor window,

Chapter 3: Tutorial: Writing and Submitting SAS Programs 49

the File pull-down menu appears (this menu contains commands for working with files,
as you will see later in this chapter).

The mouse pointer. The mouse pointer is a small icon that you move around the
screen by moving your mouse around on its mouse pad. Different icons serve as the
mouse pointer in different contexts, depending on where you are in the SAS windowing
environment. Sometimes the mouse pointer is an arrow ( ), sometimes it is an I-beam
(I), and sometimes it is a small hand ( ).

The I-beam. The I-beam is a special type of mouse pointer. It is the small icon that
looks like the letter I and appears on the screen when you are working in the Editor
window. You move the I-beam around the screen by moving the mouse around on its
mouse pad. If the I-beam appears at a particular location in a SAS program and you
click the left button on the mouse, that point becomes the insertion point in the
program; whatever you type will be inserted at that point.

The cursor. The cursor is the flashing bar ( ) that appears in your SAS program when
you are working in the Editor window. Anything you type will appear at the location of
the cursor.

Insert versus overtype mode. The Insert key toggles between insert mode and
overtype mode. When you are in insert mode, the text to the right of the cursor will be
pushed over to the right as you type. If you are in overtype mode, the text to the right of
the cursor will disappear as you type over it.

Pointing. When this tutorial tells you to point at an icon on your screen, it means to
position the mouse pointer over that icon.

Clicking. When this tutorial tells you to click something on your screen, it means to put
the mouse pointer on that word or icon and click the button on your mouse one time. If
your mouse has more than one button, click the button on the left.

Double-clicking. When this tutorial tells you to double-click something on your screen,
it means to put the mouse pointer on that word or icon and click the left button on your
mouse twice in rapid succession. Make sure that your mouse does not move when you
are clicking.

50 Step-by-Step Basic Statistics Using SAS: Student Guide

Tutorial Part I: Basics of Using the


SAS Windowing Environment
Overview
This section introduces you to the basic features of the SAS windowing environment. You
will learn how to start the SAS System and how to cycle between the three windows that
you use in a typical session: the Editor window, the Log window, and the Output window.
You will type a simple SAS program, save the program on a 3.5-inch floppy disk, and
submit it for execution. Finally, you will learn how to review the SAS log and SAS output
created by your program and how to print these files on a printer.
Starting the SAS System
Turn on your computer and monitor if they are not already on. If you are working in a
computer lab at a university and your computer screen is blank, your computer might be in
sleep mode. To activate it, press any key.
After the computer has finished booting up (or waking up), the monitor displays its normal
start-up screen.
Figure 3.1 shows the start-up screen for computers at Saginaw Valley State University
(where this book was written). Your screen will not look exactly like Figure 3.1, although it
should have a gray bar at the bottom, similar to the gray bar at the bottom of Figure 3.1. On
the left side of this bar is a button labeled Start.

Photograph copyright 2000 Saginaw Valley State University

Chapter 3: Tutorial: Writing and Submitting SAS Programs 51

Click Start button to display a list of options. One of the options should be Programs.
One of the programs should be The SAS System for Windows V8.
Figure 3.1. The initial computer screen.

This is how you start the SAS System (try this now):
Use your mouse to move the mouse pointer on your screen to the Start button. Click
this Start button once (click it with the left button on your mouse, if your mouse
has more than one button). A menu of options appears.
In this list of options is the word Programs. Put the mouse pointer on Programs.
This reveals a list of programs on the computer. One of these programs is The SAS
System for Windows V8.
Put the mouse pointer on The SAS System for Windows V8 and select it (click it
and release). This starts the SAS System. This process takes several seconds.
At your facility, the actual sequence for starting SAS might be different from the sequence
described here. For example, it is possible that there is no item on your Programs menu
labeled The SAS System for Windows V8. If this is the case, you should ask the lab
assistant or your professor for guidance regarding the correct way to start SAS at your
location.

52 Step-by-Step Basic Statistics Using SAS: Student Guide

When SAS is running, three windows appear on your screen: the Explorer window, the Log
window, and the Editor window. Your screen should look something like Figure 3.2.
Click this close button to close the Explorer window.

Log
window
Explorer
window

Editor
window

Figure 3.2. The initial SAS screen (closing the SAS Explorer window).

The Five Basic SAS System Windows


After you start SAS, you have access to five SAS windows: the Editor, Log, Output,
Explorer, and Results windows. Not all of these windows are visible when you first start
SAS.
Of these five windows, you will use only three of them to perform the types of analyses
described in this book. The three windows that you will use are briefly described here:

The Editor window. The Editor is a SAS program editor. It enables you to create, edit,
submit, save, open, and print SAS programs. In a typical session, you will spend most of
your time working within this window. When you first start SAS, the words Editor Untitled1 appear in the title bar for this window (the title bar is the bar that appears at
the top of a window). After you save your SAS program and give it a name, that name
appears in the title bar for the Editor window.

The Log window displays your SAS log after you submit a SAS program. The SAS log
is a file generated by SAS that contains your SAS program (minus the data lines), along
with a listing of notes, warnings, error messages, and other information pertaining to the
execution of your program. In Figure 3.2, you can see that the Log window appears in
the top half of the initial screen. The word Log appears in the title bar for this window.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 53

The Output window. The Output window contains the results of the analyses
requested by your SAS program. Although the Output window does not appear in
Figure 3.2, a later section shows you how to make it appear.

This book does not show you how to use the two remaining windows (the Explorer window
and the Results window). In fact, for this tutorial, the first thing you should do each time
you start SAS is to close these windows. This is not because these windows are not useful; it
is because this book is designed to be an elementary introduction to SAS, and these two
windows enable you to perform more advanced activities that are beyond the scope of this
book. For guidance in using these more advanced features of the SAS windowing
environment, see Delwiche and Slaughter (1998) and Gilmore (1997).
The two windows that you will close each time you start SAS are briefly described here:

The Explorer window appears on the left side of your computer screen when you first
start SAS (the word Explorer appears in its title bar; see Figure 3.2). It enables you to
open files, move files, copy files, and perform other file management tasks. You can use
it to create libraries of SAS files and to create shortcuts to files other than SAS files. The
Explorer window is helpful when you are managing a large number of files or libraries.

The Results window also appears on the left side of your screen when you start SAS. It
is hidden beneath the Explorer window, but you can see it after you close that window.
The Results window lists each section of your SAS output in outline form. When you
request many different statistical procedures, it provides a concise, easy-to-navigate
listing of results. You can use the Results window to view, print, and save individual
sections of output. The Results window is useful when you write a SAS program that
contains a large number of procedures.

What If My Computer Screen Does Not Look Like Figure 3.2?


Your computer screen might not look exactly like the computer screen illustrated in Figure
3.2. For example, your computer screen might not contain one of the windows (such as the
Editor window) that appears in Figure 3.2. There are a number of possible reasons for this,
and it is not necessarily a cause for alarm.
This chapter was prepared using Version 8 of SAS, so if you are using a later version of
SAS, your screen might differ from the one shown in Figure 3.2. Also, the computer
services staff at your university might have customized the SAS System, which would make
it look different at startup.
The only important consideration is this: Your SAS interface must be set up so that you can
use the Editor window, Log window, and Output window. There is more than one way to
achieve this. After you read the following sections, you will have a good idea of how to
accomplish this, even if your screen does not look exactly like Figure 3.2.

54 Step-by-Step Basic Statistics Using SAS: Student Guide

The following sections show you how to close the two windows that you will not use, how
to maximize the Editor window, and how to perform other activities that will help prepare
the SAS windowing environment for writing and submitting simple SAS programs.
Closing the SAS Explorer Window
The SAS Explorer window appears on the left side of your computer screen (see Figure 3.2).
At the top of this window is a title bar that contains the word Explorer.
Your first task is to close this Explorer window to create more room for the SAS Editor,
Log, and Output windows. In the upper-right corner of the Explorer window (on the title
. This is
bar), there is a small box with an X in it, a box that looks something like this:
the close button for the Explorer window. At this time, complete the following step:
Put your mouse pointer on the close button for the Explorer window and click
once (see Figure 3.2 for guidance; make sure that you click the close button for the
Explorer window, and not for any other window).
The Explorer window will close.
Closing the SAS Results Window
When the Explorer window closes, it reveals another window beneath itthe Results
window. Your screen should look like Figure 3.3.
Click this close button to close the Results window.

Figure 3.3. Closing the SAS Results window.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 55

Your next task is to close this Results window to create more room for the Editor, Log, and
Output windows. In the upper-right corner of the Results window (on its title bar), there is a
small box with an x in it (
). This is the close button for the Results window.
Put your mouse pointer on the close button for the Results window and click once
(see Figure 3.3).
The Results window will close.
Maximizing the Editor Window
After you close the Results window, the Log window and the Editor window expand to the
left to fill the SAS screen. The Log window appears on the upper part of the screen, and the
Editor window appears on the lower part. Your screen should look like Figure 3.4.

Click this maximize button to expand the Editor window.


Figure 3.4. Maximizing the Editor window.

As said earlier, the title bar for the Editor window is the bar that appears at the top of the
window. In Figure 3.4, the title Editor - Untitled1 appears on the left side of the title bar.
On the right side of this title bar are three buttons (dont click them yet):

is the minimize window button; if you click this button, the window will shrink and
become hidden.

is the maximize window button; if you click this button, the window will become
larger and fill the screen.

56 Step-by-Step Basic Statistics Using SAS: Student Guide

is the close window button; if you click this button, the window will close.

At this point, the Editor window and the Log window are both visible on your screen. A
possible drawback to this arrangement is that both windows are so small that it is difficult to see
very much in either window. For some SAS users it is easier to view one window at a time,
allowing the active window to expand so that it fills the screen. With this arrangement, it is as
if you have stacked the windows on top of one another, but you can see only the window that is
in the foreground, on the top of the stack. This book shows you how to set up this
arrangement.
In order to rearrange your windows this way, complete the following step:
Using your mouse pointer, click the maximize window button for the Editor window.
This is the middle buttonthe one that contains the square
(see Figure 3.4). Be
sure that you do this for the Editor window, not for the Log window or any other
window.
When clicking the maximize button in the Editor window, take care that you do not click
the close button (the button on the far right that looks like this:
). If you close this
window by accident, you can reopen it by completing the following steps (do not do this
now unless you have closed your Editor window by mistake):
On the menu bar, put your mouse pointer on the word View and click. The View
menu appears.
Select Enhanced Editor.
Your Editor window should return to the SAS windowing environment. You can select it by
using the Window menu (a later section shows you how). You can use this same procedure
to bring back your Log and Output windows if you close them by accident.
Requesting Line Numbers and Other Options
There are a number of options that you can select to make SAS easier to use. One of the most
important options is the line numbers option. If you request this option, SAS will
automatically generate line numbers for the lines of the SAS programs that you write in the
Editor. Having line numbers is useful because it helps you to know where you are in the
program, and it can make it easier to copy lines of a program, move lines, and so forth.
To request line numbers, you must first open the Enhanced Editor Options dialog box.
Figure 3.5 shows you how to do this. Complete the following steps:
On the menu bar, select Tools. The Tools menu appears.
Select Options (and continue to hold the mouse button down). A pop-up menu
appears.
Select Enhanced Editor and release the mouse button.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 57


First, select the Tools menu.
Menu bar

Second,
select
Options.
Third, select
Enhanced
Editor.

Figure 3.5. Requesting the Enhanced Editor Options dialog box.

The Enhanced Editor Options dialog box appears. This dialog box should be similar to the
one in Figure 3.6.
Verify that Show line numbers is selected (click inside this box if it is not checked).

In the Indentation
section, verify that
None is selected
(with a dot).

Verify that Clear


text on submit is
not selected
(that is, verify that
it is not checked).

Figure 3.6. Selecting appropriate options in the


Enhanced Editor Options dialog box.

58 Step-by-Step Basic Statistics Using SAS: Student Guide

There are two tabs for Enhanced Editor options: General options and Appearance options.
In Figure 3.6, you can see that the General options tab has been selected. If General
options has not been selected for the dialog box on your screen, click the General tab now
to bring General options to the front.
A number of options are listed in this Enhanced Editor Options dialog box. For example,
in the upper-left corner of the dialog box in Figure 3.6, you can see that two of the possible
options are Allow cursor movement past end of line and Drag and drop text editing. If a
check mark appears in the small box to the left of an option, it means that the option has
been selected. If no check mark appears, the option has not been selected. If a box for an
option is empty, you can click inside the box to select that option. If a box already has a
check mark, you can click inside the box to deselect it and make the check mark disappear.
There are three settings that you should always review at the beginning of a SAS session. If
you do not set these options as described here, SAS will still work, but your screen might
not look like the screens displayed in this chapter. If your screen does not look correct or if
you are having other problems with SAS, you should go to this dialog box and verify that
your options are set correctly.
Set your options as follows, at the beginning of your SAS session:

Verify that Show line numbers is selected (that is, make sure that a check mark appears
in the box for this option).

In the box labeled Indentation, verify that None is selected.

Here is the one option that should not be selected:

Verify that Clear text on submit is not selected (that is, make sure that a check mark
does not appear in the box for this option).

Figure 3.6 shows the proper settings for the Enhanced Editor Options dialog box. If
necessary, click inside the appropriate boxes so that these three options are set properly (you
can disregard the other options). When all options are correct, complete this step:
Click the OK button at the bottom of the dialog box.
This returns you to the Editor window. A single line number (the number 1) appears in the
upper-left corner of the Editor window. As you begin typing the lines of a SAS program,
SAS will automatically generate new line numbers. A later section of this tutorial provides
you with a specific SAS program to type. First, however, you will learn more about the
menu bar and the Window menu.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 59

The Menu Bar


Figure 3.7 illustrates what your screen should look like at this time. The Editor window should
now be enlarged and fill your screen. Toward the top of this window is the menu bar. The menu
bar lists all of the menus that you can access while in this window: the File menu, the Edit
menu, the View menu, the Tools menu, the Run menu, the Solutions menu, the Window
menu, and the Help menu. You will use these menus to access commands that enable you to
edit SAS programs, save them on diskettes, submit them for execution, and perform other
activities.
Use the Window menu to change windows.

Menu bar

Line number

Figure 3.7. The Editor window with line numbers.

Using the Window Menu


At this point, the Editor window should be enlarged and in the foreground. During a typical
session, you will jump back and forth frequently between the Editor window, the Log
window, and the Output window. To do this, you will use the Window menu.
In order to bring the Log window to the front of your stack, perform the following steps:
Go to the menu bar (at the top of your screen).
Put your mouse pointer on the word Window and click; this pulls down the Window
menu and lists the different windows that you can select.

60 Step-by-Step Basic Statistics Using SAS: Student Guide

In this menu, put your mouse pointer on the word Log and then release the button on
the mouse.
When you release the button, the (empty) Log window comes to the foreground. Notice that
the words SAS [Log-(Untitled)] now appear in the title bar at the top of your screen.
To bring the Output window to the foreground, complete the following steps:
Go to the menu bar at the top of your screen.
Pull down the Window menu.
Select Output.
The (empty) Output window comes to the front of your stack. Notice that the words SAS [Output - (Untitled)] now appear on the title bar at the top of your screen.
To go back to the Editor window, complete these steps:
Go to the menu bar at the top of your screen.
Pull down the Window menu.
Select Editor.
The Editor window comes to the foreground.
If your Editor window is not as large as you would like, you can enlarge it by clicking the
bottom right corner of the window and dragging it down and to the right. When you put
your mouse pointer in this corner, make sure that the double-headed arrow appears before
you click and drag.
A More Concise Way of Illustrating Menu Paths
The preceding section showed you how to follow a number of different menu paths. A
menu path is a sequence in which you pull down a menu and select one or more commands
from that menu. For example, you are following a menu path when you go to the menu bar,
pull down the Window menu, and select Editor.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 61

In the preceding section, menu paths were illustrated by listing each step on a separate line.
This was done for clarity. However, to conserve space, the remainder of this book will often
list an entire menu path on a single line. Here is an example:
Window Editor
The preceding menu path instructs you to go to the menu bar at the top of the screen, pull
down the Window menu, and select Editor. Obviously, this is the same sequence that was
described earlier, but it is now being presented in a more concise way. When possible, the
remainder of this book will use this abbreviated form for specifying menu paths.
Typing a Simple SAS Program
In this section, you will prepare and submit a short SAS program. Before doing this, make
sure that the Editor window is the active window (that is, it is in front of other windows).
You know that the Editor is the active window if the title bar at the top of your screen
includes the words SAS - [Editor - Untitled1]. If it is not the active window, use the
Window menu to bring it to the front, as described earlier.
Your cursor should now be in position to begin typing your SAS program (if it is not, use
your mouse or the arrow keys on your keyboard to move the cursor down to the now-empty
lines where your program will be typed).
Keep these points in mind as you type your program:

Do not type the line numbers that appear to the left of the program (that is, the numbers
1, 2, 3, and so forth, that appear on the left side of the SAS program). These line
numbers are automatically generated by the SAS System as you type your SAS
program.

The lines of your SAS program should be left-justified (that is, begin at the left side of
the window). If your cursor is in the wrong location, use your arrows keys to move it to
the correct location.

You can type SAS statements in uppercase letters or in lowercase letterseither is


acceptable.

If you make an error, use the backspace key to correct it. This key is identified by the
word Backspace or Delete, or possibly by an arrow pointing backward:
.
Backspace

Be sure to press ENTER at the end of each line in the program. This moves you down to
the next line.

62 Step-by-Step Basic Statistics Using SAS: Student Guide

Type this program:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 3
2 2
3 3
3 4
4 3
4 4
5 3
5 4
;
PROC MEANS DATA=D1;
TITLE1 'type your name here';
RUN;

Some notes about the preceding program:

The line numbers on the far left side of the program (for example, 1, 2, and 3) are
generated by the SAS Editor. Do not type these line numbers. When you get to the
bottom of the screen, the Editor continues to generate new lines for you each time you
press ENTER.

In this book some lines in the programs are indented (such as line 3 in the preceding
program). You are encouraged to use indention in the same way. This is not required by
the SAS System, but many programmers use indention in this way to keep sections of
the program organized meaningfully.

Line 15 contains the TITLE1 statement. You should type your first and last names
between the single quotation marks in this statement. By doing this, your name will
appear at the top of your printout when you print the results of the analysis. Be sure that
your single quotation marks are balanced (that is, you have one to the left of your name
and one to the right of your name before the semicolon). If you leave off one of the
single quotation marks, it will cause an error.

Scrolling through Your Program


The program that you have just typed is so short that you can probably see all of it on your
screen at one time. However, very often you will work on longer programs that extend
beyond the bottom of your screen. In these situations, it is necessary to scroll (move)
through your program so that you can see the hidden parts. There are a variety of approaches
that you can use to scroll through a file, and Figure 3.8 illustrates some of these approaches.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 63


Click this up-arrow to move
up one line at a time.

Click and drag this scroll


bar to move quickly
through large files.

Click this down-arrow to


move down one line at a
time.

Figure 3.8. How to scroll up and down.

Here is a brief description of ways to scroll through a file:

You can press the Page Up and Page Down keys on your keyboard.

You can click and drag the scroll bar that appears on the right side of your Editor
window. Drag it down to see the lower sections of your program, drag it up to see the
earlier sections (see Figure 3.8).

You can click the area that appears just above or below the scroll bar to move up or
down one screen at a time.

You can click the up-arrow and down-arrow in the scroll bar area to move up or down
one line at a time.

You can use these same techniques to scroll through any type of SAS file, whether it is a
SAS program, a SAS log, or a SAS output file.
SAS File Types
After your program has been typed correctly, you should save it. This is done with the File
menu and the Save As command (see the next section). The current section discusses some
conventions that are followed when naming SAS files.
Most files, including SAS files, have two-part names consisting of a root and a suffix. The
root is the basic file name that you can often make up. For example, if you are analyzing

64 Step-by-Step Basic Statistics Using SAS: Student Guide

data from a sample of subjects who are republicans, you might want to give the file a root
name such as REPUB.
If you are using a version of Windows prior to Windows 95 (such as Windows 3.1), the root
part of the file name must begin with a letter, must be no more than eight characters in
length, and must not contain any spaces or special characters (for example, #@:?>;*).
If you are using Windows 95 or later, the root part of the file name can be up to 255
characters in length, and it can contain spaces and some special characters. It cannot contain
the following characters: * | \ / ; : ? < >.
The file name extension indicates what type of file you are working with. The extension
immediately follows the root, begins with a period, and is three letters long, such as .LST,
.SAS, .LOG, or .DAT. Therefore, a complete file name might appear as REPUB.LST or
REPUB.SAS. The extensions are described here:
root.SAS

This file contains the SAS program that you write. Remember that a SAS
program is a set of statements that causes the SAS System to read data
and perform analyses on the data. If you want to include the data as part of
the program in this file, you can.

root.LOG

This file contains your SAS log: a file generated by SAS that includes
notes, warnings, error messages, and other information pertaining to the
execution of your program.

root.LST

This file contains your SAS output: the results of your analyses as
generated by SAS. The LST is an abbreviation for Listing.

root.DAT

This file is a raw data file, a file containing raw data that are to be read
and analyzed by SAS. You use a file like this only if the data are not
already included in the .SAS file that contains the SAS program. This
book does not illustrate the use of the .DAT file, because with each of the
SAS programs illustrated here, the data are always included as part of the
.SAS file.

Saving Your SAS Program on Floppy Disks versus Other Media


This book shows you how to save your SAS programs on 3.5-inch floppy disks. This is
because it is assumed that most readers of this book are university students and that floppy
disks are the media that are most readily available to students.
It is also possible to save your SAS programs in a variety of other ways. For example, in
some courses, students are instructed to save their programs on their computers hard drives.
In other courses, students are told to save their programs on Zip disks or some other
removable media. If you decide to save your programs on a storage medium other than a
3.5-inch floppy disk, ask your lab assistant or professor for guidance.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 65

Saving Your SAS Program for the First Time on a Floppy Disk
To save your SAS program on a floppy disk, make sure that a 3.5-inch high-density IBM
PC-formatted disk has been inserted in drive A on your CPU (this book assumes that drive
A is the floppy drive). Also make sure that the Editor window containing your program is
the active window (the one currently in the foreground). Then make the following
selections:
File Save As
You will see a Save As dialog box on your screen. This dialog box contains smaller boxes
with labels such as Save in, File name, and so forth. The dialog box should resemble
Figure 3.9.
Click this down arrow to get a list of other locations where you can save your file.

Names of files
and folders might
appear here.

Click inside this File name box, and then type the name that you want to give to your file.
Figure 3.9. The initial Save As dialog box.

The first time you save a program, you must tell the computer which drive your diskette is
in and provide a name for your file. In this example, suppose that your default destination is
a folder named V8 and that you do not want to save your file in this folder (on your
computer, the default folder might have a different name). Suppose you want to save it on a
floppy disk instead. With most computers, this is done in the computers A drive. It is
therefore necessary to change your computers drive, as illustrated here.

66 Step-by-Step Basic Statistics Using SAS: Student Guide

To change the location where your file will be saved, complete these steps:
Click the down arrow on the right side of the Save in box; from there you can
navigate to the location where you want to save your file.
Scroll up and down until you see 3 1/2 Floppy (A:).
Select 3 1/2 Floppy (A:).
3 1/2 Floppy (A:) appears in the Save in box.
Now you must name your file. Complete the following steps:
Click inside the box labeled File name. Your cursor appears inside this box (see
Figure 3.9).
Type the following name inside the File name box: DEMO.SAS.
With this done, your Save As dialog box should resemble the completed Save As dialog box
that appears in Figure 3.10.
When ready,
click this Save
button.

Figure 3.10. The completed Save As dialog box.

After you have completed these tasks, you are ready to save the SAS program on the floppy
disk. To do this, complete the following step:
Click the Save button (see Figure 3.10).
After clicking Save, a small light next to the 3.5-inch drive will light up for a few seconds,
indicating that the computer is saving your program on the diskette. When the light goes off,
your program has been saved under the name DEMO.SAS.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 67

Where the Name of the SAS Program Will Appear


After you use the Save As command to name a SAS program within the Editor, the name of
that file will appear on the left side of the title bar for the Editor window (remember that the
title bar is the bar that appears at the top of a window). For example, if you look on the left
side of the title bar for the current Editor window, you will see that it no longer contains the
words SAS - [Editor - Untitled1] as it did before. Instead, this location now contains the
words SAS - [DEMO]. Similarly, whenever you pull down the Window menu during this
session, you will see that it still contains items labeled Log and Output, but that it no
longer contains Editor - Untitled1. Instead, this label has been replaced with Demo, the
name that you gave to your SAS file.
Saving a File Each Subsequent Time on a Floppy Disk
The first time you save a new file, you should use the Save As command and give it a
specific name as you did earlier. Each subsequent time you save that file (during the same
session) you can use the Save command, rather than the Save As command. To use the Save
command, verify that your Editor window is active and make the following selections.
File Save
Notice that when you save the file a second time using the Save command in this way, you
do not get a dialog box. Instead, the file is saved again under the same name and in the same
location that you selected earlier.
Save Your Work Often!
Sometimes you will work on a single SAS program for a long period of time. On these
occasions, you should save the file once every 10 minutes or so. If you do not do this, you
might lose all of the work you have done if the computer loses power or becomes
inoperative during your session. However, if you have saved your work frequently, you will
be able to reopen the file, and it will appear the way that it did the last time you saved it.
Submitting the SAS Program for Execution
So far you have created your SAS program and saved it as a file on your diskette. It is now
time to submit it for execution.

68 Step-by-Step Basic Statistics Using SAS: Student Guide

There are at least two ways to submit SAS programs. The first way is to use the Run menu.
The Run menu is identified in Figure 3.11.
One way to submit a SAS program
is to click the Run menu, and then select
Submit.

Another way is to click the


Submit button on the toolbar
(that is, the running person icon).

Figure 3.11. How to submit a SAS program for execution.

To submit a SAS program by using the Run menu, make sure that the Editor is in the
foreground, and then make the following selections (go ahead and try this now):
Run Submit
The second way to submit a SAS program is to click the Submit button on the toolbar. This
is a row of buttons below the Editor window. These buttons provide shortcuts for
performing a number of activities. One of these buttons is identified with the icon of a
running person (see Figure 3.11). This is the Submit button. To submit your program using
the toolbar, you would do the following (do not do this now): Put your mouse pointer on the
running person icon, and click it once. This submits your program for execution (note that
this button is identified with a running person because after you click it, your program will
be running).

Chapter 3: Tutorial: Writing and Submitting SAS Programs 69

What Happens after Submitting a Program


In the preceding section, you submitted your SAS program for execution. When you submit
a SAS program, it disappears from the Editor window. While the program is executing, a
message appears above the menu bar for the Editor, and this message indicates which PROC
(SAS procedure) is running. It usually takes only a few seconds for a typical SAS program
to execute. Some programs might take longer, however, depending on the size of the data
set being analyzed, the number of procedures being requested, the speed of the computers
processor, and other factors.
After you submit a SAS program and it finishes executing, typically you will experience one
of three possible outcomes.

Outcome 1: Your program runs perfectly and without any errors. If this happens, the
Editor window will disappear and SAS will automatically bring the Output window to
the foreground. The results of your analysis will appear in this Output window.

Outcome 2: Part of your program runs correctly, and part of it has errors. If this
happens, it is still possible that your Editor window will disappear and the Output
window will come to the foreground. In the Output window, you will see the results
from those sections of the program that ran correctly. This outcome can be misleading,
however, because if you are not careful, you might never realize that part of your
program had errors and did not run.

Outcome 3: Your program has errors and no results are produced. If this happens, the
Output window will never appear; you will see only the Editor window.

Outcome 2 can mislead you into believing that there were no problems with your SAS
program when there may, in fact, have been problems. The point is this: after you submit a
SAS program, even if SAS brings up the Output window, you should always review all of
the log file prior to reviewing the output file. This is the only way to be sure you have no
errors or other problems in your program.
Reviewing the Contents of Your Log File
The SAS log is a file generated by SAS that contains the program you submitted (minus the
data lines), along with notes, warnings, error messages, and other information pertaining to
the execution of your program. It is important to always review your log file prior to
reviewing your output file to verify that the program ran as planned.
If your program ran, the Output window is probably in the foreground now. If your program
did not run, the Editor window is probably in the foreground. In any case, make sure that the
Log window is in the foreground, and then make the following selections:
Window Log

70 Step-by-Step Basic Statistics Using SAS: Student Guide

Your Log window is now your active window. In many cases, your SAS log will be fairly
long and only the last part of it will be visible in the Log window. This is a problem,
because it is best to begin at the beginning of your log and review it from beginning to end.
If you are currently at the end of your log file, click the scroll bar on the right side of the
Log window, drag it up, and release it. The beginning of your log file is now at the top of
your screen. Scroll through the log and verify that you have no error messages.
If your program executed correctly, your log file should look something like Log 3.1 (notice
that the SAS log contains your SAS program minus the data lines).
NOTE: SAS initialization used:
real time
14.54 seconds
1
2
3
4

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

NOTE: The data set WORK.D1 has 8 observations and 2 variables.


NOTE: DATA statement used:
real time
1.59 seconds
13
14
15
16

;
PROC MEANS DATA=D1;
TITLE1 'JANE DOE';
RUN;

NOTE: There were 8 observations read from the dataset WORK.D1.


NOTE: PROCEDURE MEANS used:
real time
1.64 seconds

Log 3.1. Log file created by the current SAS program.

What Do I Do If My Program Did Not Run Correctly?


If your program ran correctly and there are no error messages in your SAS log, you should
skim this section and then skip to the following section (titled Printing the Log File on a
Printer). If your program did not run correctly, you should follow the instructions provided
here.
If there is an error message in your log file, or if for any other reason your program did not
run correctly, you have two options:

If there is a professor or a lab assistant available, ask this person to help you debug the
program and resubmit it. After the program runs correctly, continue with the next
section.

If there is no professor or lab assistant available, go to the section titled Tutorial Part
III: Submitting a Program with an Error, which is in the second half of this chapter.
That section shows you how to correct and resubmit a program. Use the guidelines

Chapter 3: Tutorial: Writing and Submitting SAS Programs 71

provided there to debug your program and resubmit it. When your program is running
correctly, continue with the next section below.
Printing the Log File on a Printer
To get a hardcopy (paper copy) of your log file, the Log window must be in the foreground.
If necessary, make the following selections to bring your Log window to the front:
Window Log
Then select
File Print
This gives you the Print dialog box, which should look something like Figure 3.12.

Click OK to print your file.


Figure 3.12. The Print dialog box.

Assuming that all of the settings in this dialog box are acceptable, you can print your file by
completing the following step (do this now):
Click the OK button at the bottom of this dialog box.
Your log file should print. Go to the printer and pick it up. If other people are using the SAS
System at this time, make sure that you get your own personal log file and not a log file

72 Step-by-Step Basic Statistics Using SAS: Student Guide

created by someone else (your log file should have your name in the TITLE1 statement,
toward the bottom).
Reviewing the Contents of Your Output File
If your program ran correctly, the results of the analysis can be viewed in the Output
window. To review your output, bring the Output window to the foreground by making the
following selections:
Window Output
Your Output window is in the foreground. If you cannot see all of your output, it is probably
because you are at the bottom of the output file. If this is the case, scroll up to see all of the
output page. If your program ran correctly (and you keyed your data correctly), the output
should look something like Output 3.1.
JANE DOE
The MEANS Procedure

Variable
N
Mean
Std Dev
Minimum
Maximum
-------------------------------------------------------------------------TEST1
8
3.5000000
1.1952286
2.0000000
5.0000000
TEST2
8
3.2500000
0.7071068
2.0000000
4.0000000
---------------------------------------------------------------------------

Output 3.1. SAS output produced by PROC MEANS.

Printing Your SAS Output


Now you should print your output file on a printer. You do this by following the same
procedure used to print your log file. Make sure that your Output window is in the
foreground and make the following selections:
File Print
This gives you the Print dialog box. If all the settings are appropriate, complete this step:
Click the OK button at the bottom of the Print dialog box.
Your output file should print. When you pick it up, verify that you have your own output
and not the output file created by someone else (your name should appear at the top of the
output if you typed your name in the TITLE1 statement, as you were directed).

Chapter 3: Tutorial: Writing and Submitting SAS Programs 73

Clearing the Log and Output Windows


After you finish an analysis, it is a good idea to clear the Log and Output windows prior to
doing any subsequent analyses. If you perform subsequent analyses, the new log and output
files will be appended to the bottom of the log and output files created by earlier analyses.
To clear the contents of your Output window, make sure that the Output window is in the
foreground and make the following selections:
Edit Clear All
The contents of the Output window should disappear from the screen. Now bring the Log
window to the front:
Window Log
Clear its contents by clicking
Edit Clear All
Returning to the Editor Window
Suppose that you now want to modify your SAS program by adding new data to the data set.
Before doing this, you must bring the Editor to the foreground.
An earlier section warned you that after you save a SAS program, the word Editor will no
longer appear on the Window menu. In its place, you will see the name that you gave to
your SAS program. In this session, you gave the name DEMO.SAS to your SAS program.
This means that, in the Window menu, you will now find the word Demo where Editor
used to be. To bring the Editor to the foreground, you should select Demo.
Window Demo
The Editor window containing your SAS program now appears on your screen.
What If the Editor Window Is Empty?
When you bring the Editor window to the foreground, it is possible that your SAS program
has disappeared. If this is the case, it might be because you did not set the Enhanced Editor
Options in the way described earlier in this chapter. Figure 3.6 showed how these options
should be set. Toward the bottom of the Enhanced Editor Options dialog box, one of the
options is Clear text on submit. The directions provided earlier indicated that this option
should not be selected (that is, the box for this option should not be checked). If your SAS
program disappeared after you submitted it, it might be because this box was checked. If
this is the case, go to the Enhanced Editor Options dialog box now and verify that this

74 Step-by-Step Basic Statistics Using SAS: Student Guide

option is not selected. You should deselect it if it is selected (see the previous section titled
Requesting Line Numbers and Other Options for directions on how to do this).
If your SAS program has disappeared from the Editor window, you can retrieve it easily by
using the Recall Last Submit command. If this is necessary, verify that your Editor window
is in the foreground and make the following selections (do this only if your SAS program
has disappeared):
Run Recall Last Submit
Your SAS program reappears in the Editor window.
Saving Your SAS Program on a Diskette (Again)
At the end of a SAS session, you will save the most recent version of your SAS program on
a diskette. If you do this, you will be able to open up this most recent version of the program
the next time you want to do some additional analyses.
You will now save the program on the diskette in drive A. Because this is not the first
time you have saved it this session, you can use the Save command rather than the Save As
command. Verify that your Editor window is the active window (and that your program is
actually in this window), and make the following selections:
File Save
Ending Your SAS Session
You can now end your SAS session by selecting
File Exit
A dialog box appears with the message, Are you sure you want to end the SAS session?
Click OK to end the SAS session.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 75

Tutorial Part II: Opening and Editing an


Existing SAS Program
Overview
This section shows you how to open the SAS program that you have saved on a floppy disk.
It also shows you how to edit an existing program: how to insert new lines, delete lines,
copy lines, and perform other activities that are necessary to modify a SAS program.
Restarting SAS
Often you will want to open a file (on a diskette) that contains a SAS program that you
created earlier. This section and the three to follow show you how to do this.
Verify that your computer and monitor are turned on and are not in sleep mode. Then
complete the following steps:
Click the Start button that appears at the bottom of your initial screen. This displays
a list of options, including the word Programs.
Select Programs. This reveals a list of programs on the computer. One of them is
The SAS System for Windows V8.
Select (click) The SAS System for Windows V8.
This starts the SAS System. After a few seconds, you will see the initial SAS screen, which
contains the Explorer window, the Log window, and the Editor window. Your screen should
look something like Figure 3.13.

76 Step-by-Step Basic Statistics Using SAS: Student Guide


First, click the close button for the Explorer window,
as well as for the Results window, which lies beneath it.

Next, click the


maximize
button for the
Editor window.

Figure 3.13. Modifying the initial SAS screen.

Modifying the Initial SAS System Screen


Before opening an existing SAS file, you need to modify the initial SAS System screen so
that it is easier to work with. Complete the following steps.
Close the Explorer window:
Click the close window button for the Explorer window (the button that looks like
in the upper-right corner of the Explorer window; see Figure 3.13).
This reveals the Results window, which was hidden beneath the Explorer window. Now
close the Results window:
Click the close window button for the Results window (the button that looks like
in the upper-right corner of the Results window).
The remaining visible windows now expand to fill the screen. Your screen should contain
only the Log window (at the top of the screen) and the Editor window (at the bottom).
Remember that the Editor window is identified by the words Editor - Untitled1 in its title
bar. To maximize the Editor window, complete the following step:
Click the maximize window button for the Editor window (the middle button that
contains a square:
).
The Editor window expands and fills your screen.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 77

Setting Line Numbers and Other Options


To change the settings for line numbers and other options, use the Enhanced Editor
Options dialog box. From the Editors menu bar, make the following selections:
Tools Options Enhanced Editor
This opens the Enhanced Editor Options dialog box (see Figure 3.14).
Verify that Show line numbers is selected.

In the Indentation section,


verify that None is selected.

Verify that Clear text on submit


is not selected.

Figure 3.14. Verifying that appropriate options are selected in the


Enhanced Editor Options dialog box.

If you began Part II of this tutorial immediately after completing Part I (and if you are
working at the same computer), the options that you selected in Part I should still be
selected. However, if you have changed computers, or if someone else has used SAS on
your computer since you used it, your options might have been changed. For that reason, it
is always a good idea to check at the beginning of each SAS session to ensure that your
Editor options are correct.
As explained in Part I, the Enhanced Editor Options dialog box consists of two
components: General options and Appearance options. The upper-left corner of the
Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled
Appearance. You should verify that General options is at the front of the stack (that is,
General options is visible), and you should click the tab labeled General if it is not visible.

78 Step-by-Step Basic Statistics Using SAS: Student Guide

The Enhanced Editor Options dialog box contains a variety of different options, but we
will focus on three of them. Here are the two options that should be selected at the
beginning of a SAS session:

Verify that Show line numbers is selected (that is, verify that a check mark appears in
the box for this option).

In the box labeled Indentation, verify that None is selected.

This option should not be selected:

Verify that Clear text on submit is not selected (that is, verify that a check mark does
not appear in the box for this option).

Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If
necessary, click inside the appropriate boxes so that the three options described earlier are
set properly (you can disregard the other options). When all are correct,
Click the OK button at the bottom of the dialog box.
This returns you to the Editor window. A single line number (the number 1) appears in the
upper left corner of the Editor window.
Reviewing the Names of Files on a Floppy Disk and Opening an
Existing SAS Program
Earlier, you saved your SAS program on a 3.5-inch floppy disk in drive A. This section
shows you how to open this program in the Editor.
To begin this process, verify that your floppy disk is in drive A and that the Editor window
is the active window. Then make the following selections:
File Open
The Open dialog box appears on your screen. The Open dialog box contains a number of
smaller boxes with labels such as Look in, File name, and so forth. It should look
something like Figure 3.15.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 79


Click this down arrow to display a list of other possible
locations where SAS can look for your file.

After you have selected the correct Look in location,


the names of your files should appear in this window.
Figure 3.15. The Open dialog box.

Toward the top of the Open dialog box is a box labeled Look in (see Figure 3.15). This box
tells SAS where it should look to find a file. The default is to look in a folder named V8
on your hard drive. You know that this is the case if the Look in box contains the icon of a
folder and the words V8 (although the default location might be different on your
computer). You have to change this default so that SAS will look on your 3.5-inch floppy
disk to find your program file. To accomplish this, complete the following steps:
On the right side of the Look in box is a down arrow. Click this down arrow to get
other possible locations where SAS can look (see Figure 3.15).
Scroll up and down this list of possible locations (if necessary) until you see an
entry that reads 3 1/2 Floppy (A:).
Click the entry that reads 3 1/2 Floppy (A:).
3 1/2 Floppy (A:) appears in the Look in box. When SAS searches for files, it will look on
your floppy disk.

80 Step-by-Step Basic Statistics Using SAS: Student Guide

The contents of your disk now appear below the Look in box. One of these files is
DEMO, the file that you need to open. Remember that when you first saved your file, you
gave it the full name DEMO.SAS. However, this .SAS extension does not appear on the
SAS program files that appear in the Open dialog box.
To open this file, complete the following steps:
Click the file named DEMO. The name DEMO appears in the File name box below
the large box.
Click the Open button.
The SAS program that you saved under the name DEMO.SAS appears in the Editor
window. You can now modify it and submit it for execution.
What If I Dont See the Name of My File on My Disk?
In the Open dialog box, if you dont see the name of a file that you know is on your
diskette, there are several possible reasons. The first thing you should do is verify that you
are looking in drive A, and that your floppy disk has been inserted in the drive.
If everything appears to be correct with the Look in box, the second thing you should do is
review the box labeled Files of type, which also appears in the Open dialog box. Verify that
the entry inside this box is SAS Files (*.sas) (see Figure 3.15). If this entry does not appear
in this box, it means that SAS is looking for the wrong types of files on your disk. Click the
down arrow on the right side of the Files of type box to reveal other options. Select SAS
Files (*.sas), and then check the window to see if DEMO now appears there.
Another possible solution is to instruct SAS to list the names of all files on your disk,
regardless of type. To do this, go to the box labeled Files of type. On the right side of this
Files of type box is a down arrow. Click the down arrow to reveal different options. Select
the entry that says All Files (*.*). This reveals the names of all the files on your disk,
regardless of the format in which they were saved.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 81

General Comments about Editing SAS Programs


The following sections show you how to edit an existing SAS program. Editing a SAS
program involves modifying it in some way: inserting new lines, copying lines, moving
lines, and so forth. Keep the following points in mind as you edit files:

The Undo command. The Undo command allows you to undo (reverse) your most
recent editing action. For example, assume that you select (highlight) a large section of
your SAS program, and then you accidentally delete it. It is possible to use the Undo
command and return your program to its prior state.

When you make a mistake, you can undo your most recent editing action (do not select this
now; this is for illustration only):
Edit Undo
This returns your program to the state it was in prior to the most recent editing action,
whether that action involved deleting, copying, cutting, or some other activity.
You can select the Undo command multiple times in a row. This allows you to undo a
sequence of changes that you have made since the last time the Save command was used.

Using the arrow keys. Somewhere on your keyboard are keys marked with directional
arrows such as . They enable you to move your cursor around the SAS program.

When you want to move your cursor to a lower line on a program that you have already
written, you generally use the down arrow key () rather than the ENTER key. This is
because, when you press the ENTER key, it creates a new blank line in the SAS program as
it moves your cursor down. Thus, you should use the ENTER key only when you want to
create new lines; otherwise, rely on the arrow keys.
The following sections show you how to perform a number of editing activities. It is
important that you perform these editing functions as you move through the tutorial. As you
read each section, you should modify your SAS program in the same way that the SAS
program in the book is being modified.
Inserting a Single Line in an Existing SAS Program
When editing a SAS program in the Editor, you might want to insert a single line between
two existing lines as follows:

Place the cursor at the end of the line that is to precede the new line

Press the ENTER key once.

82 Step-by-Step Basic Statistics Using SAS: Student Guide

For example, suppose that you want to insert a new line after line 4 in the following
program. To do this, you must first place the cursor at the end of that line (that is, at the end
of the DATALINES statement that appears on line 4). Complete the following steps:
Use the mouse to place the I-beam at the end of the DATALINES statement on line 4
Click once.
The flashing cursor appears at the point where you clicked (if you missed and the cursor is
not in the correct location, use your arrow keys to move it to the correct location).
1
2
3
4
5
6
7

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 3
2 2
3 3
Press ENTER. A new blank line is inserted between existing lines 4 and 5, as
shown here:

1
2
3
4
5
6
7
8

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

2 3
2 2
3 3

Your cursor (the insertion point) is now in column 1 of the new line you have created. You
can now type a new data line in the blank line. Complete the following step:
Type the numbers 6 and 7 on line 5, as shown here:
1
2
3
4
5
6
7
8

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
2 3
2 2
3 3

Chapter 3: Tutorial: Writing and Submitting SAS Programs 83

Lets do this one more time: Again, insert a new blank line after the DATALINES
statement:
Use the mouse to place the I-beam at the end of the DATALINES statement on line 4.
Click once to place the insertion point there.
Press the ENTER key once. This gives you a new blank line.
Now type the numbers 8 and 9 in the new line that you have created, as shown here:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
8 9
6 7
2 3
2 2
3 3

Inserting Multiple Lines


You follow essentially the same procedure to insert multiple lines, with one exception: After
you have positioned your insertion point, you press ENTER multiple times rather than one
time.
For example, complete these steps to insert three new lines between existing lines 4 and 5:
Use the mouse to place the I-beam at the end of the DATALINES statement on line 4.
Click once to place the insertion point there.
Press ENTER three times.
This gives you three new blank lines, as shown here:
1
2
3
4
5
6
7
8
9
10
11
12

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

8
6
2
2
3

9
7
3
2
3

84 Step-by-Step Basic Statistics Using SAS: Student Guide

Next, you will type some new data on the three new lines you have created, starting with
line 5.
Use your arrow keys to move the cursor up to line 5.
Now type the following values on lines 57, so that your data set looks like the
following program (after you have typed the values on a particular line, use the
arrow keys to move the cursor down to the next line):
1
2
3
4
5
6
7
8
9
10
11
12

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 2
3 3
4 4
8 9
6 7
2 3
2 2
3 3

Deleting a Single Line


There are at least two ways to delete a single line in a SAS program. One way is obvious:
place the cursor at the end of the line and press the backspace key to delete the line, one
character at a time.
A second way (and the way that you will use here) is to click and drag to highlight the line,
and then delete the entire line at once. This introduces you to the concept of clicking and
dragging, which is a very important technique to use when editing a SAS program.
For example, suppose that you want to delete line 5 of the following program (the line with
2 2 on it):
1
2
3
4
5
6
7
8
9
10
11
12

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 2
3 3
4 4
8 9
6 7
2 3
2 2
3 3

Chapter 3: Tutorial: Writing and Submitting SAS Programs 85

Complete the following steps:


Place your I-beam cursor at the beginning of the data on line 5. (This means place
the I-beam to the immediate left of the first 2 on line 5; do not go too far to the left
of the 2 or your I-beam will turn into an arrowif this happens you have gone too
far. This might take a little practice.)
Click once and hold the button down (do not release it yet).
While holding the button down, drag your mouse to the right so that the data on line
5 are highlighted in black. (This means that you should drag to the right until the
2 2 is highlighted in black. Do not drag your mouse up or down, or you might
accidentally highlight additional lines of data).
After the data are highlighted in black, release the button. The data should remain
highlighted.
Your program should look something like this:
1
2
3
4
5
6
7
8
9
10
11
12

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
2 2
3 3
4 4
8 9
6 7
2 3
2 2
3 3

To delete the line you have highlighted, complete the following step:
Press the Backspace (DELETE) key.
The highlighted data disappear, leaving only a blank line, as shown here:
1
2
3
4
5
6
7
8
9
10
11
12

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

3 3
4 4
8 9
6 7
2 3
2 2
3 3

86 Step-by-Step Basic Statistics Using SAS: Student Guide

Your cursor is in column 1 of the newly blank line. To make the blank line disappear,
complete the following step:
Press the Backspace (DELETE) key again.
Your program now appears as shown here:
1
2
3
4
5
6
7
8
9
10
11

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
3 3
4 4
8 9
6 7
2 3
2 2
3 3

Deleting a Range of Lines


You follow a similar procedure to delete a range of lines, with one exception: When you
click and drag, you will drag down (as well as to the right) so that you highlight more than
one line. When you press the backspace key, all of the highlighted lines will be deleted.
For example, suppose that you want to delete lines 5, 6, and 7 in your program:
1
2
3
4
5
6
7
8
9
10
11

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
3 3
4 4
8 9
6 7
2 3
2 2
3 3

Complete the following steps:


Place your I-beam at the beginning of the data on line 5. (Again, this means place the
I-beam to the immediate left of the first 3 on line 5; do not go too far to the left of
the 3 or your I-beam will turn into an arrowif this happens you have gone too
far.)
Click once and hold the button down (do not release it yet).

Chapter 3: Tutorial: Writing and Submitting SAS Programs 87

While holding the button down, drag your mouse down and to the right so that the
data on lines 5, 6, and 7 are highlighted in black.
After the data are highlighted in black, release the button. The lines remain
highlighted.
Your program should look something like this:
1
2
3
4
5
6
7
8
9
10
11

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
3 3
4 4
8 9
6 7
2 3
2 2
3 3

To delete the lines you have highlighted, complete this step:


Press the Backspace (DELETE) key once.
The highlighted lines disappear. After deleting these lines, it is possible that one blank line
will be left, as shown on line 5 here:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

6 7
2 3
2 2
3 3

To delete the blank line,


Press the Backspace (DELETE) key again.
Your program now appears as shown here:
1
2
3
4
5
6
7
8

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
2 3
2 2
3 3

88 Step-by-Step Basic Statistics Using SAS: Student Guide

Copying a Single Line into Your Program


To copy a single line involves these steps (do not do this yet):
1. Create a new blank line where the copied line is to be inserted.
2. Click and drag to highlight the line to be copied.
3. Pull down the Edit menu and select Copy.
4. Place the cursor at the point where the line is to be pasted.
5. Pull down the Edit menu and select Paste.
For example, suppose that you want to make a copy of line 5 and place the copy before line
6 in the following program:
1
2
3
4
5
6
7
8

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
2 3
2 2
3 3

First, you must create a new blank line where the copied line is to be inserted. Complete the
following steps (try this now):
Place the I-beam at the end of the data on line 5 (that is, after the 6 7 on line 5)
and click once. This places your cursor to the right of the numbers 6 7.
Press ENTER once.
This creates a new blank line after line 5, as shown here:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
2 3
2 2
3 3

Next you highlight the data to be copied. Complete the following steps:
Place your I-beam at the beginning of the data on line 5. (This means place the Ibeam to the immediate left of the 6 on the 6 7 line.)
Click once and hold the button down (do not release it yet).

Chapter 3: Tutorial: Writing and Submitting SAS Programs 89

While holding the button down, drag your mouse to the right so that the data on line
5 are highlighted in black.
After the data are highlighted in black, release the button. The data remain
highlighted.
With this done, your program should look something like this:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
2 3
2 2
3 3

Now you must select the Copy command.


Edit Copy
Nothing appears to happen, but dont worrythe highlighted text has been copied to an
invisible clipboard. Now place your cursor (the insertion point) at the beginning of line 6.
Complete this step:
Place the I-beam in column 1 of line 6 and click.
Your program should look like this:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7

2 3
2 2
3 3

Finally, you can take the material that you copied to the clipboard and paste it at the
insertion point in your program.
Make the following selections:
Edit Paste

90 Step-by-Step Basic Statistics Using SAS: Student Guide

The copied data now appear on line 6. Your program should look something like this:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
6 7
2 3
2 2
3 3

Copying a Range of Lines


To copy a range of lines, you follow the same procedure that you use to copy a single line,
with one exception: When you click and drag, you drag down (as well as to the right) so that
you highlight more than one line. When you select Paste, all of the highlighted lines will be
copied.
For example, assume that you want to copy lines 5-7 and place the copied lines before line 8
in the following program:
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
6 7
2 3
2 2
3 3

First, you must create a new blank line where the copied lines are to be inserted. Complete
the following steps:
Place the I-beam at the end of the data on line 7 (that is, after the 2 3 on line 7)
and click once. This places your cursor to the right of the 2 3.)
Press ENTER once.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 91

This creates a new blank line after line 7, as shown here:


2
3
4
5
6
7
8
9
10

DATA D1;
INPUT TEST1
DATALINES;
6 7
6 7
2 3

TEST2;

2 2
3 3

Next you highlight the data to be copied. Complete these steps:


Place your I-beam at the beginning of the data on line 5. (This means place the Ibeam to the immediate left of the 6 on the 6 7 line.)
Click once and hold the button down.
While holding the button down, drag your mouse down and to the right so that the
data on lines 57 are highlighted in black.
After the lines are highlighted in black, release the button. The lines remain
highlighted.
With this done, your program should look something like this:
2
3
4
5
6
7
8
9
10

DATA D1;
INPUT TEST1
DATALINES;
6 7
6 7
2 3

TEST2;

2 2
3 3

Now you must select the Copy command.


Edit Copy
Nothing appears to happen, but dont worrythe highlighted text has been copied to an
invisible clipboard. Now you need to place your cursor (the insertion point) at the beginning
of line 8:
Place the I-beam in column 1 of line 8 and click.

92 Step-by-Step Basic Statistics Using SAS: Student Guide

Your program should look like this:


2
3
4
5
6
7
8
9
10

DATA D1;
INPUT TEST1
DATALINES;
6 7
6 7
2 3

2 2
3 3

TEST2;

Finally, take the material that you copied to the clipboard and paste your selection at the
insertion point in your program.
Edit Paste
The copied data now appear on (and following) line 8. Your program should look something
like this:
2
3
4
5
6
7
8
9
10
11
12

DATA D1;
INPUT TEST1
DATALINES;
6 7
6 7
2 3
6 7
6 7
2 3
2 2
3 3

TEST2;

Moving Lines
To move lines, you follow the same procedure that you use to copy lines, with one
exception: When you initially pull down the Edit menu, select Cut rather than Copy.
For example, you follow these steps to move a range of lines (do not actually do this now):
1. Create a new blank line where the moved lines are to be inserted.
2. Click and drag to highlight the lines to be moved.
3. Pull down the Edit menu and select Cut.
4. Place the cursor at the point where the lines are to be pasted.
5. Pull down the Edit menu and select Paste.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 93

When you are finished, there will be one blank line at the location where the moved lines
used to be. You can delete this line in the usual way: Use your mouse to place the cursor in
column 1 of that blank line, and press the backspace (delete) key.
Saving Your SAS Program and Ending the SAS Session
Now it is time to save the program on your floppy disk in drive A. Because you opened
DEMO.SAS from drive A, drive A is now the default driveit is not necessary to assign it
as the default drive. You can save your file using the Save command, rather than the
Save As command. Verify that your Editor window is the active window, and select:
File Save
Now end your SAS session by selecting:
File Exit
This produces a dialog box that asks if you are sure you want to end the SAS session. Click
OK in this dialog box, and the SAS session ends.

94 Step-by-Step Basic Statistics Using SAS: Student Guide

Tutorial Part III: Submitting a Program


with an Error
Overview
In this section, you modify an existing SAS program so that it will produce an error when
you submit it. This gives you the opportunity to learn the procedure that you will follow
when debugging SAS programs with errors.
Restarting SAS
Complete the following steps:
Click the Start button that appears at the bottom of your initial screen. This displays
a list of options, including the word Programs.
Select Programs. This reveals a list of programs on the computer. One of them is
The SAS System for Windows V8.
Select The SAS System for Windows V8 and release the button.
Modifying the Initial SAS System Screen
Before opening an existing SAS file, modify the initial SAS screen so that it is easier to
work with. Complete the following steps.
Close the Explorer window:
Click the close window button for the Explorer window (the button that looks like
in the upper-right corner of the Explorer window; see Figure 3.13).
This reveals the Results window, which was hidden beneath the Explorer window. Now
close the Results window:
Click the close window button for the Results window (the button that looks like
in the upper-right corner of the Results window).
Your screen now contains the Log window and the Editor window. To maximize the Editor
window, complete this step:

Chapter 3: Tutorial: Writing and Submitting SAS Programs 95

Click the maximize window button for the Editor window (the middle button that
contains a square :
; see Figure 3.4).
The Editor expands and fills your screen.
Finally, you need to review the Enhanced Editor Options dialog box to verify that the
appropriate options have been selected. To request this dialog box, make the following
selections:
Tools Options Enhanced Editor
The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled
General and one tab labeled Appearance. Verify that the General options tab is in the
foreground (that is, General options is visible), and click the tab labeled General if it is not
visible.
The Enhanced Editor Options dialog box contains a variety of different options, but we
will focus on three of them. Here are the two options that should be selected at the
beginning of a SAS session:

Verify that Show line numbers is selected (that is, verify that a check mark appears in
the box for this option).

In the box labeled Indentation, verify that None is selected.

Here is the one option that should not be selected:

Verify that Clear text on submit is not selected (that is, verify that a check mark does
not appear in the box for this option).

Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If
necessary, click inside the appropriate boxes so that the three options described earlier are
set properly (you can disregard the other options). When all are correct, complete the
following step:
Click the OK button at the bottom of the dialog box.
This returns you to the Editor window.

96 Step-by-Step Basic Statistics Using SAS: Student Guide

Opening an Existing SAS Program from Your Floppy Disk


Verify that your 3.5-inch floppy disk is in drive A and that the Editor window is the active
window. Then make the following selections:
File Open
The Open dialog box appears on your screen. Toward the top of the Open dialog box is a
box labeled Look in. This box tells the SAS System where it should look to find a file. If
this box does not contain 3 1/2 Floppy (A:), you will have to change it. If this is necessary,
complete these steps:
On the right side of the Look in box is a down arrow. Click this down arrow to get
other possible locations where the SAS System can look (see Figure 3.15).
Scroll up and down this list of possible locations (if necessary) until you see an
entry that reads 3 1/2 Floppy (A:).
Click the entry that reads 3 1/2 Floppy (A:).
3 1/2 Floppy (A:) appears in the Look in box. The contents of your diskette now appear in
the larger box below the Look in box. One of these files should be DEMO, the file that you
need to open. To open this file, complete the following steps:
Click DEMO. The name DEMO appears in the File name box.
Click the Open button.
The SAS program that you saved in the preceding section appears in the Editor window.
You are now free to modify it and submit it for execution.
Submitting a Program with an Error
You will now submit a program with an error in order to see how errors are identified and
corrected. With the file DEMO.SAS opened in the Editor window, change the third line
from the bottom so that it requests PROC MEENS instead of PROC MEANS. This will
produce an error message when SAS attempts to execute the program, because there is no
procedure named PROC MEENS.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 97

Here is the modified program; notice that the third line from the bottom now requests
PROC MEENS:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
6 7
2 3
6 7
6 7
2 3
2 2
3 3
3 4
4 3
4 4
5 3
5 4
;
PROC MEENS DATA=D1;
TITLE1 'JANE DOE';
RUN;
After you make this change, submit the program for execution in the usual way:
Run Submit
This submits your program for execution.
Reviewing the SAS Log and Correcting the Error
After you submit the SAS program, it takes SAS a few seconds to finish processing it.
However, after this processing is complete the Output window will not become the active
window. The fact that your Output window does not appear indicates that your program did
not run correctly.
To determine what was wrong with the program, you need to review the Log window. Make
the following selections:
Window Log

98 Step-by-Step Basic Statistics Using SAS: Student Guide

The log file that is created by your program should look similar to Log 3.2.
NOTE: SAS initialization used:
real time
15.44 seconds
1
2
3
4

OPTIONS LS=80 PS=60;


DATA D1;
INPUT TEST1 TEST2;
DATALINES;

NOTE: The data set WORK.D1 has 13 observations and 2 variables.


NOTE: DATA statement used:
real time
2.86 seconds
18
;
19
PROC MEENS DATA=D1;
ERROR: Procedure MEENS not found.
20
TITLE1 'JANE DOE';
21
RUN;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE MEENS used:
real time
0.22 seconds
Log 3.2. Log file created by a SAS program with an error.

When you go to the SAS Log window, you will usually see the last part of the log file. You
should begin reviewing at the beginning of a log file when looking for errors or other
problems. Scroll to the top of the log file by completing this step (try this now):
Scroll up to the top of the log file either by clicking and dragging the scroll bar or
by clicking the up arrow in the scroll bar area of the Log window.
Starting at the top of the log file, begin looking for warning messages, error messages, or
other signs of problems. Begin at the top of the log file and work your way down. It is
important to always begin at the top, because a single error message early in a program can
sometimes cause dozens of additional error messages later in the program; if you correct the
first error, the remaining error messages will often disappear.
Toward the bottom half of Log 3.2, you can see an error message that reads Error:
Procedure MEENS not found. The SAS program statement causing this error is the
statement that immediately precedes the error message in the SAS log. That SAS program
statement, along with the resulting error message, is reproduced here:
19
PROC MEENS DATA=D1;
ERROR: Procedure MEENS not found.
This error message indicates that SAS does not have a procedure named MEENS. A
subsequent statement in the log indicates that SAS stopped processing this step because of
the error.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 99

Obviously, the error is that PROC MEANS was incorrectly spelled as PROC MEENS.
When you see an error like this in a log file, your first impulse might be to correct the error
in the log file itself. But this will not workyou must correct the error in the SAS program,
not in the log file. Before doing this, however, clear the text of the current log file before
you resubmit the corrected program. If you do not clear the text of this log file, the next time
you submit the SAS program, SAS will append a new log file to the bottom of the existing
log file, and it will be difficult to review your current log.
To correct your error, first clear your Log window by selecting
Edit Clear All
Now return to your SAS program by making the Editor the active window:
Window Demo
The SAS program that you submitted reappears in the Editor:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT TEST1 TEST2;
DATALINES;
6 7
6 7
2 3
6 7
6 7
2 3
2 2
3 3
3 4
4 3
4 4
5 3
5 4
;
PROC MEENS
DATA=D1;
TITLE1 'JANE DOE';
RUN;
Now correct the error in the program:
If necessary, scroll down to the bottom of the program. Move your cursor down to
the line that contains PROC MEENS and change this statement so that it reads
PROC MEANS.
Now submit the program again:
Run Submit

100 Step-by-Step Basic Statistics Using SAS: Student Guide

If the error has been corrected (and if the program contains no other errors), the Output
window appears after processing is completed. The results of PROC MEANS appear in this
Output window, and these results should look similar to Output 3.2.
JANE DOE

The MEANS Procedure


Variable
N
Mean
Std Dev
Minimum
Maximum
------------------------------------------------------------------------TEST1
13
4.1538462
1.6251233
2.0000000
6.0000000
TEST2
13
4.3846154
1.8946619
2.0000000
7.0000000
------------------------------------------------------------------------Output 3.2. SAS output produced by PROC MEANS after correcting the error.

If SAS does not take you to the Output window, it means that your program still contains an
error. If this is the case, repeat the process described earlier:
1. Go to the Log window and identify the error.
2. Clear the Log window of text.
3. Go to the Editor window, which contains the SAS program.
4. Correct the error.
5. Resubmit the program.
Saving Your SAS Program and Ending This Session
If the SAS program ran correctly, the Output window should be in the foreground. Now you
must make the Editor the active window before you can save your program. Make the
following selections:
Window Demo
Your SAS program is visible in the Editor window. Now save the program on the disk in
drive A:
File Save
You can end your session with the SAS System by selecting
File Exit
The computer asks if you are sure you want to end the SAS session. Click OK, and the SAS
session ends.

Chapter 3: Tutorial: Writing and Submitting SAS Programs 101

To Learn More about Debugging SAS Programs


This section summarizes the steps that you follow when debugging a SAS program with an
error. A more concise summary of these steps appears in Finding and Correcting an Error
in a SAS Program later in this chapter.
To learn more about debugging SAS programs, see Common SAS Progamming Errors
That Beginners Make on this books companion Web site
(support.sas.com/companionsites). Delwiche and Slaughter (1998) also provide guidance on
how to find and correct errors in SAS programs.

102 Step-by-Step Basic Statistics Using SAS: Student Guide

Tutorial Part IV: Practicing What You


Have Learned
Overview
In this section, you practice the skills that you have developed in the preceding sections.
You open your existing SAS program, edit it, submit it, print the resulting log and output
files, and perform other activities.
Restarting the SAS System and Modifying the Initial
SAS System Screen
Complete the following steps to restart the SAS System:
Click the Start button that appears at the bottom of your initial screen. This displays
a list of options, including the word Programs.
Select Programs. This reveals a list of programs on the computer. One of them is
The SAS System for Windows V8.
Select The SAS System for Windows V8 and release your button.
This produces the initial SAS System screen. Next, close the Explorer window and the
Results window and maximize the Editor window:
Click the close window button for the Explorer window (the button that looks like
in the upper-right corner of the Explorer window; see Figure 3.13).
Click the close window button for the Results window (the button that looks like
in the upper-right corner of the Results window).
Click the maximize window button for the Editor window (the middle button that
contains a square:
; see Figure 3.13).
Finally, if there is any possibility that someone has changed the options in the Enhanced
Editor Options dialog box, you should review this dialog box. If this is necessary, make the
following selections:
Tools Options Enhanced Editor
The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled
General and one tab labeled Appearance. Verify that the General options tab is in the

Chapter 3: Tutorial: Writing and Submitting SAS Programs 103

foreground (that is, General options is visible), and click the tab labeled General if it is not
visible.
The Enhanced Editor Options dialog box contains a variety of different options, but we
will focus on three of them. Here are the two options that should be selected at the
beginning of a SAS session:

Verify that Show line numbers is selected (that is, verify that a check mark appears in
the box for this option).

In the box labeled Indentation, verify that None is selected.

This option should not be selected:

Verify that Clear text on submit is not selected (that is, verify that a check mark does
not appear in the box for this option).

Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If
necessary, click inside the appropriate boxes so that the three options described earlier are
set properly (you can disregard the other options). When all are correct,
Click the OK button at the bottom of the dialog box.
This returns you to the Editor window.
Reviewing the Names of Files on a Floppy Disk and Opening an
Existing SAS Program
Verify that your 3.5-inch floppy disk is in drive A and that the Editor window is the active
window. Then make the following selections:
File Open
The Open dialog box appears on your screen. Toward the top of the Open dialog box is a
box labeled Look in. This box tells the SAS System where it should look to find a file. If
this box does not contain 3 1/2 Floppy (A:), you will have to change it. If this is necessary,
complete the following steps:
On the right side of the Look in box is a down arrow. Click this down arrow to get
other possible locations where the SAS System can look (see Figure 3.15).
Scroll up and down this list of possible locations (if necessary) until you see an
entry that reads 3 1/2 Floppy (A:).
Click the entry that reads 3 1/2 Floppy (A:).

104 Step-by-Step Basic Statistics Using SAS: Student Guide

3 1/2 Floppy (A:) appears in the Look in box. The contents of your diskette should now
appear in the larger box below the Look in box. One of these files should be DEMO, the file
that you need to open. Complete the following steps:
Click the file named DEMO. The name DEMO appears in the File name box.
Click the Open button.
The SAS program that you saved in the preceding section appears in the Editor window.
You are now free to modify it and submit it for execution.
Practicing What You Have Learned
Now that the file DEMO.SAS is open in the Editor, you can practice what you have learned
in this tutorial. When you are not sure about how to complete certain tasks, refer to earlier
sections.
Within the Editor window, complete the following steps to modify your file named DEMO:
1) Insert three new lines of data into the middle of your data set (somewhere after the
DATALINES statement). Make up the numbers.
2) Delete two existing lines of data (choose any two lines of data).
3) Copy four lines of data.
4) Move three lines of data.
5) Save your program on the 3.5 diskette using the File Save As command. Give it a
new name: DEMO2.SAS.
6) Submit the program for execution.
7) Review the contents of your log file on the screen.
8) Print your log file.
9) Clear your log file from the screen by using the Edit Clear All command.
10) Review the contents of your output file on the screen.
11) Print your output.
12) Clear your output file from the screen by using the Edit Clear All command.
13) Go back to the Editor window.
14) Add one more line of data to your program.
15) Save your program again, this time using the File Save command (not the Save As
command).

Chapter 3: Tutorial: Writing and Submitting SAS Programs 105

Ending the Tutorial


You can end your SAS session by making the following selections:
File Exit
The computer asks if you are sure you want to end the SAS session. Click OK, and the SAS
session ends.
This completes the tutorial sections of this chapter.

Summary of Steps for Frequently Performed Activities


Overview
This section summarizes the steps that you follow when performing three common
activities:
starting the SAS windowing environment
opening an existing SAS program from a floppy disk
finding and correcting an error in a SAS program.
Starting the SAS Windowing Environment
Verify that your computer and monitor are turned on and not in sleep mode. Make sure that
your initial Windows screen appears (with the Start button at the bottom).
Click the Start button that appears at the bottom left of your initial screen. This
displays a list of options.
Select Programs. This reveals a list of programs on the computer.
Select The SAS System for Windows V8.
This produces the initial SAS System screen. Next, close the Explorer window and the
Results window, and maximize the Editor window:
Click

in the upper-right corner of the Explorer window (see Figure 3.13).

Click

in the upper-right corner of the Results window.

Click

to maximize the window.

106 Step-by-Step Basic Statistics Using SAS: Student Guide

Finally, if there is any possibility that someone has changed the options in the Enhanced
Editor Options dialog box, you should review this dialog box. If this is necessary, make the
following selections:
Tools Options Enhanced Editor
The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled
General and one tab labeled Appearance. Verify that General options is in the foreground
(that is, General options is visible), and you should click the tab labeled General if it is not
visible.
The Enhanced Editor Options dialog box contains a variety of different options, but we
will focus on three of them. Here are the two options that should be selected at the
beginning of a SAS session:

Verify that Show line numbers is selected (that is, verify that a check mark appears in
the box for this option).

In the box labeled Indentation, verify that None is selected.

Here is the one option that should not be selected:

Verify that Clear text on submit is not selected (that is, verify that a check mark does
not appear in the box for this option).

Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If
necessary, click the appropriate boxes so that the three options described earlier are set
properly (you can disregard the other options). When all are correct, complete the following
step:
Click the OK button at the bottom of the dialog box.
This returns you to the Editor window. You are now ready to begin typing a SAS program
or open an existing SAS program from a diskette.
Opening an Existing SAS Program from a Floppy Disk
Verify that your floppy disk is in drive A and that the Editor window is the active window.
Then make the following selections:
File Open
The Open dialog box appears on your screen. Toward the top of the Open dialog box is a
box labeled Look in. This box tells SAS where it should look to find a file. If this box does
not contain 3 1/2 Floppy (A:), you will have to change it by completing the following
steps:

Chapter 3: Tutorial: Writing and Submitting SAS Programs 107

On the right side of the Look in box is a down arrow. Click this down arrow to get
other possible locations where SAS can look (see Figure 3.15).
Scroll up and down this list of possible locations (if necessary) until you see an entry
that reads 3 1/2 Floppy (A:).
Click the entry that reads 3 1/2 Floppy (A:).
3 1/2 Floppy (A:) appears in the Look in box. The contents of your disk should now appear
in the larger box below the Look in box. One of these files is the file that you need to open.
Remember that the file names in this box might not contain the .SAS suffix, even if you
included this suffix in the file name when you first saved the file. This does not mean that
there is a problem.
To open your file, complete the following steps:
Click the name of the file that you want to open. This file name appears in the File
name box.
Click the Open button.
The SAS program saved under that file name appears in the Editor window. You are now
free to modify it and submit it for execution.
Finding and Correcting an Error in a SAS Program
After you submit a SAS program, one of two things will happen:

If SAS does not take you to the Output window (that is, if you remain at the Editor
window), it means that your SAS program did not run correctly. You need to go to the
Log window and locate the error (or errors) in your program.

If SAS does take you to the Output window, it still does not mean that the entire SAS
program ran correctly. It is always a good idea to review your SAS log for warnings,
errors, and other messages prior to reviewing the SAS output.

The point is this: After submitting a SAS program, you should always review the log file
prior to reviewing the SAS output, either to verify that there are no errors or to identify the
nature of those errors. This section summarizes the steps in this process.
After you have submitted the SAS program and processing has stopped, make the following
selections to go to the Log window:
Window Log
When you go to the SAS Log window, what you will usually see is the last part of the log
file. You must scroll to the top of the log file to see the entire log:

108 Step-by-Step Basic Statistics Using SAS: Student Guide

Scroll up to the top of the log file by either clicking and dragging the scroll bar or
by clicking the up arrow in the scroll bar area of the Log window.
Now that you are at the top of the log file, begin reviewing it for warning messages, error
messages, or other signs of problems. Begin at the top of the log file and work your way
down. It is important to begin at the top of the log because a single error message early in a
program can sometimes cause dozens of additional error messages later in the program; if
you correct the first error, the dozens of remaining error messages often disappear.
If there are no warnings, errors, or other signs of problems, go to your output file:
Window Output
If your SAS log does contain error messages, try to find the cause of these errors. Begin
with the first (earliest) error message in the SAS log. Remember that your SAS log always
contains the statements that make up the SAS program (minus the data lines). Review the
line of your SAS program that immediately precedes the first error messageare there any
problems with this line (for example, a missing semicolon, a misspelled word)? If that line
appears to be correct, review the line above it. Are there any problems with that line?
Continue working backward, one line at a time, until you find the error.
After you have found the error, remember that you cannot correct the error in the SAS log.
Instead, you must correct the error in the SAS program. First, clear all text in the existing
SAS Log and SAS Output windows. This ensures that when you resubmit your SAS
program, the new log and output will not be appended to the old. Complete the following
steps to delete the existing log and output files:
Window Log Edit Clear All
Window Output Edit Clear All
You will now go to the Editor window (remember that the Window menu might contain the
name that you gave to your SAS program, rather than the word Editor, which appears
here):
Window Editor
You should now edit your SAS program to correct the error that you identified earlier. After
you have corrected the error, save the modified program:
File Save
Now submit the revised SAS program:
Run Submit
At this point, the process repeats itself. If the program runs, you should still go to the log file
to verify that there are no errors or other signs of problems. If the program did not run, you

Chapter 3: Tutorial: Writing and Submitting SAS Programs 109

will go to the log file to look for the error (or errors) in your program. Continue this process
until your program runs without errors.

Controlling the Size of the Output Page with the


OPTIONS Statement
In completing this tutorial, all of the programs that you submitted contained the following
OPTIONS statement:
OPTIONS

LS=80

PS=60;

The OPTIONS statement is a global statement that can be used to change the value of
system options and change how the SAS System operates. For example, you can use the
OPTIONS statement to suppress the printing of page numbers, to suppress the printing of
dates, and to perform other tasks.
In this tutorial, the OPTIONS statement was used for one purpose: to specify the size of the
printed page. The OPTIONS statement presented earlier requests a small-page format for
output. The LS=80 section of this statement requests that your output have a line size of 80
characters per line (the LS stands for line size). This setting makes your output easy to
view on a narrow computer screen. The PS=60 section of this statement requests that your
output have a page size of 60 lines per page (the PS stands for page size). These
specifications are fairly standard.
Specifying a line size of 80 and a page size of 60 is fine for most programs, but it is not
optimal for SAS programs that provide a great deal of information on each page. For
example, when performing a factor analysis or principal component analysis, it is better to
use a larger format so that each page can contain more information. The printed output from
these more sophisticated analyses are easier to read if the line size is 120 characters per line
rather than 80 (of course, this assumes that you have access to a large-format printer that can
print 120 characters per line).
To print your output in the larger format, change the OPTIONS statement on the first line of
the program. Specifically, set LS=120 (you can leave the PS=60). This revised OPTIONS
statement is illustrated here:
OPTIONS

LS=120

PS=60;

In the OPTIONS statements presented in this section, LS is used as an abbreviation for the
keyword LINESIZE, and PS is used as an abbreviation for the keyword PAGESIZE. If
you prefer, you can also write your OPTIONS statement with the full-length keywords, as
shown here:
OPTIONS

LINESIZE=80

PAGESIZE=60;

110 Step-by-Step Basic Statistics Using SAS: Student Guide

For More Information


The reference section at the end of this book lists a number of books that provide additional
information about using the SAS windowing environment. Delwiche and Slaughter (1998)
provide a concise but comprehensive introduction to using SAS Version 7. Two books by
Jodie Gilmore (Gilmore, 1997; Gilmore, 1999) provide detailed instructions for using SAS
in the Windows environment. These books can be ordered from SAS at (800) 727-3228 or
(919) 677-8000.

Conclusion
For students who are using SAS for the first time, learning to use the SAS windowing
environment is often the most challenging task. The tutorial in this chapter has introduced
you to this application. When you are performing analyses with the SAS System, you should
continue to refer to this chapter to refresh your memory on how to perform specific
activities. For most students, using the SAS windowing environment becomes second nature
within a matter of weeks.
The next chapter in this book, Chapter 4, Data Input, describes how to get your data into a
format that can be analyzed by SAS. Before you can perform statistical analyses on your
data, you must first provide SAS with information about how many variables it has to read,
what names should be given to those variables, whether the variables are numeric or
character, along with other information. Chapter 4 shows you the basics for creating the
types of data sets that are most frequently encountered when conducting research.

Data Input
Introduction.........................................................................................113
Overview...............................................................................................................113
The Rows and Columns that Constitute a Data Set..............................................113
Overview of Three Options for Writing the INPUT Statement ...............................115
Example 4.1: Creating a Simple SAS Data Set ..................................117
Overview...............................................................................................................117
The OPTIONS Statement .....................................................................................117
The DATA Statement............................................................................................118
The INPUT Statement...........................................................................................119
The DATALINES Statement .................................................................................120
The Data Lines......................................................................................................120
The Null Statement ...............................................................................................121
The PROC Statement ...........................................................................................121
Example 4.2: A More Complex Data Set............................................122
Overview...............................................................................................................122
The Study .............................................................................................................122
Data Set to Be Analyzed.......................................................................................123
The SAS DATA Step.............................................................................................125
Some Rules for List Input......................................................................................126
Using PROC MEANS and PROC FREQ to Identify
Obvious Problems with the Data Set .............................................131
Overview...............................................................................................................131
Adding PROC MEANS and PROC FREQ to the SAS Program............................131

112 Step-by-Step Basic Statistics Using SAS: Student Guide

The SAS Log.........................................................................................................134


Interpreting the Results Produced by PROC MEANS...........................................135
Interpreting the Results Produced by PROC FREQ..............................................137
Summary ..............................................................................................................138
Using PROC PRINT to Create a Printout of Raw Data .......................139
Overview...............................................................................................................139
Using PROC PRINT to Print Raw Data for All of the Variables In the Data Set ....139
Using PROC PRINT to Print Raw Data for a Subset of Variables
In the Data Set ..................................................................................................141
A Common Misunderstanding Regarding PROC PRINT ......................................142
The Complete SAS Program................................................................142
Conclusion...........................................................................................144

Chapter 4: Data Input 113

Introduction
Overview
Raw data must be converted into a SAS data set before you can analyze it with SAS
statistical procedures. In this chapter you learn how to create simple SAS data sets.
Most of the chapter uses the format-free approach to data input because this is the simplest
approach, and will be adequate for the types of data sets that you will encounter in this
guide. You will learn how to write an INPUT statement to read both numeric variables as
well as character variables. You will also learn how to add missing data into the data set that
you want to analyze.
After you have typed your data, you should always analyze it with a few simple procedures
to verify that SAS read your data set as you intended. This chapter shows you how to use
PROC MEANS, PROC FREQ, and PROC PRINT to verify that your data set has been
created correctly.
The Rows and Columns that Constitute a Data Set
Suppose that you administer a short questionnaire to nine subjects. The questionnaire asks
the subjects to indicate their height (in inches), weight (in pounds), and age (in years). The
results that you obtain for the nine subjects are summarized in Table 4.1.
Table 4.1
Data from the Height and Weight Study
________________________________
Subject
Height Weight Age
________________________________
1. Marsha
64
140
20
2. Charles
68
170
28
3. Jack
74
210
20
4. Cathy
60
110
32
5. Emmett
64
130
22
6. Marie
68
170
23
7. Cindy
65
140
22
8. Susan
65
140
22
9. Fred
68
160
22
________________________________

Table 4.1 is a data set: a collection of variables and observations that could be analyzed
using a statistical package such as SAS. The table is organized pretty much the way a SAS
data set is organized: each row in the data set is (running horizontally from left to right)
represents a different observation. Because you are doing research in which the individual
person is the unit of analysis, each observation in your data set is a different person. You can

114 Step-by-Step Basic Statistics Using SAS: Student Guide

see that the first row of the data set presents data from a subject named Marsha, the next row
presents data from a subject named Charles, and so on.
In contrast, each column in the data set (running vertically from top to bottom) represents a
different variable. The first column (Subject) provides each subjects name and number;
the second column (Height) provides each subject's height in inches; the third column
(Weight) provides each subjects weight in pounds, and so on.
By reading across a given row, you can see where each subject scored on each variable. For
example, by reading across the row for subject #1 (Marsha), you can see that she stands 64
inches in height, weighs 140 pounds, and is 20 years old.
After you have entered the data in Table 4.1 into a SAS program, you could analyze it with
any number of statistical procedures. For example, you could find the mean score for the
three quantitative variables (height, weight, and age). Below is an example of a SAS
program that will do this (remember that you would not type the line numbers appearing on
the left; these numbers are used here simply to identify the lines in the program):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
HEIGHT
WEIGHT
AGE ;
DATALINES;
1 64 140 20
2 68 170 28
3 74 210 20
4 60 110 32
5 64 130 22
6 68 170 23
7 65 140 22
8 65 140 22
9 68 160 22
;
PROC MEANS DATA=D1;
VAR HEIGHT WEIGHT
TITLE1 'JANE DOE';
RUN;

AGE;

Later sections of this chapter will discuss various parts of the preceding SAS program: the
DATA statement on line 2, the INPUT statement on lines 36, and so on. For now, just
focus on the data set itself, which appears on lines 816. Notice that this is identical to the
data set of Table 4.1 except that the subjects first names have been removed, and the
columns have been moved closer to one another so that there is less space between the
variables. You can see that line 8 still presents data for subject #1 (Marsha), line 9 still
represents data for subject #2, and so on.

Chapter 4: Data Input 115

The point is this: in this guide, all data sets will be arranged in the same fashion. The rows
(running horizontally from left to right) will represent different observations (typically
different people) and the columns (running vertically from top to bottom) will represent
different variables.
Overview of Three Options for Writing the INPUT Statement
The first step in analyzing variables with SAS involves reading them as part of a DATA
step, and the heart of the DATA step is the INPUT statement. The INPUT statement is the
statement in which you assign names to the variables that you will analyze. There are many
ways to write an INPUT statement, and some ways are much more complex than others. A
later section of this chapter will provide detailed guidelines for using one specific approach.
However, following is a quick overview of all three most commonly used options.
List input. List input (also called free-formatted input) is probably the simplest way to
write an INPUT statement. This is the approach that will be taught in this guide. With list
input, you simply give a name to each variable, and tell SAS the order in which they appear
on a data line (i.e., which variable comes first, which comes second, and so on). This is a
good approach to use when you are first learning SAS, and when you have data sets with a
small number of variables.
List input is also called free-formatted input because you do not have to put a variable into
any particular column on the data line. You simply have to be sure that you leave at least
one blank space between each variable, so that SAS can tell one variable from another.
Additional guidelines on the use of free-formatted input will be presented in the section
Example 4.2: A More Complex Data Set.
Here is an example of how you could write an INPUT statement that will read the preceding
data set using the free-formatted approach:
INPUT

SUB_NUM
HEIGHT
WEIGHT
AGE;

The preceding INPUT statement tells SAS that it will read four variables for each subject.
On each data line, it will first read the subjects score on SUB_NUM (the participants
subject number). On the same data line, it will then read the subjects score on HEIGHT, the
subjects score on WEIGHT, and finally the subjects score on AGE.
Column input. With column input, you assign a name to each variable, and tell SAS the
exact columns in which the variable will appear. For example, you might indicate that the
variable SUB_NUM will appear in column 1, the variable HEIGHT will appear in columns
3 through 4, the variable WEIGHT will appear in columns 6 through 8, and the variable
AGE will appear in columns 10 through 11.

116 Step-by-Step Basic Statistics Using SAS: Student Guide

Column input is a useful approach when you are working with larger data sets that contain a
larger number of variables. Although column input will not be covered in detail here, you
can learn more about in Schlotzhauer and Littell (1997, pp. 4144).
Here is an example of column input:
INPUT

SUB_NUM
HEIGHT
WEIGHT
AGE

1
3-4
6-8
10-11 ;

Formatted input. Formatted input is a more complex type of column input in which you
again assign names to your variables and indicate the exact columns in which they will
appear. Formatted input has the advantage of making it easy to input string variables,
variables whose names begin with the same root and end with a series of numbers. For
example, imagine that you administered a 50-item questionnaire to a large sample of people,
and wanted to use the SAS variable name V1 to represent responses to the first question, the
variable name V2 for responses to the second question, and so on. It would be very time
consuming if you listed each of these variable names individually in the INPUT statement.
However, if you used formatted input, you could create all 50 variables very easily with the
following statement:
INPUT

@1

(V1-V50)

(1.);

To learn about formatted input, see Cody and Smith (1997), and Hatcher and Stepanski
(1994).
Here is an example of how the current data set could be input using the formatted input
approach:
INPUT

@1
@3
@6
@10

(SUB_NUM)
(HEIGHT)
(WEIGHT)
(AGE)

(1.)
(2.)
(3.)
(2.) ;

Chapter 4: Data Input 117

Example 4.1: Creating a Simple SAS Data Set


Overview
This section shows you how to create a simple data set that contains just three quantitative
variables (i.e., the data set from Table 4.1). You will learn to use the various components
that constitute the DATA step:

OPTIONS statement

DATA statement

INPUT statement

DATALINES statement

data lines

null statement.

For reference, here is the DATA step that was presented earlier in this chapter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
HEIGHT
WEIGHT
AGE ;
DATALINES;
1 64 140 20
2 68 170 28
3 74 210 20
4 60 110 32
5 64 130 22
6 68 170 23
7 65 140 22
8 65 140 22
9 68 160 22
;

The OPTIONS Statement


The OPTIONS statement is not really a formal part of the DATA step; it is actually a global
command that can be used to set a variety of system options. In this section you will learn
how to use the OPTIONS statement to control the size of your output page when it is
printed. The syntax for the OPTIONS statement is as follows:
OPTIONS

LS=n1

PS=n2 ;

118 Step-by-Step Basic Statistics Using SAS: Student Guide

LS in the preceding OPTIONS statement is an abbreviation for LINESIZE. This option


enables you to control the maximum number of characters that will appear on each line of
output. In this OPTIONS statement, n1 = the maximum number of characters that you want
to appear on each printed line.
PS in the OPTIONS statement is an abbreviation for PAGESIZE. This option enables you
to control the maximum number of lines that will appear on each page of output. In this
OPTIONS statement, the n2 = the maximum number of lines that you want to appear on
each page.
For example, suppose that you want your SAS output to have a maximum of 80 characters
(letters and numbers) per line, and you want to have a maximum of 60 lines per page. The
following OPTIONS statement would request this:
OPTIONS

LS=80

PS=60 ;

The preceding is a good format to request when your output will be printed on standard
letter-size paper. However, if your output will be printed on a large-format printer with a
long carriage, you may want to have 120 characters per line. The following statement would
request this:
OPTIONS

LS=120

PS=60 ;

The DATA Statement


You use the DATA statement to begin the DATA step and assign a name to the data set that
you are creating. The syntax for the DATA statement is as follows:
DATA

data-set-name ;

For example, if you want to assign the name D1 to the data set that you are creating, you
would use the following statement:
DATA

D1;

You can assign just about any name you like to a data set, as long as the name conforms to
the following rules for a SAS data set name:

The name must begin with a letter or an underscore (_).

The remainder of the name can include either letters or numbers.

If you are using SAS System Version 6 (or earlier), the name can be a maximum of
eight characters; if you are using Version 7 (or later), the name can be a maximum of 32
characters.

The name cannot include any embedded blanks. For example, POL PRTY is not an
acceptable name, as it includes a blank space. However, POL_PRTY is acceptable,

Chapter 4: Data Input 119

because an underscore (_) connects the first part of the name (POL) to the second
part of the name (PRTY).

The name cannot contain any special characters (e.g., *, #) or hyphens (-).

This guide typically uses the name D1 for a SAS data set simply because it is short and easy
to remember.
The INPUT Statement
You use the INPUT statement to assign names to the variables that you will analyze, and to
indicate the order in which the variables will appear on the data lines. Using free-formatted
input, the syntax for the INPUT statement is as follows:
INPUT

first-variable
second-variable
third-variable
.
.
.
last-variable ;

In the INPUT statement, the first variable that you name (first-variable above) should be the
first variable that SAS will encounter when reading a data line from left to right. The second
variable you name (second-variable above) should be the second variable that SAS will
encounter when reading a data line, and so on.
The INPUT statement from the preceding height and weight study is reproduced here:
INPUT

SUB_NUM
HEIGHT
WEIGHT
AGE ;

You can assign almost any name to a SAS variable, provided that you adhere to the rules for
creating a SAS variable name. The rules for creating a SAS variable name are identical to
the rules for creating a SAS data set name, and these rules were discussed in the section,
The DATA Statement. That is, a SAS variable name must begin with a letter, it must not
contain any imbedded blanks, and so on.

120 Step-by-Step Basic Statistics Using SAS: Student Guide

The DATALINES Statement


The DATALINES statement tells SAS that the data set will begin on the next line. Here is
the DATALINES statement from the preceding program, along with the first two lines of
data:
DATALINES;
1 64 140 20
2 68 170 28
You use the DATALINES statement when you want to include the data as a part of the SAS
program. However, this is not the only way to input data with SAS. For example, it is also
possible to keep your data in a separate file, and refer to that file within your SAS program
by using the INFILE statement. This approach has the advantage of allowing your SAS
program to remain relatively short.
The data sets used in this guide are fairly short, however. Therefore, to keep things simple,
this guide includes the data set as part of the example SAS program, using the DATALINES
statement. To learn how to use the INFILE statement, see Cody and Smith (1997), and
Hatcher and Stepanski (1994, pp. 5658).
The Data Lines
The data lines should be placed between the DATALINES statement (described above) and
the null statement (to be described in the following section). Below are the data lines from
the height and weight study, preceded by the DATALINES statement, and followed by the
null statement (the semicolon at the end):
DATALINES;
1 64 140 20
2 68 170 28
3 74 210 20
4 60 110 32
5 64 130 22
6 68 170 23
7 65 140 22
8 65 140 22
9 68 160 22
;
The data sets in this guide are short and simple, with only one line of data for each subject.
This should be adequate when you have collected data on only a few variables. When you
collect data on a large number of variables, however, it will be necessary to use more than
one line of data for each subject. This will require a more sophisticated approach to data
input than the format-free approach used here. To learn about these more advanced
approaches, see Hatcher and Stepanski (1994, pp 3151).

Chapter 4: Data Input 121

The Null Statement


The null statement is the shortest statement in SAS programming: It consists simply of a
line with a semicolon, as shown here:
;
The null statement appears on the line following the end of the data set. It tells SAS that the
data lines have ended. Here are the last two lines of the preceding data set, followed by the
null statement:
8 65 140 22
9 68 160 22
;
Make sure that you place this semicolon by itself on the first line following the end of the
data lines. A mistake that is often made by new SAS users is to instead place it at the end of
the last line of data, as shown here:
8 65 140 22
9 68 160 22;
Do not do this; placing the semicolon at the end of the last line of data will usually result in
an error statement. Make sure that you always place it alone on the first line following the
last line of data.
The PROC Statement
The null statement, described in the previous section, tells SAS that the DATA step has
ended. When the DATA step is complete, you can then request statistical procedures that
will analyze your data set. You request these statistical procedures using PROC statements.
For example, below is a reproduction of (a) the last two data lines for the height and weight
study, (b) the null statement, and (c) the PROC MEANS statement that tells SAS to compute
the means and other descriptive statistics for three variables in the data set:
8 65 140 22
9 68 160 22
;
PROC MEANS DATA=D1;
VAR HEIGHT WEIGHT
TITLE1 'JANE DOE';
RUN;

AGE;

This guide shows you how to use a variety of PROC statements to request descriptive
statistics, correlations, t tests, analysis of variance, and other statistical procedures.

122 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 4.2: A More Complex Data Set


Overview
The preceding section was designed to provide the big picture regarding the SAS DATA
step. Now that you understand the fundamentals, you are ready to learn some of the details.
This section describes a fictitious study from the discipline of political science, and shows
you how to input the data that might be obtained from such a study. It shows you how to
write a simple SAS program that will handle numeric variables, character variables, and
missing data.
The Study
Suppose that you are interested in identifying variables that predict the size of financial
donations that people make to political parties. You develop the following questionnaire:
1. Would you describe yourself as being generally conservative, generally
liberal, or somewhere between these two extremes? (Please circle the
number that represents your orientation)
Generally
Conservative

Generally
Liberal

2. Would you like to see the size of the federal government increased or
decreased?
Greatly
Decreased

Greatly
Increased

3. Would you like to see the federal government assume an increased role
or a decreased role in providing health care to our citizens?
Decreased
Role

Increased
Role

4. What is your political party? (Check one)


____ Democrat

____ Republican

____ Other

5. What is your sex?


____ Female

_____ Male

6. What is your age?


___________ years old
7. During the past year, how much money have you donated to your political
party?
$ ________________

Chapter 4: Data Input 123

Data Set to Be Analyzed


The table of data. You administer this questionnaire to 11 people. Their responses are
reproduced in Table 4.2.
Table 4.2
Data from the Political Donation Study
______________________________________________________________
Responses
to questions:
______________
Political
Subject
Q1
Q2
Q3
party
Sex
Age
Donation
_______________________________________________________________
01. Marsha
7
6
5
D
F
32
1000
02. Charles
2
2
3
R
M
.
0
03. Jack
3
4
3
.
M
45
100
04. Cindy
6
6
5
.
F
20
.
05. Cathy
5
4
5
D
F
31
0
06. Emmett
2
3
1
R
M
54
0
07. Edward
2
1
3
.
M
21
250
08. Eric
3
3
3
R
M
43
.
09. Susan
5
4
5
D
F
32
100
10. Freda
3
2
.
R
F
18
0
11. Richard
3
6
4
R
M
21
50
_______________________________________________________________

As was the case with the height and weight data set presented earlier, the horizontal rows of
Table 4.2 represent individual subjects, and the vertical columns represent different
variables. The first column is headed Subject, and below this heading you will find a
subject number (e.g., 01) and a first name (e.g., Marsha) for each subject. Notice that
the subject numbers are now two-digit numbers ranging from 01 to 11. Numbers that
normally would be single-digit numbers (such as 1) have been converted to two-digit
numbers such as 01. This will make it easier to keep columns of numbers lined up
properly when you are typing these subject numbers as part of a SAS data set.
The first row presents questionnaire responses from subject #1, Marsha. Reading from left
to right, you can see that Marsha

circled a 7 in response to Question 1

circled a 6 in response to Question 2

circled a 5 in response to Question 3

indicated that she is a democrat (this is reflected by the D in the column Political
party)

indicated that she is a female (reflected by the F in the column headed Sex)

is 32 years old

donated $1000 dollars to her party (reflected by the 1000 in the Donation column).

124 Step-by-Step Basic Statistics Using SAS: Student Guide

The remaining rows of the table can be interpreted in the same way.
Using periods to represent missing data. The next-to-last column in Table 4.2 is Age,
and this column indicates each subjects age in years. For example, you can see that subject
#1 (Marsha) is 32 years old.
Where the row for subject #2 (Charles) intersects with the column headed Age, you can
see a period (.). In this book, periods will be used to represent missing data. In this case,
the period for subject #2 means that you do not have data on the Age variable for this
subject. In conducting questionnaire research, you will often obtain missing data when
subjects fail to complete certain questionnaire items.
When you review the Table 4.2, you can see that there are missing data on other variables in
addition to Age. For example,

Subject #3 (Jack) and subject #4 (Cindy) each have missing data for the Political party
variable.

Subject #4 (Cindy) also has missing data on Donation.

Subject #10 (Freda) has missing data on Q3.

There are a few other periods in Table 4.2, and each of these periods similarly represent
missing data.
Later, when you write your SAS program, you will again use periods to represent missing
data within the DATA step. When SAS reads a data line and encounters a period as a value,
it interprets it as missing data.

Chapter 4: Data Input 125

The SAS DATA Step


Following is the DATA step of a SAS program that contains information from Table 4.2.
You can see that all of the variables in Table 4.2 are included in this SAS data set, except for
subjects first names (such as Marsha). However, subject numbers such as 01 and 02
have been included as the first variable in the data set.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
POL_PRTY $
SEX $
AGE
DONATION ;
DATALINES;
01 7 6 5 D F 32 1000
02 2 2 3 R M .
0
03 3 4 3 . M 45 100
04 6 6 5 . F 20
.
05 5 4 5 D F 31
0
06 2 3 1 R M 54
0
07 2 1 3 . M 21 250
08 3 3 3 R M 43
.
09 5 4 5 D F 32 100
10 3 2 . R F 18
0
11 3 6 4 R M 21
50
;

The INPUT statement appears on lines 310 of the preceding program. It assigns the
following SAS variable names:

The SAS variable name SUB_NUM will be used to represent each participants subject
number.

The SAS variable name Q1 will be used to represent subjects responses to Question 1.

The SAS variable name Q2 will be used to represent subjects responses to Question 2.

The SAS variable name Q3 will be used to represent subjects responses to Question 3.

The SAS variable name POL_PRTY will be used to represent subjects political party.

The SAS variable name SEX will be used to represent subjects sex.

The SAS variable name AGE will be used to represent subjects age.

The SAS variable name DONATION will be used to represent the size of subjects
donations to political parties.

126 Step-by-Step Basic Statistics Using SAS: Student Guide

The data set appears on lines 1222 of the preceding program. You can see that it is
identical to the data presented in Table 4.2, except that subject names have been omitted,
and the columns of data have been moved together so that only one blank space separates
each variable.
Some Rules for List Input
The list approach to data input is probably the easiest way to write an INPUT statement.
However, there are a number of rules that you must observe to ensure that your data are read
correctly by SAS. The most important of these rules are presented here.
The variables must appear on the data lines in the same sequence that they are listed in
the INPUT statement. In the INPUT statement of the preceding program, the SAS
variables were listed in this order: SUB_NUM Q1 Q2 Q3 POL_PRTY SEX AGE
DONATION. This means that the variables must appear in exactly the same order on the
data lines following the DATALINES statement.
Each variable on the data lines must be separated by at least one blank space. Below
are the first three data lines from the preceding program:
DATALINES;
01 7 6 5 D F 32 1000
02 2 2 3 R M .
0
03 3 4 3 . M 45 100
The first data line was for subject #1 (Marsha), and so the first value on her line is 01 (her
subject number).
Marsha circled a 7 for Question 1, a 6 for Question 2, and so forth. It was necessary to
leave one blank space between the 7 and the 6 so that SAS would read them as two
separate values, rather than a single value of 76.
You will notice that the variables in the data lines of the preceding program were lined up so
that each variable formed a neat, orderly column. Technically, this is not necessary with list
input, but it is recommended as it increases the likelihood that you will have at least one
blank space between each variable and you will not make any other errors in typing your
data.
When you have a large number of variables, it becomes awkward to leave one blank space
between each variable. In these cases, it is better to enter each variable without the blank
spaces and to use either the column input approach or the formatted input approach instead
of the list approach to entering data. See Cody and Smith (1997) or Hatcher and Stepanski
(1994) for details.

Chapter 4: Data Input 127

Each missing value must be represented by a single period. Data from the first three
subjects in Table 4.2 are again reproduced here:
_______________________________________________________________
Responses
to questions:
______________
Political
Subject
Q1
Q2
Q3
party
Sex
Age
Donation
_______________________________________________________________
01. Marsha
7
6
5
D
F
32
1000
02. Charles
2
2
3
R
M
.
0
03. Jack
3
4
3
.
M
45
100
_______________________________________________________________

You can see that there are some missing data in this table. For example, consider the second
line of the table, which presents questionnaire responses for subject #2, Charles: in the
column for Age, there is a single period (.) where you would expect to find Charles age.
Similarly, the third line of the table presents responses for subject #3, Jack. You can see that
there is a period for Jack in the Political party column.
If you are using the list approach to input, it is very important that you use a single period (.)
to represent each instance of missing data when typing your data lines. As was mentioned
earlier, SAS recognizes a single period as the symbol for missing data.
For example, here again are the first three lines of data from the preceding SAS program:
DATALINES;
01 7 6 5 D F 32 1000
02 2 2 3 R M .
0
03 3 4 3 . M 45 100
The seventh variable (from the left) in this data set is the AGE variable. You can see that, on
the second line of data, a period appears in the column for the AGE variable. The second line
of data is for subject #2, Charles, and this period tells SAS that you have missing data on the
AGE variable for Charles. Other periods in the data set may be interpreted in the same way.
It is important that you use only one period for each instance of missing data; do not, for
example, use two periods simply because the relevant variable occupies two columns. As an
illustration, the following lines show an incorrect way to indicate missing data for subject
#2 on the age variable (the next-to-last variable):
DATALINES;
01 7 6 5 D F 32 1000
02 2 2 3 R M ..
0
03 3 4 3 . M 45 100
In the above incorrect example, the programmer keyed two periods in the place where the
second subjects age would normally be typed. But this will cause problemsbecause there
are two periods, SAS will assume that there is missing data on two variables: AGE, as well

128 Step-by-Step Basic Statistics Using SAS: Student Guide

as DONATION (the variable next to AGE). The point is simple: use a single period to
represent a single instance of missing data, regardless of how many columns the variable
occupies.
In the INPUT statement, use the $ symbol to identify character variables. All of the
variables discussed in this guide are either numeric variables or character variables.
Numeric variables consist exclusively of numbersthey do not contain any letters of the
alphabet or any special characters (symbols such as *, %, #). In the preceding data set, AGE
was an example of a numeric variable because it could assume only numeric values such as
32, 45, and 20.
In contrast, character variables may consist of letters of the alphabet, special characters, or
numbers. In the preceding data set, POL_PRTY was an example of a character variable
because it could assume the values D (for democrats) or R (for republicans). SEX was
also a character variable because it could assume the values F and M.
By default, SAS assumes that all of your variables will be numeric variables. If a particular
variable is a character variable, you must indicate this in your INPUT statement. You do this
by placing the dollar symbol ($) after the name of the variable in the INPUT statement.
Leave at least one blank space between the name of the variable and the $ symbol.
For example, the INPUT statement from the preceding program is again reproduced here:
INPUT

SUB_NUM
Q1
Q2
Q3
POL_PRTY
SEX $
AGE
DONATION

$
;

In this program, the SAS variables Q1, Q2, and Q3 are numeric variables, so the $ symbol is
not placed next to them. However, POL_PRTY is a character variable, and so the $ appears
next to it. The same is true for the SEX variable.
If you are using the column input approach, you should type the $ symbol before indicating
the columns in which the variable will appear. For example, here is the way the preceding
INPUT statement would be typed if you were using the column input approach:
INPUT

SUB_NUM
Q1
Q2
Q3
POL_PRTY
SEX
AGE
DONATION

$
$

1
4
6
8
10
12
14-15
17-20 ;

Chapter 4: Data Input 129

The preceding statement tells SAS that SUB_NUM appears in column 1, Q1 appears in
column 4, Q2 appears in column 6, Q3 appears in column 8, POL_PRTY appears in column
10, and so on. The $ symbols next to POL_PRTY and SEX inform SAS that these variables
are character variables.
Limit the values of character variables to eight characters. When using the format-free
approach to inputting data, a value of a character variable can be no more than eight
characters in length. Remember that the values are the actual entries that appear in the data
lines. With a numeric variable, a value is usually the score that the subject displays on the
variable. For example, the numeric variable AGE could assume values such as 32, 45,
20, and so on. With a character variable, a value is usually a name or an abbreviation
consisting of letters or symbols. For example, the character variable POL_PRTY could
assume the values D or R. The character variable SEX could assume the values F or
M.
Suppose that you wanted to create a new character variable called NAME to include your
subjects first names. The values of this variable would be the subjects first names (such as
Marsha), and you would have to ensure that no name was over eight letters in length.
Now, suppose that you drop the SUB_NUM variable, which assigns numeric subject
numbers to each subject (such as 01, 02, and so on). Then you decide to replace
SUB_NUM with your new character variable called NAME, which will consist of your
subjects first names. This NAME variable would be the first variable on each data line.
Here is the INPUT statement for this revised program, along with the first few data lines:
INPUT

NAME $
Q1
Q2
Q3
POL_PRTY
SEX $
AGE
DONATION

;
DATALINES;
Marsha 7 6 5 D F 32 1000
Charles 2 2 3 R M .
0
Jack
3 4 3 . M 45 100
Notice that each value of NAME in the preceding program (such as Marsha) is eight
characters in length or shorter. This is acceptable.

130 Step-by-Step Basic Statistics Using SAS: Student Guide

However, the following data lines would not be acceptable, because the values of NAME
are over eight characters in length:
DATALINES;
Elizabeth
7 6 5 D F 32 1000
Christopher 2 2 3 R M .
0
Francisco
3 4 3 . M 45 100
Remember also that the value of a character variable must not contain any embedded blanks.
This means, for example, that you cannot have a blank space in the middle of a name, as is
done with the following unacceptable data line:
Betty Lou

7 6 5 D F 32 1000

Avoid using hyphens in variable names. When listing SAS variable names in the INPUT
statement, you should avoid creating any SAS variable names that include a hyphen, such as
AGE-YRS. This is because SAS usually reads a variable name containing a hyphen as a
string variable (string variables were discussed in the section Overview of Three Options
for Writing the INPUT Statement). Students learning SAS programming for the first time
will sometimes write a SAS variable name that includes a hyphen, not realizing that this will
cause SAS to search for a string variable. The result is often an error message and
confusion.
Instead of using hyphens, it is good practice to use an underscore (_) in SAS variable
names. If you use an underscore, SAS will assume that the variable is a regular SAS
variable, and not a string variable.
For example, suppose that one of your variables is age in years. You should not use the
following SAS variable name to represent this variable, because SAS will interpret it as a
string variable:
AGE-YRS
Instead, you can use an underscore in the variable name, like this:
AGE_YRS

Chapter 4: Data Input 131

Using PROC MEANS and PROC FREQ to Identify Obvious


Problems with the Data Set
Overview
The DATA step is now complete, and you are finally ready to analyze the data set you have
entered. This section shows you how to use two SAS procedures to analyze the data set:

PROC MEANS, which requests that the means and other descriptive statistics be
computed for the numeric variables

PROC FREQ, which creates frequency tables for either numeric or character variables.

This section will show you the basic information that you need to know in order to use these
two procedures. PROC MEANS and PROC FREQ are illustrated here so that you can
perform some simple analyses to help verify that you created your data set correctly. A more
detailed treatment of PROC MEANS and PROC FREQ will appear in the chapters to
follow.
Adding PROC MEANS and PROC FREQ to the SAS Program
The syntax. Here is the syntax for requesting PROC MEANS and PROC FREQ:
PROC MEANS DATA=data-set-name ;
VAR variable-list ;
TITLE1 ' your-name ' ;
PROC FREQ DATA=data-set-name ;
TABLES variable-list ;
RUN;
In this guide, syntax is a template for a section of a SAS program. When you use syntax for
guidance in writing a SAS program, you should adhere to the following guidelines:

If certain words are presented in uppercase type (capital letters) in the syntax, you
should type those same words in your SAS program.

If certain words are presented in lowercase type in the syntax, you should not type those
words in your SAS program. Instead, you should substitute the data set names, variable
names, or key words that are appropriate for your specific analysis.

132 Step-by-Step Basic Statistics Using SAS: Student Guide

For example:
PROC MEANS

DATA=data-set-name ;

In the preceding line, PROC MEANS and DATA= are printed in uppercase type. Therefore,
you should type these words in your program just as they appear in the syntax. However, the
words data-set-name appear in lower case italics. Therefore, you will not type the words
data-set-name. Instead, you will type the name of the data set that you wish to analyze in
your specific analysis. For example, if you wish to analyze a data set that is named D1, you
would write the PROC MEANS statement this way in your SAS program:
PROC MEANS

DATA=D1;

Most of the chapters in this guide will include syntax for performing different tasks with
SAS. In each instance, you should follow the guidelines presented above for using the
syntax.
Variables that you will analyze. With the preceding syntax, the entry variable-list that
appears with the VAR and TABLES statements refers to the list of variables that you want
to analyze. In analyzing data from the political donation study, suppose that you will use
PROC MEANS to analyze your numeric variables (such as Q1 and AGE), and PROC FREQ
to analyze your character variables (POL_PRTY and SEX).

Chapter 4: Data Input 133

The SAS program. Following is the entire program for analyzing data from the political
donation study. This time, statements have been appended to the end of the program to
request PROC MEANS and PROC FREQ. Notice how the names of actual variables have
been inserted in the locations where variable-list had appeared in the syntax that was
presented above.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
POL_PRTY $
SEX $
AGE
DONATION ;
DATALINES;
01 7 6 5 D F 32 1000
02 2 2 3 R M .
0
03 3 4 3 . M 45 100
04 6 6 5 . F 20
.
05 5 4 5 D F 31
0
06 2 3 1 R M 54
0
07 2 1 3 . M 21 250
08 3 3 3 R M 43
.
09 5 4 5 D F 32 100
10 3 2 . R F 18
0
11 3 6 4 R M 21
50
;
PROC MEANS DATA=D1;
VAR Q1 Q2 Q3 AGE DONATION;
TITLE1 'JOHN DOE';
RUN;
PROC FREQ DATA=D1;
TABLES POL_PRTY SEX;
RUN;

Lines 2427 of this program contain the statements that request the MEANS procedures.
Line 25 contains the VAR statement for PROC MEANS. You use the VAR statement to list
the variables to be analyzed. You can see that this statement requests that PROC MEANS be
performed on Q1, Q2, Q3, AGE, and DONATION. Remember that you may list only
numeric variables in the VAR statement for PROC MEANSyou may not list character
variables (such as POL_PRTY or SEX).

134 Step-by-Step Basic Statistics Using SAS: Student Guide

Lines 2830 of this program contain the statements that request the FREQ procedure. Line
29 contains the TABLES statement for PROC FREQ. You use this statement to list the
variables for which frequency tables will be produced. You can see that PROC FREQ will
be performed on POL_PRTY and SEX. In the TABLES statement for PROC FREQ, you
may list either character variables or numeric variables.
The SAS Log
After the preceding program has been submitted and executed, you should first review the
SAS log file to verify that it ran without error. The log file for the preceding program is
reproduced as Log 4.1.
NOTE: SAS initialization used:
real time
18.56 seconds
1
2
3
4
5
6
7
8
9
10
11

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
POL_PRTY $
SEX $
AGE
DONATION ;
DATALINES;

NOTE: The data set WORK.D1 has 11 observations and 8 variables.


NOTE: DATA statement used:
real time
1.43 seconds
23
24
25
26
27

;
PROC MEANS DATA=D1;
VAR Q1 Q2 Q3 AGE
TITLE1 'JOHN DOE';
RUN;

DONATION;

NOTE: There were 11 observations read from the dataset WORK.D1.


NOTE: PROCEDURE MEANS used:
real time
1.63 seconds
28
29
30

PROC FREQ DATA=D1;


TABLES POL_PRTY
RUN;

SEX;

NOTE: There were 11 observations read from the dataset WORK.D1.


NOTE: PROCEDURE FREQ used:
real time
0.61 seconds
Log 4.1. Log file from the political donation study.

Remember that the SAS log consists of your SAS program (minus the data), along with
notes, warnings, and error messages generated by SAS as it executes your program. Lines

Chapter 4: Data Input 135

111 in Log 4.1 reproduce the DATA step of your SAS program. Immediately after this, the
following note appeared in the log window:
NOTE: The data set WORK.D1 has 11 observations and 8 variables.

This note indicates that your SAS data set (named D1) has 11 observations and 8 variables.
This is a good sign, because you intended to input data from 11 subjects on 8 variables.
The remainder of the SAS log reveals no evidence of any problems in the SAS program, and
so you can proceed to the SAS output file.
Interpreting the Results Produced by PROC MEANS
The SAS Output. The output file for the current analysis consists of two pages. Page 1
contains the results of PROC MEANS, and page 2 contains the results of PROC FREQ. The
results of PROC MEANS are reproduced in Output 4.1.
JOHN DOE

The MEANS Procedure


Variable
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------Q1
11
3.7272727
1.7372915
2.0000000
7.0000000
Q2
11
3.7272727
1.7372915
1.0000000
6.0000000
Q3
10
3.7000000
1.3374935
1.0000000
5.0000000
AGE
10
31.7000000
12.2750877
18.0000000
54.0000000
DONATION
9
166.6666667
323.0711996
0
1000.00
----------------------------------------------------------------------------Output 4.1. Results of PROC MEANS, political donation study.

Once you have created a data set, it is a good idea to perform PROC MEANS on all numeric
variables, and review the results for evidence of possible errors. It is especially important to
review the information in the columns headed N, Minimum, and Maximum.
Reviewing the number of valid observations.
The first column in Output 4.1 is headed Variable. In this column you will find the names
of the variables that were analyzed. You can see that, as expected, PROC MEANS was
performed on Q1, Q2, Q3, AGE, and DONATION.
The second column in Output 4.1 is headed N. This column indicates the number of valid
observations that were found for each variable. Where the row for Q1 intersects with the
column headed N, you will find the number 11. This indicates that PROC MEANS
analyzed 11 valid cases for the variable Q1, as expected. Where the row for Q3 intersects
with the column headed N, you find the number 10, meaning that there were only 10
usable observations for the Q3 variable. However, this does not necessarily mean that there
was an error: if you review the actual data set (reproduced earlier), you will note that there is

136 Step-by-Step Basic Statistics Using SAS: Student Guide

one instance of missing data for Q3 (indicated by the single period in the column for Q3 for
the next-to-last subject). Similarly, although Output 4.1 indicates only 9 valid observations
for DONATION, this is no cause for concern because the data set itself shows that you had
missing data for two people on this variable.
Reviewing the Minimum and Maximum columns.
The fifth column in Output 4.1 is headed Minimum. This column indicates the smallest
value that was observed for each variable.
The last column in the output is headed Maximum, and this column indicates the largest
value that was observed for each variable. The Minimum and Maximum columns are
useful for determining whether any values are out of bounds. Out-of-bounds values are
scores that are either too large or too small to be possible, given the type of variable that you
are analyzing. If you find any out-of-bounds values, it probably means that you made an
error, either in writing your INPUT statement or in typing your data.
For example, consider the variable Q1 in Output 4.1. Where the row for Q1 intersects with
the column headed Minimum, you see the value of 2. This means that the smallest value
that the SAS System observed for Q1 was 2. Where the row for Q1 intersect with the
column headed Maximum, you see the value of 7. This means that the largest value that
SAS read for Q1 was 7.
Remember that the variable Q1 represents responses to a questionnaire item where the
possible responses ranged from a low of 1 (Generally Conservative) to a maximum of 7
(Generally Liberal). If Output 4.1 showed a Minimum score of 0 for Q1, this would be an
invalid, out-of-bounds score (because Q1 is not supposed to go any lower than 1). Such a
result might mean that you made an error in keying your data. Similarly, if the output
showed a Maximum score of 9 for Q1, this would also be an invalid score (because Q1 is
not supposed to go any higher than 7).
A review of minimum and maximum values in Output 4.1 does not reveal any out-ofbounds scores for any of the variables.

Chapter 4: Data Input 137

Interpreting the Results Produced by PROC FREQ


The SAS Output. Because the results of PROC MEANS did not reveal any obvious
problems with the data set, you can proceed to the results of PROC FREQ. These results are
reproduced in Output 4.2.
JOHN DOE

The FREQ Procedure


Cumulative
Cumulative
POL_PRTY
Frequency
Percent
Frequency
Percent
------------------------------------------------------------D
3
37.50
3
37.50
R
5
62.50
8
100.00
Frequency Missing = 3

Cumulative
Cumulative
SEX
Frequency
Percent
Frequency
Percent
-------------------------------------------------------F
5
45.45
5
45.45
M
6
54.55
11
100.00
Output 4.2. Results of PROC FREQ, political donation study.

Reviewing the frequency tables. Two tables appear in Output 4.2: a frequency table for
the variable POL_PRTY, and a frequency table for the variable SEX.
In the first table, the variable name POL_PRTY appears in the upper left corner, meaning
that this is the frequency table for POL_PRTY (political party). Beneath this variable name
are the two possible values that the variable could assume: D (for democrats) and R
(for republicans). You should always review this list of values to verify that no invalid
values appear there. For example, if the values for this table had included T along with
D and R, it probably would indicate that you made an error in keying your data because
T doesnt stand for anything meaningful in this study.
When typing character variables in a data set, case is important, so you must be consistent in
using uppercase and lowercase letters. For example, when keying POL_PRTY, if you
initially use an uppercase D to represent democrats, you should never switch to using a
lowercase d within that data set. If you do, SAS will treat the uppercase D and the
lowercase d as two completely different values. When you perform a PROC FREQ on the
POL_PRTY variable, you will obtain one row of frequency information for subjects
identified with the uppercase D, and a different row of frequency information for subjects
identified with the lowercase d. In most cases, this will not be desirable.

138 Step-by-Step Basic Statistics Using SAS: Student Guide

The second column in the frequency table in Output 4.2 is headed Frequency. This
column indicates the number of subjects who were observed in each of the categories of the
variable being analyzed. For example, where the row for the value D intersects with the
column headed Frequency, you can see the number 3. This means that 3 subjects were
coded with a D in the data set. In other words, it means that 3 subjects were democrats.
Where the row for the value R intersects with the column headed Frequency, you see
the number 5. This means that 5 subjects were coded with a R in the data set (i.e., 5
subjects were republicans).
Below the frequency table for the POL_PRTY variable, you can see the entry Frequency
Missing = 3. This section of the results produced by PROC FREQ indicates the number of
observations with missing data for the variable being analyzed. This frequency missing
entry for POL_PRTY indicates that there were three subjects with missing data for the
political party variable.
Whenever you create a new data set, you should always perform PROC FREQ on all
character variables in this manner, to verify that the results seem reasonable. A warning
sign, for example, would be a very large value for Frequency Missing. For POL_PRTY,
all of the results from PROC FREQ seem reasonable, indicating no obvious problems.
The second frequency table in Output 4.2 provides results for the SEX variable. It shows
that 5 subjects were coded with an F (5 subjects were female), and 6 subjects were coded
with an M (6 subjects were male). There is no Frequency Missing entry for the SEX
table, which indicates that there were no missing data for this variable. These results, too,
seem reasonable, and do not indicate any obvious problems with the DATA step so far.
Summary
In summary, whenever you create a new data set, you should perform a few simple
descriptive analyses to verify that there were no obvious errors in writing the INPUT
statement or in typing the data. At a minimum, this should include performing PROC
MEANS on your numeric variables, and performing PROC FREQ on your character
variables.
PROC UNIVARIATE is also useful for performing descriptive analysis on numeric
variables. The results produced by PROC UNIVARIATE are somewhat more complex than
those produced by PROC MEANS; for this reason, it will be covered in Chapter 7,
Measures of Central Tendency and Variability.
If the results produced by PROC MEANS and PROC FREQ do not reveal any obvious
problems, it does not necessarily mean that your data set is free of typos or other errors. An
even more thorough approach to checking your data set involves using PROC PRINT to
print out the raw data, so that you can proof every subjects value on every variable. The
following section shows how to do this.

Chapter 4: Data Input 139

Using PROC PRINT to Create a Printout of Raw Data


Overview
The PRINT procedure (PROC PRINT) is useful for generating a printout of your raw data
(i.e., a printout of your data as they appear in a SAS data set). You can use PROC PRINT to
review each subjects score on each variable in your data set.
Whenever you create a new data set, you should always use PROC PRINT to print out the
raw data before doing any other, more sophisticated analyses. You should check the output
created by PROC PRINT against your original data records to verify that SAS has read your
data in the way that you intended.
The first part of this section shows you how to use PROC PRINT to print raw data for all
variables in a data set. Later, this section shows how you can use the VAR statement to print
raw data for a subset of variables.
Using PROC PRINT to Print Raw Data for All of the Variables In the
Data Set
The Syntax. Here is the syntax for the PROC step that will cause the PRINT procedure to
print the raw data for all variables in your data set:
PROC PRINT
TITLE1
RUN;

DATA=data-set-name ;
' your-name ' ;

Here are the actual statements that you use with the PRINT procedure to print the raw data
for the political donation study described above (a later section will show where these
statements should go in your program):
PROC PRINT
TITLE1
RUN;

DATA=D1;
'JOHN DOE';

Output 4.3 shows the results that are generated by the preceding statements.

140 Step-by-Step Basic Statistics Using SAS: Student Guide

JOHN DOE

Obs

SUB_NUM

Q1

Q2

Q3

POL_PRTY

SEX

AGE

DONATION

1
2
3
4
5
6
7
8
9
10
11

1
2
3
4
5
6
7
8
9
10
11

7
2
3
6
5
2
2
3
5
3
3

6
2
4
6
4
3
1
3
4
2
6

5
3
3
5
5
1
3
3
5
.
4

D
R

F
M
M
F
F
M
M
M
F
F
M

32
.
45
20
31
54
21
43
32
18
21

1000
0
100
.
0
0
250
.
100
0
50

D
R
R
D
R
R

Output 4.3. Results of PROC PRINT performed on data from the political
donation study (see Table 4.2).

Output created by PROC PRINT. For the most part, Output 4.3 presents a duplication of
the data that appeared in Table 4.2, which was presented earlier in this chapter. The most
obvious difference is the fact that subject names that appeared in Table 4.2 do not appear in
Output 4.3.
The first column, Obs (Observation number) lists a unique observation number for each
subject in the study. When the observations in a data set are individual subjects (as is the
case with the current political donation study), the observation numbers are essentially
subject numbers. This means that, in the row for observation #1, you will find data for your
first subject (Marsha from Table 4.2); in the row for observation #2, you will find data for
your second subject (Charles from Table 4.2), and so on.
You probably remember that you did not include this Obs variable in the data set that you
created. Instead, this Obs variable is automatically generated by SAS whenever you create a
SAS data set.
This column shows the subject number variable that was input as part of your SAS data set.
The column headed Q1 contains subject responses to question #1 from the political donation
questionnaire that was presented earlier. Question #1 asked, Would you describe yourself as
being generally conservative, generally liberal, or somewhere between these two extremes?
Subjects could circle any number from 1 to 7 to indicate their response, where 1 = Generally
Conservative and 7 = Generally Liberal. Under the heading of Q1 in Output 4.3, you can see
that subject #1 circled a 7, subject #2 circled a 2, subject #3 circled a 3, and so on.
In the columns headed Q2 and Q3, you will find subject responses to question #2 and
question #3 from the political donation questionnaire. These questions also used a 7-point
response format. The output shows that subject #10 has a period (.) listed under the heading
Q3. This means that this subject has missing data for question #3.

Chapter 4: Data Input 141

Under POL_PRTY, you will find subject values for the political party variable. You will
remember that this was a character variable in which the value D represents democrats
and R represents republicans. You can see subject #3, subject #4, and subject #7 do not
have any values for POL_PRTY. This is because they had missing data on the political party
variable.
The column headed SEX indicates subject sex. This was a character variable in which F
represented females and M represented males.
The column headed AGE indicates subject age.
The column headed DONATION indicates the amount of the financial donation made to a
political party by each subject.
Using PROC PRINT to Print Raw Data for a Subset of Variables In
the Data Set
Statements for the SAS Program. In some cases, you may wish to print raw data for only a
few variables in a data set. When this is the case, you should use the VAR statement in
conjunction with PROC PRINT statement. In the VAR statement, list only the names of the
variables that you want to print. Below is the syntax:
PROC PRINT DATA=data-set-name ;
VAR variable-list ;
TITLE1 ' your-name ' ;
RUN;
For example, the following will cause PROC PRINT to print raw values only for the SEX
and AGE variables:
PROC PRINT DATA=D1;
VAR SEX AGE;
TITLE1 'JOHN DOE';
RUN;

142 Step-by-Step Basic Statistics Using SAS: Student Guide

Output created by PROC PRINT. Output 4.4 shows the results that are generated by the
preceding statements.
JOHN DOE
Obs

SEX

AGE

1
2
3
4
5
6
7
8
9
10
11

F
M
M
F
F
M
M
M
F
F
M

32
.
45
20
31
54
21
43
32
18
21

Output 4.4. Results of PROC PRINT in which only the SEX and
AGE variables were listed in the VAR statement.

You can see that Output 4.4 is similar to Output 4.3, with the exception that Output 4.4
includes only three variables: Obs, SEX, and AGE. As was stated earlier, Obs is not entered
by the SAS user as a part of the data set; instead, it is automatically generated by SAS.
A Common Misunderstanding Regarding PROC PRINT
Students learning the SAS System for the first time often misunderstand PROC PRINT:
they sometimes assume that a SAS program must contain PROC PRINT in order to generate
a paper printout of their results. This is not the case. PROC PRINT simply generates a
printout of your raw data (i.e., subjects individual scores for the variables in your data set).
If you have performed some other SAS procedure such as PROC MEANS or PROC FREQ,
you do not have to include PROC PRINT in your program to create a paper printout of the
results generated by these procedures. Use PROC PRINT only when you want to generate a
listing of each subjects values on the variables in your SAS data set.

The Complete SAS Program


To review: when you first create a SAS data set, it is very important to perform a few
simple SAS procedures to verify that SAS read your data set as you intended. In most cases,
this means that you should

perform PROC MEANS on all numeric variables in the data set (if any).

perform PROC FREQ on all character variables in the data set (if any).

Chapter 4: Data Input 143

perform PROC PRINT to print out the complete raw data set, including numeric and
character variables.

These three procedures have been discussed separately in previous sections. However, it is
often best to request all three procedures in the same SAS program when you have created a
new data set. An example of such a program appears below. The program does the
following:
1. inputs the political donation data set described earlier in this chapter
2. requests that PROC MEANS be performed on one subset of variables
3. requests that PROC FREQ be performed on a different subset of variables
4. includes a PROC PRINT statement that will cause the entire raw data set to be printed
out (notice that the VAR statement has been omitted from the PROC PRINT section of
the program):
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
POL_PRTY $
SEX $
AGE
DONATION ;
DATALINES;
01 7 6 5 D F 32 1000
02 2 2 3 R M .
0
03 3 4 3 . M 45 100
04 6 6 5 . F 20
.
05 5 4 5 D F 31
0
06 2 3 1 R M 54
0
07 2 1 3 . M 21 250
08 3 3 3 R M 43
.
09 5 4 5 D F 32 100
10 3 2 . R F 18
0
11 3 6 4 R M 21
50
;
PROC MEANS DATA=D1;
VAR Q1 Q2 Q3 AGE DONATION;
TITLE1 'JOHN DOE';
RUN;
PROC FREQ DATA=D1;
TABLES POL_PRTY SEX;
RUN;
PROC PRINT DATA=D1;
RUN;

144 Step-by-Step Basic Statistics Using SAS: Student Guide

Conclusion
This chapter focused on the list input approach to writing the INPUT statement. This is a
relatively simple approach, and will be adequate for the types of data sets that you will
encounter in this Student Guide. For more complex data sets (e.g., data sets that include
more than one line of data for each observation), you might want to learn more about
formatted input. This approach is described and illustrated in Cody and Smith (1997),
Hatcher (2001), and Hatcher and Stepanski (1994).
After you have prepared the DATA step of your SAS program, it is good practice to analyze
it with PROC FREQ (along with other procedures) to verify that there were no obvious
errors in the INPUT statement. This chapter provided a quick introduction to PROC FREQ;
next, Chapter 5, Creating Frequency Tables, discusses the FREQ procedure in greater
detail.

Creating
Frequency
Tables
Introduction.........................................................................................146
Overview...............................................................................................................146
Why It Is Important to Use PROC FREQ ..............................................................146
Example 5.1: A Political Donation Study ...........................................147
The Study .............................................................................................................147
Data Set to Be Analyzed.......................................................................................148
The DATA Step of the SAS Program ....................................................................150
Using PROC FREQ to Create a Frequency Table................................152
Writing the PROC FREQ Statement .....................................................................152
Output Produced by the SAS Program .................................................................152
Examples of Questions That Can Be Answered by
Interpreting a Frequency Table......................................................155
The Frequency Table............................................................................................155
The Questions.......................................................................................................156
Conclusion...........................................................................................157

146 Step-by-Step Basic Statistics Using SAS: Student Guide

Introduction
Overview
In this chapter you learn how to use the FREQ procedure to create simple, one-way
frequency tables. When you use PROC FREQ to analyze a specific variable, the resulting
frequency table displays

values for that variable that were observed in the sample that you analyzed

frequency (number) of observations appearing at each value

percent of observations appearing at each value

cumulative frequency of observations appearing at each value

cumulative percent of observations appearing at each value.

Some of the preceding statistics terms (e.g., cumulative frequency) may be new to you.
Later sections of this chapter will explain these terms, and will show you how to interpret a
frequency table created by the FREQ procedure.
Why It Is Important to Use PROC FREQ
After you have created a SAS data set, it is often a good idea to analyze it with PROC FREQ
before going on to perform more sophisticated statistical analyses (such as analysis of
variance). At a minimum, this will help you find errors in your data or program. In addition,
with some types of investigations it is necessary to create a frequency table in order to
answer research questions. For example, performing PROC FREQ on the correct data set
can help you answer the research question What percentage of the adult U.S. population
favors the death penalty?

Chapter 5: Creating Frequency Tables 147

Example 5.1: A Political Donation Study


The Study
Suppose that you are a political scientist conducting research on campaign finance. With
your current study, you wish to identify the variables that predict the size of the financial
donations that people make to political causes. You develop the following questionnaire:
What is your political party (please check one):
____ Democrat

____ Republican

____ Independent

What is your sex?


____ Female

____ Male

What is your age?


______ years old

During the past year, how much money have you donated to political causes?
$ ____________
Below are a number of statements with which you may agree or disagree. For
each, please circle the number that indicates the extent to which you either
agree or disagree with the statement. Please use the following format in
making your responses:
7
6
5
4
3
2
1

=
=
=
=
=
=
=

Agree Very Strongly


Agree Strongly
Agree
Neither Agree nor Disagree
Disagree
Disagree Strongly
Disagree Very Strongly

Circle your
response

1 2 3 4 5

1. I believe that our federal government is generally


doing a good job.

2. The federal government should raise taxes.

3. The federal government should do a better job of


maintaining our interstate highway system.

4. The federal government should increase social


security benefits to the elderly.

148 Step-by-Step Basic Statistics Using SAS: Student Guide

Data Set to Be Analyzed


Responses to the questionnaire. You administer this questionnaire to 22 individuals
between the ages of 33 and 59. Table 5.1 contains subject responses to the questionnaire.
Table 5.1
Subject Responses to the Political Donation Questionnaire
_______________________________________________________________
Responses to
statementsb
Subject

Political
partya

Sex

Age

Donation

__________________
Q1
Q2
Q3
Q4

_______________________________________________________________
01
D
M
47
400
4
3
6
2
02
R
M
36
800
4
6
6
6
03
I
F
52
200
1
3
7
2
04
R
M
47
300
3
2
5
3
05
D
F
42
300
4
4
5
6
06
R
F
44
1200
2
2
5
5
07
D
M
44
200
6
2
3
6
08
D
M
50
400
4
3
6
2
09
R
F
49
2000
3
1
6
2
10
D
F
33
500
3
4
7
1
11
R
M
49
700
7
2
6
7
12
D
F
59
600
4
2
5
6
13
D
M
38
300
4
1
2
6
14
I
M
55
100
5
5
6
5
15
I
F
52
0
5
2
6
5
16
D
F
48
100
6
3
4
6
17
R
F
47
1500
2
1
6
2
18
D
M
49
500
4
1
6
2
19
D
F
43
1000
5
2
7
3
20
D
F
44
300
4
3
5
7
21
I
F
38
100
5
2
4
1
22
D
F
47
200
3
7
1
4
_______________________________________________________________
a

For the political party variable, D represents democrats, R


represents republicans, and I represents independents.
b
Responses to the four agree-disagree statements at the end of the
questionnaire.

Chapter 5: Creating Frequency Tables 149

Understanding Table 5.1. In Table 5.1, the rows (running horizontally) represent different
subjects, and the columns (running vertically) represent different variables. The first column
is headed Subject. This variable simply assigns a unique subject number to each person
who responded to the questionnaire. These subject numbers run from 01 to 22. The
second column is headed Political party. With this variable, the value D is used to
represent democrats, R is used to represent republicans, and I is used to represent
independents. Third column is headed Sex. With this variable, the value M is used to
represent male subjects, and the value F is used to represent female subjects.
The fourth and fifth columns are headed Age and Donation. These columns provide
each subjects age and the size of the political donations they have made, respectively. For
example, you can see that Subject 01 was 47 years old and made a donation of $400, Subject
02 was 36 years old and made a donation of $800, and so forth.
The last four columns of Table 5.1 appear under the major heading Responses to
statements. These columns contain subject responses to the four agree-disagree
statements that appear in the previously mentioned questionnaire:

Column Q1 indicates the number that each subject circled in response to the statement I
believe that our federal government is generally doing a good job. You can see that
Subject 01 circled 4 (which stands for Neither agree nor Disagree), Subject 02 also
circled 4, Subject 03 circled 1 (which stands for Disagree Very Strongly), and so on.

Column Q2 contains responses to the statement The federal government should raise
taxes.

Column Q3 contains responses to the statement The federal government should do a


better job of maintaining our interstate highway system.

Column Q4 contains responses to the statement The federal government should increase
social security benefits to the elderly.

150 Step-by-Step Basic Statistics Using SAS: Student Guide

The DATA Step of the SAS Program


Keying the DATA step. Now you include the data that appear in Table 5.1 as part of the
DATA step of a SAS program. In doing this, you arrange the data in a way that is similar to
the preceding table (i.e., the first column contains a unique subject number for each
participant, the second column indicates the political party to which each subject belongs,
and so on).
Below is the DATA step for the SAS program that contains these data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
POL_PRTY $
SEX
$
AGE
DONATION
Q1
Q2
Q3
Q4 ;
DATALINES;
01 D M 47 400 4 3 6 2
02 R M 36 800 4 6 6 6
03 I F 52 200 1 3 7 2
04 R M 47 300 3 2 5 3
05 D F 42 300 4 4 5 6
06 R F 44 1200 2 2 5 5
07 D M 44 200 6 2 3 6
08 D M 50 400 4 3 6 2
09 R F 49 2000 3 1 6 2
10 D F 33 500 3 4 7 1
11 R M 49 700 7 2 6 7
12 D F 59 600 4 2 5 6
13 D M 38 300 4 1 2 6
14 I M 55 100 5 5 6 5
15 I F 52
0 5 2 6 5
16 D F 48 100 6 3 4 6
17 R F 47 1500 2 1 6 2
18 D M 49 500 4 1 6 2
19 D F 43 1000 5 2 7 3
20 D F 44 300 4 3 5 7
21 I F 38 100 5 2 4 1
22 D F 47 200 3 7 1 4
;

Chapter 5: Creating Frequency Tables 151

Understanding the DATA step. Remember that if you were typing the preceding program,
you would not actually type the line numbers (in italic) that appear on the left; they are
provided here for reference.
With the preceding program, the INPUT statement appears on lines 311. This INPUT
statement assigns the following SAS variable names to your variables:

The SAS variable name SUB_NUM is used to represent each subjects unique subject
number (i.e., 01, 02, 03, and so on).

The SAS variable name POL_PRTY represents the political party to which the subject
belongs. In typing your data, you used the value D to represent democrats, R to
represent republicans, and I to represent independents. The dollar sign ($) to the right of
this variable name indicates that it is a character variable.

The SAS variable name SEX represents each subjects sex, with the value F
representing females and M representing males. Again, the dollar sign ($) to the right of
this variable name indicates that it is also a character variable.

The SAS variable name AGE indicates the subject's age in years.

The SAS variable name DONATION indicates the size of the political donation (in
dollars) that each subject made in the past year.

The SAS variable name Q1 indicates subject responses to the first question using the
agree-disagree format. You keyed a 1 if the subject circled a 1 (for Disagree Very
Strongly), you keyed a 2 if the subject circled a 2 (for Disagree Strongly), and so
on.

In the same way, the SAS variable names Q2, and Q3, and Q4 represent subject responses
to the second, third, and fourth questions using the Agree-Disagree format.

Notice that at least one blank space was left between each variable in the data set. This is
required when using the list, or free-formatted, approach to data input.
With the data set typed in, you can now append PROC statements below the null statement
(the lone semicolon that appears on line 35). You can use these PROC statements to create
frequency tables that will help you understand more clearly the variables in the data set. The
following section shows you how to do this.

152 Step-by-Step Basic Statistics Using SAS: Student Guide

Using PROC FREQ to Create a Frequency Table


Writing the PROC FREQ Statement
The syntax. Following is the syntax for the PROC FREQ statement (and related statements)
that will create a simple frequency table:
PROC FREQ
TABLES
TITLE1
RUN;

DATA=data-set-name ;
variable-list ;
' your-name ' ;

The second line of the preceding code is the TABLES statement. In this statement you list
the names of the variables for which frequency tables should be created. If you list more
than one variable, you should separate each variable name by at least one space. Listing
more than one variable in the TABLES statement will cause SAS to create a separate
frequency table for each variable.
For example, the following statements create a frequency table for the variable AGE that
appears in the preceding data set:
PROC FREQ DATA=D1;
TABLES AGE;
TITLE1 'JANE DOE';
RUN;

Output Produced by the SAS Program


The frequency table. The frequency table created by the preceding statements is
reproduced as Output 5.1. The remainder of this section shows how to interpret the various
parts of the table.

Chapter 5: Creating Frequency Tables 153

JANE DOE

The FREQ Procedure


Cumulative
Cumulative
AGE
Frequency
Percent
Frequency
Percent
-------------------------------------------------------33
1
4.55
1
4.55
36
1
4.55
2
9.09
38
2
9.09
4
18.18
42
1
4.55
5
22.73
43
1
4.55
6
27.27
44
3
13.64
9
40.91
47
4
18.18
13
59.09
48
1
4.55
14
63.64
49
3
13.64
17
77.27
50
1
4.55
18
81.82
52
2
9.09
20
90.91
55
1
4.55
21
95.45
59
1
4.55
22
100.00
Output 5.1. Results from the FREQ procedure performed on the variable AGE.

You can see that the frequency table consists of five vertical columns headed AGE ( ),
Frequency ( ), Percent ( ), Cumulative Frequency ( ), and Cumulative Percent ( ). The
following sections describe the meaning of the information contained in each column.
The column headed with the variable name. The first column in a SAS frequency table is
headed with the name of the variable that is being analyzed.
You can see that the first column in Output 5.1 is labeled AGE, the variable being
analyzed here. The various values assumed by AGE appear in the first column under the
heading AGE. Reading from the top down in this column, you can see that, in this data
set, the observed values of AGE were 33, 36, 38, and so on, through 59. This means that the
youngest person in your data set was 33, and the oldest was 59.
The second column in the data set is headed Frequency. This column reports the number
of observations that appear at each value of the variable being analyzed. In the present case,
it will tell us how many subjects were at age 33, how many were at age 36, and so on.
For example, the first value in the AGE column is 33. If you read to the right of this value,
you will find information about how many people were at age 33. Where the row for 33
intersects with the column for Frequency, you see the number 1. This means that just
one person was at age 33. Now skip down two rows to the row for the age 38. Where the
row for 38 intersects with the column headed Frequency, you see the number 2. This
means that two people were at age 38.

154 Step-by-Step Basic Statistics Using SAS: Student Guide

Reviewing various parts of the Frequency column reveals the following:

There were 3 people at age 44.

There were 4 people at age 47.

There was 1 person at age 59.

The next column is the Percent column. This column indicates the percent of observations
appearing at each value. In the present case, it will reveal the percent of people at age 33, the
percent at age 36, and so on.
A particular entry in the Percent column is equal to the corresponding value in the
Frequency column, divided by the total number of usable observations in the data set. For
example, where the row for the age of 33 intersects with the column headed Percent, you
see the entry 4.55. This means that 4.55% of the subjects were at age 33. This was computed
by dividing the frequency of people at age 33 (which was 1) by the total number of usable
observations (which was 22). 1 divided by 22 is equal to .0455, or 4.55%.
Now go down to the row for the age of 44. Where the row for 44 intersects with the column
headed Percent, you see the entry 13.64, meaning that 13.64% of the subjects were at age
44. This was computed by dividing the frequency of people at age 44 (which was 3) by
the total number of usable observations (which was 22). 3 divided by 22 is equal to .1364,
or 13.64%.
The next column is the Cumulative Frequency column. A particular entry in the
Cumulative Frequency column indicates the sum of

the number of observations scoring at the current value in the "Frequency" column plus ...

the number of observations scoring at each of the preceding (lower) values in the
"Frequency" column.

For example, look at the point where the row for AGE = 44 intersects with the column
headed Cumulative Frequency. At that intersection, you see the number 9. This means
that a total of 9 people were at age 44 or younger.
Next, look at the point where the row for AGE = 55 intersects with the column headed
Cumulative Frequency. At that intersection, you see the number 21. This means that a
total of 21 people were at age 55 or younger.
Finally, the last entry in the Cumulative Frequency column is 22, meaning that 22
people were at age 59 or younger. It also means that a total of 22 people provided valid data
on this AGE variable (the last entry in the Cumulative Frequency column always indicates
the total number of usable observations for the variable being analyzed).

Chapter 5: Creating Frequency Tables 155

The last column is the Cumulative Percent column. A particular entry in the Cumulative
Percent column indicates the sum of

the percent of observations scoring at the current value in the Percent column plus...

the percent of observations scoring at each of the preceding (lower) values in the Percent
column.

For example, look at the point where the row for AGE = 44 intersects with the column
headed Cumulative Percent. At that intersection, you see the number 40.91. This means
that 40.91% of the subjects were at age 44 or younger.
Next, look at the point where the row for AGE = 55 intersects with the column headed
Cumulative Percent. At that intersection, you see the number 95.45. This means that
95.45% of the subjects were at age 55 or younger.
Finally, the last entry in the Cumulative Percent column is 100.00, meaning that 100%
of the people were at age 59 or younger. The last figure in the Cumulative Percent column
will always be 100%.
Frequency missing. In some instances, you will see Frequency missing = n below the
frequency table that was created by PROC FREQ (this entry does not appear in Output 5.1).
This entry appears when you perform PROC FREQ on a variable that has at least some
missing data for the variable being analyzed. The entry is followed by a number that
indicates the number of observations containing missing data that were encountered by SAS.
For example, the following entry, after the frequency table for the variable AGE, would
indicate that SAS encountered five subjects with missing data on the AGE variable:
Frequency missing = 5

Examples of Questions That Can Be Answered by


Interpreting a Frequency Table
The Frequency Table
The frequency table for the AGE variable is reproduced again as Output 5.2. This output is
identical to Output 5.1, with the exception that different parts of the output are now
identified with numbers (e.g., , ).

156 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE

The FREQ Procedure


Cumulative
Cumulative
AGE
Frequency
Percent
Frequency
Percent
-------------------------------------------------------33
1
4.55
1
4.55
36
1
4.55
2
9.09
38
2
9.09
4
18.18
42
1
4.55
5
22.73
43
1
4.55
6
27.27
44
3
13.64
9
40.91
47
4
18.18
13
59.09
48
1
4.55
14
63.64
49
3
13.64
17
77.27
50
1
4.55
18
81.82
52
2
9.09
20
90.91
55
1
4.55
21
95.45
59
1
4.55
22
100.00
Output 5.2. Results from the FREQ procedure, with various items identified
for purposes of answering questions.

The Questions
The companion volume to this book, Step-by-Step Basic Statistics Using SAS: Exercises,
provides exercises that enable you to review what you learned in this chapter by

entering a new data set

performing PROC FREQ on one of the variables in that data set

answering a series of questions about the frequency table created by PROC FREQ.

Here are examples of the types of questions that you will be asked to answer.
Read each of the questions presented below, review the answer provided, and verify that
you understand where the answer is found in Output 5.2. Also verify that you understand
why that answer is correct. If you are confused by any of the following questions and
answers, go back to the relevant section of this chapter and reread that section.

Question: What is the lowest observed value for the AGE variable?
Answer: 33. This number is identified with number

in Output 5.2.

Question: What is the highest observed value for the AGE variable?
Answer: 59. This number is identified with number

in Output 5.2.

Chapter 5: Creating Frequency Tables 157

Question: How many people are 49 years old? (i.e., What is the frequency for people who
displayed a value of 49 on the AGE variable?)
Answer: Three. This number is identified with number

in Output 5.2.

Question: What percent of people are 38 years old?


Answer: 9.09% This number is identified with number

Question: How many people are 50 years old or younger?


Answer: 18. This number is identified with number

in Output 5.2.

Question: What percent of people are 52 years old or younger?


Answer: 90.91%. This number is identified with number

in Output 5.2.

in Output 5.2.

Question: What is the total number of valid observations for the AGE variable in this data
set?
Answer: 22. This number is identified with number

in Output 5.2.

Conclusion
This chapter has shown you how to use PROC FREQ to create simple frequency tables.
These tables provide the numbers that enable you to verbally describe the nature of your
data; they allow you to make statements such as Nine percent of the sample were 38 years
of age or 96% of the sample were age 55 or younger.
In some cases, it is more effective to use a graph to illustrate the nature of your data. For
example, you might use a bar graph to indicate the frequency of subjects at various ages. Or
you might use a bar graph to illustrate the mean age for male subjects versus female
subjects.
SAS provides an number of procedures that enable you to create bar graphs of this sort, as
well as other types of graphs and charts. The following chapter introduces you to some of
these procedures.

158 Step-by-Step Basic Statistics Using SAS: Student Guide

Creating Graphs
Introduction.........................................................................................160
Overview...............................................................................................................160
High-Resolution versus Low-Resolution Graphics ................................................160
What to Do If Your Graphics Do Not Fit on the Page............................................161
Reprise of Example 5.1: the Political Donation Study .......................161
The Study .............................................................................................................161
SAS Variable Names ............................................................................................161
Using PROC CHART to Create a Frequency Bar Chart.......................162
What Is a Frequency Bar Chart? ..........................................................................162
Syntax for the PROC Step ....................................................................................163
Creating a Frequency Bar Chart for a Character Variable ....................................163
Creating a Frequency Bar Chart for a Numeric Variable.......................................165
Creating a Frequency Bar Chart Using the LEVELS Option .................................168
Creating a Frequency Bar Chart Using the MIDPOINTS Option...........................170
Creating a Frequency Bar Chart Using the DISCRETE Option.............................172
Using PROC CHART to Plot Means for Subgroups .............................174
Plotting Means versus Frequencies ......................................................................174
The PROC Step ....................................................................................................174
Output Produced by the SAS Program .................................................................176
Conclusion...........................................................................................177

160 Step-by-Step Basic Statistics Using SAS: Student Guide

Introduction
Overview
In this chapter you learn how to use the SAS Systems CHART procedure to create bar
charts. Most of the chapter focuses on creating frequency bar charts: figures in which the
horizontal axis plots values for a variable, and the vertical axis plots frequencies. A bar for a
particular value in a frequency bar chart indicates the number of observations that display a
particular value in the data set. Frequency bar charts are useful for quickly determining
which values are relatively common in a data set, and which values are less common. You'll
learn how to modify your bar charts by using the LEVELS, MIDPOINTS, and DISCRETE
options.
The final section of this chapter shows you how to use PROC CHART to create subgroupmean bar charts. These are figures in which the points on the horizontal axis represent
different subgroups of subjects, and the vertical axis plots values on a selected quantitative
variable. A bar for a particular group illustrates the mean score displayed by that group for
the quantitative variable. Subgroup-mean bar charts are useful for quickly determining
which groups scored relatively high on a quantitative variable, and which groups scored
relatively low.
High-Resolution versus Low-Resolution Graphics
This chapter shows you how to create low-resolution graphics, as opposed to high-resolution
graphics. The difference between low-resolution and high-resolution graphics is one of
appearance: high-resolution graphics have a higher-quality, more professional look, and
therefore are more appropriate for publication in a research journal. Low-resolution
graphics are fine for helping you review and understand your data, but the quality of their
appearance is generally not good enough for publication.
This chapter presents only low-resolution graphics because the SAS programs requesting
them are simpler, and they require only base SAS and SAS/STAT software, which most
SAS users have access to. If you need to prepare high-resolution graphics, you need
SAS/GRAPH software. For more information on producing high-quality figures with
SAS/GRAPH, see SAS Institute Inc. (2000), and Carpenter and Shipp (1995).

Chapter 6: Creating Graphs 161

What to Do If Your Graphics Do Not Fit on the Page


Most of chapters in this book advise that you begin each SAS program with the following
OPTIONS statement to control the size of your output page:
OPTIONS

LS=80

PS=60;

The option PS=60 is an abbreviation for PAGESIZE=60. This requests that output be
printed with 60 lines per page. With some computers and printers, however, these
specifications will cause some figures (e.g., bar charts) to be too large to be printed on a
single page. If you have charts that are broken across two pages, try reducing the page size
to 50 lines by using the following OPTIONS statement:
OPTIONS

LS=80

PS=50;

Reprise of Example 5.1: the Political Donation Study


The Study
This chapter will demonstrate how to use PROC CHART to analyze data from the fictitious
political donation study that was presented in Chapter 5, Creating Frequency Tables. In
that chapter, the example involved research on campaign finance, with a questionnaire that
was administered to 33 subjects.
The results of the questionnaire provided demographic information about the subjects (e.g.,
sex, age), the size of political donations they had made recently, and some information
regarding their political beliefs (sample item: I believe that our federal government is
generally doing a good job). Subjects responded to these items using a seven-point
response format in which 1 = Disagree Very Strongly and 7 = Agree Very Strongly.
SAS Variable Names
When you typed your data, you used the following SAS variable names:

POL_PRTY represents the political party to which the subject belongs. In keying your
data, you used the value D to represent democrats, R to represent republicans, and
I to represent independents.

SEX represents the subjects sex, with the value F representing females and M
representing males.

AGE represents the subject's age in years.

DONATION represents the size (in dollars) of the political donation that each subject
made in the past year.

162 Step-by-Step Basic Statistics Using SAS: Student Guide

Q1 represents subject responses to the first question using the Agree-Disagree format.
You typed a 1 if the subject circled a 1 for Disagree Very Strongly," you typed a
2 if the subject circled a 2 for Disagree Strongly, and so on.

Q2, Q3, and Q4 represent subject responses to the second, third, and fourth questions
using the agree-disagree format.

Chapter 5, Creating Frequency Tables includes a copy of the questionnaire that was used
to obtain the data. It also presented the complete SAS DATA step that input the data as a
SAS data set. If you need to refamiliarize yourself with the questionnaire or the data set,
review Example 5.1 in Chapter 5.

Using PROC CHART to Create a Frequency Bar Chart


What Is a Frequency Bar Chart?
Chapter 5 showed you how to use PROC FREQ to create a simple frequency table. These
tables are useful for determining the number of people whose scores lie at each value of a
given variable. In some cases, however, it is easier to get a sense of these frequencies by
plotting them in a bar chart, rather than summarizing them in numerical form in a table. The
SAS Systems PROC CHART makes it easy to create this type of bar chart.
A frequency bar chart is a figure in which the horizontal axis plots values for a variable,
and the vertical axis plots frequencies. A bar for a particular value in a frequency bar chart
indicates the number of observations displaying that value in the data set. Frequency bar
charts are useful for quickly determining which values are relatively common in a data set,
and which values are less common. The following sections illustrate a variety of approaches
for creating these charts.

Chapter 6: Creating Graphs 163

Syntax for the PROC Step


Following is the syntax for the PROC step of a SAS program that requests a frequency bar
chart with vertical bars.:
PROC CHART DATA=data-set-name;
VBAR variable-list
/
options ;
TITLE1 ' your-name ';
RUN;
The second line of the preceding syntax presents the VBAR statement, which requests a
vertical bar chart (use the HBAR statement for a horizontal bar chart). It is in the variablelist from the VBAR statement that you list the variables for which you want frequency bar
charts. The VBAR statement ends with /options, the section in which you list particular
options you want for the charts. Some of these options will be discussed in later sections.
Creating a Frequency Bar Chart for a Character Variable
The PROC step. In this example, you will create a frequency bar chart for the variable
POL_PRTY. You will recall that this is the variable for the subject's political party:
democrat, republican, or independent. Following are the statements that will request a
vertical bar chart plotting frequencies for POL_PRTY:
PROC CHART DATA=D1;
VBAR POL_PRTY;
TITLE1 'JANE DOE';
RUN;
Where these statements should appear in the SAS program. Remember that the PROC
step of a SAS program should generally come after the DATA step. Chapter 5 provided the
DATA step for the political donation data that will be analyzed here. To give you a sense of
where the PROC CHART statement should go, the following code shows the last few data
lines from the data set, followed by the statements in the PROC step:
[Lines 130 of the DATA step presented in Chapter 5 would appear here]
19 D F 43 1000 5 2 7 3
20 D F 44 300 4 3 5 7
21 I F 38 100 5 2 4 1
22 D F 47 200 3 7 1 4
;
PROC CHART DATA=D1;
VBAR POL_PRTY;
TITLE1 'JANE DOE';
RUN;

164 Step-by-Step Basic Statistics Using SAS: Student Guide

Output produced by the SAS program. Output 6.1 shows the bar chart that is created by
the preceding statements.
JANE DOE

Frequency
12 +
*****
|
*****
11 +
*****
|
*****
10 +
*****
|
*****
9 +
*****
|
*****
8 +
*****
|
*****
7 +
*****
|
*****
6 +
*****
*****
|
*****
*****
5 +
*****
*****
|
*****
*****
4 +
*****
*****
*****
|
*****
*****
*****
3 +
*****
*****
*****
|
*****
*****
*****
2 +
*****
*****
*****
|
*****
*****
*****
1 +
*****
*****
*****
|
*****
*****
*****
-------------------------------------------D
I
R
POL_PRTY
Output 6.1. Results of PROC CHART performed on POL_PRTY.

The name of the variable that is being analyzed appears below the horizontal axis of the bar
chart. You can see that POL_PRTY is the variable being analyzed in this case.
The values that this variable assumed are used as labels for the bars in the bar chart. In
Output 6.1, the value D (democrats) labels the first bar, I (independents) labels the second
bar, and R (republicans) labels the last bar.
Frequencies are plotted on the vertical axis of the bar chart. The height of a bar on the
frequency axis indicates the frequencies associated with each value of the variable being
analyzed. For example, Output 6.1 shows a frequency of 12 for subjects coded with a D, a
frequency of 4 for subjects coded with an I, and a frequency of 6 for subjects coded with an
R. In other words, this bar chart shows that there were 12 democrats, 4 independents, and 6
republicans in your data set.

Chapter 6: Creating Graphs 165

Creating a Frequency Bar Chart for a Numeric Variable


When you create a bar chart for a character variable, SAS will create a separate bar for each
value that your character variable includes. You can see this in Output 6.1, where separate
bars were created for the values D, I, and R.
However, if you create a bar chart for a numeric variable (and the numeric variable assumes
a relatively large number of values), SAS will typically group your data, and create a bar
chart in which the various bars are labeled with the midpoint for each group.
The PROC step. Here are the statements necessary to create a frequency bar chart for the
numeric variable AGE:
PROC CHART DATA=D1;
VBAR AGE;
TITLE1 'JANE DOE';
RUN;

166 Step-by-Step Basic Statistics Using SAS: Student Guide

Output produced by the SAS program. Output 6.2 shows the bar chart that is created by
the preceding statements.
JANE DOE

Frequency
9 +
*****
|
*****
|
*****
|
*****
8 +
*****
|
*****
|
*****
|
*****
7 +
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
6 +
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
5 +
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
4 +
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
3 +
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
2 +
*****
*****
*****
*****
|
*****
*****
*****
*****
|
*****
*****
*****
*****
|
*****
*****
*****
*****
1 +
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
-------------------------------------------------------------------33
39
45
51
57
AGE Midpoint
Output 6.2. Results of PROC CHART performed on AGE.

Notice that the horizontal axis is now labeled AGE Midpoint. The various bars in the
chart are labeled with the midpoints for the groups that they represent. Table 6.1
summarizes the way that the AGE values were grouped:

Chapter 6: Creating Graphs 167


Table 6.1
Criteria Used for Grouping Values of the AGE Variable
_____________________________________________
An observation is
Interval
placed in this interval if
midpoint
AGE scores fell in this range
_____________________________________________
33

30 AGE Score < 36

39

36 AGE Score < 42

45

42 AGE Score < 48

51

48 AGE Score < 54

57
54 AGE Score < 60
_____________________________________________

In Table 6.1, the first value under Interval midpoint is 33. To the right of this midpoint is
the entry 30 AGE Score < 36. This means that if a given value of AGE is greater than or
equal to 30 and also is less than 36 it is placed into the interval that is identified with a
midpoint of 33. The remainder of Table 6.1 can be interpreted in the same way.
The first bar in Output 6.2 shows that there is a frequency of 1 for the group identified with
the midpoint of 33. This means that there was only one person whose age was in the interval
from 30 to 35. The remaining bars in Output 6.2 show that

there was a frequency of 3 for the group identified with the midpoint of 39

there was a frequency of 9 for the group identified with the midpoint of 45

there was a frequency of 7 for the group identified with the midpoint of 51

there was a frequency of 2 for the group identified with the midpoint of 57.

168 Step-by-Step Basic Statistics Using SAS: Student Guide

Creating a Frequency Bar Chart Using the LEVELS Option


The preceding section showed that when you analyze a numeric variable with PROC
CHART, SAS may group the values on that variable and identify each bar in the bar chart
with the interval midpoint. But what if SAS does not group these values into the number of
bars that you want? Is there a way to override the SAS Systems default approach to
grouping these values? Fortunately, there is. PROC CHART provides a number of options
that enable you to control the number and nature of the bars that appear in the bar chart. For
example, the LEVELS option provides one of the easiest approaches for controlling the
number of bars that will appear.
The syntax. Here is the syntax for the PROC step in which the LEVELS option is used to
control the number of bars:
PROC CHART DATA=data-set-name;
VBAR variable-list / LEVELS=desired-number-of-bars ;
TITLE1 ' your-name ';
RUN;
For example, suppose that you want to have exactly six bars in your chart. The following
statements would request this:
PROC CHART DATA=D1;
VBAR AGE / LEVELS=6;
TITLE1 'JANE DOE';
RUN;

Chapter 6: Creating Graphs 169

Output produced by the SAS program. Output 6.3 shows the chart that is created by the
preceding statements.
JANE DOE

Frequency
8 +
*****
|
*****
|
*****
|
*****
|
*****
7 +
*****
|
*****
|
*****
|
*****
|
*****
6 +
*****
|
*****
|
*****
|
*****
|
*****
5 +
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
4 +
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
|
*****
*****
3 +
*****
*****
*****
*****
|
*****
*****
*****
*****
|
*****
*****
*****
*****
|
*****
*****
*****
*****
|
*****
*****
*****
*****
2 +
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
1 +
*****
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
*****
|
*****
*****
*****
*****
*****
*****
------------------------------------------------------------------------32.5
37.5
42.5
47.5
52.5
57.5
AGE Midpoint

Output 6.3. Results of PROC CHART using the LEVELS option.

Notice that there are now six bars in the chart, as requested in the PROC step. Notice also
that the midpoints in the chart have been changed to accommodate the fact that there are
now bars for six groups, rather than five.

170 Step-by-Step Basic Statistics Using SAS: Student Guide

Creating a Frequency Bar Chart Using the MIDPOINTS Option


PROC CHART also enables you to specify exactly what you want the midpoints to be for
the various bars in the figure. You can do this by using the MIDPOINTS option in the
VBAR statement. With this approach, you can either list the exact values that each midpoint
should assume, or provide a range and an interval. Both approaches are illustrated below.
Listing the exact midpoints. If your bar chart will have a small number of bars, you might
want to specify the exact value for each midpoint. Here is the syntax for the PROC step that
specifies midpoint values:
PROC CHART DATA=data-set-name;
VBAR variable-list / MIDPOINTS=desired-midpoints ;
TITLE1 ' your-name ';
RUN;
For example, suppose that you want to use the midpoints 20, 30, 40, 50, 60, 70, and 80 for
the bars in your chart. The following statements would request this:
PROC CHART DATA=D1;
VBAR AGE / MIDPOINTS=20 30 40 50 60 70 80;
TITLE1 'JANE DOE';
RUN;
Providing a range and an interval. Writing the MIDPOINTS statement in the manner
illustrated can be tedious if you want to have a large number of bars on your chart. In these
situations, it may be easier to use the key words TO and BY with the MIDPOINTS option.
This allows you to specify the lowest midpoint, the highest midpoint, and the interval that
separates each midpoint.
For example, the following MIDPOINTS option asks SAS to create midpoints that range
from 20 to 80, with each midpoint separated by an interval of 10 units:
PROC CHART DATA=D1;
VBAR AGE / MIDPOINTS=20 TO 80 BY 10;
TITLE1 'JANE DOE';
RUN;

Chapter 6: Creating Graphs 171

Output produced by the SAS program. Output 6.4 shows the bar chart that is created by
the preceding statements.
JANE DOE

Frequency
11 +
*****
|
*****
10 +
*****
|
*****
9 +
*****
|
*****
8 +
*****
*****
|
*****
*****
7 +
*****
*****
|
*****
*****
6 +
*****
*****
|
*****
*****
5 +
*****
*****
|
*****
*****
4 +
*****
*****
|
*****
*****
3 +
*****
*****
|
*****
*****
2 +
*****
*****
*****
|
*****
*****
*****
1 +
*****
*****
*****
*****
|
*****
*****
*****
*****
--------------------------------------------------------------------------20
30
40
50
60
70
80
AGE Midpoint
Output 6.4. Results of PROC CHART using the MIDPOINTS option.

You can see that all of the requested midpoints appear on the horizontal axis of Output 6.4.
This is the case despite the fact that some of the midpoints have no bar at all, indicating a
frequency of zero. For example, the last midpoint on the horizontal axis is 80, but there is no
bar for this group. This is because, as you may remember, the oldest subject in your data set
was 59 years old; thus there were no subjects in the group with a midpoint of 80.

172 Step-by-Step Basic Statistics Using SAS: Student Guide

Creating a Frequency Bar Chart Using the DISCRETE Option


Earlier sections have shown that when you use PROC CHART to create a frequency table
for a character variable (such as POL_PRTY or SEX), it will automatically create a separate
bar for each value that the variable includes. However, it will not typically do this with a
numeric variable; when numeric variables assume many values, PROC CHART will
normally group the data, and label the axis with the midpoints for each group.
But what if you want to create a separate bar for each observed value of your numeric
variable? In this case, you simply specify the DISCRETE option in the VBAR statement.
The syntax. Here is the syntax for the PROC step that will cause a separate bar to be printed
for each observed value of a numeric variable:
PROC CHART DATA=data-set-name;
VBAR variable-list
/
DISCRETE ;
TITLE1 ' your-name ';
RUN;
The following statements again create a frequency bar chart for AGE. This time, however,
the DISCRETE option is used to create a separate bar for each observed value of AGE.
PROC CHART DATA=D1;
VBAR AGE / DISCRETE;
TITLE1 'JANE DOE';
RUN;

Chapter 6: Creating Graphs 173

Output produced by the SAS program. Output 6.5 shows the bar chart that is created by
the preceding statements.
JANE DOE

Frequency
4 +
***
|
***
|
***
|
***
|
***
|
***
|
***
|
***
|
***
|
***
3 +
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
|
*** ***
***
2 +
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
|
***
*** ***
***
***
1 + *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
| *** *** *** *** *** *** *** *** *** *** *** *** ***
-------------------------------------------------------------------33
36
38
42
43
44
47
48
49
50
52
55
59
AGE

Output 6.5. Results of PROC CHART using the DISCRETE option.

In Output 6.5, the bars are narrower than in the previous examples because there are more of
them. There is now one bar for each value that AGE actually assumed in the data set.
You can see that the first bar indicates the number of people at age 33, the second bar
indicates the number of people at age 36, and so on. Notice that there are no bars labeled 34
or 35, as there were no people at these ages in your data set.

174 Step-by-Step Basic Statistics Using SAS: Student Guide

Using PROC CHART to Plot Means for Subgroups


Plotting Means versus Frequencies
Earlier sections of this chapter have shown how to use PROC CHART to create frequency
bar charts. However, it is also possible to use PROC CHART to create subgroup-mean bar
charts. These are figures in which the bars represent the means for various subgroups
according to a particular, quantifiable criterion.
For example, consider the political donation questionnaire that was presented in Chapter 5.
One of the items on the questionnaire was I believe that our federal government is
generally doing a good job. Subjects were asked to indicate the extent to which they agreed
or disagreed with this statement by circling a number from 1 (Disagree Very Strongly) to
7 (Agree Very Strongly). Responses to this question were given the SAS variable name
Q1 in the SAS program.
A separate item on the questionnaire asked subjects to indicate whether they were
democrats, republicans, or independents. Responses to this question were given the SAS
variable name POL_PRTY.
It would be interesting to see if there are any differences between the ways that democrats,
republicans, and independents responded to the question about whether the federal
government is doing a good job (variable Q1). In fact, it is possible to compute the mean
score on Q1 for each of these subgroups, and to plot these subgroup means on a chart. The
following section shows how do this with PROC CHART.
The PROC Step
Following is the syntax for the PROC CHART statements that create a subgroup-mean bar
chart:
PROC CHART DATA=data-set-name;
VBAR group-variable / SUMVAR=criterion-variable
TITLE1 ' your-name ';
RUN;

TYPE=MEAN;

The second line of the preceding syntax includes the following:


VBAR

group-variable

Here, group-variable refers to the SAS variable that codes group membership. In your
program, you list POL_PRTY as the group variable because POL_PRTY indicates whether
a given subject is a democrat, a republican, or an independent (the three groups that you
want to compare).

Chapter 6: Creating Graphs 175

The second line of the syntax also includes the following:


/ SUMVAR=criterion-variable
The slash (/) indicates that options will follow. You use the SUMVAR= option to identify
the criterion variable in your analysis. This criterion variable is the variable on which means
will be computed. In the present example, you want to compute mean scores on Q1, the item
asking whether the federal government is doing a good job. Therefore, you will include
SUMVAR=Q1 in your final program.
Finally, the second line of the syntax ends with
TYPE=MEAN;
This option specifies that PROC CHART should compute group means on the criterion
variable, as opposed to computing group sums on the criterion variable. If you wanted to
compute group sums on the criterion variable, this option would be typed as:
TYPE=SUM;
Following are the statements that request that PROC CHART create a bar chart to plot
means for the three groups on the variable Q1:
PROC CHART DATA=D1;
VBAR POL_PRTY /
SUMVAR=Q1
TITLE1 'JANE DOE';
RUN;

TYPE=MEAN;

176 Step-by-Step Basic Statistics Using SAS: Student Guide

Output Produced by the SAS Program


Output 6.6 shows the bar chart that is created by the preceding PROC step.
JANE DOE

Q1 Mean
|
*****
4 +
*****
*****
|
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
3 +
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
2 +
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
1 +
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
|
*****
*****
*****
-------------------------------------------D
I
R
POL_PRTY
Output 6.6. Results of PROC CHART in which the criterion variable is Q1 and
the grouping variable is POL_PRTY.

When this type of analysis is performed, the name of the grouping variable appears as the
label for the horizontal axis.
Output 6.6 shows that POL_PRTY is the grouping variable for the current analysis.
The values that POL_PRTY can assume appear as labels for the individual bars in the chart.
The labels for the three bars are D (democrats), I (independents), and R (republicans).
In the frequency bar charts that were presented earlier in this chapter, the vertical axis
plotted frequencies. However, with the type of analysis reported here, the vertical axis
reports mean scores for the various groups on the criterion variable.
The heading for the vertical axis is now Q1 Mean, meaning that you use this axis to
determine each groups mean score on the criterion variable, Q1.

Chapter 6: Creating Graphs 177

The height of a particular bar on the vertical axis indicates the mean score for that group on
Q1 (the question about whether the federal government is doing a good job). Output 6.6
shows that

The democrats (the bar labeled D) had a mean score that was just above 4.0 on the
criterion variable, Q1.

The independents (the bar labeled I) had a mean score that was about 4.0.

The republicans (the bar labeled R) had a mean score that was just below 4.0.

Remember that 1 represented Disagree Very Strongly and 7 represented Agree Very
Strongly. A score of 4 represented Neither Agree nor Disagree. The mean scores
presented in Output 6.6 show that all three groups had a mean score close to 4.0, meaning
that their mean scores were all close to the response Neither Agree nor Disagree. The
mean score for the democrats was a bit higher (a bit closer to Agree), and the mean score
for the republicans was a bit lower (a bit closer to Disagree), although we have no way of
knowing whether these differences are statistically significant at this point.

Conclusion
This chapter has shown you how to use PROC CHART to create frequency bar charts and
subgroup-mean bar charts. Summarizing results graphically in charts such as these can make
it easier for you to identify trends in your data at a glance. These figures can also make it
easier to communicate your findings to others.
When presenting your findings, it is also common to report measures of central tendency
(such as the mean or the median) and measures of variability (such as the standard deviation
or the interquartile range). Chapter 7, "Measures of Central Tendency and Variability,"
shows you how to use PROC MEANS and PROC UNIVARIATE to compute these
measures, along with other measures of central tendency and variability. Chapter 7 also
shows you how to create stem-and-leaf plots that can be reviewed to determine the general
shape of a particular distribution of scores.

178 Step-by-Step Basic Statistics Using SAS: Student Guide

Measures of
Central
Tendency and
Variability
Introduction.........................................................................................181
Overview...............................................................................................................181
Why It Is Important to Compute These Measures.................................................181
Reprise of Example 5.1: The Political Donation Study.......................181
The Study .............................................................................................................181
SAS Variable Names ............................................................................................182
Measures of Central Tendency: The Mode, Median, and Mean.........183
Overview...............................................................................................................183
Writing the SAS Program......................................................................................183
Output Produced by the SAS Program .................................................................184
Interpreting the Mode Computed by PROC UNIVARIATE....................................185
Interpreting the Median Computed by PROC UNIVARIATE .................................186
Interpreting the Mean Computed by PROC UNIVARIATE....................................186
Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE ...187
Overview...............................................................................................................187
Output Produced by the SAS Program .................................................................187
Using PROC UNIVARIATE to Determine the Shape of
Distributions ...................................................................................190
Overview...............................................................................................................190
Variables Analyzed ...............................................................................................190
An Approximately Normal Distribution ..................................................................191
A Positively Skewed Distribution...........................................................................194

180 Step-by-Step Basic Statistics Using SAS: Student Guide

A Negatively Skewed Distribution .........................................................................196


A Bimodal Distribution...........................................................................................198
Simple Measures of Variability: The Range, the Interquartile
Range, and the Semi-Interquartile Range ......................................200
Overview...............................................................................................................200
The Range ............................................................................................................200
The Interquartile Range ........................................................................................202
The Semi-Interquartile Range ...............................................................................203
More Complex Measures of Central Tendency: The Variance and
Standard Deviation .........................................................................204
Overview...............................................................................................................204
Relevant Terms and Concepts..............................................................................204
Conceptual Formula for the Population Variance..................................................205
Conceptual Formula for the Population Standard Deviation .................................206
Variance and Standard Deviation: Three Formulas ..........................207
Overview...............................................................................................................207
The Population Variance and Standard Deviation ................................................207
The Sample Variance and Standard Deviation .....................................................208
The Estimated Population Variance and Standard Deviation................................209
Using PROC MEANS to Compute the Variance and
Standard Deviation .........................................................................210
Overview...............................................................................................................210
Computing the Sample Variance and Standard Deviation ....................................210
Computing the Population Variance and Standard Deviation ...............................212
Computing the Estimated Population Variance and Standard Deviation ..............212
Conclusion...........................................................................................214

Chapter 7: Measures of Central Tendency and Variability 181

Introduction
Overview
This chapter shows you how to perform simple procedures that help describe and summarize
data. You will learn how to use PROC UNIVARIATE to compute the measures of central
tendency that are most frequently used in research: the mode, the median, and the mean.
You will also learn how to create and interpret a stem-and-leaf plot, which can be helpful in
understanding the shape of a variables distribution. Finally, you will learn how to use
PROC MEANS to compute the variance and standard deviation of quantitative variables.
You will learn how to compute the sample standard deviation and variance, as well as the
estimated population standard deviation and variance.
Why It Is Important to Compute These Measures
There are a number of reasons why you need to be able to perform these procedures. At a
minimum, the output produced by PROC MEANS and PROC UNIVARIATE will help you
verify that you made no obvious errors in entering data or writing your program. It is always
important to verify that your data are correct before going on to analyze them with more
sophisticated procedures. In addition, most research journals require that you report simple
descriptive statistics (e.g., means and standard deviations) for the variables you analyze.
Finally, many of the later chapters in this guide build on the more basic concepts presented
here. For example, in Chapter 9 you will learn how to create a type of standardized variable
called a z score by using standard deviations, a concept taught in this chapter.

Reprise of Example 5.1: The Political Donation Study


The Study
This chapter illustrates PROC UNIVARIATE and PROC MEANS by analyzing data from
the fictitious political donation study presented in Chapter 5, Creating Frequency Tables.
In that chapter, you were asked to suppose that you are a political scientist conducting
research on campaign finance. You developed a questionnaire and administered it to 22
individuals.
With the questionnaire, subjects provided demographic information about themselves (e.g., sex,
age), indicated the size of political donations they had made recently, and responded to four
items designed to assess some of their political beliefs (sample item: I believe that our federal
government is generally doing a good job). Subjects responded to these items using a 7-point
response format in which 1 = Disagree Very Strongly and 7 = Agree Very Strongly.

182 Step-by-Step Basic Statistics Using SAS: Student Guide

SAS Variable Names


In entering your data, you used the following SAS variable names to represent the variables
measured:

The SAS variable SUB_NUM contains unique subject numbers assigned to each
subject.

The SAS variable POL_PRTY represents the political party to which the subject
belongs. In entering your data, you used the value D to represent democrats, R to
represent republicans, and I to represent independents.

The SAS variable SEX represents the subjects sex, with the value F representing
females and M representing males.

The SAS variable AGE represents the subject's age in years.

The SAS variable DONATION represents the size of the political donation (in dollars)
that each subject made in the past year.

The SAS variable Q1 represents subject responses to the first question using the AgreeDisagree format. You typed a 1 if the subject circled a 1 (for Disagree Very
Strongly), you typed a 2 if the subject circled a 2 (for Disagree Strongly), and so
forth.

In the same way, the SAS variables Q2, Q3, and Q4 represent subject responses to the
second, third, and fourth questions using the Agree-Disagree format.

Chapter 5, Creating Frequency Tables, provides a copy of the questionnaire that was used
to obtain the preceding data. It also presented the complete SAS DATA step that read in the
data as a SAS data set.

Chapter 7: Measures of Central Tendency and Variability 183

Measures of Central Tendency: The Mode, Median, and


Mean
Overview
When you assess some numeric variable (such as subject age) in the course of conducting a
research study, you will typically obtain a variety of different scoresa distribution of
scores. If you write an article about your study for a research journal, you will need some
mechanism to describe your obtained distribution of scores. Readers will want to know what
was the most typical or representative score on your variable; in the present case, they
would want to know what was the most typical or representative age in your sample. To
convey this, you will probably report one or more measures of central tendency.
A measure of central tendency is a value or number that represents the location of a
sample of data on a continuum by revealing where the center of the distribution is located in
that sample. There are a variety of different measures of central tendency, and each uses a
somewhat different approach for determining just what the center of the distribution is.
The measures of central tendency that are used most frequently in the behavioral sciences
and education are the mode, the median, and the mean. This section discusses the difference
between each, and shows how to compute them using PROC UNIVARIATE.
Writing the SAS Program
The PROC step. The UNIVARIATE procedure in SAS can produce a wide variety of
indices for describing a distribution. These include the sample size, the standard deviation,
the skewness, the kurtosis, percentiles, and other indices. This section, however, focuses on
just three: the mode, the median, and the mean.
Below is the syntax for the PROC step that requests PROC UNIVARIATE:
PROC UNIVARIATE DATA=data-set-name
VAR variable-list ;
TITLE1 ' your-name ';
RUN;

options

This chapter illustrates how to use a PROC UNIVARIATE statement that requests the usual
default statistics, along with two options: The PLOT option (which requests a stem-and-leaf
plot), and the NORMAL option (which requests statistics that test the null hypothesis that
the sample data were drawn from a normally distributed population). The current section
discusses only the default output (which includes the mode, median, and mean). Later
sections cover the stem-and-leaf plot and the tests for normality.

184 Step-by-Step Basic Statistics Using SAS: Student Guide

Here are the actual statements requesting that PROC UNIVARIATE be performed on the
variable AGE:
PROC UNIVARIATE DATA=D1
VAR AGE;
TITLE1 'JANE DOE';
RUN;

PLOT

NORMAL;

Where these statements should appear in the SAS program. Remember that the PROC
step of a SAS program should generally come after the DATA step. Chapter 5, Creating
Frequency Tables, provided the DATA step for the political donation data that will be
analyzed here. To give you a sense of where the PROC CHART statement should go, here is
a reproduction of the last few data lines from the data set, followed by the statements in the
current PROC step:
[Lines 1-30 of the DATA step presented in Chapter 5 would appear here]
20 D F 44 300 4 3 5 7
21 I F 38 100 5 2 4 1
22 D F 47 200 3 7 1 4
;
PROC UNIVARIATE DATA=D1 PLOT NORMAL;
VAR AGE;
TITLE1 'JANE DOE';
RUN;

Output Produced by the SAS Program


The preceding statements produced two pages of output that provide plenty of information
about the variable being analyzed. The output provides:

a moments table that includes the sample size, mean, standard deviation, variance,
skewness, and kurtosis, along with other statistics

a basic statistical measures table that includes measures of location (the mean, median,
and mode), as well as measures of variability (the standard deviation, variance, range,
and interquartile range)

a tests for normality table that includes statistical tests of the null hypothesis that the
sample was drawn from a normally distributed population

a quantiles table that provides the median, 25th percentile, 75th percentile, and related
information

an extremes table that provides the five highest values and five lowest values on the
variable being analyzed

a stem-and-leaf plot, box plot, and normal probability plot.

Chapter 7: Measures of Central Tendency and Variability 185

Interpreting the Mode Computed by PROC UNIVARIATE


The mode (also called the modal score) is the most frequently occurring value or score in a
sample. Technically, the mode can be assessed with either quantitative or nonquantitative
(nominal-level) variables, although this chapter focuses on quantitative variables because
the UNIVARIATE procedure is designed for quantitative variables only. When you are
working with numeric variables, the mode is a useful measure of central tendency to report
when the distribution has more than one mode, is skewed, or is markedly nonnormal in
some other way.
PROC UNIVARIATE prints the mode as part of the Basic Statistical Measures table in its
output. The Basic Statistical Measures table from the current analysis of the variable AGE is
reproduced here as Output 7.1.
Basic Statistical Measures
Location
Mean
Median
Mode

46.04545
47.00000
47.00000

Variability
Std Deviation
Variance
Range
Interquartile Range

6.19890
38.42641
26.00000
6.00000

Output 7.1. The mode as it appears in the Basic Statistical Measures table
from PROC UNIVARIATE; the variable analyzed is AGE.

The mode appears on the last line of the Location section of the table. You can see that,
for this data set, the mode is 47. This means that the most frequently occurring score on
AGE was 47. You can verify that this is the case by reviewing output reproduced earlier in
this guide, in Chapter 6, Creating Graphs. Output 6.5 contains the results of running
PROC CHART on AGE; it shows that the value 47 has the highest frequency.
A word of warning about the mode as computed by PROC UNIVARIATE: when there is
more than one mode for a given variable, PROC UNIVARIATE prints only the mode with
the lowest numerical value. For example, suppose that there were two modes for the current
data set: imagine that 10 people were at age 25, and 10 additional people were at age 35.
This means that the two most common scores on AGE would be 25 and 35. PROC
UNIVARIATE would report only one mode for this variable: it would report 25 (because
25 was the mode with the lowest numerical value). When you have more than one mode, a
note at the bottom of the Basic Statistical Measures table indicates the number of modes that
were observed. This situation is discussed in the section, "A Bimodal Distribution," that
appears later in this chapter. See Output 7.12.

186 Step-by-Step Basic Statistics Using SAS: Student Guide

Interpreting the Median Computed by PROC UNIVARIATE


The median (also called the median score) is the score located at the 50th percentile. This
means that the median is the score below which 50% of all data appear. For example,
suppose that you administer a test worth 100 points to a very large sample of students. If
50% of the students obtain a score below 71 points, then the median is 71.
The median can be computed only from some type of quantitative data. It is a particularly
useful measure of central tendency when you are working with an ordinal (ranked) variable.
It is also useful when you are working with interval- or ratio-level variables that display a
skewed distribution.
Along with the mode, PROC UNIVARIATE also prints the median as part of the Basic
Statistical Measures table in its output. The table from the present analysis of the variable
AGE is reproduced as Output 7.2.
Basic Statistical Measures
Location
Mean
Median
Mode

46.04545
47.00000
47.00000

Variability
Std Deviation
Variance
Range
Interquartile Range

6.19890
38.42641
26.00000
6.00000

Output 7.2. The median as it appears in the Basic Statistical Measures table
from PROC UNIVARIATE; the variable analyzed is AGE.

Output 7.2 shows that the median for the current data set is 47. You can see that, in this data
set, the median and the mode happen to be the same number: 47. This result is not unusual,
especially when the data set has a symmetrical distribution.
Interpreting the Mean Computed by PROC UNIVARIATE
The mean is the score that is located at the mathematical center of a distribution. It is
computed by (a) summing the scores and (b) dividing by the number of observations. The
mean is useful as a measure of central tendency for numeric variables that are assessed at
the interval- or ratio-level, particularly when they display fairly symmetrical, unimodal
distributions (later, you will see that the mean can be dramatically affected when a
distribution is skewed).
You may have noticed that the mean was printed as part of the Basic Statistical Measures
table in Output 7.1 and 7.2. The same mean is also printed as part of the Moments table
produced by PROC UNIVARIATE. The moments table from the current PROC
UNIVARIATE analysis of AGE appears here in Output 7.3.

Chapter 7: Measures of Central Tendency and Variability 187

Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation

22
46.0454545
6.19890369
-0.2047376
47451
13.4625746

Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean

22
1013
38.4264069
0.19131126
806.954545
1.32161071

Output 7.3. The N (sample size) and the mean as they appear in the Moments
table from PROC UNIVARIATE; the variable analyzed is AGE.

Output 7.3 provides many statistics from the analysis of AGE, but this section focuses on
just two.
First, to the right of the heading N you will find the number of valid (usable) observations
on which these analyses were based. Here, you can see that N = 22, which means that scores
on AGE were analyzed for 22 subjects.
Second, to the right of the heading Mean you will find the mean score for AGE. You can
see that the mean score on AGE is 46.045 for the current sample. Again, this is fairly close
to the mode and median of 47, which is fairly common for distributions that are largely
symmetrical.

Interpreting a Stem-and-Leaf Plot Created by PROC


UNIVARIATE
Overview
A stem-and-leaf plot is a special type of chart for plotting frequencies. It is particularly
useful for understanding the shape of the distribution; i.e., for determining whether the
distribution is approximately normal, as opposed to being skewed, bimodal, or in some other
way nonnormal. The current section shows you how to interpret the stem-and-leaf plot
generated by PROC UNIVARIATE. The section that follows provides examples of
variables with normal and nonnormal distributions.
Output Produced by the SAS Program
Earlier this chapter indicated that, when you include the PLOT option in the PROC
UNIVARIATE statement, SAS produces three figures. These figures are a stem-and-leaf
plot, a box plot, and a normal probability plot. This section focuses only on the stem-andleaf plot, which is reproduced here as Output 7.4.

188 Step-by-Step Basic Statistics Using SAS: Student Guide

Stem
5
5
4
4
3
3

Leaf
59
022
77778999
23444
688
3
----+----+----+----+
Multiply Stem.Leaf by 10**+1

#
2
3
8
5
3
1

Boxplot
0
|
+--+--+
+-----+
|
0

Output 7.4. Stem-and-leaf plot for the variable AGE produced by


PROC UNIVARIATE.

Remember that the variable being analyzed in this case is AGE. In essence, the stem-andleaf plot indicates what values appear in the data set for the variable AGE, and how many
occurrences of each value appear.
Interpreting the stems and leaves. Each potential value of AGE is separated into a
stem and a leaf.
The stem for a given value appears under the heading Stem.
The leaf for each value appears under the heading Leaf.
This concept is easier to understand one score at a time. For example, immediately under the
heading Stem, you can see that the first stem is 5. Immediately under the heading
Leaf you see a 5 and a 9. This excerpt from Output 7.4 is reproduced here:
Stem Leaf
5 59
Connecting the stem (5) to the first leaf (also a 5) gives you the first potential value that
AGE took on: 55. This means that the data set included one subject at age 55. Similarly,
connecting the stem (5) to the second leaf (the 9) gives you the second potential value
that AGE took on: 59. In short, the plot tells you that one subject had a score on AGE of
55, and one had a score of 59.
Now move down one line. The stem on the second line is again a 5, but you now have
different leaves: a 0, a 2, and another 2 (see below). Connecting the stem to these
leaves tells you that, in your data set, one subject had a score on AGE of 50, one had a
score of 52, and another had a score of 52.
Stem Leaf
5 59
5 022

Chapter 7: Measures of Central Tendency and Variability 189

One last example: Move down to the third line. The stem on the third line is now a 4.
Further, you now have eight leaves: four leaves are a 7, one leaf is an 8, and three
leaves are a 9, as follows:
Stem
5
5
4

Leaf
59
022
77778999

If you connect the stem (4) to these individual leaves, you learn that, in your data set,
there were four subjects who had a score on AGE of 47, one subject who had a score of
48, and three subjects who had a score of 49. The remainder of the stem-and-leaf plot
can be interpreted in the same way.
In summary, you can see that the stem-and-leaf plot is similar to a frequency bar chart,
except that it is set on its side: the values that the variable took on appear on the vertical
axis, and the frequencies are plotted along the horizontal axis. This is the reverse of what
you saw with the vertical bar charts created by PROC CHART in the previous chapter.
Interpreting the note at the bottom of the plot. There is one more feature in the stem-andleaf plot that requires explanation. Output 7.4 shows that the following note appears at the
bottom of the plot:
Multiply Stem.Leaf by 10**+1
To understand the meaning of this note, you need to mentally insert a decimal point into the
stem-leaf values you have just reviewed. Those values are again reproduced here:
Stem
5
5
4
4
3
3

Leaf
59
022
77778999
23444
688
3

Notice the blank space that separates each stem from its leaves. For example, in the first
line, there is a blank space that separates the stem 5 from the leaves 5 and 9.
Technically, you are supposed to read this blank space as a decimal point (.). This means
that the values in the first line are actually 5.5 (for the subject whose age was 55) and 5.9
(for the subject whose age was 59).
The note at the bottom of the page tells you how to move this decimal point so that the
values will return to their original metric. For this plot, the note says Multiply Stem.Leaf
by 10**+1. This means multiply the stem-leaf by 10 raised to the first power. The
number 10 raised to the first power is, of course, 10. So what happens to the stem-leaf 5.5
when it is multiplied by 10? It becomes 55, the subjects actual score on AGE. And what
happens to the stem-leaf 5.9 when it is multiplied by 10? It becomes 59, another subjects
actual score on AGE.

190 Step-by-Step Basic Statistics Using SAS: Student Guide

And that is how you interpret a stem-and-leaf plot. Whenever you type in a new data set,
you should routinely create stem-and-leaf plots for each of your numeric variables. This will
help you identify obvious errors in data entry, and will help you visualize the shape of your
distribution (i.e., help you determine whether it is skewed, bimodal, or in any other way
nonnormal). The next section of this chapter shows you how to do this.

Using PROC UNIVARIATE to Determine the Shape of


Distributions
Overview
This guide often describes a sample of scores as displaying an approximately normal
distribution. This means that the distribution of scores more or less follows the bell-shaped,
symmetrical pattern of the normal curve. It is generally wise to review the shape of a sample
of data prior to analyzing it with more sophisticated inferential statistics. This is because
some inferential statistics require that your sample be drawn from a normally distributed
population.
When the values that you have obtained in a sample display a marked departure from
normality (such as a strong skew), then it becomes doubtful that your sample was drawn
from a population with a normal distribution. In some cases, this will mean that you should
not analyze the data with certain types of inferential statistics.
This section illustrates several different shapes that a distribution may display. Using stemand-leaf plots, it shows how a sample may appear (a) when it is approximately normal, (b)
when it is positively skewed, (c) when it is negatively skewed, and (d) when it may have
multiple modes. It also shows how each type of distribution affects the mode, median, and
mean computed by PROC UNIVARIATE.
Variables Analyzed
As was discussed earlier, Chapter 5, Creating Frequency Tables, provided a fictitious
political donation questionnaire. The last four items on the questionnaire presented subjects
with statements, and asked the subjects to indicate the extent to which they agreed or
disagreed with each statement. They responded by using a 7-point scale in which 1
represented Disagree Very Strongly and 7 represented Agree Very Strongly.
Responses to these four items were given the SAS variable names Q1, Q2, Q3, and Q4,
respectively. This section shows some of the results produced when PROC UNIVARIATE
was used to analyze responses to these items.
Here are the SAS statements requesting that PROC UNIVARIATE be performed on Q1,
Q2, Q3, and Q4. Notice that the PROC UNIVARIATE statement itself contains the PLOT

Chapter 7: Measures of Central Tendency and Variability 191

option (which will cause stem-and-leaf plots to be created), as well as the NORMAL option
(which requests tests of the null hypothesis that the data were sampled from a normally
distributed population).
PROC UNIVARIATE DATA=D1
VAR Q1 Q2 Q3 Q4;
TITLE1 'JANE DOE';
RUN;

PLOT

NORMAL;

These statements resulted in four sets of output: one set of PROC UNIVARIATE output for
each of the four variables.
An Approximately Normal Distribution
The stem-and-leaf plot. The SAS variable Q1 represents responses to the statement, I
believe that our federal government is generally doing a good job. Output 7.5 presents the
stem-and-leaf plot of fictitious responses to this question.
Stem
7
6
5
4
3
2
1

Leaf
0
00
0000
00000000
0000
00
0
----+----+----+----+

#
1
2
4
8
4
2
1

Boxplot
|
|
+-----+
*--+--*
+-----+
|
|

Output 7.5. Stem-and-leaf plot of an approximately normal distribution


produced by the PROC UNIVARIATE analysis of Q1.

The stem-and-leaf plot appears on the left side of Output 7.5. The box plot for the same
variable appears on the right (for guidelines on how to interpret a box plot, see Schlotzhauer
and Littell [1997], or Tukey [1977] ).
The first line of the stem-and-leaf plot shows a stem-leaf combination of 7 0, which
means that the variable included one value of 7.0. The second line shows the stem 6,
along with two leaves of 0 and 0. This means that the sample included two subjects with
a score of 6.0. The remainder of the plot can be interpreted in the same fashion.
Notice that there is no note at the bottom of the plot saying anything such as Multiply
Stem.Leaf by 10**+1. This means that these stem-leaf values already display the
appropriate unit of measurement; it is not necessary to multiply them by 10 (or any other
value).
The stem-and-leaf plot shows that Q1 has a symmetrical distribution centered around the
score of 4.0 (in this context, the word symmetrical means that the tail extending above a

192 Step-by-Step Basic Statistics Using SAS: Student Guide

score of 4.0 is the same length as the tail extending below a score of 4.0). A response of 4
on the questionnaire corresponds to the answer Neither Agree nor Disagree, so it appears
that the most common response was for subjects to neither agree nor disagree with the
statement, I believe that our federal government is generally doing a good job.
Based on the physical appearance of the stem-and-leaf plot, Q1 appears to have an
approximately normal shape.
The mean, median, and mode. Output 7.6 provides the Basic Statistical Measures table
and the Tests for Normality table produced by PROC UNIVARIATE in its analysis of Q1.
Basic Statistical Measures
Location
Mean
Median
Mode

Variability

4.000000
4.000000
4.000000

Std Deviation
Variance
Range
Interquartile Range

1.41421
2.00000
6.00000
2.00000

Tests for Location: Mu0=0


Test

-Statistic-

-----p Value------

Student's t
Sign
Signed Rank

t
M
S

Pr > |t|
Pr >= |M|
Pr >= |S|

13.2665
11
126.5

<.0001
<.0001
<.0001

Tests for Normality


Test

--Statistic---

-----p Value------

Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling

W
D
W-Sq
A-Sq

Pr
Pr
Pr
Pr

0.957881
0.181818
0.115205
0.55516

<
>
>
>

W
D
W-Sq
A-Sq

0.4477
0.0573
0.0686
0.1388

Output 7.6. Basic Statistical Measures table and Tests for Normality table
produced by the PROC UNIVARIATE analysis of Q1.

Output 7.6 shows that, for the variable Q1, the mean is 4 ( ), the median is 4 ( ), and the
mode is also 4 ( ). You may remember that the mean, median, and mode are expected to be
the same number when the distribution being analyzed is symmetrical (i.e., when neither tail
of the distribution is longer than the other). The stem-and-leaf plot of Output 7.5 has already
shown that the distribution is symmetrical, so it makes sense that the mean, median, and
mode of Q1 would all be the same value.

Chapter 7: Measures of Central Tendency and Variability 193

The test for normality. In Output 7.6, the section headed Tests for Normality provides
the results from four statistics. Each of these statistics tests the null hypothesis that the
sample was drawn from a normally distributed population. This section focuses on just one
of these tests:
The results for the Shapiro-Wilk W statistic appear in the row headed Shapiro-Wilk.
In the Tests for Normality section, one of the columns is headed Statistic. This section
provides the obtained value for the statistic. In the present analysis, you can see that the
obtained value for the Shapiro-Wilk W statistic is 0.957881, which rounds to .96.
The next column in the Tests for Normality section is headed p Value, which stands for
probability value. Where the row headed Shapiro-Wilk intersects with the column
headed p Value, you can see that the obtained p value for the current statistic is 0.4477
(this appears to the immediate right of the heading Pr < W).
In general, a p value represents the probability that you would obtain the present statistic if
the null hypothesis were true. The smaller the p value is, the less likely it is that the null
hypothesis is true. A standard rule of thumb is that, if a p value is less than .05, you should
assume that it is very unlikely that the null hypothesis is true, and you should reject the null
hypothesis.
In the present analysis, the p value of the Shapiro-Wilk statistic represents the probability
that you would obtain a W statistic of this size if your sample were drawn from a normally
distributed population. In general, when this p value is less than .05, you may reject the null
hypothesis, and conclude that your sample was probably not drawn from a normally
distributed population. When the p value is greater than .05, you should fail to reject the null
hypothesis, and tentatively conclude that your sample probably was drawn from a normally
distributed population.
In the current analysis, you can see that this p value is 0.4477 (look below the heading p
Value). Because this p value is greater than the criterion of .05, you fail to reject the null
hypothesis of normality. In other words, you tentatively conclude that your sample probably
did come from a normally distributed population. In most cases, this is good news because
many statistical procedures require that your data be drawn from normally distributed
populations.
As is the case with many statistics, this statistic is sensitive to sample size in that it tends be
very powerful (sensitive) with large samples. This means that the W statistic may imply that
your sample did not come from a normal distribution even when the sample shows a very
minor departure from normality. So keep the sample size in mind, and use caution in
interpreting the results of this test, particularly when your sample size is large.

194 Step-by-Step Basic Statistics Using SAS: Student Guide

A Positively Skewed Distribution


What is a skewed distribution? A distribution is skewed if one tail is longer than the other.
It shows positive skewness if the longer tail of the distribution points in the direction of
higher values. In contrast, negative skewness means that the longer tail points in the
direction of lower values. The present section illustrates a positively skewed distribution,
and the following section covers negative skewness.
The stem-and-leaf plot. In the political donation study, the second agree-disagree item
stated, The federal government should raise taxes. Responses to this question were
represented by the SAS variable Q2. The stem-and-leaf plot created when PROC
UNIVARIATE analyzed Q2 is reproduced here as Output 7.7.
Stem
7
6
5
4
3
2
1

Leaf
0
0
0
00
00000
00000000
0000
----+----+----+----+

#
1
1
1
2
5
8
4

Boxplot
*
0
0
|
+-----+
*--+--*
|

Output 7.7. Stem-and-leaf plot of a positively skewed distribution produced by


the PROC UNIVARIATE analysis of Q2.

The stem-and-leaf plot of Output 7.7 shows that most responses to item Q2 appear around
the number 2, meaning that most subjects circled the number 2 or some nearby number.
This seems reasonable, because, on the questionnaire, the response number 2 represented
Disagree Strongly, and it makes sense that many people would disagree with the
statement The federal government should raise taxes.
However, the stem-and-leaf plot shows that a small number of people actually agreed with
the statement. It shows that two people circled 4 (for Neither Agree nor Disagree), one
person circled 5 (for Agree), one person circled 6 (for Agree Strongly), and one
person circled 7 (for Agree Very Strongly). These responses created a long tail for the
distributiona tail that stretches out in the direction of higher numbers (such as 6 and
7). This means that the distribution for Q2 is a positively skewed distribution.
The mean, median, and mode. You can also determine whether a distribution is skewed by
comparing the mean to the median. These statistics are presented in Output 7.8.

Chapter 7: Measures of Central Tendency and Variability 195

Basic Statistical Measures


Location
Mean
Median
Mode

Variability

2.772727
2.000000
2.000000

Std Deviation
Variance
Range
Interquartile Range

1.60154
2.56494
6.00000
1.00000

Tests for Location: Mu0=0


Test

-Statistic-

-----p Value------

Student's t
Sign
Signed Rank

t
M
S

Pr > |t|
Pr >= |M|
Pr >= |S|

8.120454
11
126.5

<.0001
<.0001
<.0001

Tests for Normality


Test

--Statistic---

-----p Value------

Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling

W
D
W-Sq
A-Sq

Pr
Pr
Pr
Pr

0.85838
0.230725
0.208576
1.173185

<
>
>
>

W
D
W-Sq
A-Sq

0.0048
<0.0100
<0.0050
<0.0050

Output 7.8. Basic Statistical Measures table and Tests for Normality table
produced by the PROC UNIVARIATE analysis of Q2.

Output 7.8 shows that, for Q2, the mean is 2.77 ( ), the median is 2.0 ( ), and the mode is
also 2.0 ( ). Even if you had never seen the stem-and-leaf plot of Output 7.7, you would
still know that the distribution is positively skewed, because the mean (at 2.77) is higher
than the median (at 2.0). The reasons for this are explained here.
Generally speaking, the mean tends to be more strongly influenced by a long tail, compared
to the median. For example, if a distribution has a long tail in the direction of higher values,
the mean tends to be pulled upward in the direction of those higher valuesit tends to
take on a higher value. However, the median generally remains unaffected by the longer tail.
Taken together, this means that when a distribution has a long tail in the direction of higher
values (i.e., when a distribution is positively skewed), the mean should be higher than the
median. Output 7.8 bears this out: the mean (at 2.77) is higher than the median (at 2.0),
indicating a positively skewed distribution.
The test for normality. Further evidence of a departure from normality can be seen in the
results of the test for normality, which appears toward the bottom of Output 7.8.

196 Step-by-Step Basic Statistics Using SAS: Student Guide

The p value for the W statistic is quite small at 0.0048. This is below the standard criterion
of .05, meaning that you may reject the null hypothesis that this sample of scores was drawn
from a normally distributed population. This is the result you would generally expect when a
sample displays a dramatic skew, as is the case here.
A Negatively Skewed Distribution
The stem-and-leaf plot. In the political donation study, the third agree-disagree item stated,
The federal government should do a better job of maintaining our interstate highway
system. Responses to this question were represented by the SAS variable Q3. The stemand-leaf plot created when PROC UNIVARIATE analyzed Q3 is reproduced here as
Output 7.9.
Stem
7
6
5
4
3
2
1

Leaf
000
000000000
00000
00
0
0
0
----+----+----+----+

#
3
9
5
2
1
1
1

Boxplot
|
+-----+
+--+--+
|
0
0
*

Output 7.9. Stem-and-leaf plot of a negatively skewed distribution produced


by the PROC UNIVARIATE analysis of Q3.

The stem-and-leaf plot of Output 7.9 shows that most responses to item Q3 appear around
the number 6, meaning that most subjects circled the number 6 or some nearby number.
This seems reasonable, because, on the questionnaire, the response number 6 represented
Agree Strongly, and it makes sense that many people would agree with the statement The
federal government should do a better job of maintaining our interstate highway system.
However, the stem-and-leaf plot shows that a small number of people actually disagreed
with the statement. It shows that two people circled 4 (for Neither Agree nor Disagree),
one person circled 3 (for Disagree), one person circled 2 (for Disagree Strongly),
and one person circled 1 (for Disagree Very Strongly). These responses created a long
tail for the distributiona tail that stretches out in the direction of lower numbers (such as
1 and 2). This means that the distribution for Q3 is a negatively skewed distribution.
The mean, median, and mode. Output 7.10 presents the basic statistical measures table
from the analysis of Q3.

Chapter 7: Measures of Central Tendency and Variability 197

Basic Statistical Measures


Location
Mean
Median
Mode

Variability

5.181818
6.000000
6.000000

Std Deviation
Variance
Range
Interquartile Range

1.56255
2.44156
6.00000
1.00000

Tests for Location: Mu0=0


Test

-Statistic-

-----p Value------

Student's t
Sign
Signed Rank

t
M
S

Pr > |t|
Pr >= |M|
Pr >= |S|

15.55464
11
126.5

<.0001
<.0001
<.0001

Tests for Normality


Test
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling

--Statistic--W
0.846403
D
0.245183
W-Sq 0.248146
A-Sq 1.354406

-----p Value-----Pr < W


0.0029
Pr > D
<0.0100
Pr > W-Sq <0.0050
Pr > A-Sq <0.0050

Output 7.10. Basic Statistical Measures table and Tests for Normality table
produced by the PROC UNIVARIATE analysis of Q3.

Output 7.10 shows that, for Q3, the mean is 5.18 ( ), the median is 6.0 ( ), and the mode
is also 6.0 ( ). The preceding section indicated that the mean tends to be pulled in the
direction of the longer tail in a distribution. With a negatively skewed distribution, the
longer tail is in the direction of the lower numbers. This means that, with a negatively
skewed distribution, the mean should be pulled downward, so that it is lower than the
median. As expected, Output 7.10 shows that this is exactly the case: The mean for Q3 (at
5.18) is lower than the median (at 6.0).
The test for normality. The test for normality further attests that Q3 shows a marked
departure from normality.
The p value from the Shapiro-Wilk W Statistic is .0029. This is below the standard criterion
of .05, and so you may reject the null hypothesis that this sample was drawn from a
normally distributed population. In other words, you may tentatively conclude that the
sample was probably drawn from a population that was not normally distributed.

198 Step-by-Step Basic Statistics Using SAS: Student Guide

A Bimodal Distribution
What is a bimodal distribution? A bimodal distribution is a frequency distribution that
has two peaks or humps. With a bimodal distribution, the scores at the very center of
the two peaks have relatively high frequencies. Many textbooks say that, technically, the
two scores at the center of each peak should have exactly the same frequency for the
distribution to be considered bimodal. In practice, however, researchers often say that they
have a bimodal distribution simply because it has two peaks; they often say this even when
the two scores at the center of each peak do not have exactly the same frequency (i.e., even
when one peak is a little taller than the other).
The stem-and-leaf plot. In the political donation study, the fourth agree-disagree item
stated, The federal government should increase social security benefits to the elderly.
Responses to this question were represented by the SAS variable Q4. The stem-and-leaf plot
created when PROC UNIVARIATE analyzed Q4 is reproduced here as Output 7.11.
Stem
7
6
5
4
3
2
1

Leaf
00
000000
000
0
00
000000
00
----+----+----+----+

#
2
6
3
1
2
6
2

Boxplot
|
+-----+
|
|
*--+--*
|
|
+-----+
|

Output 7.11. Stem-and-leaf plot of a bimodal distribution produced by the


PROC UNIVARIATE analysis of Q4.

The stem-and-leaf plot of Output 7.11 reveals two peaks, or modes. It shows that six people
circled response number 6 (for Agree Strongly), and another six people circled response
number 2 (for Disagree Strongly). Clearly, responses to Q4 form a bimodal distribution.
Although stem-and-leaf plots are useful for identifying the general shape of distributions,
they can be misleading when it comes to determining whether a distribution is technically
bimodal. According to some textbooks, a distribution is technically bimodal only if the
scores at the center of each peak have exactly the same frequencies. With some data sets
(especially with large data sets), SAS must perform some rounding of numbers prior to
constructing its stem-and-leaf plot. In those situations, you should view the stem-and-leaf
plot as providing only a rough approximation of the shape of the distribution. Further, even
if the plot appears to suggest that the two scores at the center of each peak have exactly the
same frequencies, you should not assume that this is the case. You should instead look for a
note at the bottom of the Basic Statistical Measures table that tells you whether the sample
contains more than one mode. This note is discussed in the following section.

Chapter 7: Measures of Central Tendency and Variability 199

The mean, median, and mode. The Basic Statistical Measures table and Tests for
Normality table from the analysis of Q4 are presented here as Output 7.12.
Basic Statistical Measures
Location
Mean
Median
Mode

Variability

4.045455
4.500000
2.000000

Std Deviation
Variance
Range
Interquartile Range

2.05814
4.23593
6.00000
4.00000

NOTE: The mode displayed is the smallest of 2 modes with a count of 6.

Tests for Location: Mu0=0


Test

-Statistic-

-----p Value------

Student's t
Sign
Signed Rank

t
M
S

Pr > |t|
Pr >= |M|
Pr >= |S|

9.219434
11
126.5

<.0001
<.0001
<.0001

Tests for Normality


Test

--Statistic---

-----p Value------

Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling

W
D
W-Sq
A-Sq

Pr
Pr
Pr
Pr

0.875992
0.203485
0.190371
1.147371

<
>
>
>

W
D
W-Sq
A-Sq

0.0101
0.0186
0.0064
<0.0050

Output 7.12. Basic Statistical Measures table and Tests for Normality table
produced by the PROC UNIVARIATE analysis of Q4.

Output 7.12 shows that the mean score on Q4 is 4.05 ( ), and the median is 4.5 ( ). The output
indicates that the mode is equal to 2.0 ( ), and this may surprise you because, as established
previously, Q4 actually has two modes: 2.0 and 6.0. However, an earlier section of this chapter
mentioned that, when a distribution has more than one mode, PROC UNIVARIATE reports only
the mode with the lowest numerical value. With Q4, the lower mode is 2.0.
Just below the Basic Statistical Measures table, the following note appears:
NOTE: The mode displayed is the smallest of 2 modes
with a count of 6.
A note similar to this will tell you if you have more than one mode. As you can see, it will
also indicate the number of modes observed in your distribution, although it will not tell you
the scores associated with those modes.

200 Step-by-Step Basic Statistics Using SAS: Student Guide

The test for normality.


The Shapiro-Wilk W statistic appears as part of the Tests for Normality table. You can see
that the p value for this test is .0101. This is below the criterion of .05, and so you may
reject the null hypothesis that this sample came from a normally distributed population. You
may instead tentatively conclude that the sample was probably drawn from a population
with a nonnormal distribution.

Simple Measures of Variability: The Range, the


Interquartile Range, and the Semi-Interquartile Range
Overview
A measure of variability is a number or set of numbers that represents the extent of
dispersion in a set of scores. It is the extent to which the scores differ from one another, or
from some measure of central tendency such as the mean. This section shows how to use
PROC UNIVARIATE to compute three relatively simple measures of variability: the range,
the interquartile range, and the semi-interquartile range.
The Range
The range is defined as the difference between the highest and lowest scores in a
distribution. The formula for the range is
Xmax Xmin
where
Xmax represents the highest observed score in the distribution.
Xmin represents the lowest observed score in the distribution.
When you run PROC UNIVARIATE on a variable, information relevant to the range
appears in the Basic Statistical Measures table and the Quantiles table. These tables appear
in Output 7.13 (again, the variable being analyzed is AGE).

Chapter 7: Measures of Central Tendency and Variability 201

Basic Statistical Measures


Location
Mean
Median
Mode

Variability

46.04545
47.00000
47.00000

Std Deviation
Variance
Range
Interquartile Range

6.19890
38.42641
26.00000
6.00000

Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min

Estimate
59
59
55
52
49
47
43
38
36
33
33

Output 7.13. The range, maximum score, and minimum score as they appear
in the Basic Statistical Measures table and Quantiles table produced by the
PROC UNIVARIATE analysis of AGE.

Output 7.13 shows that the range for the AGE variable was 26. This makes sense, because
the highest score on AGE was 59 and the lowest score was 33. Using the formula for range
presented above, the range can be computed as:
Xmax Xmin
59 33 = 26.
If you wish to compute the range by hand, you can easily find the highest and lowest
observed scores in your sample in the Quantiles table produced by PROC UNIVARIATE.
In Output 7.13, the Quantiles table provides the highest observed score to the right of the
heading 100% Max. Here, you can see that the highest score was indeed 59. The
Quantiles table provides the lowest score to the right of the heading 0% Min.
You can see that the lowest score in the present sample was indeed 33.

202 Step-by-Step Basic Statistics Using SAS: Student Guide

The Interquartile Range


The interquartile range is a useful measure of variability to use when you are working with
a data set that demonstrates a dramatic departure from normality. For example, it is a useful
measure of variability to use when your sample is skewed or has more than one mode.
The formula for the interquartile range is
Q3 Q1
Where
Q3

represents the third quartile: the score at the 75th percentile of the distribution
(the score below which 75% of all scores fall), and

Q1

represents the first quartile: the score at the 25th percentile of the distribution
(the score below which 25% of all scores fall).

Q1, Q3, and the interquartile range are reported in the Basic Statistical Measures table and
Quantiles table produced by PROC UNIVARIATE. Output 7.14 again presents these tables,
as produced when PROC UNIVARIATE was performed on the AGE variable.
Basic Statistical Measures
Location
Mean
Median
Mode

Variability

46.04545
47.00000
47.00000

Std Deviation
Variance
Range
Interquartile Range

6.19890
38.42641
26.00000
6.00000

Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min

Estimate
59
59
55
52
49
47
43
38
36
33
33

Output 7.14. The interquartile range, third quartile, and first quartile as they
appear in the Basic Statistical Measures table and Quantiles Table produced
by the PROC UNIVARIATE analysis of AGE.

Chapter 7: Measures of Central Tendency and Variability 203

In the Quantiles table, the score at the third quartile appears to the right of the heading 75%
Q3.
Output 7.14 shows that the score at the third quartile for AGE is 49. The score at the first
quartile appears to the right of the heading 25% Q1.
The output shows that the score at the first quartile for AGE is 43.
In the Basic Statistical Measures table, the interquartile range itself appears to the right of
the heading Interquartile Range.
Output 7.14 shows that the interquartile range for AGE is 6. This makes sense because, for
the current data set, the interquartile range was computed in this way:
Q3 Q1
49 43 = 6
The Semi-Interquartile Range
The semi-interquartile range is simply the interquartile range divided by 2. The formula is
Q3 Q1

2
PROC UNIVARIATE does not compute the semi-interquartile range, but it is easily
computed from output that PROC UNIVARIATE does provide: Simply divide the
interquartile range by 2. For the current output, the semi-interquartile range is computed as
follows:
Q3 Q1
=
2
49 43
=
2
6
= 3
2
So, the semi-interquartile range for AGE is 3.

204 Step-by-Step Basic Statistics Using SAS: Student Guide

More Complex Measures of Central Tendency: The


Variance and Standard Deviation
Overview
The variance is defined as the average of the squared deviations of scores around their
mean. The standard deviation is the square root of the variance. This section provides
formulas for computing these two statistics.
The variance and standard deviation are more complex measures of variability, compared to
the range, interquartile range, and semi-interquartile range. In part, this is because the
variance and standard deviation are influenced by every observation in a sample, whereas
the range and its kindred measures are not. The fact that the variance and standard deviation
are influenced by every observation in the sample is one of the reasons that these two
measures are widely used in research.
This section begins by reviewing some basic terms that are relevant to the concepts to be
taught here. It discusses the differences between populations versus samples because the
formula for computing a population variance is different from the formula for computing a
sample variance, and both types of formulas are presented later. This section also discusses
the difference between descriptive versus inferential statistics for the same reason. Finally, it
shows how the population variance and standard deviation are computed, using conceptual
formulas.
Relevant Terms and Concepts
Populations versus samples. A population can be defined as a complete set of scores
obtained from a specified group in a particular situation. For example, suppose that you are
conducting research on intelligence tests (IQ), and you are interested in using adults in the
U.S. as your subjects. If you were able to obtain IQ scores for every adult in the U.S., you
would have obtained the entire population of scores. Needless to say, it would probably not
be possible for you to obtain scores for this entire population when conducting research. In
fact, it is assumed that it is seldom possible to study entire populations.
In contrast, a sample is a subset of scores drawn from a population. Suppose that, in order to
conduct your research, you put together a group of 200 adults from the U.S. and obtain their
IQ scores. When you work with this set of 200 IQ scores, you are working with a sample.
Researchers in the behavioral sciences almost always conduct their research with samples. A
later section shows that the formula for computing the population variance is somewhat
different from the formula for computing the sample variance.
Descriptive versus inferential statistics. Descriptive statistics are procedures for
summarizing and conveying some particular characteristic of a specific set of data. For
example, suppose that you assess the mean IQ score for your sample of 200 adults, and find

Chapter 7: Measures of Central Tendency and Variability 205

that this mean score is 105. If you use this number merely as an index of the average IQ of
this specific group of 200 subjectsand do not use it to draw inferences about the average
IQ score in the larger populationthen you are using it as a descriptive statistic.
In contrast, inferential statistics are procedures that involve gathering data from a sample
of subjects, and then using the data to draw inferences about the likely characteristics of the
larger population from which the sample was drawn. For example, again suppose that you
assess the mean IQ in your sample of 200 U.S. adults, and find that this mean IQ score is
105. You conclude the following: Based on these results from the sample, I estimate that
the average IQ score for the entire population of U.S. adults is 105. In this case, you are
using sample data to draw inferences about the likely characteristics of the population from
which they were drawn. This means that you were using the obtained IQ score as an
inferential statistic. A later section shows that the formula for computing the variance as a
descriptive statistic is somewhat different from the formula for computing the variance as an
inferential statistic.
Conceptual Formula for the Population Variance
The preceding section indicated that populations are generally assumed to be so large that it
is not possible to obtain data from every member of a population. For the moment, however,
suspend disbelief and suppose that you have conducted a study in which you did obtain data
from every member of a population.
If your data were on an interval or ratio scale, and if they formed a symmetrical distribution,
you would probably want to use the variance and standard deviation as your measures of
variability. The formula for computing the variance for a population of scores is as follows:
(X )2
X2 =
N
where
X2 = the population variance
X = the individual scores
= the population mean
N = the number of scores.

206 Step-by-Step Basic Statistics Using SAS: Student Guide

An earlier section defined the variance as the average of the squared deviations of scores
around their mean. You can see that the preceding formula is consistent with this definition.
This formula provides the following instructions for computing the population variance:
1. For a given subject, find the deviation of that subjects observed score from the
population mean. This is represented by the subtraction term which appears in the
formula, and is again reproduced below:
(X )
2. Next, square that deviation. This is represented by the fact that the deviation term is
squared:
(X )2
3. Next, sum these squared deviations. This is represented by the symbol to the left of the
deviation term:
(X )

4. Finally, divide the sum of the squared deviations by the number of observations. This is
represented by the fact that the entire quantity is divided by N:
X2 =

(X )2

Conceptual Formula for the Population Standard Deviation


So far, this section has focused on the variance as a measure of variability. However,
researchers in the social and behavioral sciences also make heavy use of the standard
deviation as a measure of variability. Earlier, the standard deviation was defined as the
square root of the variance. Therefore, the formula for computing the standard deviation for
a population of scores is as follows:
(X )

N
2

X =

You can see that the preceding formula is identical to the formula for the population
variance, with two exceptions. First, note that the square root symbol has now been placed
around everything to the right of the equals sign. This conveys the fact that the standard
deviation is simply the square root of the variance. Second, you can see that the symbol for
the standard deviation is x, whereas the symbol for the variance had been x2 (notice that
only the symbol for the variance has the squared sign).

Chapter 7: Measures of Central Tendency and Variability 207

In reporting the results of investigations, researchers often prefer to report the standard
deviation rather than the variance. This is because the variance is a measure of variability
expressed in squared units. As the square root of the variance, however, the standard
deviation is a measure of variability that reflects the original unit of measurement. This
makes the standard deviation easier to interpret in general.

Variance and Standard Deviation: Three Formulas


Overview
Students in elementary statistics courses are sometimes confused when they learn about
variance because there are essentially three types of variance estimates that they must learn.
These variance estimates differ with respect to the formulas that are used to compute them
and whether they are descriptive versus inferential in nature.
Adding to the confusion is the fact that different textbooks sometimes use different names
when discussing the same type of variance estimate. Therefore, this section uses the names,
statistical symbols, and formulas that are fairly typical of most statistics textbooks in the
social and behavioral sciences.
The Population Variance and Standard Deviation
The population variance. The population variance is a parameter that is descriptive
(rather than inferential). It is a number that describes the actual variance in a population of
scores. It is appropriate to compute the population variance when you have obtained scores
from every member of the relevant population (remember that this seldom occurs in real
life).
The formula for the population variance was presented in the preceding section, and is
presented again here for purposes of comparison:
(X )
X2 =
N
2

The population standard deviation. The population standard deviation is simply the
square root of the population variance. The formula for the population standard deviation
was also presented earlier. It, too, is presented again here for purposes of comparison:
X =

(X )2

208 Step-by-Step Basic Statistics Using SAS: Student Guide

The Sample Variance and Standard Deviation


The sample variance. Suppose that you find yourself in the following situation:

You have obtained data from a sample of subjects drawn from a larger population.
Suppose that this is the more typical situation in which you have data from the sample,
but not from each member of the larger population.

You wish to compute the variance in this sample.

Further, you wish this variance to describe the actual variance in the sample; you do not
wish it to estimate what the variance probably is in the larger population. In other words,
you wish to use this variance as a descriptive statistic (describing the actual variance in
the sample), and not as an inferential statistic (making inferences about what the
variance might be in the population).

In a situation such as this, you wish to compute what this book calls the sample variance. In
this book, the sample variance is defined as a descriptive statistic that describes the actual
variance in a sample; it is not an inferential statistic that estimates what the variance
probably is in the population.
The definitional formula for the sample variance is as follows:
(X X)2
SX2 =
N
where
SX2 = the sample variance
X = the individual scores
X = the sample mean
N = the number of scores.
Notice the differences between the formula for the population variance (presented earlier)
versus the formula for the sample variance presented here. First, the symbols for the
2
2
variance statistics themselves are different (x versus SX ). Also, notice that the symbol for
the mean in the population formula was , whereas the symbol for the mean in the sample
formula was X.
Despite these differences, you can see that the formula for the population variance is
essentially equivalent to the formula for the sample variance. Both formulas involve the
steps of (a) taking deviations from the mean, (b) squaring and summing these deviations,
and (c) dividing by N.

Chapter 7: Measures of Central Tendency and Variability 209

The sample standard deviation. Again, the sample standard deviation is simply the square
root of the sample variance. The formula is as follows:

SX =

(X X)2

The Estimated Population Variance and Standard Deviation


The estimated population variance. It is possible to use data from a sample to estimate the
variance and standard deviation in the population from which the sample was drawn.
However, to do this you should not use the formula for the sample variance presented in the
preceding section. This is because the formula for the sample variance tends to
underestimate the actual population variance.
To obtain a more accurate estimate of the population variance, it is necessary to change the
devisor in this formula (the part below the division line).
Below is the formula for the sample variance that was presented earlier:
(X X)2
SX2 =
N
The divisor in this formula is N, the sample size. To compute the estimated population
variance, however, it is necessary to replace N with N 1. The quantity N 1 is referred to
as the degrees of freedom.
Below is the revised formula, the formula for the estimated population variance:
(X X)
sX2 =
N1
2

Notice that the divisor is now N 1 rather than N.


The above formula also contains one additional change. The symbol for the estimated
population variance is sX2 . The s in this symbol is a lowercase s, rather than the
uppercase S that was used as the symbol for the sample variance.

210 Step-by-Step Basic Statistics Using SAS: Student Guide

The estimated population standard deviation. The relationship between the estimated
population standard deviation and estimated population variance is similar to the
relationship between the sample standard deviation and sample variance. The estimated
population standard deviation is computed by taking the square root of the estimated
population variance. The formula is as follows:

sX =

(X X)2

N1

Using PROC MEANS to Compute the Variance and


Standard Deviation
Overview
The variance and standard deviation can be computed using either PROC UNIVARIATE or
PROC MEANS. Since the output of PROC MEANS is somewhat easier to read, the
following sections show how to use PROC MEANS to compute the sample variance,
population variance, and estimated population variance, respectively.
Computing the Sample Variance and Standard Deviation
Use the directions in this section for situations in which you are using the variance as a
descriptive statistic. That is, they are appropriate for situations in which you wish to
describe the variance as it actually is in the sample, and you are not interested in estimating
what the variance probably is in the larger population.
Below is the syntax for the PROC step that uses PROC MEANS to compute the sample
variance and sample standard deviation (along with some additional statistics):
PROC MEANS DATA=data-set-name
VAR variable-list;
TITLE1 ' your-name ';
RUN;

VARDEF=N

MEAN

STD

VAR

MIN

MAX;

The PROC MEANS statement is the first statement in the preceding PROC step. It includes
the following keywords and values for VARDEF:
VARDEF=N
Requests that the devisor in the computation of the variance and standard deviation be
equal to N, the sample size. This ensures that PROC MEANS will compute the

Chapter 7: Measures of Central Tendency and Variability 211

sample variance and standard deviation, as opposed to the estimated population


variance and standard deviation (which will be discussed in a later section).
N
prints the sample size (the number of valid observations).
MEAN
prints the sample mean.
STD
prints the sample standard deviation.
VAR
prints the sample variance.
MIN
prints the lowest observed score in the sample.
MAX
prints the highest observed score in the sample.
Here are the statements used to compute the sample variance and standard deviation for the
AGE variable from the political donation study:
PROC MEANS DATA=D1 VARDEF=N
N MEAN
VAR AGE;
TITLE1 'JANE DOE... DIVISOR = N';
RUN;

STD

VAR

MIN

MAX;

Output 7.15 presents the results generated by the preceding statements.


JANE DOE... DIVISOR = N
The MEANS Procedure
Analysis Variable : AGE
N
Mean
Std Dev
Variance
Minimum
Maximum
----------------------------------------------------------------------------22
46.0454545
6.0563811
36.6797521
33.0000000
59.0000000
-----------------------------------------------------------------------------

Output 7.15. The sample standard deviation and variance as produced by


PROC MEANS with the option VARDEF=N.

You can see that the sample size, mean, minimum value, and maximum value for the AGE
variable reported in Output 7.15 are identical to the values reported when AGE was
analyzed using PROC UNIVARIATE earlier in this chapter.

212 Step-by-Step Basic Statistics Using SAS: Student Guide

Below the heading Std Dev is the sample standard deviation for AGE, which in this case
is 6.056.
Below the heading Variance is the sample variance: 36.680. You can see that the standard
deviation and variance in Output 7.15 are not identical to those contained in the output from
PROC UNIVARIATE, presented earlier. This is because, by default, PROC UNIVARIATE
prints the estimated population standard deviation and variance, rather than the sample
standard deviation and variance. Later, this section shows how to modify the PROC
MEANS statement so that it prints the estimated population standard deviation and variance.
Computing the Population Variance and Standard Deviation
To compute the population variance and standard deviation, you should write your PROC
MEANS statement according to the directions provided in the preceding section. That is,
you should write your PROC MEANS statement the same way that you would in order to
compute the sample variance and standard deviation.
This is because the formula for computing the population variance is essentially equivalent
to the formula for the sample variance: Both formulas have N as the divisor (rather than
N 1). In the previous section, you learned that you can request that N be used as the
divisor by including the option VARDEF=N in the PROC MEANS statement. This means
that you should include the VARDEF=N option in the PROC MEANS statement regardless
of whether you wish to compute the sample variance or the population variance.
Remember that these directions are appropriate only if every member of the population is
included in the data set that you are going to analyze. In this situation, you are computing
the variance as a descriptive statistic because you are simply describing the population
variance as it is. These directions do not apply if you are analyzing data from a sample
and are using those data to estimate what the variance probably is in the larger
population. In this second situation, you are using the variance as an inferential statistic.
In this inferential situation, you should follow the directions provided in the following
section.
Computing the Estimated Population Variance and Standard
Deviation
The directions in this section are appropriate for situations in which you are using the
variance as a inferential statistic. That is, they are appropriate for situations in which you
have obtained data from a sample, and you wish to use that data to estimate what the
variance probably is in the larger population from which the sample was drawn.

Chapter 7: Measures of Central Tendency and Variability 213

Below is the syntax for the SAS statements that request the estimated population variance
and standard deviation:
PROC MEANS DATA=data-set-name
VAR variable-list;
TITLE1 ' your-name ';
RUN;

VARDEF=DF

MEAN

STD

VAR

MIN

MAX;

You can see that the only difference involves the VARDEF option in the PROC MEANS
statement. When you requested the sample variance, you saw VARDEF=N. Now that you
are instead requesting the estimated population variance, it instead reads VARDEF=DF.
The DF here stands for degrees of freedom. Remember that, in this context, the degrees
of freedom are equal to N 1. This means that, when you request VARDEF=DF, SAS uses
N 1 as the divisor in computing the variance and standard deviation.
Here are the statements used to request the estimated population variance and standard
deviation for the AGE variable:
PROC MEANS DATA=D1 VARDEF=DF
N MEAN
VAR AGE;
TITLE1 'JANE DOE... DIVISOR = N-1';
RUN;

STD

VAR

MIN

MAX;

Output 7.16 presents the results of this PROC MEANS statement.


JANE DOE... DIVISOR = N-1

The MEANS Procedure


Analysis Variable : AGE
N
Mean
Std Dev
Variance
Minimum
Maximum
----------------------------------------------------------------------------22
46.0454545
6.1989037
38.4264069
33.0000000
59.0000000
-----------------------------------------------------------------------------

Output 7.16. The estimated population standard deviation and variance as


produced by PROC MEANS with the option VARDEF=DF.

Output 7.16 shows that the estimated population standard deviation for AGE rounds to
6.199 ( ) and that the estimated population variance is 38.426 ( ). Notice that the
estimated population variance (38.426) is larger than the sample variance (36.680) that was
reported in Output 7.15. This is to be expected, since the divisor in computing the estimated
population variance (N 1) is a smaller number than the divisor used in computing the
sample variance (N). This results in a larger value for the estimated population variance.

214 Step-by-Step Basic Statistics Using SAS: Student Guide

Conclusion
This chapter has shown you how to perform a variety of descriptive procedures. You have
learned (a) how to compute measures of central tendency to determine the location of your
sample on a continuum, (b) how to compute measures of variability to determine the
dispersion of your scores, and (c) how to interpret a stem-and-leaf plot to identify the shape
of a distribution. After creating a data set, you should generally perform simple procedures
such as these before moving on to more sophisticated analyses.
But what if your data are not yet ready to be analyzed with more sophisticated procedures?
For example, what if you have administered a depression-screening questionnaire to
subjects, have entered their responses, and now need to sum their responses to the individual
items to create a single score that represents their level of depression? In situations such as
these, it is necessary to perform some type of data manipulation: operations in which you
(or a computer application) transform existing variables or create new variables from
existing variables.
When conducting research in the behavioral sciences and education, it is commonplace that
researchers need to perform transformations on their raw data prior to performing more
sophisticated statistical procedures. Fortunately, SAS makes it easy to do this through the
use of simple mathematical equations, IF-THEN statements, and other operations. The
next chapter shows how to use these tools to recode reversed items, create total scores
from individual scale items, create data subsets, and perform other forms of data
manipulation.

Creating and
Modifying
Variables and
Data Sets
Introduction.........................................................................................217
Overview...............................................................................................................217
Why It Is Often Necessary to Modify Variables and Data Sets .............................217
Example 8.1: An Achievement Motivation Study ..............................218
The Study .............................................................................................................218
Data Set to Be Analyzed.......................................................................................220
SAS Program to Read the Raw Data....................................................................221
Using PROC PRINT to Create a Printout of Raw Data .......................222
Why You Need to Use PROC PRINT ...................................................................222
Using PROC PRINT to Print the Current Data Set................................................222
A Common Misunderstanding Regarding PROC PRINT ......................................225
Where to Place Data Manipulation and Data Subsetting
Statements .....................................................................................225
Overview...............................................................................................................225
Placing These Statements Immediately Following the INPUT Statement.............225
Placing the Statements Immediately After Creating a New Data Set....................226
Basic Data Manipulation.....................................................................228
Overview...............................................................................................................228
Creating Duplicate Variables with New Variable Names.......................................228
Creating New Variables from Existing Variables...................................................230
Recoding Reversed Variables...............................................................................233
Where Should the Recoding Statements Go? ......................................................234

216 Step-by-Step Basic Statistics Using SAS: Student Guide

Recoding a Reversed Item and Creating a New Variable for the


Achievement Motivation Study ......................................................235
The Scale..............................................................................................................235
Creating the New SAS Variable ............................................................................235
Recoding the Reversed Item ................................................................................236
The SAS Program.................................................................................................236
The SAS Output....................................................................................................237
The SAS Log.........................................................................................................238
Using IF-THEN Control Statements ....................................................239
Overview...............................................................................................................239
Creating a New AGE Variable...............................................................................239
Comparison Operators..........................................................................................241
The Syntax of the IF-THEN Statement .................................................................242
Using ELSE Statements .......................................................................................243
Using the Conditional Statements AND and OR ...................................................244
Working with Character Variables.........................................................................245
Data Subsetting ..................................................................................248
Overview...............................................................................................................248
The Syntax of Data Subsetting Statements ..........................................................248
An Example ..........................................................................................................249
An Example with Multiple Subsets ........................................................................251
Using Comparison Operators and the Conditional Statements AND and OR.......252
Eliminating Observations That Have Missing Data on Some Variables ................252
Combining a Large Number of Data Manipulation and Data
Subsetting Statements in a Single Program ..................................256
Overview...............................................................................................................256
A Longer SAS Program ........................................................................................256
Some General Guidelines .....................................................................................259
Conclusion...........................................................................................260

Chapter 8: Creating and Modifying Variables and Data Sets 217

Introduction
Overview
This chapter shows you how to modify a data set by creating new variables and modifying
existing variables. It shows how to write SAS statements that contain simple mathematical
formulas, and how to use IF-THEN control statements. It shows how to use subsetting IF
statements to analyze data for various subgroups within a data set. Finally, it shows how to
eliminate unwanted observations from a data set so that analyses are performed only on
subjects that have no missing data.
Why It Is Often Necessary to Modify Variables and Data Sets
Very often researchers obtain a data set in which the data are not yet in a form appropriate
for analysis. For example, imagine that you are conducting research on job satisfaction.
Perhaps you wish to compute the correlation between subject age and a single index of job
satisfaction. You administer a 10-item questionnaire that assesses job satisfaction to 200
employees, and enter their responses to the 10 individual questionnaire items. You now
need to add together each subjects response to those 10 items to arrive at a single composite
score that reflects that subjects overall level of satisfaction. This computation is very easy
to perform within the SAS program by including a number of data manipulation statements.
Data manipulation statements are SAS statements that transform the data set in some way.
They may be used to recode reversed variables, create new variables from existing variables,
and perform a wide range of other tasks.
At the same time, it is possible that your original data set contains observations that you do
not wish to include in your analyses. Perhaps the questionnaire was administered to hourly
as well as nonhourly employees, and you wish to analyze only data from the hourly
employees. In addition, you may wish to analyze data only from subjects who have usable
data on all of the studys variables. In these situations, you may include data subsetting
statements to eliminate the unwanted subjects from the sample. Data subsetting statements
are SAS statements that eliminate unwanted observations from a sample, so that only a
specified subgroup is included in the resulting data set.
The SAS programming language is so comprehensive and flexible that it can perform
virtually any type of data manipulation task imaginable. A complete treatment of these
capabilities would easily fill a book, and is therefore beyond the scope of this text.
However, this chapter reviews some basic SAS statements that can be used to perform a
wide variety of research tasks (particularly in research that involves questionnaire data).
Those who need additional help should consult Hatcher and Stepanski (1994).

218 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 8.1: An Achievement Motivation Study


The Study
The use of data manipulation and data subsetting statements is illustrated here by analyzing
data from a fictitious study dealing with achievement motivation. Briefly, achievement
motivation is the desire to exceed some standard of performance. People who score high on
achievement motivation try to improve on their past performance, take moderate risks, and
tend to set challenging goals for themselves.
Suppose that you have constructed a 5-item scale designed to assess achievement motivation
in a sample of college students. The scale consists of five statements, and a subject who
completes the scale uses a 6-point response format to indicate the extent to which he or she
agrees with each statement. The scale is designed so that subjects who possess a great deal
of achievement motivation should tend to agree with items 1, 2, 3, and 5, but disagree with
item 4.
Your 5-item scale appears on the following short questionnaire. The questionnaire also
contains a few additional items to assess demographic information (sex, age, and academic
major).

Chapter 8: Creating and Modifying Variables and Data Sets 219

Directions: Please indicate the extent to which you agree or disagree


with each of the following statements. You will do this by circling the
appropriate number to the left of that statement. The following format
shows what each response number stands for:
6
5
4
3
2
1

=
=
=
=
=
=

Agree Very Strongly


Agree Strongly
Agree Somewhat
Disagree Somewhat
Disagree Strongly
Disagree Very Strongly

For example, if you "Disagree Very Strongly" with the first statement,
circle the "1" to the left of that statement. If you "Agree Somewhat,"
circle the "4," and so forth.
---------------Circle Your
Response
---------------1 2 3 4 5 6

1. I try very hard to improve on my performance in


classes.

2. I take moderate risks to get ahead in classes.

3. I try to perform better than my fellow students.

4. I try to achieve as little as possible in


classes.

5. I do my best work when my class assignments are


fairly difficult.

6. What is your sex?

______ Female (F)

______ Male (M)

7. What is your age in years? _______________


8. What is your major?

______ Arts and Sciences (1)


______ Business (2)
______ Education (3)

220 Step-by-Step Basic Statistics Using SAS: Student Guide

Data Set to Be Analyzed


You administer the questionnaire to nine college students. Their responses are summarized
in Table 8.1
Table 8.1
Data from the Achievement Motivation Study
___________________________________________
Agree-Disagree
Questions
__________________
Subject
Q1 Q2 Q3 Q4 Q5
Sex
Age
Major
____________________________________________________
1. Marsha

22

2. Charles

25

3. Jack

30

4. Cathy

41

5. Emmett

22

6. Marie

20

7. Cindy

21

8. Susan

25

9. Fred
4
5
4
2
5
M
23
3
___________________________________________________

As is the case with most tables of data in this guide, Table 8.1 follows the conventions that
the horizontal rows of the table represent different subjects, and the vertical columns of the
table represent different variables.
Below the heading Q1 are responses to the first achievement motivation statement (I try
very hard to improve on my performance in classes). For example, with respect to Q1, you
can see that

Marsha circled a 6 (for Agree Very Strongly)

Charles circled a 2 (for Disagree Strongly)

Jack circled a 5 (for Agree Strongly).

Below the heading Q2 are responses to the second achievement motivation statement (I
take moderate risks to get ahead in classes). Responses to the remaining achievement
motivation questions appear below the headings Q3, Q4, and Q5, respectively.
Subject sex is recorded below the heading Sex (with the value F representing females
and the value M representing males), and subject age appears below Age.

Chapter 8: Creating and Modifying Variables and Data Sets 221

Values below the heading Major indicate whether a given subject is majoring in the arts
and sciences versus business versus education. The value 1 represents subjects majoring
in the arts and sciences, the value 2 represents subjects majoring in business, and the
value 3 represents subjects majoring in education.
SAS Program to Read the Raw Data
Here is the program that creates a SAS data set from the data appearing in Table 8.1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
Q4
Q5
SEX $
AGE
MAJOR;
DATALINES;
1 6 5 5 2 6
F
2 2 1 2 5 2
M
3 5 . 4 3 6
M
4 5 6 6 . 6
F
5 4 4 5 2 5
M
6 5 6 6 2 6
F
7 5 5 6 1 5
F
8 2 3 1 5 2
F
9 4 5 4 2 5
M
;

22
25
30
41
22
20
21
25
23

1
1
1
2
2
2
3
3
3

Some notes regarding the preceding DATA step:

The INPUT statement appears on lines 311.

The subjects actual names were not included in the data set. However, each subject was
given a unique subject number (from 1 through 9) for purposes of identification. These
subject numbers are contained in the SAS variable SUB_NUM. The variable name
SUB_NUM appears on line 3 of the program.

The SAS variables Q1Q5 contain subject responses to achievement motivation items 15,
respectively. These variable names appear in the INPUT statement on lines 4 through 8.

Line 9 shows that the SAS variable name SEX was used to represent subject sex (this was
the only character variable in the data set, as is indicated by the $ that follows the name
SEX).

Lines 1011 show that the SAS variable name AGE was used to represent subject age, and
MAJOR was used to represent subject major.

222 Step-by-Step Basic Statistics Using SAS: Student Guide

The data lines themselves appear on lines 1321. You can see that these data lines
correspond to the data appearing in Table 8.1. Remember that the first column contains the
unique subject numbers assigned to each subject (this variable was given the SAS variable
name SUB_NUM).

Using PROC PRINT to Create a Printout of Raw Data


Why You Need to Use PROC PRINT
Before discussing data manipulation and data subsetting, this section first shows you how to use
the PRINT procedure. The PRINT procedure (PROC PRINT) is useful for generating a printout
of your raw data (i.e., a printout of your data as they appear in an internal SAS data set).
You can use PROC PRINT to review each subjects score on each variable in your data set
(or on any subset of variables that you choose). PROC PRINT is useful for a wide variety of
purposes, but here this section focuses on just two:

After you have created a SAS data set, you can use PROC PRINT to print the contents of
that data set to verify that SAS read your data the way that you intended.

After you have used data manipulation statements to create new SAS variables, you can
use PROC PRINT to print the values of that variable and verify that it was created in the
manner that you intended.

Using PROC PRINT to Print the Current Data Set


Printing all variables in a data set. Here is the syntax for the PROC step that prints the
raw data for all variables in your data set:
PROC PRINT DATA=data-set-name;
TITLE1 ' your-name ';
RUN;
Here are the PRINT procedure statements that print the raw data for the achievement
motivation study described above. To illustrate where the statements should be placed in the
program, the last few data lines from the preceding data set are also reproduced below:
[The first part of the DATA step appears here]
5 5 6 1 5
F
21
3
2 3 1 5 2
F
25
3
4 5 4 2 5
M
23
3
;
PROC PRINT DATA=D1;
TITLE1 'JANE DOE';
RUN;

Chapter 8: Creating and Modifying Variables and Data Sets 223

Output 8.1 presents the PRINT results generated by the preceding statements.
JANE DOE

Obs

SUB_NUM

Q1

Q2

Q3

Q4

Q5

SEX

AGE

MAJOR

1
2
3
4
5
6
7
8
9

1
2
3
4
5
6
7
8
9

6
2
5
5
4
5
5
2
4

5
1
.
6
4
6
5
3
5

5
2
4
6
5
6
6
1
4

2
5
3
.
2
2
1
5
2

6
2
6
6
5
6
5
2
5

F
M
M
F
M
F
F
F
M

22
25
30
41
22
20
21
25
23

1
1
1
2
2
2
3
3
3

Output 8.1. Results of PROC PRINT performed on initial data set,


achievement motivation study.

For the most part, Output 8.1 duplicates the data that appeared in Table 8.1. The most
obvious difference is the fact that subject names do not appear in Output 8.1; instead, each
data line is identified with an observation number.
These observation numbers appear in the column headed Obs. The first entry in the
column headed Obs is 1 for observation #1, the second entry is 2 for observation
#2, and so forth. This Obs variable was generated by SAS.
When the observations in a data set are individual subjects (as is the case with the current
achievement motivation study), the observation numbers are essentially subject numbers. This
means that observation #1 consists of data for subject #1 (Marsha, from Table 8.1),
observation #2 consists of data for subject #2 (Charles), and so forth. This can be easily
verified by comparing the observation numbers in the first column ( ) to values of the subject
number variable (SUB_NUM) which appears as the second column in the output ( ).
The remaining columns of Output 8.1 correspond to the columns of Table 8.1 presented
earlier. Specifically:
The column headed Q1 presents subject responses to question 1, the first achievement
motivation item. In the same way, columns Q2-Q5 present responses to the remaining
achievement motivation items.
The column headed SEX identifies the sex for each subject in the study.
The column headed AGE represents subject age.
The column headed MAJOR identifies the area that each subject majored in. You will
remember that, in the way that these data were entered, the value 1 identifies arts and

224 Step-by-Step Basic Statistics Using SAS: Student Guide

sciences majors, the value 2 identifies business majors, and the value 3 identifies
education majors.
Printing a subset of variables in a data set. Sometimes you will wish to print raw data for
just a few variables in a data set. When this is the case, you should use the VAR statement
with a PROC PRINT statement. In the VAR statement, you list just the names of the
variables that you want to print. Here is the syntax:
PROC PRINT DATA=data-set-name;
VAR variable-list ;
TITLE1 ' your-name ';
RUN;
For example, the following will cause PROC PRINT to print raw values for just the SEX
and AGE variables:
PROC PRINT DATA=D1;
VAR SEX AGE;
TITLE1 'JANE DOE';
RUN;
Output 8.2 presents the results generated by the preceding statements:
JANE DOE
Obs

SEX

AGE

1
2
3
4
5
6
7
8
9

F
M
M
F
M
F
F
F
M

22
25
30
41
22
20
21
25
23

Output 8.2. Results of PROC PRINT in which only the SEX and AGE variables
were listed in the VAR statement.

You can see that the values for SEX and AGE in Output 8.2 are identical to the values
appearing in Output 8.1.

Chapter 8: Creating and Modifying Variables and Data Sets 225

A Common Misunderstanding Regarding PROC PRINT


Students learning SAS often misunderstand PROC PRINT: they sometimes assume that a
SAS program must contain PROC PRINT in order to generate a paper printout of their
results. This is not the case. PROC PRINT simply generates a printout of your raw data
(i.e., subjects individual scores on the variables in your data set). If you have run some
other SAS procedure such as PROC MEANS or PROC FREQ, you do not have to include
PROC PRINT in your program to create a paper printout of the results generated by those
procedures.

Where to Place Data Manipulation and Data Subsetting


Statements
Overview
In general, data manipulation and subsetting statements should appear only within a SAS
DATA step. Remember that a DATA step begins with the DATA statement, and ends when
SAS encounters a PROC (procedure) statement. This means that, if you prepare a DATA
step, end the DATA step with a PROC, and then place some manipulation or subsetting
statements immediately after the PROC, an error results.
To avoid this error (and keep things simple), place your data manipulation and data
subsetting statements either:

immediately following the INPUT statement, or

immediately following the creation of a new data set.

Placing These Statements Immediately Following the INPUT


Statement
This guideline is illustrated by referring to the study on achievement motivation. Suppose
that you prepared the following SAS program to analyze data obtained in your study.

226 Step-by-Step Basic Statistics Using SAS: Student Guide

In the following program, lines 13 and 14 indicate where you could place data manipulation
or data subsetting statements in that program:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

OPTIONS LS=80 PS=60;


DATA D1;
INPUT Q1
Q2
Q3
Q4
Q5
SEX $
AGE
MAJOR;
/*place data manipulation statements and
data subsetting statements here*/

DATALINES;
6 5 5 2 6
F
2 1 2 5 2
M
5 . 4 3 6
M
5 6 6 . 6
F
4 4 5 2 5
M
5 6 6 2 6
F
5 5 6 1 5
F
2 3 1 5 2
F
4 5 4 2 5
M
;
PROC MEANS DATA=D1;
RUN;

22
25
30
41
22
20
21
25
23

1
1
1
2
2
2
3
3
3

Placing the Statements Immediately After Creating a New Data Set


A new data set may be created at virtually any point in a SAS program (even after PROCs
have been requested).
SAS programmers often create a new data set so that, initially, it is a duplicate of an existing
data set (perhaps the one created with a preceding INPUT statement). If data manipulation
or data subsetting statements follow the creation of this new data set, the new data set
displays the modifications requested by those statements.
To create a duplicate of an existing data set, use the following syntax:
DATA new-data-set-name ;
SET existing-data-set-name ;

Chapter 8: Creating and Modifying Variables and Data Sets 227

Here is an example of statements that you could use:


DATA D2;
SET D1;
The preceding lines tell SAS to create a new data set named D2, and make this new data set
a duplicate of the existing data set, D1. Now that a new data set has been created, you can
write as many data manipulation and subsetting statements as you like. However, once you
write a PROC statement, that ends the DATA step, and no more manipulation or subsetting
statements may be written beyond that point (unless you create yet another data set, perhaps
calling it something like D3, later in the program).
Here is an example of how a program might have been written so that the manipulation and
subsetting statements follow the creation of a new data set:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

OPTIONS LS=80 PS=60;


DATA D1;
INPUT Q1
Q2
Q3
Q4
Q5
SEX $
AGE
MAJOR;
DATALINES;
6 5 5 2 6
F
22
2 1 2 5 2
M
25
5 . 4 3 6
M
30
5 6 6 . 6
F
41
4 4 5 2 5
M
22
5 6 6 2 6
F
20
5 5 6 1 5
F
21
2 3 1 5 2
F
25
4 5 4 2 5
M
23
;
PROC MEANS DATA=D1;
RUN;

1
1
1
2
2
2
3
3
3

DATA D2;
SET D1;
/*place data manipulation statements and
data subsetting statements here*/
PROC MEANS
RUN;

DATA=D2;

Some notes about the preceding program:

The DATA step on line 2 tells SAS to give this data set the name D1.

Lines 221 create the initial data set in the usual way.

228 Step-by-Step Basic Statistics Using SAS: Student Guide

Lines 2223 cause PROC MEANS to be performed on the initial data set.

Lines 2526 cause a new DATA step to begin. Line 25 tells SAS to give this new data set
the name D2, and line 26 tells SAS to make D2 an exact copy of D1.

Any data manipulation or subsetting statements that appear in lines 2829 affect only the
new data set, D2.

The PROC MEANS statement in line 31 requests that some simple descriptive statistics be
computed. By default, PROC MEANS always computes the mean, standard deviation, and a
few other descriptive statistics. It is clear that these statistics would be computed from the
data in the new data set D2, because of the DATA=D2 option that appears in the PROC
MEANS statement. If the statement had instead specified DATA=D1, the analyses would
have instead been performed on the original data set.

Basic Data Manipulation


Overview
Data manipulation involves performing some type of transformation on one or more
variables within a DATA step. This section discusses several types of transformations that
are frequently required in research, such as creating duplicate variables with new variable
names, creating new variables from existing variables, and recoding reversed items.
Creating Duplicate Variables with New Variable Names
Suppose you gave a variable a certain name when it was input, but now wish the variable to
have a different, perhaps more meaningful name when it appears later in the SAS program
or in SAS output. One way to rename an existing variable is to create a new variable that
is identical to the existing variable, and assign a new, more meaningful name to this new
variable.
This can easily be accomplished by writing a statement that uses the following syntax:
new-variable-name

existing-variable-name;

For example, in the achievement motivation study, you used the SAS variable name SEX
to represent the subjects sex. Suppose for a moment that, at some point in the program, you
wish to use the SAS variable name GENDER instead of SEX. This could be done with
the following statement:
GENDER = SEX;
The preceding statement tells SAS to create a new variable named GENDER, and make it
a duplicate of SEX. This means that, if a given subject had a value of F on SEX, she

Chapter 8: Creating and Modifying Variables and Data Sets 229

will also have a value of F on GENDER; if a given subject had a value of M on SEX,
he will also have a value of M on GENDER.
The key, of course, is to make sure that you place statements such as these only within a
DATA step. Below is an example of how this could be done with the achievement
motivation program (to conserve space, only the last few data lines from the program are
reproduced below):
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
8 2 3 1 5 2
F
25
9 4 5 4 2 5
M
23
;
PROC MEANS DATA=D1;
RUN;

3
3
3

DATA D2;
SET D1;
GENDER = SEX;
PROC FREQ DATA=D2;
TABLES GENDER;
RUN;

Some notes about the preceding program:

The data lines end with line 20, and lines 2223 cause PROC MEANS to be performed on
the initial data set.

Lines 2526 cause a new DATA step to begin. Line 25 tells SAS to give this new data set
the name D2, and line 26 tells SAS to make D2 a duplicate of D1.

Line 28 tells SAS to create a new variable called GENDER in the new data set D2, and
to make it a duplicate of the existing variable, SEX.

Lines 3032 cause PROC FREQ to be performed on GENDER. Notice that the
DATA=D2 part of line 30 specifies that the analysis should be performed using the data
set named D2.

Conforming to the rules for SAS variable names. When you create a new variable name,
remember to make it conform to the rules for SAS variable names discussed in Chapter 4,
Data Input (e.g., begin the new variables name with a letter). Also, note that each
statement that creates a duplicate of an existing variable must end with a semicolon.
Duplicating variables versus renaming variables. Technically, it should be clear that the
previous program did not really rename the variable initially named SEX. What it
actually did was create a duplicate of the variable SEX, and assign a new name to the
duplicate variable. This means that the resulting data set contained both the original

230 Step-by-Step Basic Statistics Using SAS: Student Guide

variable under its old name (SEX), as well as the duplicate variable under its new name
(GENDER).
Creating New Variables from Existing Variables
A job satisfaction example. It is often necessary to perform mathematical operations on
existing variables, and use the results to create a new variable. For example, suppose that
you created the following 3-item scale designed to measure job satisfaction:
Directions: Please indicate the extent to which you agree or disagree
with each of the following statements. You will do this by circling the
appropriate number to the left of that statement. The following format
shows what each response number stands for:
6
5
4
3
2
1

=
=
=
=
=
=

Agree Very Strongly


Agree Strongly
Agree Somewhat
Disagree Somewhat
Disagree Strongly
Disagree Very Strongly

Circle Your
Response

1. I am satisfied with my job.

2. My job satisfies most of my work-related needs.

3. I like my job.

Subjects respond to each of these three items using a 6-point response format in which 1 =
Disagree Very Strongly and 6 = Agree Very Strongly (similar to the achievement
motivation scale presented earlier).
Suppose that you administer this 3-item scale to 100 subjects and write a SAS program to
analyze their responses. You use the SAS variable name Q1 to represent responses to
item 1, Q2 to represent responses to item 2, and Q3 to represent responses to item 3.
Suppose that, for each subject, you would like to compute a single score to represent overall
level of job satisfaction. For a given subject, this single score will simply be the sum of his
or her responses to items 1, 2, and 3 from the scale. The following formula shows how it
would be computed:
Overall job satisfaction = Q1 + Q2 + Q3

Chapter 8: Creating and Modifying Variables and Data Sets 231

Scores on this overall job satisfaction variable must fall somewhere within the following
range:

The lowest possible score would be a score of 3. This score would be obtained if the
subject circled 1 for each of the three items (for Disagree Very Strongly).

The highest possible score would be a score of 18. This score would be obtained if the
subject circled 6 for each of the three items (for Agree Very Strongly).

Obviously, with this scale, higher scores indicate higher levels of job satisfaction.
There are at least two ways that you can create this single overall job satisfaction variable. The
more difficult way would be to pull out your pocket calculator, look at a given subjects
responses to Q1, Q2, and Q3, and add these three responses together. The easier way would be
to write a simple SAS statement that does this work for you. The following section shows how.
Creating new variables using simple formulas. To create a new variable by performing
mathematical operations on existing variables, use the following syntax:
new-variable-name

formula-including-existing-variables;

For example, assume that you have written a SAS program that inputs responses to the three
job satisfaction questions, and have given them the SAS variable names Q1, Q2, and Q3.
You can now write a SAS statement within a DATA step to create the single overall job
satisfaction variable that you need. The following statement does this:
SATIS = Q1 + Q2 + Q3;
The preceding statement tells SAS to create a new variable named SATIS. A given
subjects score on SATIS should be equal to the sum of Q1, Q2, and Q3. You can now use
the variable SATIS in subsequent analyses: You can compute your subjects mean score on
SATIS, correlate SATIS with other variables, and perform a wide variety of other analyses.
When creating new variables this way, be sure that all variables on the right side of the
equals sign are existing variables. This means that they already exist in the data set,
either because they are listed in the INPUT statement, or because they were created with
earlier data manipulation statements.
Symbols for arithmetic operators. The preceding statement that created the new variable
SATIS used two arithmetic operators: the equals sign (=) and the plus sign (+). You use
arithmetic operators to tell SAS about the types of mathematical operations you wish to
perform on your data. Here is a list of the symbols for these arithmetic operators to use
when writing SAS statements:
+ Addition
- Subtraction
* Multiplication
/ Division
= Equals

232 Step-by-Step Basic Statistics Using SAS: Student Guide

Using parentheses. When you write formulas, make heavy use of parentheses.
Remember that operations enclosed within parentheses are performed first, and
operations outside the parentheses are performed later. Using parentheses ensures that
operations are performed in the sequence that you expect.
For example, suppose that, with the preceding study on job satisfaction, you want each
subjects score on SATIS to be equal to the average of his or her responses to Q1, Q2, and
Q3 (rather than the sum of his or her responses to Q1, Q2, and Q3). The following statement
would compute this average:
SATIS = (Q1 + Q2 + Q3) / 3;
This statement tells SAS to

create a new variable named SATIS.

in creating this new variable, begin by adding together Q1, Q2, and Q3.

divide this sum by 3.

The resulting quotient is that subject's score on SATIS.


By using parentheses, you ensure that the addition is performed first, and the division is
performed second.
In contrast, consider what would have happened if you had instead written the statement in
the following way, without parentheses:
SATIS = Q1 + Q2 + Q3 / 3;
In this case, SAS would have begun by dividing each subjects score on Q3 by 3. The
resulting quotient would have then been added to Q1 and Q2. Obviously, this would have
resulted in a very different score for SATIS.
Why the difference? Because division has priority over addition as an arithmetic operator.
When no parentheses are included in a formula, division is performed before addition is
performed.
When an expression contains more than one operator, SAS follows a set of rules that
determine which operations are performed first, which are performed second, and so forth.
Here are the rules that pertain to mathematical operators (+, -, /, and *):

Multiplication and division operators (* and /) have equal priority, and they are performed
first.

Addition and subtraction operators (+ and -) have equal priority, and they are performed
second.

To protect yourself from errors, use parentheses when writing formulas. Because operations
that are included inside parentheses are performed first, using parentheses gives you control
over the sequence in which operations are executed.

Chapter 8: Creating and Modifying Variables and Data Sets 233

Recoding Reversed Variables


Very often a questionnaire will contain a number of "reversed" items. A reversed item is a
question stated so that its meaning is the opposite of the other items in that scale. For
example, consider the following (somewhat revised) items from the job satisfaction scale:
1

1. I am satisfied with my job.

2. My job satisfies most of my work-related needs.

3. I hate my job.

Items 1 and 2 from the preceding scale are stated in the same way that they were stated
when the scale was first presented a few pages earlier. Item 3, however, has been changed;
item 3 is now a reversed item. In the original version of this scale, item 3 stated I like my
job. In the current version of the scale, item 3 now states the opposite: I hate my job.
In a sense, all of the questions in this 3-item scale are measuring the same thing: whether
the subject feels satisfied with his or her job. Items 1 and 2 are stated so that, the more
strongly you agree with the statement, the higher is your level of job satisfaction (remember
that, with the response format, 1 = Disagree Very Strongly and 6 = Agree Very
Strongly).
However, item 3 is now a reversed item: It is stated so that the more strongly you agree
with it, the lower is your level of job satisfaction. Here, scores of 1 indicate a higher level
of satisfaction, and scores of 6 indicate lower satisfaction (which is just the reverse of items
1 and 2).
It would be nice if all three items were consistent, so that scores of 6 always indicate high
satisfaction, and scores of 1 always indicate low satisfaction. This requires that you recode
item 3 so that people who circled a 6 are given a score of 1 instead; people who circled a 5
are given a score of 2 instead; and so on. This can be done very easily with the following
statement:
V3 = 7 - V3;
In SAS, these are called assignment statements. This book refers to them as recoding
statements. The preceding tells the computer to create a new version of the variable V3 and
subtract the subjects existing (old) score on V3 from the number 7. The result will be the
subject's new score on V3. Notice that now, if a person's old score on V3 was 6, his or her
new score is 1; if the old score was 1, the new score is 6, and so on.
The syntax for this recoding statement is as follows:
existing-variable

constant

existing-variable;

What is the constant? It will always be equal to the number of response points on your
scale plus 1. For example, the job satisfaction scale included 6 response points: a subject

234 Step-by-Step Basic Statistics Using SAS: Student Guide

could choose from a range of responses, beginning with 1 for Disagree Very Strongly
through 6 for Agree Very Strongly. It was a 6-point scale, so the constant was 6 + 1 = 7.
What would the constant be if the following 7-point response format had been used instead?
7
6
5
4
3
2
1

=
=
=
=
=
=
=

Agree Very Strongly


Agree Strongly
Agree Somewhat
Neither Agree nor Disagree
Disagree Somewhat
Disagree Strongly
Disagree Very Strongly

The constant would have been 8, because 7 + 1 = 8. This means that the recoding statement
would have read:
V3 = 8 - V3;

Where Should the Recoding Statements Go?


In most cases, reversed items should be recoded before other data manipulations are
performed on them. For example, suppose that you want to create a new variable named
SATIS, which stands for job satisfaction. With this scale, higher scores indicate higher
levels of satisfaction. For a given subject, his or her score on this scale will be the average
of his or her responses to items 1, 2, and 3 from the preceding scale. Because item 3 is a
reversed item, it is important that it be recoded before it is added to items 1 and 2 in
calculating this scale score. Therefore, the correct sequence of statements is as follows:
V3 = 7 - V3;
SATIS = (V1 + V2 + V3) / 3;
The following sequence is not correct:
SATIS = (V1 + V2 + V3) / 3;
V3 = 7 - V3;

Chapter 8: Creating and Modifying Variables and Data Sets 235

Recoding a Reversed Item and Creating a New Variable


for the Achievement Motivation Study
The Scale
The beginning of this chapter presented a short questionnaire that could be used to assess
achievement motivation in a sample of college students. The five questionnaire items that
assessed achievement motivation are reproduced here:
---------------Circle Your
Response
---------------1 2 3 4 5 6

1. I try very hard to improve on my performance in


classes.

2. I take moderate risks to get ahead in classes.

3. I try to perform better than my fellow students.

4. I try to achieve as little as possible in


classes.

5. I do my best work when my class assignments are


fairly difficult.

It is clear that all five items were designed to measure academic achievement motivation.
With items 1, 2, 3, and 5, the more the subject agrees with the item, the higher his or her
level of achievement motivation (you will remember that the scale uses a response format in
which 1 = Disagree Very Strongly and 6 = Agree Very Strongly).
Item 4, however, is a reversed item; it states I try to achieve as little as possible in classes.
The more strongly subjects agree with item 4, the lower their level of achievement
motivation.
Creating the New SAS Variable
Suppose that you wish to create a new SAS variable named ACH_MOT. A given
subjects score on this new variable would be equal to the average of his or her responses to
items 15 from the achievement motivation questionnaire. Higher scores on ACH_MOT
indicate higher levels of achievement motivation. The SAS statement that creates
ACH_MOT looks like this:
ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5;

236 Step-by-Step Basic Statistics Using SAS: Student Guide

Recoding the Reversed Item


However, you have a problem: As was stated earlier, item 4 from the scale is a reversed
item. Before you can include it in the preceding statement to create ACH_MOT, you must
first recode Q4 so that it is no longer reversed. Here is the SAS statement that will
accomplish this:
Q4 = 7 - Q4;
In the preceding statement, the constant is 7. This is because the achievement motivation
scale uses a 6-point response format, and 6 + 1 = 7.
The SAS Program
Finally, you are ready to put it all together. Here are the statements that recode the reversed
item, Q4, and create the new SAS variable, ACH_MOT:
18
19
20
21
22
23
24
25
26
27
28
29
30
31

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
8 2 3 1 5 2
F
25
9 4 5 4 2 5
M
23
;
DATA D2;
SET D1;

3
3
3

Q4 = 7 - Q4;
ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5;
PROC PRINT DATA=D2;
VAR Q1 Q2 Q3 Q4
TITLE1 'JANE DOE';
RUN;

Q5

ACH_MOT;

Some notes about the preceding program:

The last data lines from the achievement motivation study appear on lines 1820 (to save
space, only the last few lines are presented).

A new DATA step begins on line 22. This was necessary, because you can recode
variables and create new variables only within a DATA step.

Line 25 presents the statement that recodes Q4.

Line 26 presents the statement that creates ACH_MOT. Each subjects score on
ACH_MOT is equal to the average of his or her responses to Q1Q5.

Lines 2831 request that the PRINT procedure be performed. Notice that the VAR
statement on line 29 tells SAS to print only the variables Q1, Q2, Q3, Q4, Q5, and
ACH_MOT.

Chapter 8: Creating and Modifying Variables and Data Sets 237

The SAS Output


Output 8.3 presents the results generated by the preceding program.
JANE DOE
Obs

Q1

Q2

Q3

Q4

Q5

ACH_MOT

1
2
3
4
5
6
7
8
9

6
2
5
5
4
5
5
2
4

5
1
.
6
4
6
5
3
5

5
2
4
6
5
6
6
1
4

5
2
4
.
5
5
6
2
5

6
2
6
6
5
6
5
2
5

5.4
1.8
.
.
4.6
5.6
5.4
2.0
4.6

Output 8.3. Results of the PROC PRINT statement in which Q1Q5 and
ACH_MOT were listed in the VAR statement, achievement motivation study.

Some notes about Output 8.3:

The column headed Q4 presents each subjects score on SAS variable Q4 (responses
to item 4 from the questionnaire). This output presents Q4 as it existed after being recoded
by the SAS statement Q4 = 7 Q4;.

In Output 8.3, Observation #1 corresponds to subject #1 (Marsha) from Table 8.1,


presented at the beginning of this chapter. In Table 8.1, Marsha had a score of 2 on Q4;
in Output 8.3, her score on Q4 has been recoded to be 5. This is as expected.

If you compare the variable Q4 as it appears in Table 8.1 to the variable Q4 as it appears in
Output 8.3, you will see that the recoding statement appears to have had the intended
effect in recoding all responses to Q4.

The final column in Output 8.3 is headed ACH_MOT. This column provides each
subjects score on the new variable, ACH_MOT. The score on ACH_MOT for subject #1
is 5.4. This represents the average of subject #1s responses to Q1, Q2, Q3, Q4, and Q5,
(6, 5, 5, 5, and 6, respectively). The output shows that the score on ACH_MOT for subject
#2 is 1.8. This represents the average of subject #2s responses to Q1, Q2, Q3, Q4, and Q5
(2, 1, 2, 2, and 2, respectively). You can see that scores on ACH_MOT were determined in
the same way for each of the remaining subjects in Output 8.3.

Subject #3 has missing data for ACH_MOT (a single period appears where you would
expect his score on ACH_MOT to appear). This is because Subject #3 (Jack) had missing
data on the variable Q2, which was used in creating ACH_MOT. Whenever a subject has
missing data on an existing variable that is used in the creation of a new variable, that
subject is assigned missing data on the new variable as well (at least this is the case with
the type of data manipulation statements presented in this chapter).

238 Step-by-Step Basic Statistics Using SAS: Student Guide

In the same way, you can see that Subject #4 (Cathy) also had missing data on
ACH_MOT. This is because she had missing data on Q4, which was also used in the
creation of ACH_MOT.

The SAS Log


The SAS log contains your SAS program (minus the data lines), along with any notes,
warnings, or error messages created by SAS as it executes the program. Log 8.1 presents an
excerpt from the SAS log created by the preceding program.
22
23
24
25
26

DATA D2;
SET D1;
Q4 = 7 - Q4;
ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5;

NOTE: Missing values were generated as a result of performing an operation on


missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 25:11
1 at 26:18
1 at 26:23
2 at 26:28
2 at 26:33
2 at 26:39
Log 8.1. Note about missing values produced by data manipulation
statements.

The excerpt of the SAS log appearing in Log 8.1 contains the statements that cause the
creation of the new SAS data set D2 on lines 2223.
It also contains the statement in which Q4 was recoded on line 25, and
the statement in which ACH_MOT was created from existing variables on line 26.
Log 8.1 shows that, just below these statements, SAS generated the following: NOTE:
Missing values were generated as a result of performing an operation on missing values.
The note lists the location in the SAS program where these missing values were generated.
A note such as this is not necessarily a cause for alarm. SAS automatically prints this type of
note whenever you are creating a new variable from existing variables and some of the
existing variables contain missing data. The preceding section headed The SAS Output
pointed out that the program contained missing data for the following two subjects:

Subject #3 (Jack) had missing data on the variable Q2, which was used in creating the new
variable ACH_MOT

Subject #4 (Cathy) had missing data on Q4, which was also used in the creation of
ACH_MOT.

Chapter 8: Creating and Modifying Variables and Data Sets 239

The note that appears in Log 8.1 was generated by SAS to remind you that it was generating
missing values on the new variable that it was creating (ACH_MOT) because these two
subjects had missing values on the two existing variables, Q2 and Q4.
When you receive a note such as this, you should review it to verify that the number of
missing values being created is reasonable, given the number of missing values that appear
in your initial data set. If the number of missing values being generated seems to be
inappropriate, you should attempt to identify the source of the problem by carefully
reviewing the data values in your data set, along with all data manipulation statements.

Using IF-THEN Control Statements


Overview
An IF-THEN control statement allows you to ensure that operations are performed on a
given subject's data only if certain conditions are true regarding that subject. You can use
IF-THEN control statements to modify existing variables or create new variables. This
section illustrates a number of ways that IF-THEN control statements can be used to
perform tasks that are commonly required in research.
Creating a New AGE Variable
For example, in the achievement motivation study described earlier, one of the SAS
variables included in the data set is AGE: each subjects age in years. Suppose that you now
wish to create a new variable called AGE2. You will use the following rules in assigning
scores to subjects on AGE2:

If a given subjects score on AGE is less than 25, then his or her score on AGE2 will be
zero.

If a given subjects score on AGE is greater than or equal to 25, then his or her score on
AGE2 will be 1.

240 Step-by-Step Basic Statistics Using SAS: Student Guide

The following IF-THEN control statements create AGE2 according to these rules:
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
8 2 3 1 5 2
F
25
9 4 5 4 2 5
M
23
;
DATA D2;
SET D1;

3
3
3

AGE2 = .;
IF AGE LT 25 THEN AGE2 = 0;
IF AGE GE 25 THEN AGE2 = 1;
PROC PRINT DATA=D2;
VAR AGE AGE2;
TITLE1 'JANE DOE';
RUN;

Some notes about the preceding program:

The last data lines from the achievement motivation study appear on lines 1820 (to save
space, only the last few lines are presented).

A new DATA step begins on line 22. This was necessary because you can use IF-THEN
control statements only within a DATA step to create a new variable.

Line 25 tells SAS to create a new variable called AGE2, and begin by assigning missing
data (.) to all subjects on AGE2.

Line 26 tells SAS that, if the score on AGE for a given subject is less than 25, then that
subjects score on AGE2 should be 0 (the LT in this statement stands for is less
than).

Line 27 tells SAS that, if the score on AGE for a given subject is greater than or equal to
25, then that subjects score on AGE2 should be 1 (the GE in this statement stands for
is greater than or equal to).

Lines 2932 request that the PRINT procedure be performed on variables AGE and
AGE2.

Chapter 8: Creating and Modifying Variables and Data Sets 241

The preceding program generates results presented in Output 8.4.


JANE DOE
Obs

AGE

AGE2

1
2
3
4
5
6
7
8
9

22
25
30
41
22
20
21
25
23

0
1
1
1
0
0
0
1
0

Output 8.4. Results of the PROC PRINT in which AGE and AGE2 were listed in
the VAR statement, achievement motivation study.

In Output 8.4, the column headed AGE presents subject scores on AGE as they appeared
in the initial SAS data set. The column headed AGE2 presents subject scores on the new
AGE2 variable, as created by the IF-THEN control statements.
Observation #1 presents the AGE score for subject #1 (Marsha), which was 22. This
subjects score on AGE2 was 0, which was as expected: The preceding IF-THEN control
statements indicated that if a subjects score on AGE was less than 25, then her score on
AGE2 should be 0.
Observation #2 presents the AGE score for subject #2 (Charles), which was 25. This
subjects score on AGE2 was 1, which was as expected: The preceding IF-THEN control
statements indicated that if a subjects score on AGE was greater than or equal to 25, then
his score on AGE2 should be 1.
Output 8.4 shows that each remaining subjects score on AGE2 was created according to the
same rules.
Comparison Operators
The preceding example introduced you to the concept of comparison operators. The
comparison operator LT represented is less than, and the comparison operator GE
represented is greater than or equal to.

242 Step-by-Step Basic Statistics Using SAS: Student Guide

The following comparison operators may be used with IF-THEN statements:


=
NE
GT or >
GE
LT or <
LE

is
is
is
is
is
is

equal to
not equal to
greater than
greater than or equal to
less than
less than or equal to

The Syntax of the IF-THEN Statement


The syntax is as follows:
expression

IF

THEN

statement ;

The expression usually consists of some comparison involving existing variables. The
statement usually involves some operation performed on existing variables or new
variables.
To illustrate the use of the IF-THEN statement, this section again refers to the fictitious
Learning Aptitude Test (abbreviated LAT) presented earlier in this guide. A previous
chapter indicated that the LAT included a verbal subtest as well as a mathematical subtest.
Suppose that you have obtained LAT scores for a sample of subjects, and now wish to create
a new variable called LATVGRP, which is an abbreviation for LAT-verbal group. This
variable will be created with the following provisions:

If you do not know what a subjects LAT Verbal test score is, that subject will have a
score of "." (for "missing data") on LATVGRP.

If the subjects score is under 500 on the LAT Verbal test, the subject will have a score of
0 on LATVGRP.

If the subjects score is 500 or greater on the LAT Verbal test, the subject will have a score
of 1 on LATVGRP.

Suppose that the variable LATV already exists in your data set and that it contains each
subjects actual score on the LAT Verbal test. You can now use it to create the new
variable, LATVGRP, by writing the following statements:
LATVGRP = .;
IF LATV LT 500 THEN LATVGRP = 0;
IF LATV GE 500 THEN LATVGRP = 1;
The preceding statements tell SAS to create a new variable called LATVGRP, and begin by
setting everyone's score as equal to "." (missing). If a subject's score on LATV is less than
500, then SAS sets his or her score on LATVGRP as equal to 0. If a subject's score on
LATV is greater than or equal to 500, then SAS sets his or her score on LATVGRP as equal
to 1.

Chapter 8: Creating and Modifying Variables and Data Sets 243

Using ELSE Statements


You could have performed the preceding operations more efficiently by using the ELSE
statement. Here is the syntax for using the ELSE statement with the IF-THEN statement:
IF

expression THEN statement ;


ELSE IF expression THEN statement ;

The ELSE statement provides alternative actions that SAS may take if the original IF
expression is not true. For example, consider the following:
1
2
3

LATVGRP = .;
IF LATV LT 500 THEN LATVGRP = 0;
ELSE IF LATV GE 500 THEN LATVGRP = 1;

The preceding tells SAS to

create a new variable called LATVGRP, and initially assign all subjects a value of
missing

assign a particular subject a score of 0 on LATVGRP if that subject has an LATV score
less than 500

assign a particular subject a score of 1 on LATVGRP if that subject has an LATV score
greater than or equal to 500.

Obviously, the preceding statements were identical to the earlier statements that created
LATVGRP, except that the word ELSE has now been added to the beginning of the third
line. In fact, these two approaches result in assigning exactly the same values on
LATVGRP to each subject. So what, then, is the advantage of including the ELSE
statement? The answer has to do with efficiency: When an ELSE statement is included, the
actions specified in that statement are executed only if the expression in the preceding IF
statement is not true.
For example, consider the situation in which subject 1 has a score on LATV that is less than
500. Line 2 in the preceding statements would assign that subject a score of 0 on LATVGRP.
SAS would then ignore line 3 (because it contains the ELSE statement) thus saving computer
time. If line 3 did not contain the word ELSE, SAS would execute the line, checking to see
whether the LATV score for subject 1 is greater than or equal to 500 (which is actually
unnecessary, given what was learned in line 2).
Regarding missing data, notice that line 2 of the preceding program assigns subjects to
group 0 (under LATVGRP) if their scores on LATV are less than 500. Unfortunately, a
score of missing (.) on LATV is viewed as being less than 500 (actually, SAS views it
as being less than 0). This means that subjects with missing data on LATV will be
assigned to group 0 under LATVGRP by line 2 of the preceding program. This is not
desirable.

244 Step-by-Step Basic Statistics Using SAS: Student Guide

To prevent this from happening, you may rewrite the program in the following way:
1
2
3

LATVGRP = .;
IF LATV GE 200 AND LATV LT 500 THEN LATVGRP = 0;
ELSE IF LATV GE 500 THEN LATVGRP = 1;

Line 2 of the program now tells SAS to assign subjects to group 0 only if their scores on
LATV are both greater than or equal to 200, and less than 500. This modification uses the
conditional AND statement, which is discussed in greater detail in the following section.
Finally, remember to use the ELSE statement only in conjunction with a preceding IF
statement and always to place it immediately following the relevant IF statement.
Using the Conditional Statements AND and OR
As the preceding section suggests, it is also possible to use the conditional statement AND
within an IF-THEN statement or an ELSE statement. For example, consider the following:
LATVGRP = .;
IF LATV GT 400 AND LATV LT 500 THEN LATVGRP = 0;
ELSE IF LATV GE 500 THEN LATVGRP = 1;
The second statement tells SAS "if LATV is greater than 400 and less than 500, then give
this subject a score on LATVGRP of 0." This means that subjects are given a score of 0
only if they are both over 400 and under 500. What happens to those who have a score of
400 or less on the LATV? They are given a score of "." on LATVGRP. That is, they are
classified as having missing data on LATVGRP. This is because they (along with everyone
else) were given a score of "." in the first statement, and neither of the later statements
replaces that "." with a 0 or a 1. However, for subjects whose scores are over 400, one of
the later statements will replace the "." with either a 0 or a 1.
It is also possible to use the conditional statement OR within an IF-THEN statement or an
ELSE statement. For example, suppose that you have a variable in your data set called
ETHNIC. With this variable, subjects were assigned the value 5 if they were Caucasian, 6 if
they were African-American, or 7 if they were Asian-American. Supose that you now wish
to create a new variable called MAJORITY: Subjects will be assigned a value of 1 on this
variable if they are in the majority group (i.e., if they are Caucasians), and they will be
assigned a value of 0 on this variable if they are in a minority group (if they are either
African-Americans or Asian-Americans). The following statements would create this
variable:
MAJORITY=.;
IF ETHNIC = 5 THEN MAJORITY = 1;
ELSE IF ETHNIC = 6 OR ETHNIC = 7 THEN MAJORITY = 0;
In the preceding statements, all subjects are first assigned a value of missing on
MAJORITY. If their value on ETHNIC is 5, their score on MAJORITY is changed to 1,

Chapter 8: Creating and Modifying Variables and Data Sets 245

and SAS ignores the following ELSE statement. If their value on ETHNIC is not 5, then
SAS moves on to the ELSE statement. There, if the subjects value on ETHNIC is either 6
or 7, the subject is assigned a value of 0 on MAJORITY.
Working with Character Variables
Using single quotation marks. When you are working with character variables (variables
in which the values may consist of letters or special characters), it is important that you
enclose values within single quotation marks in the IF-THEN and ELSE statements.
Converting character values to numeric values. For example, suppose that you
administered the achievement motivation questionnaire (from the beginning of this chapter)
to a sample of subjects. This questionnaire asked subjects to identify their sex. In entering
the data, you created a character variable called SEX in which the value F represented
female subjects and the value M represented male subjects.
Suppose that you now wish to create a new variable called SEX2. SEX2 will be a numeric
variable in which the value 0 is used to represent females and the value 1 is used to
represent males. The following SAS statements could be used to create and print this new
variable:
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
8 2 3 1 5 2
F
25
9 4 5 4 2 5
M
23
;
DATA D2;
SET D1;

3
3
3

SEX2 = .;
IF SEX = 'F' THEN SEX2 = 0;
IF SEX = 'M' THEN SEX2 = 1;
PROC PRINT DATA=D2;
VAR SEX SEX2;
TITLE1 'JANE DOE';
RUN;

Some notes about the preceding program:

The last data lines from the achievement motivation study appear on lines 1820 (to save
space, only the last few lines are presented).

A new DATA step begins on line 22. This was necessary, because you can use IF-THEN
control statements only within a DATA step to create a new variable.

Line 25 tells SAS to create a new variable called SEX2, and begin by assigning missing
data (.) to all subjects on SEX2.

246 Step-by-Step Basic Statistics Using SAS: Student Guide

Line 26 tells SAS that, if a given subjects value on SEX is equal to F, then her value on
SEX2 should be 0.

Line 27 tells SAS that, if a given subjects value on SEX is equal to M, then his value on
SEX2 should be 1.

On line 26, notice that the F is enclosed within single quotation marks. This was
necessary because SEX was a character variable. However, when SEX2 is set to zero
on line 26, the zero is not enclosed within single quotation marks. This is because SEX2
is not a character variableit is a numeric variable. Use single quotation marks only to
enclose character variable values.
Output 8.5 presents the results generated by the preceding program.
JANE DOE
Obs

SEX

SEX2

1
2
3
4
5
6
7
8
9

F
M
M
F
M
F
F
F
M

0
1
1
0
1
0
0
0
1

Output 8.5. Results of PROC PRINT in which SEX and SEX2 were listed in the
VAR statement, achievement motivation study.

Output 8.5 presents each subjects values for the variables SEX and SEX2. Notice that, if a
given subjects value for SEX is F, her value on SEX2 is equal to zero; if a given subjects
value on SEX is M, his value on SEX2 is equal to 1. This is as expected, given the
preceding IF-THEN control statements.
Converting numeric values to character values. The same conventions apply when you
convert numeric values to character values: Within the IF-THEN control statements, the
values for character variables should be enclosed within single quotation marks, but the
values for numeric variables should not be enclosed within single quotation marks.
For example, you may remember that the achievement motivation questionnaire asked
subjects to indicate their major. That section of the questionnaire is reproduced here:
8. What is your major?

______ Arts and Sciences (1)


______ Business (2)
______ Education (3)

Chapter 8: Creating and Modifying Variables and Data Sets 247

In entering the data, you created a numeric variable called MAJOR. You used the value 1
to represent subjects majoring in the arts and sciences, the value 2 to represent subjects
majoring in business, and the value 3 to represent subjects majoring in education.
Suppose that you now wish to create a new variable called MAJOR2, which will also
identify the area in which the subjects majored. However, MAJOR2 will be a character
variable, and the values of MAJOR2 will be three characters long. Specifically,

the value A&S represents subjects majoring in the arts and sciences

the value BUS represents subjects majoring in business

the value EDU represents subjects majoring in education.

The following statements use IF-THEN control statements to create the new MAJOR2
variable:
1
2
3
4

MAJOR2 =
IF MAJOR
IF MAJOR
IF MAJOR

' .';
= 1 THEN MAJOR2 = 'A&S';
= 2 THEN MAJOR2 = 'BUS';
= 3 THEN MAJOR2 = 'EDU';

Some notes about the preceding program:

Line 1 creates the new variable MAJOR2 and initially assigned missing data to all
subjects. It did this with the following statement:
MAJOR2 = '

.';

Notice that there is room for three characters within the single quotation marks. Within the
single quotation marks there are two blank spaces and a single period (to represent
missing data). This was important, because the subsequent IF-THEN statements assign
3-character values to MAJOR2.

Line 2 indicates that, if a given subject has a value of 1 on MAJOR, then that subjects
value on MAJOR2 should be A&S.

Line 3 indicates that, if a given subject has a value of 2 on MAJOR, then that subjects
value on MAJOR2 should be BUS.

Line 4 indicates that, if a given subject has a value of 3 on MAJOR, then that subjects
value on MAJOR2 should be EDU.

Once again, notice that when line 2 includes the expression


IF MAJOR = 1
the 1 does not appear within single quotation marks. This is because MAJOR is a
numeric variable. However, when line 2 includes the statement:
THEN MAJOR2 = 'A&S';

248 Step-by-Step Basic Statistics Using SAS: Student Guide

the A&S does appear within single quotation marks. This is because MAJOR2 is a
character variable.
If a later section of your program included a PROC PRINT statement to print the
contents of MAJOR and MAJOR2, the results would look something like Output 8.6.
JANE DOE
Obs

MAJOR

MAJOR2

1
2
3
4
5
6
7
8
9

1
1
1
2
2
2
3
3
3

A&S
A&S
A&S
BUS
BUS
BUS
EDU
EDU
EDU

Output 8.6. Results of the PROC PRINT in which MAJOR and MAJOR2 were
listed in the VAR statement, achievement motivation study.

Data Subsetting
Overview
An earlier section of this chapter indicated that data subsetting statements are SAS
statements that eliminate unwanted observations from a sample, so that only a specified
subgroup is included in the resulting data set. Often, it is necessary to perform an analysis
on only a subset of the subjects who are included in the data set. For example, you may
wish to review the mean survey responses provided by just the female subjects. Or, you
may wish to review mean survey responses provided by just those subjects majoring in the
arts and sciences. Subsetting IF statements may be used to obtain these results.
The Syntax of Data Subsetting Statements
Here is the syntax for the statements that perform data subsetting:
DATA new-data-set-name ;
SET existing-data-set-name ;
IF comparison ;
The "comparison" generally includes some existing variable and at least one comparison
operator.

Chapter 8: Creating and Modifying Variables and Data Sets 249

An Example
The SAS program. For example, suppose you wish to compute mean survey responses for
just those subjects who are majoring in the arts and sciences. The following statements
accomplish this:
18
19
20
21
22
23
24
25
26
27
28
29

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
8 2 3 1 5 2
F
25
9 4 5 4 2 5
M
23
;
DATA D2;
SET D1;

3
3
3

IF MAJOR = 1;
PROC MEANS DATA=D2;
TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS';
RUN;

Some notes about the preceding program:

The last data lines from the achievement motivation study appear on lines 1820 (to save
space, only the last few lines are presented).

A new DATA step begins on line 22. This was necessary, because you can perform data
subsetting only within a DATA step.

Lines 2223 tell SAS to create a new data set, name it D2, and create it as a duplicate of
data set D1.

Line 25 tells SAS to retain a given observation for D2 only if that observations value on
MAJOR is equal to 1. This will retain in the data set D2 only those subjects who
majored in the arts and sciences (because the number 1 was used to represent this group
under the MAJOR variable).

Lines 2729 request that PROC MEANS be performed on the data set.

Line 27 includes the option DATA=D2, which specifies that PROC MEANS should be
performed on the data set D2. This makes sense, because the D2 is the data set that
contains just the arts and sciences majors.

250 Step-by-Step Basic Statistics Using SAS: Student Guide

Output 8.7 presents the results generated by the preceding program.


JANE DOE--ARTS AND SCIENCES MAJORS

The MEANS Procedure


Variable
N
Mean
Std Dev
Minimum
Maximum
----------------------------------------------------------------------------SUB_NUM
3
2.0000000
1.0000000
1.0000000
3.0000000
Q1
3
4.3333333
2.0816660
2.0000000
6.0000000
Q2
2
3.0000000
2.8284271
1.0000000
5.0000000
Q3
3
3.6666667
1.5275252
2.0000000
5.0000000
Q4
3
3.3333333
1.5275252
2.0000000
5.0000000
Q5
3
4.6666667
2.3094011
2.0000000
6.0000000
AGE
3
25.6666667
4.0414519
22.0000000
30.0000000
MAJOR
3
1.0000000
0
1.0000000
1.0000000
----------------------------------------------------------------------------Output 8.7. Results of the PROC MEANS performed on data set consisting of
arts and sciences majors only, achievement motivation study.

Some notes about this output:


PROC MEANS was performed on all of the numeric variables in the data set because the
VAR statement had been omitted from the SAS program.
In the column headed N, you can see that, for most variables, the analyses were performed
on just three subjects. This makes sense, because Table 8.1 showed that just three subjects
were majoring in the arts and sciences.
To the right of the variable name MAJOR, you can see that the mean score on MAJOR is
1.0, the minimum value for MAJOR is 1.0, and the maximum value for MAJOR is 1.0. This
is what you would expect if your data set consisted exclusively of arts and sciences majors:
Each of them should have a value on MAJOR that is equal to 1.

Chapter 8: Creating and Modifying Variables and Data Sets 251

An Example with Multiple Subsets


It is possible to write a single program that creates multiple data sets, with each data set
consisting of a different subgroup of subjects. This is done with the following program:
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
3
8 2 3 1 5 2
F
25
3
9 4 5 4 2 5
M
23
3
;
DATA D2;
SET D1;
IF MAJOR = 1;
PROC MEANS DATA=D2;
TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS';
RUN;
DATA D3;
SET D1;
IF MAJOR = 2;
PROC MEANS DATA=D3;
TITLE1 'JANE DOE--BUSINESS MAJORS';
RUN;
DATA D4;
SET D1;
IF MAJOR = 3;
PROC MEANS DATA=D4;
TITLE1 'JANE DOE--EDUCATION MAJORS';
RUN;

Some notes about the preceding program:

Lines 2224 create a new data set named D2. The subsetting IF statement on line 24
ensures that this data set will contain only subjects with a value of 1 on the variable
MAJOR. This means that the data set will contain only arts and sciences majors. Lines 25
27 request that PROC MEANS be performed on this data set.

Lines 2931 create a new data set named D3. The subsetting IF statement on line 31
ensures that this data set will contain only subjects with a value of 2 on the variable,
MAJOR. This means that the data set will contain only business majors. Lines 3234
request that PROC MEANS be performed on this data set.

Lines 3638 create a new data set named D4. The subsetting IF statement on line 38
ensures that this data set will contain only subjects with a value of 3 on the variable,
MAJOR. This means that the data set will contain only education majors. Lines 3941
request that PROC MEANS be performed on this data set.

252 Step-by-Step Basic Statistics Using SAS: Student Guide

Specifying the initial data set in the SET statement. Notice that, throughout the preceding
program, the SET statement always specifies D1, as shown here:
SET D1;
This is because the data set D1 was the initial data set and the only data set that contained all
of the initial observations. When creating a new data set that will consist of a subset of this
initial data set, you will usually want to specify the initial data set in your SET statement.
Specifying the current data set in the PROC statements. PROC MEANS statements
appear on lines 25, 32, and 39 of the preceding program. Notice that, in each case, the
DATA= option of the PROC MEANS statement always specifies the data set that has just
been created. Line 25 reads PROC MEANS DATA=D2, line 32 reads, PROC MEANS
DATA=D3, and line 39 reads PROC MEANS DATA=D4. This ensures that the first
PROC MEANS is performed on the data set containing just the arts and sciences majors, the
second PROC MEANS is performed on the data set containing just the business majors, and
the third PROC MEANS is performed on the data set containing just the education majors.
Using Comparison Operators and the Conditional Statements AND
and OR
When writing a subsetting IF statement, you may use all of the comparison operators
described above (such as LT or GE) as well as the conditional statements AND and OR.
For example, suppose that you have created an initial data set named D1 that contains the
SAS variables SEX (which represents subject sex) and AGE (which represents subject age).
You now wish to create a second data set named D2, and a subject will be retained in D2
only if she is a female, and she is 65 years of age or older. The following statements will
accomplish this:
DATA D2;
SET D1;
IF SEX = 'F' AND AGE GE 65;

Eliminating Observations That Have Missing Data on Some


Variables
Overview. One of the most common difficulties encountered by researchers in the social
sciences and education is the problem of missing data. Briefly, the missing data problem
involves not having scores on all variables for all subjects in a data set.

Chapter 8: Creating and Modifying Variables and Data Sets 253

Missing data in the achievement motivation study. To illustrate the concept of missing
data, Table 8.1 is reproduced here as Table 8.2:
Table 8.2
Data from the Achievement Motivation Study
___________________________________________
Agree-Disagree
Questions
__________________
Subject
Q1 Q2 Q3 Q4 Q5
Sex
Age
Major
____________________________________________________
1. Marsha

22

2. Charles

25

3. Jack

30

4. Cathy

41

5. Emmett

22

6. Marie

20

7. Cindy

21

8. Susan

25

9. Fred
4
5
4
2
5
M
23
3
___________________________________________________

Table 8.2 uses a single period (.) to represent missing data. The table reveals missing data
for the third subject (Jack) on the variable Q2: There is a single period in the location where
you would expect to see Jacks score for Q2. Similarly, the table also reveals missing data
for the fourth subject (Cathy) on variable Q4.

254 Step-by-Step Basic Statistics Using SAS: Student Guide

In Chapter 4, Data Input, you learned that you should also use a single period to represent
missing data when entering data in a SAS data set. This was shown in the section SAS
Program to Read the Raw Data, earlier in this chapter. The initial DATA step from that
program is reproduced below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
Q4
Q5
SEX $
AGE
MAJOR;
DATALINES;
1 6 5 5 2 6
F
2 2 1 2 5 2
M
3 5 . 4 3 6
M
4 5 6 6 . 6
F
5 4 4 5 2 5
M
6 5 6 6 2 6
F
7 5 5 6 1 5
F
8 2 3 1 5 2
F
9 4 5 4 2 5
M
;

22
25
30
41
22
20
21
25
23

1
1
1
2
2
2
3
3
3

Line 15 from the preceding program contains data from the subject Jack. You can see that a
single period appears in the location where you would normally expect Jacks score on Q2.
In the same way, line 16 contains data from the subject Cathy. You can see that a single
period appears in the location where you would normally expect Cathys score on Q4.
Eliminating observations with missing data from a new data set. Suppose that you now
wish to create a new data set named D2. The new data set will be identical to the initial data
set (D1) with one exception: D2 will contain only observations that have no missing data on
the five achievement motivation questionnaire items (Q1, Q2, Q3, Q4, and Q5). In other
words, you wish to include a subject in the new data set only if the subject answered all five
of the achievement motivation questionnaire items. Once you have created the new data set,
you will use PROC PRINT to print it out.

Chapter 8: Creating and Modifying Variables and Data Sets 255

The following statements accomplish this:


19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

[The first part of the DATA step appears here]


7 5 5 6 1 5
F
21
8 2 3 1 5 2
F
25
9 4 5 4 2 5
M
23
;
DATA D2;
SET D1;
IF Q1 NE . AND
Q2 NE . AND
Q3 NE . AND
Q4 NE . AND
Q5 NE . ;

3
3
3

PROC PRINT DATA=D2;


TITLE1 'JANE DOE';
RUN;

Some notes about the preceding program:

The last data lines from the achievement motivation study appear on lines 1921.

A new DATA step begins on line 23. This was necessary, because you can perform data
subsetting only within a DATA step.

Lines 2324 tell SAS to create a new data set, name it D2, and initially create it as a
duplicate of data set D1.

Lines 2529 contain a single subsetting IF statement. The comparison operator NE that
appears in the statement stands for is not equal to. This subsetting IF statement tells SAS
to retain a given observation for data set D2 only if all of the following are true:
Q1 is not equal to missing data
Q2 is not equal to missing data
Q3 is not equal to missing data
Q4 is not equal to missing data
Q5 is not equal to missing data.

With this subsetting IF statement, a given subject is retained in data set D2 only if he or
she had no missing data on any of the five variables listed.

Lines 3133 contain the PROC PRINT statement that prints the new data set. Notice that
the DATA=D2 option on line 31 specifies that D2 should be printed, rather than D1.

256 Step-by-Step Basic Statistics Using SAS: Student Guide

Output 8.8 presents the results generated by the preceding program.


JANE DOE

Obs

SUB_NUM

Q1

Q2

Q3

Q4

Q5

SEX

AGE

MAJOR

1
2
3
4
5
6
7

1
2
5
6
7
8
9

6
2
4
5
5
2
4

5
1
4
6
5
3
5

5
2
5
6
6
1
4

2
5
2
2
1
5
2

6
2
5
6
5
2
5

F
M
M
F
F
F
M

22
25
22
20
21
25
23

1
1
2
2
3
3
3

Output 8.8. Results of the PROC PRINT performed on data set D2,
achievement motivation study.

Notice that there are only seven observations in Output 8.8. The initial data set (D1)
contained nine observations, but two of these (observations for subjects Jack and Cathy)
contained missing data, and were therefore not included in data set D2. If you look at the
values for variables Q1, Q2, Q3, Q4, and Q5, you will not see any single periods indicating
missing data.

Combining a Large Number of Data Manipulation and


Data Subsetting Statements in a Single Program
Overview
Most of the SAS programs presented in this chapter have been fairly simple in that only a
few data manipulation or data subsetting statements have been included in each program. In
practice, however, it is possible to includewithin a single programa relatively large
number of statements that modify variables and data sets. This section provides an example
of such a program.
A Longer SAS Program
For example, this section presents a fairly long SAS program. This program includes data
from the achievement motivation study (first presented in Table 8.1) that has been analyzed
throughout this chapter. You will see that this single program includes a wide variety of
statements that perform most of the tasks discussed in this chapter.

Chapter 8: Creating and Modifying Variables and Data Sets 257

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
Q1
Q2
Q3
Q4
Q5
SEX $
AGE
MAJOR;
DATALINES;
1 6 5 5 2 6
F
2 2 1 2 5 2
M
3 5 . 4 3 6
M
4 5 6 6 . 6
F
5 4 4 5 2 5
M
6 5 6 6 2 6
F
7 5 5 6 1 5
F
8 2 3 1 5 2
F
9 4 5 4 2 5
M
;
PROC PRINT DATA=D1;
TITLE1 'JANE DOE';
RUN;

22
25
30
41
22
20
21
25
23

1
1
1
2
2
2
3
3
3

DATA D2;
SET D1;
Q4 = 7 - Q4;
ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5;
AGE2 = .;
IF AGE LT 25 THEN AGE2 = 0;
IF AGE GE 25 THEN AGE2 = 1;
SEX2 = .;
IF SEX = 'F' THEN SEX2 = 0;
IF SEX = 'M' THEN SEX2 = 1;
MAJOR2 =
IF MAJOR
IF MAJOR
IF MAJOR

' .';
= 1 THEN MAJOR2 = 'A&S';
= 2 THEN MAJOR2 = 'BUS';
= 3 THEN MAJOR2 = 'EDU';

PROC PRINT DATA=D2;


TITLE1 'JANE DOE';
RUN;
DATA D3;

258 Step-by-Step Basic Statistics Using SAS: Student Guide

51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

SET D2;
IF MAJOR2 = 'A&S';
PROC MEANS DATA=D3;
TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS';
RUN;
DATA D4;
SET D2;
IF MAJOR2 = 'BUS';
PROC MEANS DATA=D4;
TITLE1 'JANE DOE--BUSINESS MAJORS';
RUN;
DATA D5;
SET D2;
IF MAJOR2 = 'EDU';
PROC MEANS DATA=D5;
TITLE1 'JANE DOE--EDUCATION MAJORS';
RUN;
DATA D6;
SET D2;
IF Q1 NE
Q2 NE
Q3 NE
Q4 NE
Q5 NE

.
.
.
.
.

AND
AND
AND
AND
;

PROC PRINT DATA=D6;


TITLE1 'JANE DOE';
RUN;

Some notes concerning the preceding program:

Lines 122 input the achievement motivation data that were first presented in Table 8.1.

Lines 2325 request that PROC PRINT be performed on the initial data set, D1.

Lines 2728 begin a new DATA step. The new data set is named D2, and is initially
created as a duplicate of D1.

With line 30, the reversed variable Q4 is recoded.

With line 31, the new variable ACH_MOT is created as the average of variables Q1, Q2,
Q3, Q4, and Q5.

Lines 3335 create a new variable named AGE2, based on the existing variable named
AGE.

Lines 3739 create a new variable named SEX2, based on the existing variable named
SEX.

Chapter 8: Creating and Modifying Variables and Data Sets 259

Lines 4144 create a new variable named MAJOR2, based on the existing variable named
MAJOR.

Lines 4648 request that PROC PRINT be performed on the new data set, D2.

Lines 5051 begin a new DATA step. The new data set is named D3, and is initially
created as a duplicate of D2. This is followed by a subsetting IF statement on line 52 that
retains a subject only if his or her value on MAJOR2 is A&S. This ensures that only arts
and sciences majors be retained for the new data set. The statements on lines 5355
request that PROC MEANS be performed on the new data set.

Lines 5758 begin a new DATA step. The new data set is named D4, and is initially
created as a duplicate of D2. This is followed by a subsetting IF statement on line 59 that
retains a subject only if his or her value on MAJOR2 is BUS. This ensures that only
business majors be retained for the new data set. The statements on lines 6062 request
that PROC MEANS be performed on the new data set.

Lines 6465 begin a new DATA step. The new data set is named D5, and is initially
created as a duplicate of D2. This is followed by a subsetting IF statement on line 66 that
retains a subject only if his or her value on MAJOR2 is EDU. This ensures that only
education majors be retained for the new data set. The statements on lines 6769 request
that PROC MEANS be performed on the new data set.

Lines 7172 begin a new DATA step. The new data set is named D6, and is initially
created as a duplicate of D2. This is followed by a subsetting IF statement on lines 7377
that retains a subject only if he or she has no missing data on Q1, Q2, Q3, Q4, or Q5. The
statements on lines 7981 request that PROC PRINT be performed on the new data set.

Some General Guidelines


When writing relatively long SAS programs such as this one, it is important to keep two
points in mind. First, remember that you can perform data manipulation or data subsetting
only within a DATA step. This means that in most cases you should begin a new DATA
step (by using the DATA statement) before writing the statements that create new
variables, modify existing variables, or create subsets of data.
Second, you must keep track of the names that you give to new data sets, and must specify
the correct data set name within a given PROC statement. For example, suppose that you
create a data set called D1. In the course of a lengthy SAS program, you create a number of
different data sets, all based on D1. Somewhere late in the program, you create a new data
set named D5, and within this data set create a new variable named ACH_MOT. You now
wish to perform PROC MEANS on ACH_MOT. To do this, you must specify the data set
D5 in the PROC MEANS statement, as follows:
PROC MEANS
RUN;

DATA=D5;

260 Step-by-Step Basic Statistics Using SAS: Student Guide

If you specify any other data set (such as D1), you will not obtain the mean for ACH_MOT,
as it appears only within the data set named D5. In this case, SAS will issue an error
statement in your log file.

Conclusion
This chapter has shown you how to use simple formulas, IF-THEN control statements,
subsetting IF statements, and other tools to modify existing data sets. You should now be
prepared to perform the types of data manipulation that are most commonly required in
research in the social sciences and education.
For example, with these tools it should now be a simple matter for you to convert raw scores
into standardized scores. When analyzing data, researchers often like to standardize
variables so that they have a known mean (typically zero) and a known standard deviation
(typically 1). Scores that have been standardized in this way are called z scores. The
following chapter shows you how to use data manipulation statements to create z scores, and
illustrates some of the ways that z scores can be used to answer research questions.

z Scores
Introduction.........................................................................................262
Overview...............................................................................................................262
Raw-Score Variables versus Standardized Variables...........................................262
Types of Standardized Scores ..............................................................................262
The Advantages of Working with z Scores ...........................................................263
Example 9.1: Comparing Mid-Term Test Scores for Two Courses...266
Data Set to Be Analyzed.......................................................................................266
The DATA Step.....................................................................................................267
Converting a Single Raw-Score Variable into a z-Score Variable......268
Overview...............................................................................................................268
Step 1: Computing the Mean and Sample Standard Deviation ............................269
Step 2: Creating the z-Score Variable..................................................................270
Examples of Questions That Can Be Answered with the New z-Score Variable ..276
Converting Two Raw-Score Variables into z-Score Variables ...........278
Overview...............................................................................................................278
Review: Data Set to Be Analyzed ........................................................................278
Step 1: Computing the Means and Sample Standard Deviations ........................279
Step 2: Creating the z-Score Variables................................................................280
Examples of Questions That Can Be Answered with the New z-Score Variables.284
Standardizing Variables with PROC STANDARD ................................285
Isnt There an Easier Way to Do This? .................................................................285
Why This Guide Used a Two-Step Approach .......................................................286
Conclusion...........................................................................................286

262 Step-by-Step Basic Statistics Using SAS: Student Guide

Introduction
Overview
This chapter shows you the advantages of working with standardized variables: variables
with specified means and standard deviations. Most of the chapter focuses on z scores:
scores that have been standardized to have a mean of zero and a standard deviation of 1. It
shows you how to use SAS data manipulation statements to convert raw scores into z scores,
and how to interpret the characteristics of a z score (its sign and size) to understand the
relative standing of that score within a sample.
Raw-Score Variables versus Standardized Variables
All of the variables presented in this guide so far have been raw-score variables. Raw-score
variables are variables that have not been transformed to have a specified mean and
standard deviation. For example, if you administer an attitude scale to a group of subjects,
compute their scores on the scale, and do not transform their scores in any way, then the
attitude scale is a raw-score variable. Depending on the nature of your scale, the sample of
scores might have almost any mean or standard deviation.
In contrast, a standardized variable is a variable that has been transformed to have a
specified mean and standard deviation. For example, consider the scores on the attitude
scale mentioned above. If you wanted to, you could convert these raw scores into z scores.
This means that you would transform the variable so that it has a mean of zero and a
standard deviation of 1. In this situation the new variable that you create (the group of z
scores) is a standardized variable.
Types of Standardized Scores
In the social sciences and education, the z score is probably the most frequently used type of
standardized score. A z score is a value that indicates the distance of a raw score from the
mean when the distance is measured in standard deviations. In other words, a z score
indicates how many standard deviations above (or below) the mean a given raw score is
located. By definition, a sample of z scores has a mean of zero and a standard deviation of 1.
Another type of standardized variable that is sometimes used in the social and behavioral
sciences is the T score. A sample of T scores has a mean of 50 and a standard deviation
of 10.
Intelligence quotient (IQ) scores are often standardized as well. A sample of IQ scores is
typically standardized so that it has a mean of 100 and a standard deviation of about 15.

Chapter 9: z Scores 263

The Advantages of Working with z Scores


Overview. z scores can be easier to work with than raw scores for a variety of purposes. For
example, z scores enable you to immediately determine a particular scores relative position
in a sample, and can also make it easier to compare scores on variables that initially had
different means and/or standard deviations. These advantages are discussed below.
Immediately determining the relative position of a particular score in a sample. When
you look at a z score, you can immediately determine whether the corresponding raw score
is above or below the mean, and how far the raw score is from the mean. You do this by
viewing the sign and the absolute magnitude of the z score (the absolute magnitude of a
number is simply the size of the number, regardless of sign).
Again, assume that you have administered the attitude scale discussed earlier, and have
converted the raw scores of the subjects into z scores. The preceding section stated that a
sample of z scores has a mean of zero and a standard deviation of 1. You can take advantage
of this fact to review a subjects z score and immediately understand that subject's position
within the sample.
For example, if a subject has a z score of zero, you know that she scored at the mean on the
attitude scale. If her z score has a positive sign (e.g., + 1.0, +2.0), then you know that she
scored above the mean. If her z score has a negative sign (e.g., 1.0, 2.0), then you know
that she scored below the mean.
The absolute magnitude of the subjects z score tells you how far away from the mean the
corresponding raw score was located, when the raw score was measured in terms of standard
deviations. If a z score has a positive sign, it tell you how many standard deviations the
corresponding raw score was above the mean. For example, if a subject has a z score of
+1.0, this tells you that his raw score was 1 standard deviation above the mean. If the subject
has a z score of +2.0, his raw score was 2 standard deviations above the mean.
The same holds true for z scores with a negative sign, except that these z scores tell you how
many standard deviations the corresponding raw score was below the mean. If a given
subject has a z score of 1.0, this tells you that his raw score was 1 standard deviation below
the mean; if the z score was 2.0, the corresponding raw score was 2 standard deviations
below the mean.
So far the discussion has focused on z scores that are whole numbers (such as 1 or 2),
but it is important to remember that z scores are typically carried out to one or more places
to the right of the decimal point. It is common, for example, to see z scores with values such
as 1.4 or 2.31.
Comparing scores for variables with different means and standard deviations. When
you are working with a group of variables that have different means and standard deviations,
it is difficult to compare scores across variables. For example, a raw score of 50 may be
above the mean on Variable 1, but below the mean on Variable 2.

264 Step-by-Step Basic Statistics Using SAS: Student Guide

If you need to make comparisons across variables, it is often a good idea to first convert
all raw scores on all variables into z scores. Regardless of the variable being represented,
all z scores have the same interpretation (e.g., a z score of 1.0 always means that the
corresponding raw score was 1 standard deviation above the mean).
To illustrate this concept in a more concrete way, imagine that you are an admissions
officer at a university. Each year 5,000 people apply for admission to your school. Half
of them come from states in which college applicants take the Learning Aptitude Test
(LAT), an aptitude test that contains three subtests (the Verbal subtest, the Math subtest,
and the Analytical subtest). Suppose that the LAT Verbal subtest has a range from 200 to
800, a mean of 500, and a standard deviation of 100.
The other half come from states in which college applicants take Higher Education
Aptitude Test (HEAT). This test also consists of three subtests (for verbal, math, and
analytical skills), but each subtest has a range, mean, and standard deviation that is
different from those found with the LAT. For example, suppose that the HEAT Verbal
subtest has a range from 1 to 30, a mean of 15, and a standard deviation of 5.
Suppose that you are reviewing the files of two people who have applied for admission to
your university. Applicant A comes from a state that uses the LAT, and her raw score on
the LAT Verbal subtest is 600. Applicant B comes from a state that uses the HEAT, and
his raw score on the HEAT Verbal subtest is 19. Relatively speaking, which of these two
had the higher score?
It is very difficult to make this comparison as long as the variables are in raw-score form.
However, the comparison becomes much easier once the two variables have been
converted into z scores.
The formula for computing a z score is
XX
z =
SX
where
z = the subjects z score
X = the subjects raw score
X = the sample mean
SX = the sample standard deviation (remember that N is used in the denominator for this
standard deviation; not N 1).

Chapter 9: z Scores 265

First, we will convert Applicant As raw score into a z score (remember that Applicant A
had a raw score of 600 on the LAT Verbal subtest). Below, we substitute the appropriate
values into the formula:
XX
z =
SX

600 500
100
= = 1.0
100
100

So Applicant A had a z score of 1.0 (she stood 1 standard deviation above the mean).
Next, we convert Applicant Bs raw score into a z score (remember that he had a raw score
of 19 on the HEAT Verbal subtest ). Below, we substitute the appropriate values into the
formula (notice that a different mean and standard deviation are used for Applicant Bs
formula, compared to Applicant As formula):
XX
z =
SX

19 15
4
= = 0.8
5
5

So Applicant B had a z score of 0.8 (he stood 8/10ths of a standard deviation above the
mean).
Earlier, we asked which of the two applicants had the higher score. This question was
difficult to answer when the variables were in raw-score form, but is easier to answer now
that the variables are in z-score form. The z score for Applicant A (1.0) was slightly higher
than the z score for Applicant B (0.8). In terms of entrance exam scores, Applicant A may
be a somewhat stronger candidate.
This illustrates one of the reasons that z scores are so important in the social sciences and
education: very often, you will work with groups of variables that have different means and
standard deviations, making it difficult to compare scores from one variable to another. By
converting all scores to z scores, you create a common metric that makes it easier to make
these comparisons.

266 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 9.1: Comparing Mid-Term Test Scores for


Two Courses
Data Set to Be Analyzed
Suppose that you obtain test scores for 12 college students. All of the students are enrolled
in a French course (French 101) and a geology course (Geology 101). All students recently
took a mid-term test in each of these two courses. With the test given in French 101, scores
could range from 0 to 50. The test given in Geology 101 was longerscores on that test
could range from 0 to 200.
Table 9.1 presents the scores that the 12 students obtained on these two tests.
Table 9.1
Mid-Term Test Scores for Students
_______________________________________
Subject
French 101
Geology 101
_______________________________________
01. Fred

50

90

02. Susan

46

165

03. Marsha

45

170

04. Charles

41

110

05. Paul

39

130

06. Cindy

38

150

07. Jack

37

140

08. Cathy

35

120

09. George

34

155

10. John

31

180

11. Marie

29

135

12. Emmett
25
200
_______________________________________

Table 9.1 use the same conventions that have been used with other tables in this guide: the
horizontal rows represent individual subjects, and the vertical columns represent different
variables (scores on mid-term tests, in this case). Where the row for a particular student
intersects with the column for a particular course, the table provides the students score on
the mid-term test for that course (e.g., the table shows that Fred received a score of 50 on

Chapter 9: z Scores 267

the French 101 test and a score of 90 on the Geology 101 test; Susan received a score of 46
on the French 101 test and a score of 165 on the Geology 101 test, and so on).
The DATA Step
As you know, the first section of most SAS programs is the DATA step: the section in
which the raw data are read to create a SAS data set. The data set of the program used in this
example will include all four variables represented in Table 9.1:

subject numbers

subject names

scores on the French 101 test

scores on the Geology 101 test.

Technically, it is not necessary to create a SAS variable for subject numbers and subject
names in order to compute z scores and perform the other tasks illustrated in this chapter.
However, including the subject number and subject name variables will make the output
somewhat easier to read.
Below is the DATA step from the SAS program that will analyze the data from Table 9.1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
NAME $
FREN
GEOL;
DATALINES;
01 Fred
50
90
02 Susan
46 165
03 Marsha
45 170
04 Charles
41 110
05 Paul
39 130
06 Cindy
38 150
07 Jack
37 140
08 Cathy
35 120
09 George
34 155
10 John
31 180
11 Marie
29 135
12 Emmett
25 200
;

Some notes concerning the preceding DATA step:

Line 2 assigns this data set the SAS data set name D1.

Line 3 assigns the SAS variable name SUB_NUM to represent subject numbers.

268 Step-by-Step Basic Statistics Using SAS: Student Guide

Line 4 assigns the SAS variable name NAME to represent the students names. Notice
that the variable name is followed by a $ to tell SAS that this will be a character
variable.

Line 5 assigns to the SAS variable name FREN to represent the students scores on
the French 101 test.

Line 6 assigns to the SAS variable name GEOL to represent the students scores on
the Geology 101 test.

The actual data appear on lines 819. You can see that these data were taken directly
from Table 9.1

Converting a Single Raw-Score Variable into a z-Score


Variable
Overview
This section shows you how to convert student scores on the French 101 mid-term test into z
scores. We can convert scores on the Geology 101 test later.
The approach recommended here involves two steps:

Step 1: Computing the mean and sample standard deviation for the raw-score variable.

Step 2: Using data manipulation statements to create the z-score variable.

This approach requires you to submit your SAS program twice. At Step 1, you will use
PROC MEANS to determine the mean and sample standard deviation for raw scores on the
French 101 test variable, FREN.
At Step 2, you will add a data manipulation statement to your SAS program. This data
manipulation statement will create a new variable to be called FREN_Z, which will be the zscore version of student scores on the French 101 test. The data manipulation statement that
creates FREN_Z is the formula for the z score, similar to the one presented earlier in this
chapter.
After you have created the new z score variable, you will use PROC PRINT to print out
values on the variable, and will use PROC MEANS to obtain descriptive statistics.

Chapter 9: z Scores 269

Step 1: Computing the Mean and Sample Standard Deviation


The syntax. You will use PROC MEANS to calculate the mean and sample standard
deviation for the raw score variable, FREN. The syntax is presented below:
PROC MEANS DATA=data-set-name
VAR raw-score-variable ;
TITLE1 ' your-name ';
RUN;

VARDEF=N

MEAN

STD

MIN

MAX;

In the preceding syntax, one of the options specified was the VARDEF option (see the first
line). This VARDEF option specifies the divisor to be used when calculating the standard
deviation. If you request
VARDEF=N
then PROC MEANS will compute the sample standard deviation (the formula for the
sample standard deviation uses N as the divisor). In contrast, if you request
VARDEF=DF
then PROC MEANS will compute the estimated population standard deviation (the formula
for the estimated population standard deviation uses N 1 as the divisor).
This distinction is important because ultimately (at Step 2) you will want to insert the
correct type of standard deviation into the formula that creates your z scores. When
computing z scores, it is very important that you insert the sample standard deviation into
the computational formula for z scores; you generally should not insert the estimated
population standard deviation. This means that, when writing the PROC MEANS
statement at Step 1, you should specify VARDEF=N. If you leave the VARDEF option
out, then PROC MEANS will compute the estimated population standard deviation by
default, and you do not want this.
A number of other options are also included in the preceding PROC MEANS statement:

N requests that the sample size be printed.

MEAN requests that the sample mean be printed.

STD requests that the standard deviation be printed.

MIN requests that the smallest observed value be printed.

MAX requests that the largest observed value be printed.

The remaining sections of the preceding syntax are self-explanatory.

270 Step-by-Step Basic Statistics Using SAS: Student Guide

The actual SAS statements. Below are the actual statements that request that the MEANS
procedure be performed on the FREN variable from the current data set (note that FREN
is specified in the VAR statement):
PROC MEANS DATA=D1 VARDEF=N
VAR FREN;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

The SAS Output. Output 9.1 presents the results generated by the preceding program.
JANE DOE
The MEANS Procedure
Analysis Variable : FREN
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------12
37.5000000
7.0059499
25.0000000
50.0000000
---------------------------------------------------------Output 9.1. Results of PROC MEANS performed on the
raw-score variable FREN.

Some notes concerning this output:


FREN is the name of the variable being analyzed.
N is the number of subjects providing usable data. In this analysis, 12 students provided
scores on FREN, as expected.
Mean is the sample mean that (at Step 2) will be inserted in your formula for computing z
scores. Here, you can see that the sample mean for FREN is 37.50.
Std Dev is the sample standard deviation that (at Step 2) also will be inserted in your
formula for computing z scores. Here, you can see that the sample standard deviation for
FREN rounds to 7.01
It is usually a good idea to check the Minimum and Maximum values to verify that
there were no obvious typographical errors in the data. Here, the minimum and maximum
values were 25 and 50 respectively, which seems reasonable.
Step 2: Creating the z-Score Variable
Overview. The preceding MEANS procedure provided the sample mean and sample
standard deviation for the raw-score variable, FREN. You will now include these values in a

Chapter 9: z Scores 271

SAS data manipulation statement that will use the raw scores included in FREN to create the
z-score variable, FREN_Z.
The formula for z scores. Remember that the formula for computing z scores is:
XX
z =
SX
where
z = the subjects z score
X = the subjects raw score
X = the sample mean
SX = the sample standard deviation
The SAS data manipulation statement. It is now necessary to convert this generic formula
into a SAS data manipulation statement that does the same thing (i.e., that creates z scores
by transforming raw scores). Here is the syntax for the SAS data manipulation statement
that will do this:
z-variable = (raw-variable mean) / standard-deviation;
You can see that both the generic formula for creating z scores, as well as the SAS data
manipulation statement presented above, do the same thing:

They begin with a subjects score on the raw variable.

They subtract the sample mean from that raw-variable score.

The resulting difference is then divided by the sample standard deviation.

The result is the subjects z score.

Below is the SAS data manipulation statement that creates the z-score variable, FREN_Z.
Notice how the variable names, as well as the mean and standard deviation from Step 1,
have been inserted in the appropriate locations in this statement:
FREN_Z = (FREN - 37.50) / 7.01;
This statement tells SAS to create a new variable called FREN_Z by doing the following:

Begin with a given subjects score on FREN.

Subtract 37.50 from FREN (37.50 is the sample mean from Step 1).

The resulting difference should then be divided by 7.01 (7.01 is the sample standard
deviation from Step 1).

The result is the subjects score on FREN_Z (a z score).

272 Step-by-Step Basic Statistics Using SAS: Student Guide

Including the data manipulation statement as part of a new SAS DATA step. In
Chapter 8, Creating and Modifying Variables and Data Sets, you learned that you can
only create a new variable within a DATA step. In the present example, you are creating a
new variable (FREN_Z) to compute these z scores, and this means that the data
manipulation statement that creates FREN_Z must appear within a new DATA step.
In your original SAS program from Step 1, you assigned the name D1 to your initial SAS
data set. After the DATA step, you added the PROC MEANS statements that computed the
sample mean and standard deviation. In order to complete the tasks required for Step 2, you
can now append new SAS statements to that existing SAS program. Here is one way that
you can do this:

Begin a new DATA step after the PROC MEANS statements that were used in Step 1.

Begin this new DATA step by creating a new data set named D2. Initially, D2 will be
created as a duplicate of the existing data set, D1.

After creating this new data set, D2, you will append the data manipulation statement
that creates FREN_Z. This will ensure that the new z-score variable, FREN_Z, will be
included in D2.

Following is a section of the SAS program that accomplishes this:


1
2
3
4
5
6
7
8
9
10
11
12
13

[First part of the DATA step appears here]


10 John
31 180
11 Marie
29 135
12 Emmett
25 200
;
PROC MEANS DATA=D1 VARDEF=N
VAR FREN;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

DATA D2;
SET D1;
FREN_Z = (FREN - 37.50) / 7.01;

Some notes concerning the preceding lines:

Lines 13 present the last three lines from the data set (to conserve space, only the last
few lines are presented here).

Lines 69 present the PROC MEANS statements that were used in Step 1. Obviously, it
was not really necessary to include these statements in the program that you submit at
Step 2; if you liked, you could have simply deleted these lines. But it is a good idea to
include them so that, within a single program, you will have all of the SAS statements
that are used to compute the z scores.

Chapter 9: z Scores 273

Lines 1113 create the new data set (named D2), initially creating it as a duplicate of
D1. Line 13 presents the data manipulation statement that creates the z scores, and
includes them in a variable named FREN_Z.

Using PROC PRINT to print out the new z-score variable. After creating the new z-score
variable, FREN_Z, you should next use PROC PRINT to print out each subjects value on
this variable. Among other things, this will enable you to check your work, to verify that the
new variable was created correctly.
Chapter 8, Creating and Modifying Variables and Data Sets, provided the syntax for a
PROC PRINT statement. Below are the statements that use PROC PRINT to create a
printout listing each subjects value for NAME, FREN, and FREN_Z:
PROC PRINT DATA=D2;
VAR NAME FREN FREN_Z;
TITLE 'JANE DOE';
RUN;
Notice that, in the PROC PRINT statement, the DATA option specifies that the analysis
should be performed using the data set D2. This is important because the variable FREN_Z
appears only in D2; it does not appear in D1.
Using PROC MEANS to request descriptive statistics for the new variable. Finally, you
use PROC MEANS to obtain simple descriptive statistics (e.g., means, standard deviations)
for any new z-score variables that you create. This is useful because you already know that a
sample of z scores is supposed to have a mean of zero and a standard deviation of 1. In Step
2, you will review the results of PROC MEANS to verify that your new z-score variable
FREN_Z also has a mean of zero and a standard deviation of 1.
Here are the statements that request descriptive statistics for the current example:
PROC MEANS DATA=D2 VARDEF=N
VAR FREN_Z;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

Two points concerning these statements:

The DATA option in the PROC MEANS statement specifies that the analysis should be
performed on the new data set, D2.

The VARDEF option specifies VARDEF=N, which ensures that the MEANS
procedure will compute the sample standard deviation rather than the estimated
population standard deviation. This is appropriate when you want to verify that the
standard deviation of the z-score variable is close to 1.

274 Step-by-Step Basic Statistics Using SAS: Student Guide

Putting it all together. So far, this chapter has presented the SAS statements needed for
Step 2. So that you will have a better idea of how all of these statements fit together, below
we present (a) the last part of the initial DATA step (from Step 1), and (b) the SAS
statements needed to perform the various tasks of Step 2:
[First part of the DATA step appears here]
10 John
31 180
11 Marie
29 135
12 Emmett
25 200
;
PROC MEANS DATA=D1 VARDEF=N
VAR FREN;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

STD

MIN

MAX;

DATA D2;
SET D1;
FREN_Z = (FREN - 37.50) / 7.01;
PROC PRINT DATA=D2;
VAR NAME FREN FREN_Z;
TITLE 'JANE DOE';
RUN;
PROC MEANS DATA=D2 VARDEF=N
VAR FREN_Z;
TITLE1 'JANE DOE';
RUN;

MEAN

SAS Output generated by PROC PRINT. Output 9.2 presents the results generated by the
PRINT procedure in the preceding program:
JANE DOE
Obs
1
2
3
4
5
6
7
8
9
10
11
12

NAME
Fred
Susan
Marsha
Charles
Paul
Cindy
Jack
Cathy
George
John
Marie
Emmett

FREN

FREN_Z

50
46
45
41
39
38
37
35
34
31
29
25

1.78317
1.21255
1.06990
0.49929
0.21398
0.07133
-0.07133
-0.35663
-0.49929
-0.92725
-1.21255
-1.78317

Output 9.2. Results of PROC PRINT performed on NAME, FREN, and FREN_Z.

Chapter 9: z Scores 275

Some notes concerning the preceding output:


The OBS variable is generated by SAS whenever it performs PROC PRINT. It merely
assigns an observation number to each subject.
The NAME column provides each students first name.
The FREN column provides each students raw score for the French 101 mid-term test, as it
appears in Table 9.1.
The FREN_Z column provides each students score for the new z-score variable that was
created. These z scores correspond to the raw scores for the French 101 mid-term test, as
they appear in Table 9.1.
Were the z scores in the FREN_Z column created correctly? You can find out by computing
z scores manually for a few subjects, and verifying that your results match the results
generated by SAS. For example, the first subject (Fred) had a raw score on FREN of 50.
You can compute his z score by inserting this raw score in the z-score formula:
XX
50 37.50
12.50
z = = = = 1.783
7.01
7.01
SX
So Freds z score was 1.783, which rounds to 1.78. Output 9.2 shows that this is also the
value that SAS obtained. So far, these results are consistent with the conclusion that the z
scores have been computed correctly.
Reviewing the mean and standard deviation for the new z-score variable. Another way
to verify that the z scores were created correctly is to perform PROC MEANS on the z-score
variable, then verify that the mean for the new variable is approximately zero, and that the
standard deviation is approximately 1. The preceding program included PROC MEANS
statements, and the results are presented in Output 9.3.
JANE DOE
The MEANS Procedure
Analysis Variable : FREN_Z
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------12 -5.55112E-17
0.9994222
-1.7831669
1.7831669
---------------------------------------------------------Output 9.3. Results of PROC MEANS performed on FREN_Z.

276 Step-by-Step Basic Statistics Using SAS: Student Guide

The variable name FREN_Z tells you that this analysis was performed on the new z-score
variable.
The Mean column contains the sample mean for this z-score variable, 5.55112E17. You
might be concerned that something is wrong because this number does not appear to be
approximately zero, as we had expected. But the number is, in fact, very close to zero.
The number is presented in scientific notation. The actual value presented in Output 9.3 is
5.55112, and E17 tells you that the decimal place must be moved 17 spaces to the left.
Thus, the actual mean is 0.0000000000000000555112. Obviously, this mean is very close
to zero, and should reassure us that the z-score variable was probably created correctly.
Why was the mean for FREN_Z not exactly zero? The answer is that we did not use a great
deal of precision in creating FREN_Z. Here again is the data manipulation statement that
created it:
FREN_Z = (FREN - 37.50) / 7.01;
Notice that we went to only two places beyond the decimal point when we typed the sample
mean (37.50) and standard deviation (7.01). If we had carried these values out a greater
number of decimal places, our z-score variable would have been created with greater
precision, and the mean score on FREN_Z would have been even closer to zero.
The standard deviation for FREN_Z appears below the heading Std Dev in Output 9.3.
You can see that the standard deviation for this variable is 0.9994222. Again, this is very
close to the value of 1 that is expected with a sample of z scores (the fact that it is not
exactly 1, again, is due to the somewhat weak precision used in our data manipulation
statement). The results suggest that the z-score variable was probably created in the correct
manner.
Examples of Questions That Can Be Answered with the New zScore Variable
Reviewing the sign and absolute magnitude of a z score. The introduction section of this
chapter discussed a number of advantages of working with z scores. One of these advantages
involves the fact that, by simply reviewing a z score, you can immediately determine the
relative position of that score within a sample. Specifically,

The sign of a z score tells you whether the raw score appears above or below the mean
(a positive sign means above, a negative sign means below).

The absolute magnitude of the z score tells you how far away from the mean the
corresponding raw score was located, in terms of standard deviations (e.g., a z score of
1.2 tells you that the raw score was 1.2 standard deviations from the mean).

Chapter 9: z Scores 277

Output 9.4 presents the results of the PROC PRINT that were previously presented as
Output 9.2. This output provides each subjects z score for the French 101 test, as created by
the preceding program. This output is reproduced again so that you can see how the results
that it contains can be used to answer questions about the location of specific scores within
the sample. This section provides the answers for each of the questions. Be sure that you
understand the reasoning that led to these answers, as you might be asked to answer similar
questions as part of an exercise when you complete this chapter.
JANE DOE
Obs
1
2
3
4
5
6
7
8
9
10
11
12

NAME
Fred
Susan
Marsha
Charles
Paul
Cindy
Jack
Cathy
George
John
Marie
Emmett

FREN

FREN_Z

50
46
45
41
39
38
37
35
34
31
29
25

1.78317
1.21255
1.06990
0.49929
0.21398
0.07133
-0.07133
-0.35663
-0.49929
-0.92725
-1.21255
-1.78317

Output 9.4. Results of PROC PRINT performed on NAME, FREN, and FREN_Z
(to illustrate the questions that can be answered with z scores).

Questions regarding the new z-score variable, FREN_Z, that appears in Output 9.4:
1. Question: Freds raw score on the French 101 test was 50 (Fred was Observation #1).
What was the relative position of this score within the sample? Explain your answer.
Answer: Freds score was 1.78 standard deviations above the mean. I know that his
score was above the mean because his z score was a positive value. I know that his
score was 1.78 standard deviations from the mean because the absolute value of the z
score was 1.78.
2. Question: Cindys raw score on the French 101 test was 38 (Cindy was Observation
#6). What was the relative position of this score within the sample? Explain your
answer.
Answer: Cindys score was 0.07 standard deviations above the mean. I know that her
score was above the mean because her z score was a positive value. I know that her
score was 0.07 standard deviations from the mean because the absolute value of the z
score was 0.07.
3. Question: Maries raw score on the French 101 test was 29 (Marie was Observation
#11). What was the relative position of this score within the sample? Explain your
answer.

278 Step-by-Step Basic Statistics Using SAS: Student Guide

Answer: Maries score was 1.21 standard deviations below the mean. I know that her
score was below the mean because her z score was a negative value. I know that her
score was 1.21 standard deviations from the mean because the absolute value of the z
score was 1.21.

Converting Two Raw-Score Variables into z-Score


Variables
Overview
It many situations, it is necessary to convert one or more raw-score variables into z-score
variables. In these situations, you should follow the same 2-step sequence described above:
(a) compute the mean and standard deviation for each raw-score variable, and then (b) write
data manipulation statements that will create the new z-score variables. You will write a
separate data manipulation statement for each new z-score variable to be created. Needless
to say, when you write a data manipulation statement for a given variable, it is important to
insert the correct mean and standard deviation into that statement (i.e., the mean and
standard deviation for the corresponding raw-score variable).
This section shows you how to convert two raw-score variables into two new z-score
variables. It builds on the preceding section by analyzing the same data set (the data set with
test scores for French 101 and Geology 101). Because almost all of the concepts discussed
here have already been discussed in earlier section of the chapter, the present material will
be covered with less detail.
Review: Data Set to Be Analyzed
An earlier section of this chapter, Example 9.1: Scores on Mid-Term Tests in Two
Courses, described the four variables included in your data set:

The first variable was given the SAS variable name SUB_NUM. This was a numeric
variable that included each students subject number.

The second variable was given the SAS variable name NAME. This was a character
variable that included each subjects first name.

The third variable was given the SAS variable name FREN. This was a numeric
variable that included each subjects raw score on the mid-term test given in the French
101 course.

The fourth variable was given the SAS variable name GEOL. This was a numeric
variable that included each subjects raw score on the mid-term test given in the
Geology 101 course.

Chapter 9: z Scores 279

The same earlier section also provided each subjects values on these variables, then
presented the SAS DATA step to read the data into a SAS data set.
Step 1: Computing the Means and Sample Standard Deviations
Your first task is to compute the sample mean and sample standard deviation for the two test
score variables, FREN and GEOL. This can be done by adding the following statements to a
SAS program that already contains the SAS DATA step:
PROC MEANS DATA=D1 VARDEF=N
VAR FREN GEOL;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

The preceding statements are identical to the PROC MEANS statements presented earlier,
except that the VAR statement now lists both FREN and GEOL. This will cause SAS to
compute the mean and sample standard deviation (along with some other descriptive
statistics) for both of these variables.
Output 9.5 presents the results that were generated by the preceding statements.
JANE DOE
The MEANS Procedure

Variable
N
Mean
Std Dev
Minimum
Maximum
-------------------------------------------------------------------FREN
12
37.5000000
7.0059499
25.0000000
50.0000000
GEOL
12
145.4166667
29.7530344
90.0000000
200.0000000
-------------------------------------------------------------------Output 9.5. Results of PROC MEANS performed on the raw-score variables
FREN and GEOL.

The Mean column presents the sample mean for the two test-score variables.
The Std Dev column presents the sample standard deviations. You can see that, for FREN,
the mean is 37.50 and the sample standard deviation is 7.01 (obviously, these figures had to
be identical to the figures presented in Output 9.1 because the same variable was analyzed).
For the second variable, GEOL, the mean is 145.42, and the sample standard deviation is
29.75.

280 Step-by-Step Basic Statistics Using SAS: Student Guide

With these means and standard deviations successfully computed, you can now move on to
Step 2, where they will be inserted into data manipulation statements that will create the
new z-score variables.
Step 2: Creating the z-Score Variables
The SAS data manipulation statements. Earlier in this chapter, the following syntax for
the data manipulation statement created a z-score variable:
z-variable = (raw-variable mean) / standard-deviation;
In this step you will create a z-score variable for scores on the French 101 test, and give it
the SAS variable name FREN_Z. In doing this, the mean and sample standard deviation for
FREN from Output 9.5 will be inserted in the formula (because you are working with the
same mean and standard deviation, this data manipulation statement will be identical to the
data manipulation statement for FREN_Z that was presented earlier).
FREN_Z = (FREN - 37.50) / 7.01;
Next, you will create a z-score variable for scores on the Geology 101 test, and give it the
SAS variable name GEOL_Z. In doing this, the mean and sample standard deviation for
GEOL from Output 9.5 will be inserted in the formula:
GEOL_Z = (GEOL - 145.42) / 29.75;
Including the data manipulation statements as part of a new SAS DATA step.
Remember that new SAS variables can be created only within a DATA step. Therefore,
within your SAS program, you will begin a new DATA step prior to writing the two
preceding statements that create FREN_Z and GEOL_Z. This is done in the following
excerpt from the SAS program:
1
2
3
4
5
6
7
8
9
10
11
12
13
14

[First part of the DATA step appears here]


10 John
31 180
11 Marie
29 135
12 Emmett
25 200
;
PROC MEANS DATA=D1 VARDEF=N
VAR FREN GEOL;
TITLE1 'JANE DOE';
RUN;

MEAN

DATA D2;
SET D1;
FREN_Z = (FREN - 37.50) / 7.01;
GEOL_Z = (GEOL - 145.42) / 29.75;

STD

MIN

MAX;

Chapter 9: z Scores 281

Some notes about the preceding excerpt:

Lines 13 present the last three data lines from the data set. To save space, only the last
few lines from the data set are reproduced.

Lines 69 present the PROC MEANS statements that cause SAS to compute the mean
and standard deviation for FREN and GEOL. These statements were discussed in Step 1.

Lines 1112 begin a new DATA step by creating a new data set named D2. Initially, D2
is created as a duplicate of D1.

Lines 1314 present the data manipulation statements that create the new z-score
variables, FREN_Z and GEOL_Z.

Using PROC PRINT to print out the new z-score variables. After you create the new
z-score variables, you can use PROC PRINT to print out each subjects value on these
variables. The following statements accomplish this:
PROC PRINT DATA=D2;
VAR NAME FREN GEOL
TITLE 'JANE DOE';
RUN;

FREN_Z

GEOL_Z;

The preceding VAR statement requests that this printout include each subjects values for
the variables NAME, FREN, GEOL, FREN_Z, and GEOL_Z.
Using PROC MEANS to request descriptive statistics for the new variables. Remember
that it is generally a good idea to use PROC MEANS to compute simple descriptive
statistics for the new z-score variables that you create. This will enable you to verify that the
mean is approximately zero, and the standard deviation is approximately 1, for each new
variable. This is accomplished by the following statements:
PROC MEANS DATA=D2 VARDEF=N
VAR FREN_Z GEOL_Z;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

282 Step-by-Step Basic Statistics Using SAS: Student Guide

Putting it all together. The following shows the last part of the initial DATA step, along
with the SAS statements needed to perform the various tasks of Step 2:
(First part of the DATA step appears here)
10 John
31 180
11 Marie
29 135
12 Emmett
25 200
;
PROC MEANS DATA=D1 VARDEF=N
VAR FREN GEOL;
TITLE1 'JANE DOE';
RUN;

MEAN

STD

MIN

MAX;

MIN

MAX;

DATA D2;
SET D1;
FREN_Z = (FREN - 37.50) / 7.01;
GEOL_Z = (GEOL - 145.42) / 29.75;
PROC PRINT DATA=D2;
VAR NAME FREN GEOL
TITLE 'JANE DOE';
RUN;

FREN_Z

PROC MEANS DATA=D2 VARDEF=N


VAR FREN_Z GEOL_Z;
TITLE1 'JANE DOE';
RUN;

GEOL_Z;

MEAN

STD

SAS output generated by PROC MEANS. In the output that is generated by the preceding
program, the results of the MEANS procedure performed on FREN_Z and GEOL_Z will be
presented first. This will enable you to verify that there were no obvious errors in creating
the new z-score variables. After this is done, you can view the results from the PRINT
procedure.
Output 9.6 presents the results of PROC MEANS performed on FREN_Z and GEOL_Z.
JANE DOE
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
-------------------------------------------------------------------FREN_Z
12 -5.55112E-17
0.9994222
-1.7831669
1.7831669
GEOL_Z
12 -0.000112045
1.0001020
-1.8628571
1.8346218
-------------------------------------------------------------------Output 9.6. Results of the PROC MEANS performed on FREN_Z and GEOL_Z.

Chapter 9: z Scores 283

Reviewing the means and standard deviations for the new z-score variables. Output 9.3
(from earlier in this chapter) has already provided the mean and standard deviation for
FREN_Z. Those results are identical to the results for FREN_Z presented in Output 9.6 ( ),
and so the descriptive statistics for FREN_Z will not be reviewed again at this point.
The second row of results in Output 9.6 presents descriptive statistics for GEOL_Z ( ).
You can see that the mean for this variable is 0.000112045, which is very close to the mean
of zero that you would normally expect for a z-score variable. This provides some assurance
that GEOL_Z was created correctly.
You may have noticed that the mean for GEOL_Z was not presented in scientific notation,
as was the case for FREN_Z. This is because the mean for GEOL_Z was not quite as close
to zero, and it was therefore not necessary to use scientific notation.
Where the row GEOL_Z intersects with the column headed Std Dev, you can see that the
standard deviation for this variable is 1.0001020. This is very close to the standard deviation
of 1 that you would normally expect for a z-score variable, and again provides some
evidence that GEOL_Z was probably created correctly.
Because the means and standard deviations for the new z-score variables seem to be
appropriate, you can now review the individual z scores in the results that were generated by
PROC PRINT.
SAS output generated by PROC PRINT. Output 9.7 presents results generated by the
PROC PRINT statements included in the preceding program.
JANE DOE
Obs
1
2
3
4
5
6
7
8
9
10
11
12

NAME
Fred
Susan
Marsha
Charles
Paul
Cindy
Jack
Cathy
George
John
Marie
Emmett

FREN

GEOL

FREN_Z

GEOL_Z

50
46
45
41
39
38
37
35
34
31
29
25

90
165
170
110
130
150
140
120
155
180
135
200

1.78317
1.21255
1.06990
0.49929
0.21398
0.07133
-0.07133
-0.35663
-0.49929
-0.92725
-1.21255
-1.78317

-1.86286
0.65815
0.82622
-1.19059
-0.51832
0.15395
-0.18218
-0.85445
0.32202
1.16235
-0.35025
1.83462

Output 9.7. Results of PROC PRINT performed on NAME, FREN, GEOL,


FREN_Z, and GEOL_Z.

In Output 9.7, the column headed FREN presents subjects raw scores for the French 101
test.
Similarly, the column headed GEOL presents raw scores for the Geology 101 test.

284 Step-by-Step Basic Statistics Using SAS: Student Guide

At this point, however, you are more interested in the new z-score variables that appear in
the columns headed FREN_Z and GEOL_Z. These columns provide standardized
versions of scores on the French 101 and Geology 101 tests, respectively. Because these test
scores have been standardized, you can use them to answer a different set of questions.
These questions and their answers are discussed in the following section.
Examples of Questions That Can Be Answered with the New zScore Variables
The introduction section of this chapter indicated that one of the advantages of working with
z scores is the fact that they enable you to compare scores on variables that otherwise would
have different means and standard deviations. For example, assume that you are working
with the raw-score versions of scores on the French 101 and Geology 101 tests. Scores on
the French 101 test could possibly range from 1 to 50, and scores on the Geology 101 test
could possibly range from 1 to 200. This resulted in the two tests having very different
means and standard deviations.
Comparing scores on variables with different means and standard deviations. Suppose
that you wanted to know: Compared to the other students, did the student named Susan
(Observation #2 in Output 9.7) score higher on the French 101 test or on the Geology 101
test? This question is difficult to answer if you focus on the raw scores (the columns
headed FREN and GEOL in Output 9.7). Although her score on the French 101 test was 46,
and her score on the Geology 101 test was higher at 165, this certainly does not mean that
she did better on the Geology 101 test; her score may have been higher there simply because
the test was on a 200-point scale (rather than the 50-point scale used with the French 101
test).
Comparing scores on z-score variables. The data becomes much more meaningful when
you review the z-score versions of the variables named FREN_Z and GEOL_Z. These
columns show that Susans z score on the French 101 test was 1.21, while her z score on the
Geology 101 test was lower at 0.66. Because you are working with z scores, you know that
both of these variables have the same mean (zero) and the same standard deviation (1). This
means that you can directly compare the two scores. Clearly, Susan did better on the French
101 test than on the Geology 101 test (compared to other students).
The following section provides some questions that could be asked about the performance of
students on the French 101 and Geology 101 tests. Following each question is the correct
answer based on the z scores presented in Output 9.7. Be sure that you understand the
reasoning that led to these answers, as you might be asked to answer similar questions on
your own as part of an exercise when you complete this chapter.

Chapter 9: z Scores 285

Questions regarding the new z-score variables, FREN_Z and GEOL_Z, that appear in
Output 9.7:
1. Question: Compared to the other students, did the student named Fred (Observation
#1 in Output 9.7) score higher on the French 101 test or on the Geology 101 test?
Explain your answer.
Answer: Compared to the other students, Fred scored higher on the French 101 test
than on the Geology 101 test. I know this because his z score on the French 101 test
was a positive value (1.78), while his z score on the Geology 101 test was a negative
value (1.86).
2. Question: Compared to the other students, did the student named Cindy (Observation
#6 in Output 9.7) score higher on the French 101 test or on the Geology 101 test?
Explain your answer.
Answer: Compared to the other students, Cindy scored higher on the Geology 101 test
than on the French 101 test. I know this because both z scores were positive, and her z
score on the Geology 101 test (0.15) was higher than her score on the French 101 test
(0.07).
3. Question: Compared to the other students, did the student named Cathy (Observation
#8 in Output 9.7) score higher on the French 101 test or on the Geology 101 test?
Explain your answer.
Answer: Compared to the other students, Cathy scored higher on the French 101 test
than on the Geology 101 test. I know this because both z scores were negative, and her
z score on the French 101 test (0.36) was closer to zero than her score on the Geology
101 test (0.85).

Standardizing Variables with PROC STANDARD


Isnt There an Easier Way to Do This?
This chapter has presented a two-step approach that can be used to standardize variables,
converting raw-score variables into z-score variables. It is worth mentioning, however, that
when you work with SAS, there is often more than one way to accomplish a data
management or statistical analysis task. This applies to the creation of z scores as well.
The SAS System includes a procedure named STANDARD that can be used to standardize
variables in a SAS data set. This procedure enables you to begin with a raw-score variable
and standardize it so that it has a specified mean and standard deviation. If you specify that
the new variable should have a mean of zero and a standard deviation of 1, then you have

286 Step-by-Step Basic Statistics Using SAS: Student Guide

created a z-score variable. You can use the new standardized variable in subsequent
analyses.
Using PROC STANDARD has a number of advantages over the approach to standardization
taught in this chapter. One of these advantages is that it enables you to complete the
standardization process in one step, rather than two.
Why This Guide Used a Two-Step Approach
If PROC STANDARD has this advantage, then why did the current chapter teach the
somewhat more laborious, two-step procedure? This was done because it was expected that
this guide will normally be used in a basic statistics course, and the present approach is
somewhat more educational: you begin with the generic formula for creating z scores, and
translate that generic formula into a SAS data manipulation statement that actually creates z
scores. This approach should reinforce your understanding of what a z score is, and exactly
how a z score is obtained.
For more detailed information on the use of the STANDARD procedure for computing
z scores and other types of standardized variables, see the SAS Procedures Guide (1999c).

Conclusion
Up until this point, this guide has largely focused on basic concepts in statistics and the SAS
System. You have learned the basics of how to use SAS, and have learned about SAS
procedures that perform elementary types of data analysis.
The next chapter will take you to a new level, however, as it presents the first inferential
statistic to be covered in this text. In Chapter 10, you will learn how to use the SAS System
to compute Pearson correlation coefficients. The Pearson correlation coefficient is a
measure of association that is used to investigate the relationship between two numeric
variables. In Chapter 10, Bivariate Correlation, you will learn the assumptions that
underlie this statistic, will see examples in which PROC CORR is used to compute Pearson
correlations, and will learn how to prepare analysis reports that summarize the results
obtained from correlational research.

Bivariate
Correlation
Introduction..........................................................................................290
Overview................................................................................................................ 290
Situations Appropriate for the Pearson Correlation Coefficient.........290
Overview................................................................................................................ 290
Nature of the Predictor and Criterion Variables ..................................................... 291
The Type-of-Variable Figure .................................................................................. 291
Example of a Study Providing Data Appropriate for This Procedure...................... 291
Summary of Assumptions for the Pearson Correlation Coefficient ........................ 293
Interpreting the Sign and Size of a Correlation Coefficient................293
Overview................................................................................................................ 293
Interpreting the Sign of a Correlation Coefficient ................................................... 293
Interpreting the Size of a Correlation Coefficient ................................................... 295
The Coefficient of Determination............................................................................ 296
Interpreting the Statistical Significance of a Correlation
Coefficient .......................................................................................297
Overview................................................................................................................ 297
The Null Hypothesis for the Test of Significance ................................................... 297
The Alternative Hypothesis for the Test of Significance......................................... 297
The p Value ........................................................................................................... 298
Problems with Using Correlations to Investigate Causal
Relationships ...................................................................................299
Overview................................................................................................................ 299
Correlations and Cause-and-Effect Relationships ................................................. 300
An Initial Explanation ............................................................................................. 300

288 Step-by-Step Basic Statistics Using SAS: Student Guide

Alternative Explanations ........................................................................................ 300


Obtaining Stronger Evidence of Cause and Effect................................................. 302
Is Correlational Research Ever Appropriate?......................................................... 302
Example 10.1: Correlating Weight Loss with a Variety of
Predictor Variables..........................................................................303
Overview................................................................................................................ 303
The Study .............................................................................................................. 303
The Criterion Variable and Predictor Variables in the Analysis.............................. 304
Data Set to Be Analyzed........................................................................................ 305
The DATA Step for the SAS Program.................................................................... 306
Using PROC PLOT to Create a Scattergram........................................307
Overview................................................................................................................ 307
Why You Should Create a Scattergram Prior to Computing a Correlation
Coefficient .......................................................................................................... 307
Syntax for the SAS Program.................................................................................. 308
Results from the SAS Output................................................................................. 310
Using PROC CORR to Compute the Pearson Correlation
between Two Variables...................................................................313
Overview................................................................................................................ 313
Syntax for the SAS Program.................................................................................. 313
Results from the SAS Output................................................................................. 315
Steps in Interpreting the Output ............................................................................. 315
Summarizing the Results of the Analysis............................................................... 318
Using PROC CORR to Compute All Possible Correlations for a
Group of Variables...........................................................................320
Overview................................................................................................................ 320
Writing the SAS Program....................................................................................... 321
Results from the SAS Output................................................................................. 322
Summarizing Results Involving a Nonsignificant Correlation.............324
Overview................................................................................................................ 324
The Results from PROC CORR............................................................................. 324
The Results from PROC PLOT .............................................................................. 325
Summarizing the Results of the Analysis............................................................... 328
Using the VAR and WITH Statements to Suppress the Printing
of Some Correlations.......................................................................329
Overview................................................................................................................ 329
Writing the SAS Program....................................................................................... 329
Results from the SAS Output................................................................................. 331
Computing the Spearman Rank-Order Correlation Coefficient
for Ordinal-Level Variables..............................................................332
Overview................................................................................................................ 332
Situations Appropriate for This Statistic ................................................................. 332

Chapter 10: Bivariate Correlation 289

Example of When to Compute the Spearman Rank-Order


Correlation Coefficient........................................................................................ 332
Writing the SAS Program....................................................................................... 333
Understanding the SAS Output.............................................................................. 333
Some Options Available with PROC CORR ..........................................333
Overview................................................................................................................ 333
Where in the Program to Request Options ............................................................ 334
Description of Some Options ................................................................................. 334
Where to Find More Options for PROC CORR ...................................................... 335
Problems with Seeking Significant Results ........................................335
Overview................................................................................................................ 335
Reprise: Null Hypothesis Testing with Just Two Variables ................................... 335
Null Hypothesis Testing with a Larger Number of Variables .................................. 336
How to Avoid This Problem.................................................................................... 337
Conclusion............................................................................................338

290 Step-by-Step Basic Statistics Using SAS: Student Guide

Introduction
Overview
This chapter shows you how to use SAS to compute correlation coefficients. Most of the
chapter focuses on the Pearson product-moment correlation coefficient. You use this
procedure when you want to determine whether there is a significant relationship between
two numeric variables that are each assessed on an interval scale or ratio scale (there are a
number of additional assumptions that must also be met; these will be discussed below). The
chapter also illustrates the use of the Spearman rank-order correlation coefficient, which is
appropriate for variables assessed on an ordinal scale of measurement.
This chapter discusses a number of issues related to the conduct of correlational research. It
shows how to interpret the sign and size of correlations coefficients, and how to determine
whether they are statistically significant. It cautions against fishing for significant findings
by computing large numbers of correlation coefficients in a single study. It also cautions
against using correlational findings to draw conclusions about cause-and-effect
relationships.
This chapter shows you how to use PROC PLOT to create a scattergram so that you can
verify that the relationship between two variables is linear. It then illustrates the use of
PROC CORR to compute Pearson correlation coefficients. It shows (a) how to compute the
correlation between just two variables, (b) how to compute all possible correlations between
a number of variables, and (c) how to use the VAR and WITH statements to selectively
suppress the printing of some correlations. It shows how to prepare analysis reports for
correlations that are statistically significant, as well as for correlations that are
nonsignificant.

Situations Appropriate for the Pearson Correlation


Coefficient
Overview
A correlation coefficient is a number that summarizes the nature of the relationship
between two variables. Most of this chapter focuses on the Pearson product-moment
correlation coefficient. The Pearson correlation coefficient is appropriate when both
variables being analyzed are assessed on an interval or ratio level, and the relationship
between the two variables is linear. The symbol for the Pearson product-moment correlation
is r.
The first part of this section describes the types of situations in which this statistic is
typically computed, and discusses a few of the assumptions underlying the procedure. A
more complete summary of assumptions is presented at the end of this section.

Chapter 10: Bivariate Correlation 291

Nature of the Predictor and Criterion Variables


Predictor variable. In computing a Pearson correlation coefficient, the predictor variable
should be a numeric variable that is assessed on an interval or ratio scale of measurement.
Criterion variable. The criterion variable should also be a numeric variable that is assessed
on an interval or ratio scale of measurement.
The Type-of-Variable Figure
When researchers compute Pearson correlation coefficients, they are typically studying the
relationship between (a) a criterion variable that is a multi-value numeric variable and (b) a
predictor variable that is also a multi-value numeric variable.
Chapter 2, Terms and Concepts Used in This Guide, introduced you to the concept of the
type-of-variable figure. A type-of-variable figure indicates the types of variables that are
included in an analysis when the variables are classified according to the number of values
that they assume. Using this scheme, all variables can be classified as being either
dichotomous variables, limited-value variables, or multi-value variables.
The following figure illustrates the types of variables that are typically being analyzed when
computing a Pearson correlation coefficient.
Criterion

Predictor

Chapter 2 indicated that the symbol that appears to the left of the equal sign in this type of
figure represents the criterion variable in the analysis. The Multi symbol that appears to
the left of the equal sign in the above figure shows that the criterion variable in the
computation of a Pearson correlation is typically a multi-value variable (a variable that
assumes more than six values in your sample).
Chapter 2 also indicated that the symbol that appears to the right of the equal sign in this
type of figure represents the predictor variable in the analysis. The Multi symbol that
appears to the right of the equal sign in the above figure shows that the predictor variable in
the computation of a Pearson correlation is also typically a multi-value variable.
Example of a Study Providing Data Appropriate for This Procedure
Predictor and criterion variables. Suppose that you are an industrial psychologist who is
studying prosocial organizational behavior. Employees score high on prosocial
organizational behavior when they do helpful things for the organization or for other
employeeshelpful things that are beyond their normal job responsibilities. This might
include volunteering for some new assignment, helping a new employee on the job, or
helping to clean up the shop.

292 Step-by-Step Basic Statistics Using SAS: Student Guide

Suppose that you want to identify variables that may be correlated with prosocial
organizational behavior. Based on a review of the literature, you hypothesize that perceived
organizational fairness may be related to this variable. Employees score high on perceived
organizational fairness when they believe that the organizations management has treated
them equitably.
Research method. You conduct a study to determine whether there is a significant
correlation between prosocial organizational behavior and perceived organizational fairness
in a sample of 300 employees. To assess prosocial organizational behavior, you develop a
checklist of prosocial behaviors, and ask supervisors to evaluate each of their subordinates
with this checklist by checking off behaviors each time they are displayed by the
subordinate.
To assess perceived organizational fairness, you use a questionnaire scale developed by
other researchers. The questionnaire contains items such as This organization treats me
fairly. Employees circle a number from 1 to 7 to indicate the extent to which they agree or
disagree with each item. You sum responses to the individual items to create a single
summed score for each employee. With this variable, higher scores indicate greater
agreement that the organization treats them fairly.
To analyze the data, you compute the correlation between the measure of prosocial behavior
and the measure of perceived fairness. You hypothesize that there will be a positive
correlation between the two variables.
Why this questionnaire data would be appropriate for this procedure. Earlier sections
have indicated that, to compute a Pearson product-moment correlation coefficient, the
predictor variable should be a numeric variable that is assessed on an interval or ratio scale
of measurement. The predictor variable in this study consisted of scores on a questionnaire
scale that is designed to assess perceived organizational fairness. Most researchers would
agree that scores from this type of summated rating scale can be viewed as constituting an
interval scale of measurement (assuming that the scale was developed properly).
To compute a Pearson correlation, the criterion variable in the analysis should also be a
numeric variable that is assessed on an interval or ratio scale of measurement. The criterion
variable in the current study was prosocial organizational behavior. A particular employees
score on this variable is the number of prosocial behaviors (as assessed by the employees
supervisor) that they have displayed in a specified period of time. This number of prosocial
behaviors variable has equal intervals and a true zero point. Therefore, this variable appears
to be assessed on the ratio level.
To review, when you compute a Pearson correlation, the predictor and the criterion variable
are usually multi-value variables. To determine whether this is the case for the current study,
you would use PROC FREQ to create simple frequency tables for the predictor and criterion
variables (similar to those shown in Chapter 5, Creating Frequency Tables). You would
know that both variables were multi-value variables if you observed more than six values for
each of them in their frequency tables.

Chapter 10: Bivariate Correlation 293

Summary of Assumptions for the Pearson Correlation Coefficient

Interval-level measurement. Both the predictor and criterion variables should be


assessed on an interval or ratio level of measurement.

Random sampling. Each subject in the sample should contribute one score on the
predictor variable, and one score on the criterion variable. These pairs of scores should
represent a random sample drawn from the population of interest.

Linearity. The relationship between the criterion variable and the predictor variable
should be linear. This means that, in the population, the mean criterion scores at each
value of the predictor variable should fall on a straight line. The Pearson correlation
coefficient is not appropriate for assessing the strength of the relationship between two
variables involved in a curvilinear relationship.

Bivariate normal distribution. The pairs of scores should follow a bivariate normal
distribution. This means that (a) scores on the criterion variable should form a normal
distribution at each value of the predictor variable and (b) scores of the predictor variable
should form a normal distribution at each value of the criterion variable. Scores that
represent a bivariate normal distribution form an elliptical scattergram when plotted (i.e.,
their scattergram is shaped like a football: relatively fat in the middle and tapered on the
ends).

Interpreting the Sign and Size of a Correlation


Coefficient
Overview
As was stated earlier, a correlation coefficient is a number that represents the nature of the
relationship between two variables. To understand the nature of this relationship, you will
review the sign of the coefficient (whether it is positive or negative), as well as the size of
the coefficient (whether it is relatively close to zero or close to 1.00). This section shows
you how.
Interpreting the Sign of a Correlation Coefficient
Overview. A correlation coefficient may be either positive (+) or negative (). The sign of
the correlation tells you about the direction of the relationship between the two variables.
Positive correlation. A positive correlation means that high values on one variable tend to
be associated with high values on the other variable, and low values on one variable tend to
be associated with low values on the other variable.

294 Step-by-Step Basic Statistics Using SAS: Student Guide

For example, consider the fictitious industrial psychology study described above. In that
study, you assessed two variables:

Prosocial organizational behavior. This refers to positive, helpful things that an


employee might do to help his or her organization. Assume that, according to the way that
you measured this variable, higher scores represent higher levels of prosocial
organizational behavior.

Perceived organizational fairness. This refers to the extent to which the employee
believes that the organization has treated him or her in a fair and equitable way. Again,
assume that, according to the way that you measured this variable, higher scores represent
higher levels of perceived organizational fairness.

Suppose that you have reviewed the research literature, and it suggests that employees who
score high on perceived organizational fairness are likely to feel grateful to their employing
organizations, and are likely to repay them by engaging in prosocial organizational behavior.
Based on this idea, you conduct a correlational study.
If you measured these two variables in a sample of 300 employees, you would probably find
that there is a positive correlation between perceived organizational fairness and prosocial
organizational behavior. Consistent with the definition provided above, you would probably
find that both of the following are true:

employees with high scores on perceived organizational fairness would also tend to have
high scores on prosocial organizational behavior

employees with low scores on perceived organizational fairness would also tend to have
low scores on prosocial organizational behavior.

In social science research, there are countless additional examples of pairs of variables that
would demonstrate positive correlations. Here are just a few examples:

In a sample of college students, there would probably be a positive correlation between


scores on the Scholastic Aptitude Test (SAT) and subsequent grade point average in
college.

In a sample of contestants in a body-building contest, there would probably be a positive


correlation between the number of hours that they spend training and their overall scores
as body builders.

Negative correlation. In contrast to a positive correlation, a negative correlation means


that high values on one variable tend to be associated with low values on the second
variable. To illustrate this concept, consider what kind of variables would probably show a
negative correlation with prosocial organizational behavior. For example, imagine that you
develop a multi-item scale designed to measured burnout among employees. For our
purposes, burnout refers to the extent to which employees feel exhausted, stressed, and
unable to cope on the job. Assume that, according to the way that you measured this
variable, higher scores represent higher levels of burnout.

Chapter 10: Bivariate Correlation 295

Suppose that you have reviewed the research literature, and it suggests that employees who
score high on burnout are probably too exhausted to engage in any prosocial organizational
behavior. Based on this idea, you conduct a correlational study.
If you measured burnout and prosocial organizational behavior in a sample of employees,
you would probably find that there is a negative correlation between the two variables.
Consistent with the definition provided above, you would probably find that both of the
following are true:

employees with high scores on burnout would tend to have low scores on prosocial
organizational behavior

employees with low scores on burnout would tend to have high scores on prosocial
organizational behavior.

In social science research, there are also countless examples of pairs of variables that would
demonstrate negative correlations. Here are just a few examples:

In a sample of college students, there would probably be a negative correlation between


the number of hours they spent at parties each week, and their subsequent grade point
average in college.

In a sample of contestants in a body-building contest, there would probably be a negative


correlation between the amount of junk food that they eat and their subsequent overall
scores as body builders.

Interpreting the Size of a Correlation Coefficient


Overview. You interpret the size of a correlation coefficient to determine the strength of the
relationship between the two variables. Generally speaking, the larger the size of the
coefficient (in absolute value), the stronger the relationship. Absolute value refers to how
large the correlation coefficient is, regardless of its sign.
When there is a strong relationship between two variables, you are able to predict values on
one variable from values on the second variable with a relatively high degree of accuracy.
When there is a weak relationship between two variables, you are able to predict values on
variable from values on the second variable with a relatively low degree of accuracy.
A guide. Below is an informal guide for interpreting the approximate strength of the
relationship between two variables, based on the absolute value of the coefficient:
1.00
.80
.50
.20
.00

=
=
=
=
=

Perfect correlation
Strong correlation
Moderate correlation
Weak correlation
No correlation

For example, the above guide suggests that you should view a correlation as being relatively
strong if the correlation coefficient were +.80 (or .80). Similarly, it suggests that you

296 Step-by-Step Basic Statistics Using SAS: Student Guide

should view a correlation as being relatively weak if the correlation coefficient were +.20 (or
.20).
Again, remember to consider the absolute value of the coefficient when you interpret the
size of the correlation. This means that a correlation of .50 is just as strong as a correlation
of +.50; a correlation of .75 is just as strong as a correlation of +.75, and so forth.
The above guide shows that the possible values of correlation coefficients range from 1.00
through zero through +1.00. This means that you will never obtain a Pearson productmoment correlation below 1.00, or above +1.00.
Perfect correlation. A correlation of 1.00 is a perfect correlation. When the correlation
between two variables is 1.00, it means that you can predict values on one variable from
values on the second variable with no errors. For all practical purposes, the only time you
will obtain a perfect correlation is when you correlate a variable with itself.
Zero correlation. A correlation of .00 means that there is no relationship between the two
variables being studied. This means that, if you know how a subject is rated on one variable,
it does not allow you to predict how that subject is rated on the second variable with any
accuracy.
The Coefficient of Determination
The coefficient of determination refers to the proportion of variance in one variable that is
accounted for by variability in the second variable. This issue of proportion of variance
accounted for is an important one in statistics; in the chapters that follow, you will learn
some techniques for calculating the percentage of variance in a criterion variable that is
accounted for by a predictor variable.
The coefficient of determination is relatively simple to compute if you have calculated a
Pearson correlation coefficient. The formula is as follows:
Coefficient of determination = r

In other words, to compute the coefficient of determination, you simply square the
correlation coefficient. For example, suppose that you find that the correlation between two
variables is equal to .50. In this case:
Coefficient of determination = r2
Coefficient of determination = (.50)2
Coefficient of determination = .25
So, when the Pearson correlation is equal to .50, the coefficient of determination is equal to
.25. This means that 25% of the variability in the criterion variable is associated with
variability in the predictor variable.

Chapter 10: Bivariate Correlation 297

Interpreting the Statistical Significance of a Correlation


Coefficient
Overview
When researchers report the results of correlational research, they typically indicate whether
the correlation coefficients that they have computed are statistically significant. When
researchers report that a correlation coefficient is statistically significant, they typically
mean that the coefficient is significantly different from zero.
To understand the concept of statistical significance, it is necessary to first understand the
concepts of the null hypothesis, the alternative hypothesis, and the p value, as they apply to
correlational research. Each of these concepts is discussed in the following sections.
The Null Hypothesis for the Test of Significance
A null hypothesis is a statistical hypothesis about a population or about the relationship
between two or more different populations. A null hypothesis typically states either that (a)
there is no relationship between the variables being studied, or that (b) there is no difference
between the populations being studied. In other words, the null hypothesis is typically a
hypothesis or no relationship or no difference.
When you are conducting correlational research and are investigating the correlation between
two variables, your null hypothesis will typically state that, in the population, there is no
correlation between these two variables. For example, again suppose that you are studying the
relationship between perceived organizational fairness and prosocial organizational behavior
in a sample of employees. Your null hypothesis for this analysis might be stated as follows:
Statistical null hypothesis (H0): = 0; In the population, the correlation between the
perceived organizational fairness and prosocial organizational behavior is equal to zero.
In the preceding statement, the symbol H0 is the symbol for null hypothesis. The symbol
is the greek letter that represents the correlation between two variables in the population.
When the above null hypothesis states = 0, it is essentially stating that the correlation
between these two variables is equal to zero in the population.
The Alternative Hypothesis for the Test of Significance
Like the null hypothesis, the alternative hypothesis is a statistical hypothesis about a
population or about the relationship between two or more different populations. In contrast
to the null hypothesis, however, the alternative hypothesis typically states either that (a)
there is a relationship between the variables being studied, or that (b) there is a difference
between the populations.

298 Step-by-Step Basic Statistics Using SAS: Student Guide

For example, again consider the fictitious study investigating the relationship between
perceived organizational fairness and prosocial organizational behavior. The alternative
hypothesis for this study would state that there is a relationship between these two variables
in the population. In formal terms, it could be stated this way:
Statistical alternative hypothesis (H1): 0; In the population, the correlation
between perceived organizational fairness and prosocial organizational behavior is not
equal to zero.
Notice that the above alternative hypothesis was stated as a nondirectional hypothesis. It
does not predict whether the actual correlation in the population is positive or negative; it
simply predicts that it will not be equal to zero.
The p Value
Overview. When you use SAS to compute a Pearson correlation coefficient, it automatically
provides a p value for that coefficient. This p value may range in size from .00 through 1.00.
You will review this p value to determine whether the coefficient is statistically significant.
This section shows how to interpret these p values.
What a p value represents. In general terms, a probability value (or p value) is the
probability that you would obtain the present results if the null hypothesis were true. The
exact meaning of a p value depends upon the type of analysis that you are performing. When
you compute a correlation coefficient, the p value represents the probability that you would
obtain a correlation coefficient this large or larger in absolute magnitude if the null
hypothesis were true.
Remember that the null hypothesis that you are testing states that, in the population, the
correlation between the two variables is equal to zero. Suppose that you perform your
analysis, and you find that, for your sample, the obtained correlation coefficient is .15
(symbolically, r = .15). This is a relatively weak correlationit is fairly close to zero.
Assume that the p value associated with this correlation coefficient is .89 (symbolically,
p =.89). Essentially, this p value is making the following statement:

If the null hypothesis were true, the probability that you would obtain a correlation
coefficient of .15 is fairly high at 89%.

Clearly, 89% is a high probability. Under these circumstances, it seems reasonable to retain
your null hypothesis. You will conclude that, in the population, the correlation between
these two variables probably is equal to zero. You will conclude that your obtained
correlation coefficient of r =.15 is not statistically significant. In other words, you will
conclude that it is not significantly different from zero.
Now consider another fictitious outcome. Suppose that you perform your analysis, and you
find that, for your sample, the obtained correlation coefficient is .70 (symbolically, r = .70).
This is a relatively strong correlation. Suppose that the p value associated with this

Chapter 10: Bivariate Correlation 299

correlation coefficient is .01 (symbolically, p = .01). This p value is making the following
statement:

If the null hypothesis were true, the probability that you would obtain a correlation
coefficient of .70 is fairly low at only 1%.

Most researchers would agree that 1% is a fairly low probability. Given this low probability,
it now seems reasonable to reject your null hypothesis. You will conclude that, in the
population, the correlation between these two variables is probably not equal to zero. You
will conclude that your obtained correlation coefficient of r =.70 is statistically significant.
In other words, you will conclude that it is significantly different from zero.
Deciding whether to reject the null hypothesis. With the above examples, you saw that,
when the p value is a relative large value (such as p = .89) you should not reject the null
hypothesis, and that when the p value is a relative small value (such as p = .01) you should
reject the null hypothesis. This naturally leads to the question, Just how small must the p
value be to reject the null hypothesis? The answer to this question will depend on a number
of factors, such as the nature of your research and the importance of not erroneously
rejecting a true null hypothesis. To keep things simple, however, this book will adopt the
following guidelines:

If the p value is less than .05, you should reject the null hypothesis.

If the p value is .05 or larger, you should not reject the null hypothesis.

This means that if your p value is less than .05 (such as p = .0400, p = .0121, or p < .0001),
you should reject the null hypothesis, and should conclude that your obtained correlation
coefficient is statistically significant. Conversely, if your p value is .05 or larger (such as
p = .0510, p = .5456, or p = .9674), you should not reject the null hypothesis, and should
conclude that your obtained correlation is statistically nonsignificant.
This book will use this guideline throughout all of the remaining chapters. Most of the
chapters following this one will report some type of significance test. In each of those
chapters, you will continue to use the same rule that you will reject the null hypothesis if
your p value is less than .05.

Problems with Using Correlations to Investigate Causal


Relationships
Overview
When you compute a Pearson correlation, you can review the correlation coefficient to learn
about the nature of the relationship between the two variables (e.g., whether the relationship
is positive or negative; whether the relationship is relatively strong or weak). However, a
single Pearson correlation coefficient by itself will not tell you anything about whether there
is a causal relationship between the two variables. This section discusses the concept of

300 Step-by-Step Basic Statistics Using SAS: Student Guide

cause-and-effect relationships, and cautions against using simple correlational analyses to


provide evidence for such relationships.
Correlations and Cause-and-Effect Relationships
Chapter 2, Terms and Concepts Used in This Guide discussed some of the differences
between experimental research versus nonexperimental (correlational) research. It indicated
that correlational research generally provides relatively weak evidence concerning causeand-effect relationships. This is especially the case when investigating the correlation
between just two variables (as is the case in this chapter). This means that, if you observe a
strong, significant relationship between two variables, it should not be taken as evidence that
one of the variables is exerting a causal effect on the other.
An Initial Explanation
For example, suppose that you began your investigation with a hypothesis of cause and
effect. Suppose that you hypothesized that perceived organizational fairness has a causal
effect on prosocial organizational behavior: If you can increase employee perceptions of
fairness, this will cause the employees to display an increase in prosocial organizational
behavior. This cause-and-effect relationship is illustrated in Figure 10.1.

Figure 10.1. Hypothesized relationship between prosocial organizational


behavior and perceived organizational fairness.

Suppose that you conduct your study, and find that there is a relatively large, positive, and
significant correlation between the fairness variable and the prosocial behavior variable.
Your first impulse might be to rejoice that you have obtained proof that fairness has a
causal effect on prosocial behavior. Unfortunately, few researchers will find your evidence
convincing.
Alternative Explanations
The reason has to do with the concept of alternative explanations. You have found a
significant correlation, and have offered your own explanation for what it means: It means
that fairness has a causal effect on prosocial behavior. However, it will be easy for other
researchers to offer very different explanations that may be equally plausible in explaining
why you obtained a significant correlation. And if others can generate alternative

Chapter 10: Bivariate Correlation 301

explanations, few researchers are going to be convinced that your explanation must be
correct.
For example, suppose that you have obtained a strong, positive correlation, and have
presented your results at a professional conference. You tell your audience that these results
prove that perceived fairness has an effect on prosocial behavior, as illustrated in
Figure 10.1.
At the end of your presentation, a member of the audience may rise and ask if it is not
possible that there is a different explanation for your correlation. She may suggest that the
two variables are correlated because of the influence of an underlying third variable. One
such possible underlying variable is illustrated in Figure 10.2.

Figure 10.2. An alternative explanation for the observed correlation between


prosocial organizational behavior and perceived organizational fairness.

Figure 10.2 suggests that fairness and prosocial behavior may be correlated because they are
both influenced by the same underlying third variable: the personality trait of optimism.
The researcher in your audience may argue that it is reasonable to assume that optimism has
a causal effect on prosocial behavior. She could argue that, if a person is generally
optimistic, they will be more likely to believe that showing prosocial behaviors will result in
rewards from the organization (such as promotions). This argument supports the causal
arrow that goes from optimism to prosocial behavior in Figure 10.2.
The researcher in your audience could go on to argue that it is reasonable to assume that
optimism also has a causal effect on perceived fairness. She could argue that optimistic
people tend to look for the good in all situations, including their work situations. This could
cause optimistic people to describe their organizations as treating them more fairly
(compared to pessimistic people). This argument supports the causal arrow that goes from
optimism to perceived fairness in Figure 10.2.
In short, the researcher in your audience could argue that there is no causal relationship
between prosocial behavior and fairness at allthe only reason they are correlated is
because they are both influenced by the same underlying third variable: optimism.

302 Step-by-Step Basic Statistics Using SAS: Student Guide

This is why correlational research provides relatively weak evidence of cause-and-effect


relationships: It is often possible to generate more than one explanation for an observed
correlation between two variables.
Obtaining Stronger Evidence of Cause and Effect
Researchers who wish to obtain stronger evidence of cause-and-effect relationships typically
rely on one (or both) of two approaches. The first approach is to conduct experimental
research. Chapter 2 of this guide argued that, when you conduct a true experiment, it is
sometimes possible to control all important extraneous variables. This means that, when you
conduct a true experiment and obtain a significant effect for your independent variable, it
provides more convincing evidence that it was truly your independent variable that had an
effect on the dependent variable, and not an underlying third variable. In other words, with
a well-designed experiment, it is more difficult for other researchers to generate plausible
alternative explanations for your results.
Another alternative for researchers who wish to obtain stronger evidence of cause-and-effect
relationships is to use correlational data, but analyze them with statistical procedures that are
much more sophisticated than the procedures discussed in this text. These sophisticated
correlational procedures go by the names such as path analysis, causal modeling, and
structural equation modeling. Hatcher (1994) provides an introduction to some of these
procedures.
Is Correlational Research Ever Appropriate?
None of this is meant to discourage you from conducting correlational research. There are
many situations in which correlational research is perfectly appropriate. These include:

Situations in which it would be unethical or impossible to conduct an experiment.


There are many situations in which it would be unethical or impossible to conduct a true
experiment. For example, suppose that you want to determine whether physically abusing
children will cause those individuals to become child abusers themselves when they grow
up. Obviously, no one would wish to conduct a true experiment in which half of the child
subjects are assigned to an abused condition, and the other half are assigned to a nonabused condition. Instead, you might conduct a correlational study in which you simply
determine whether the way people were previously treated by their parents is correlated
with the way that they now treat their own children.

As an early step in a research program that will eventually include experiments. In


many situations, experiments are more expensive and time-consuming to conduct, relative
to correlational research. Therefore, when researchers believe that two variables may be
causally related, they sometimes begin their program of research by conducting a simple
nonexperimental study to see if the two variables are, in fact, correlated. If yes, the
researcher may then be sufficiently encouraged to proceed to more ambitious controlled
studies such as experiments.

Chapter 10: Bivariate Correlation 303

Also, remember that in many situations researchers are not even interested in testing causeand-effect relationships. In many situations they simply wish to determine whether two
variables are correlated. For example, a researcher might simply wish to know whether high
scores on Test X tend to be associated with high scores on Test Y.
And that is the approach that will be followed in this chapter. In general, it will discuss
studies in which researchers simply wish to determine whether one variable is correlated
with another. To the extent possible, it will avoid using language that implies possible
cause-and-effect relationships between the variables.

Example 10.1: Correlating Weight Loss with a Variety of


Predictor Variables
Overview
Most of this chapter focuses on analyzing data from a fictitious study that produces data that
can be analyzed with the Pearson correlation coefficient. In this study, you will investigate
the correlation between weight loss and a number of variables that might be correlated with
weight loss. This section describes these variables, and shows you how to prepare the
DATA step.
The Study
Hypotheses. Suppose that you are conducting a study designed to identify the variables that
are predictive of weight loss in men. You want to test the following research hypotheses:

Hypothesis 1: Weight loss will be positively correlated with motivation: Men who are
highly motivated to lose weight will tend to lose more weight than those who are less
motivated.

Hypothesis 2: Weight loss will be positively correlated with time spent exercising: Men
who exercise many hours each week will tend to lose more weight than those who
exercise fewer hours each week.

Hypothesis 3: Weight loss will be negatively correlated with calorie consumption: Men
who consume many calories each day will tend to lose less weight than those who
consume fewer calories each week.

Hypothesis 4: Weight loss will be positively correlated with intelligence: Men who are
highly intelligent will tend to lose more weight than those who are less intelligent.

Research method. To test these hypotheses, you conduct a correlational study with a group
of 22 men over a 10-week period. At the beginning of the study, you administer a 5-item
scale that is designed to assess each subjects motivation to lose weight. The scale consists
of statements such as It is very important to me to lose weight. Subjects respond to each
item using a 7-point response format in which 1 = Disagree Very Strongly and 7 =

304 Step-by-Step Basic Statistics Using SAS: Student Guide

Agree Very Strongly. You sum their responses to the five items to create a single
motivation score for each subject. Scores on this measure may range from 5 to 35, with
higher scores representing greater motivation to lose weight. You will correlate this
motivation scale with subsequent weight loss to test Hypothesis 1 (from above).
Throughout the 10-week study, you ask each subject to record the number of hours that he
exercises each week. At the end of the study, you determine the average number of hours
spent exercising for each subject, and correlate this number of hours spent exercising with
subsequent weight loss to test Hypothesis 2.
Throughout the study, you also ask each subject to keep a log of the number of calories that
he consumes each day. At the end of the study, you compute the average number of calories
consumed by each subject. You will correlate this measure of daily calorie intake with
subsequent weight loss. You use this correlation to test Hypothesis 3.
At the beginning of the study you also administer the Weschler Adult Intelligence Scale
(WAIS) to each subject. The combined IQ score from this instrument will serve as the
measure of intelligence in your study. You will correlate IQ with subsequent weight loss to
test Hypothesis 4.
Throughout the 10-week study, you weigh each subject and record his body weight in
kilograms (1 kilogram is equal to approximately 2.2 pounds). When the study is completed,
you subtract their body weight at the beginning of the study from their weight at the end of
the study. You use the resulting difference as your measure of weight loss.
The Criterion Variable and Predictor Variables in the Analysis
This study involves one criterion variable and four predictor variables. The criterion variable
is weight loss measured in kilograms (kgs). In the analysis, you will give this variable the
SAS variable name KG_LOST for kilograms lost.
The first predictor variable is motivation to lose weight, as measured by the questionnaire
described earlier in this example. In the analysis, you will use the SAS variable name
MOTIVAT to represent this variable.
The second predictor variable is the average number of hours the subjects spent exercising
each week during the study. In the analysis, you will use the SAS variable name EXERCISE
to represent this variable.
The third predictor variable is the average number of calories consumed during each day of
the study. In the analysis, you will use the SAS variable name CALORIES to represent this
variable.
The final predictor variable is intelligence, as measured by the WAIS. In the analysis, you
will use the SAS variable name IQ to represent this variable.

Chapter 10: Bivariate Correlation 305

Data Set to Be Analyzed


Table 10.1 presents fictitious scores for each subject on each of the variables to be analyzed
in this study.
Table 10.1
Variables Analyzed in the Weight Loss Study
___________________________________________________________________
Kilograms
Hours
Calories
Subject
lost
Motivation
exercising
consumed
IQ
___________________________________________________________________
01. John
2.60
5
0
2400
100
02. George
1.00
5
0
2000
120
03. Fred
1.80
10
2
1600
130
04. Charles
2.65
10
5
2400
140
05. Paul
3.70
10
4
2000
130
06. Jack
2.25
15
4
2000
110
07. Emmett
3.00
15
2
2200
110
08. Don
4.40
15
3
1400
120
09. Edward
5.35
15
2
2000
110
10. Rick
3.25
20
1
1600
90
11. Ron
4.35
20
5
1800
150
12. Dale
5.60
20
3
2200
120
13. Bernard
6.44
20
6
1200
90
14. Walter
4.80
25
1
1600
140
15. Doug
5.75
25
4
1800
130
16. Scott
6.90
25
5
1400
140
17. Sam
7.75
25
.
1400
100
18. Barry
5.90
30
4
1600
100
19. Bob
7.20
30
5
2000
150
20. Randall
8.20
30
2
1200
110
21. Ray
7.80
35
4
1600
130
22. Tom
9.00
35
6
1600
120
___________________________________________________________________

Table 10.1 provides scores for 22 male subjects. The first subject appearing in the table is
named John. Table 10.1 shows the following values for John on the studys variables:

He lost 2.60 kgs of weight by the end of the study.

His score on the motivation to lose weight scale was 5 (out of a possible 35).

His score on Hours Exercising was 0, meaning that he exercised zero hours per week on
the average.

His score on calories was 2400, meaning that he consumed 2400 calories each day, on the
average.

His IQ was 100 (with the WAIS, the mean IQ is 100 and the standard deviation is about
15 in the population).

Scores for the remaining subjects can be interpreted in the same way.

306 Step-by-Step Basic Statistics Using SAS: Student Guide

The DATA Step for the SAS Program


Below is the DATA step for the SAS program that will read the data presented in Table
10.1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
KG_LOST
MOTIVAT
EXERCISE
CALORIES
IQ;
DATALINES;
01 2.60
5 0 2400
02 1.00
5 0 2000
03 1.80 10 2 1600
04 2.65 10 5 2400
05 3.70 10 4 2000
06 2.25 15 4 2000
07 3.00 15 2 2200
08 4.40 15 3 1400
09 5.35 15 2 2000
10 3.25 20 1 1600
11 4.35 20 5 1800
12 5.60 20 3 2200
13 6.44 20 6 1200
14 4.80 25 1 1600
15 5.75 25 4 1800
16 6.90 25 5 1400
17 7.75 25 . 1400
18 5.90 30 4 1600
19 7.20 30 5 2000
20 8.20 30 2 1200
21 7.80 35 4 1600
22 9.00 35 6 1600
;

100
120
130
140
130
110
110
120
110
90
150
120
90
140
130
140
100
100
150
110
130
120

Some notes about the preceding program:

Line 1 of the preceding program contains the OPTIONS statement which, in this case,
specifies the size of the printed page of output. One entry in the OPTIONS statement is
PS=60, which is an abbreviation for PAGESIZE=60. This key word requests that each
page of output have up to 60 lines of text on it. Depending on the font that you are using
(and other factors), requesting PS=60 may cause the bottom of your scatterplot to be cut
off when it is printed. If this happens, you should change the OPTIONS statement so that
it requests just 50 lines of text per page. You will do this by including PS=50 in your
OPTIONS statement, rather than PS=60. Your complete OPTIONS statement should
appear as follows:
OPTIONS LS=80 PS=50;

Chapter 10: Bivariate Correlation 307

You can see that lines 38 of the preceding program provide the INPUT statement. There,
the SAS variable name SUB_NUM is used to represent subject number, KG_LOST is
used to represent kilograms lost, MOTIVAT is used to represent motivation to lose
weight, and so forth.

The data themselves appear in lines 1031. The data on these lines are identical to the data
appearing in Table 10.1, except that the names of the subjects have been removed.

Using PROC PLOT to Create a Scattergram


Overview
A scattergram is a type of graph that is useful when you are plotting one multi-value
variable against a second multi-value variable. This section explains why it is always
necessary to plot your variables in a scattergram prior to computing a Pearson correlation
coefficient. It then shows how to use PROC PLOT to create a scattergram, and how to
interpret the output created by PROC PLOT.
Why You Should Create a Scattergram Prior to Computing a
Correlation Coefficient
What is a scattergram? A scattergram (also called a scatterplot) is a graph that plots the
individual data points from a correlational study. It is particularly useful when both of the
variables being analyzed are multi-value variables. Each data point in a scattergram
represents a single observation (typically a human subject). The data point indicates where
the subject stands on both the predictor variable (the X variable) and the criterion variable
(the Y variable).
You should always create a scattergram for a given pair of variables prior to computing the
correlation between those variables. This is because the Pearson correlation coefficient is
appropriate only if the relationship between the variables is linear; it is not appropriate when
the relationship is nonlinear.
Linear versus nonlinear relationships. There is a linear relationship between two
variables when their scattergram follows the form of a straight line. This means that, in the
population, the mean criterion scores at each value of the predictor variable should fall on a
straight line. When there is a linear relationship between X and Y, it is possible to draw a
straight line through the center of the scattergram.
In contrast, there is a nonlinear relationship between two variables if their scattergram
does not follow the form of a straight line. For example, imagine that you have constructed
a test of creativity, and have administered it to a large sample of college students. With this
test, higher scores reflect higher levels of creativity. Imagine further that you obtain the
Learning Aptitude Test (LAT) verbal test scores for these students, plot their LAT scores

308 Step-by-Step Basic Statistics Using SAS: Student Guide

against their creativity scores, creating a scattergram. With this scattergram, LAT scores are
plotted on the horizontal axis, and creativity scores are plotted on the vertical axis.
Suppose that this scattergram shows that (a) students with low LAT scores tend to have low
creativity scores, (b) students with moderate LAT scores tend to have high creativity scores,
and (c) students with high LAT scores tend to have low creativity scores. Such a
scattergram would take the form of an upside-down U. It is would not be possible to draw
a good-fitting straight line through the data points of this scattergram, and this is why we
would say that there is a nonlinear (or perhaps a curvilinear) relationship between LAT
scores and creativity scores.
Problems with nonlinear relationships. When you use the Pearson correlation coefficient
to assess the relationship between two variables involved in a nonlinear relationship, the
resulting correlation coefficient usually underestimates the actual strength of the relationship
between the two variables. For example, computing the Pearson correlation between the
LAT scores and creativity scores (from the preceding example) might result in a correlation
coefficient of .10, which would indicate only a very weak relationship between the two
variables. And yet there might actually be a fairly strong relationship between LAT scores
and creativity: It may be possible to predict someone's creativity with great accuracy if you
know where they stand on the LAT. Unfortunately, you would never know this if you did
not first create the scattergram.
The implication of all this is that you should always create a scattergram to verify that there
is a linear relationship between two variables before computing a Pearson correlation for
those variables. Fortunately, this is very easy to do using the SAS PLOT procedure.
Syntax for the SAS Program
Here is the syntax for requesting a scattergram with the PLOT procedure:
PROC PLOT
PLOT
TITLE1
RUN;

DATA=data-set-name;
criterion-variable*predictor-variable ;
' your-name ';

The variable listed as the criterion-variable in the preceding program will be plotted on the
vertical axis (the Y axis), and the predictor-variable will be plotted on the horizontal axis
(the X axis).

Chapter 10: Bivariate Correlation 309

Suppose that you wish to compute the correlation between KG_LOST (kilograms lost) and
MOTIVAT (the motivation to lose weight). Prior to computing the correlation, you would
use PROC PLOT to create a scattergram plotting KG_LOST against MOTIVAT. The SAS
statements that would create this scattergram are:
28
29
30
31
32
33
34
35

[First part of the DATA step appears here]


20 8.20 30 2 1200 110
21 7.80 35 4 1600 130
22 9.00 35 6 1600 120
;
PROC PLOT DATA=D1;
PLOT KG_LOST*MOTIVAT;
TITLE1 'JANE DOE';
RUN;

Some notes about the preceding program:

To conserve space, the preceding shows only the last few data lines from the DATA step
on lines 2830. This DATA step was presented in full in the preceding section titled The
DATA Step for the SAS Program.

The PROC PLOT statement appears on line 32. The DATA option for this statement
requests that the analysis be performed on the data set named D1.

The PLOT statement appears on line 33. It requests that KG_LOST serve as the criterion
variable (Y variable) in the plot, and that MOTIVAT serve as the predictor variable (X
variable). This means that KG_LOST will appear on the vertical axis, and MOTIVAT will
appear on the horizontal axis.

Lines 3435 present the TITLE1 and RUN statements for the program.

310 Step-by-Step Basic Statistics Using SAS: Student Guide

Results from the SAS Output


Output 10.1 presents the scattergram that was created by the preceding program.
Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

|
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
KG_LOST |
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 10.1. Scattergram plotting kilograms lost against the


motivation to lose weight.

Understanding the scattergram. Notice that, in this output, the criterion variable
(KG_LOST) is plotted on the vertical axis, while the predictor variable (MOTIVAT) is
plotted on the horizontal axis. Each letter in a scattergram represents one or more individual
subjects. For example, consider the letter A that appears in the top right corner of the
scattergram. This letter is located directly above a score of 35 on the MOTIVAT axis
(the horizontal axis). It is also located directly to the right of a score of 9 on the

Chapter 10: Bivariate Correlation 311

KG_LOST axis. This means that this letter represents a person who had a score of 35 on
MOTIVAT and a score of 9 on KG_LOST.
In contrast, now consider the letter A that appears in the lower left corner of the
scattergram. This letter is located directly above a score of 5 on the MOTIVAT axis (the
horizontal axis). It is also located directly to the right of a score of 1 on the KG_LOST
axis. This means that this letter represents a person who had a score of 5 on MOTIVAT and
a score of 1 on KG_LOST. Each of the remaining letters in the scattergram can be
interpreted in the same fashion.
The legend at the top of the output says, Legend: A = 1 obs, B = 2 obs, etc. This means
that a particular letter in the graph may represent one or more observations (human subjects,
in this case). If you see the letter A, it means that a single person is located at that point
(i.e., a single subject had that particular combination of scores on KG_LOST and
MOTIVAT). If you see the letter B, it means that two people are located at that point, and
so forth. You can see that only the letter A appears in this output, meaning that there was
no point in the scattergram where more than one person scored.
Drawing a straight line through the scattergram. The shape of the scattergram in
Output 10.1 shows that there is a linear relationship between KG_LOST and MOTIVAT.
This can be seen from the fact that it would be possible to draw a good-fitting straight line
through the center of the scattergram. To illustrate this, Output 10.2 presents the same
scattergram, this time with a straight line drawn through its center.

312 Step-by-Step Basic Statistics Using SAS: Student Guide

Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

|
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
KG_LOST |
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 10.2. Graph plotting kilograms lost against the motivation to lose
weight, with straight line drawn through the center of the scattergram.

Strength of the relationship. The general shape of the scattergram also suggests that there
is a fairly strong relationship between the two variables: Knowing where a subject stands on
the MOTIVAT variable enables us to predict, with some accuracy, where that subject will
stand on the KG_LOST variable. Later, we will compute the correlation coefficient for
these two variables to see just how strong the relationship is.
Positive versus negative relationships. Output 10.2 shows that the relationship between
MOTIVAT and KG_LOST is positive: Large values on MOTIVAT are associated with
large values on KG_LOST, and small values on MOTIVAT are associated with small values
on KG_LOST. This makes intuitive sense: You would expect that the subjects who are
highly motivated to lose weight would in fact be the subjects who would lose the most

Chapter 10: Bivariate Correlation 313

weight. When there is a positive relationship between two variables, the scattergram will
stretch from the lower left corner to the upper right corner of the graph (as is the case in
Output 10.2).
In contrast, when there is a negative relationship between two variables, the scattergram will
be distributed from the upper left corner to the lower right corner of the graph. A negative
relationship means that small values on the predictor variable are associated with large
values of the criterion variable, and large values on the predictor variable are associated with
small values on the criterion variable.
Because the relationship between MOTIVAT and KG_LOST is linear, it is reasonable to
proceed with the computation of a Pearson correlation for this pair of variables.

Using PROC CORR to Compute the Pearson Correlation


between Two Variables
Overview
This chapter illustrates three different ways of using PROC CORR to compute correlation
coefficients. It shows (a) how to compute the correlation between just two variables, (b)
how to compute all possible correlations between a number of variables, and (c) how to use
the VAR and WITH statements to selectively suppress the printing of some correlations.
The present section focuses on the first of these: computing the correlation between just two
variables. This section shows how to manage the PROC step for the analysis, how to
interpret the output produced by PROC CORR, and how to prepare a report that summarizes
the results.
Syntax for the SAS Program
In some instances, you may wish to compute the correlation between just two variables.
Here is the syntax for the statements that will accomplish this:
PROC CORR
DATA=data-set-name
options ;
VAR
variable1
variable2 ;
TITLE1 ' your-name ';
RUN;
In the PROC CORR statement, you specify the name of the data set to be analyzed and
request any options for the analysis. A section toward the end of this chapter will discuss
some of the options available with PROC CORR.

314 Step-by-Step Basic Statistics Using SAS: Student Guide

You use the VAR statement to list the names of the two variables to be correlated. (The
choice of which variable is variable1 and which is variable2 is arbitrary.) For example,
suppose that you want to compute the correlation between the number of kilograms lost
(KG_LOST) and the motivation to lose weight (MOTIVAT). Here are the required
statements:
28
29
30
31
32
33
34
35

[First part of the DATA step appears here]


20 8.20 30 2 1200 110
21 7.80 35 4 1600 130
22 9.00 35 6 1600 120
;
PROC CORR
DATA=D1;
VAR KG_LOST MOTIVAT;
TITLE1 'JANE DOE';
RUN;

Some notes concerning the preceding statements:

To conserve space, the preceding code block shows only the last few data lines from the
DATA step on lines 2830. This DATA step was presented in full in the preceding section
titled The DATA Step for the SAS Program.

The PROC CORR statement appears on line 32. The DATA option for this statement
requests that the analysis be performed on the data set named D1.

The VAR statement appears on line 33. It requests that SAS compute the correlation
between KG_LOST and MOTIVAT.

Lines 3435 present the TITLE1 and RUN statements for the program.

Chapter 10: Bivariate Correlation 315

Results from the SAS Output


The preceding program results in a single page of output, reproduced here as Output 10.3:
JANE DOE

11

The CORR Procedure


Variables:
KG_LOST MOTIVAT
Simple Statistics

Variable
KG_LOST
MOTIVAT

N
22
22

Mean
4.98591
20.00000

Std Dev
2.27488
8.99735

Sum
109.69000
440.00000

Minimum
1.00000
5.00000

Maximum
9.00000
35.00000

Pearson Correlation Coefficients, N = 22


Prob > |r| under H0: Rho=0
KG_LOST
MOTIVAT
KG_LOST
1.00000
0.88524
<.0001
MOTIVAT
0.88524
1.00000
<.0001
Output 10.3. Results of PROC CORR in which kilograms lost is correlated
with the motivation to lose weight.

Steps in Interpreting the Output


Make sure that everything looks right. The first part of Output 10.3 presents simple
descriptive statistics for the variables being analyzed. This enables you to verify that
everything looks right, that the correct number of cases were analyzed, that no variables
were out of bounds, and so on.
The names of the variables appear below the Variable heading, and the statistics for the
variables appear to the right of the variable names.
The column headed N shows that 22 subjects provided usable data for the KG_LOST
variable.
The column headed Mean shows that the mean for KG_LOST was 4.99 kilograms.
The column headed Std Dev shows that the standard deviation was 2.27.
It is always important to review the Minimum and Maximum columns to verify that no
impossible scores appear in the data.
In Output 10.3, the Minimum column shows that the lowest observed score on kilograms
lost was 1.0 kilograms.
The Maximum column shows that the highest observed score was 9.0 kilograms.

316 Step-by-Step Basic Statistics Using SAS: Student Guide

These scores seem reasonable: When converted to pounds, these figures indicate that the
least successful subject lost approximately 2.2 pounds, and the most successful subject lost
approximately 19.8 pounds. Because these figures seem reasonable, they provide no
obvious evidence that you have entered an impossible score when you typed the data
(again, these proofing procedures do not guarantee that no errors were made in typing data,
but they are useful for identifying some types of errors).
The descriptive statistics for the motivation variable (to the right of MOTIVAT) also seem
reasonable, given the nature of the scale used. Since the descriptive statistics provide no
obvious evidence of typing or programming mistakes, you can review the correlations more
confidently.
Interpreting a matrix of correlation coefficients. So that it will be easy to review, Output
10.3 is reproduced again here as Output 10.4.
JANE DOE

Variable
KG_LOST
MOTIVAT

N
22
22

11

The CORR Procedure


Variables:
KG_LOST MOTIVAT

Mean
4.98591
20.00000

Simple Statistics
Std Dev
Sum
2.27488
109.69000
8.99735
440.00000

Minimum
1.00000
5.00000

Maximum
9.00000
35.00000

Pearson Correlation Coefficients, N = 22


Prob > |r| under H0: Rho=0
KG_LOST
MOTIVAT
KG_LOST
1.00000
0.88524
<.0001
MOTIVAT
0.88524
1.00000
<.0001
Output 10.4. Correlation coefficients and p values obtained when kilograms
lost is correlated with the motivation to lose weight.

The bottom half of Output 10.4 provides the correlations that were requested in the VAR
statement. There are actually four correlation coefficients in the output because your
statement requested that the system compute every possible correlation between the
variables KG_LOST and MOTIVAT. This caused the computer to compute the correlation
between KG_LOST and MOTIVAT, between MOTIVAT and KG_LOST, between
KG_LOST and KG_LOST, and between MOTIVAT and MOTIVAT.
You can see that the correlation matrix in Output 10.4 consists of two rows, with one row
headed KG_LOST and the other headed MOTIVAT. It also contains two columns, also
with one headed KG_LOST and the other headed MOTIVAT. The point at which a
given row and column intersect is called a cell.
In Output 10.4, find the cell that appears at the intersection of the row headed KG_LOST
and the column that is also headed KG_LOST. This cell appears in the upper left corner of

Chapter 10: Bivariate Correlation 317

the matrix of correlation coefficients in Output 10.4 ( ). This cell provides information
about the Pearson correlation between KG_LOST and KG_LOST (i.e., the correlation
obtained when KG_LOST is correlated with itself). You can see that the correlation between
KG_LOST and KG_LOST is equal to 1.00000, or 1.00. A correlation of 1.00 is a perfect
correlation. This result makes sense, because you will always obtain a perfect correlation
when you correlate a variable with itself. Similarly, in the lower right corner of the matrix
(where the row headed MOTIVAT intersects with the column headed MOTIVAT), you
see that the correlation between MOTIVAT and MOTIVAT is also 1.00 ( ).
When a particular cell provides information about one variable correlated with a different
variable, that cell will provide at least two pieces of information: (a) the Pearson correlation
between the two variables, and (b) the probability value (p value) associated with that
correlation. The information is arranged according to the following format:
correlation
p value
In other words, the top number in the cell is the correlation coefficient, and the bottom
number is the p value that is associated with that correlation.
For example, find the cell where the column headed KG_LOST intersects with the row
headed MOTIVAT. The top number in this cell is .88524, which rounds to .89 ( ). This
means that the Pearson correlation between KG_LOST and MOTIVAT is .89. The earlier
section titled Interpreting the Size of a Correlation Coefficient showed you how to
interpret the size of a correlation coefficient. There, you learned that correlations that are
larger than .80 (in absolute magnitude) are generally considered to represent fairly strong
relationships. This means that there is a fairly strong relationship between kilograms lost and
motivation (in this fictitious example, at least).
Below this correlation coefficient, you find the following entry: <.0001 ( ). This is the
probability value, or p value, associated with this correlation coefficient. A section earlier in
this chapter titled The p Value indicated that, if a p value for a correlation coefficient is
less than .05, you should reject the null hypothesis, and should conclude that the correlation
is significantly different from zero. The present p value of <.0001 is clearly less than .05.
Therefore, you will reject the null hypothesis and conclude that the correlation between
kilograms lost and motivation is statistically significant.
Interpreting the sign of the coefficient. An earlier section indicated that a correlation
coefficient can be either positive (+) or negative (). You also learned that, with a positive
relationship, large values on one variable tend to be associated with large values on the
second variable, and small values on one variable tend to be associated with small values on
the second variable. You further learned that, with a negative relationship, large values on
one variable tend to be associated with small values on the second variable.
In Output 10.4, the correlation between KG_LOST and MOTIVAT was .88524 ( ), which
means that there was a positive correlation between these two variables. Although the
positive symbol (+) does not appear in front of the number .88524, it is common

318 Step-by-Step Basic Statistics Using SAS: Student Guide

convention to assume that a correlation coefficient is positive unless the negative sign ()
appears in front of it.
Because the correlation between KG_LOST and MOTIVAT is positive, you know that large
values on MOTIVAT are associated with large values on KG_LOST, and that small values
on MOTIVAT are associated with small values on KG_LOST. This is consistent with the
relationship that you observed in the scattergram of Output 10.2.
Determine the sample size. The size of the sample that is used in computing a correlation
coefficient may appear in one of two places on the output page. First, if all correlations in
the analysis are based on the same number of subjects, the sample size appears only once on
the page, in the line above the matrix of correlations. This line with the sample size appears
just below the descriptive statistics. In Output 10.4 the line takes the following form:
Pearson Correlation Coefficients, N = 22
Prob > |r| under H0: Rho=0
The N = portion of this output indicates the sample size. In this case, you can see that the
sample size was 22 ( ).
On the other hand, if you compute every possible correlation for a group of variables, and if
some correlations are based on sample sizes that are different from the sample sizes used to
compute other correlations, then these sample sizes will appear in a different location.
Specifically, the sample size for a given correlation will appear in the cell that presents that
correlation. In this situation, the information in each cell is presented according to the
following format:
correlation
p value
N
This format shows that the sample size (N) appears just below the correlation coefficient and
probability value for that coefficient. These issues are discussed in greater detail in the
section titled Using PROC CORR to Compute All Possible Correlations for a Group of
Variables.
Summarizing the Results of the Analysis
The present chapter and the chapters that follow show you how to perform null hypothesis
significance tests. These chapters will show you how to prepare brief analysis reports:
reports in which you state the null hypothesis being tested, describe the results that you
obtained, and summarize your conclusions regarding the null hypothesis.
The exact way that you prepare an analysis report will vary depending on the statistic that
you used to analyze your data.

Chapter 10: Bivariate Correlation 319

As an illustration, the following report summarizes the results of the analysis in which you
computed the correlation between the number of kilograms lost and the motivation to lose
weight.
A) Statement of the research question: The purpose of this
study was to determine whether the motivation to lose weight
is correlated with the amount of weight actually lost over a
10-week period.
B) Statement of the research hypothesis: There will be a
positive relationship between motivation to lose weight and
the amount of weight actually lost over a 10-week period.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was the motivation to lose weight.
This was a multi-value variable and was assessed on an
interval scale.
The criterion variable was the number of kilograms of
weight lost during the 10-week study. This was a multivalue variable and was assessed on a ratio scale.
Pearson product-moment correlation

D) Statistical test:
coefficient.

E) Statistical null hypothesis (H0): = 0; In the


population, the correlation between the motivation to lose
weight and the number of kilograms lost is equal to zero.
F) Statistical alternative hypothesis (H1): 0; In the
population, the correlation between the motivation to lose
weight and the number of kilograms lost is not equal to zero.
G) Obtained statistic:

r = .89

H) Obtained probability (p) value: p < .0001


I) Conclusion regarding the statistical null hypothesis:
Reject the null hypothesis.
J) Conclusion regarding the research hypothesis: These
findings provide support for the studys research hypothesis.
K) Coefficient of determination:

.79.

L) Formal description of the results for a paper: Results were


analyzed by computing a Pearson product-moment correlation
coefficient. This analysis revealed a significant positive
correlation between the motivation to lose weight and the
amount of weight actually lost, r = .89, p < .0001.The nature
of the correlation coefficient showed that subjects who scored
higher on the motivation to lose weight tended to lose more
weight than those who scored lower on the motivation to lose

320 Step-by-Step Basic Statistics Using SAS: Student Guide

weight. The coefficient of determination showed that


motivation accounted for 79% of the variance in the amount of
weight actually lost.
Some notes about the preceding report:

Item H in the preceding report indicated the following:


H) Obtained probability (p) value:

p < .0001

This item used the less-than sign (<), because the less-than sign actually appeared in
Output 10.4, presented previously (that is, the p value that appeared below the
correlation coefficient was printed as <.0001). If the less-than sign is not actually
printed in the SAS Output, you should instead use the equal sign (=) when indicating
your obtained p value. For example, suppose that the p value for this correlation
coefficient had actually been printed as .0143. In that situation, you would have used the
equal sign, as below:
H) Obtained probability (p) value:

p = .0143

Item K reported the coefficient of determination as follows:


K) Coefficient of determination: .79.
An earlier section in this chapter indicated that the coefficient of determination is the
proportion of variance in one variable that is accounted for by variability in the second
variable. It indicated that, to compute the coefficient of determination, you square the
obtained correlation coefficient. In the present case, the obtained correlation coefficient
was r = .89. The coefficient of determination was therefore computed in this way:
Coefficient of determination = r2
Coefficient of determination = (.89)2
Coefficient of determination = .79

Using PROC CORR to Compute All Possible Correlations


for a Group of Variables
Overview
When you conduct a study that involves a number of variables, it is often a good idea to
compute correlations for all possible pairings of the variables. This enables you to see the
big picture regarding the nature of the relationship between the variables. In addition, if
you submit your research report for publication, the reviewers for most research journals
will expect you to include a table that contains these correlations.

Chapter 10: Bivariate Correlation 321

This section shows how to use PROC CORR to compute all possible correlations for a group
of variables. It also shows you how to interpret results from the matrix of correlations that
will be produced by PROC CORR.
Writing the SAS Program
The syntax. In some situations, you may have measured a number of numeric variables in a
study, and want to compute every possible correlation between every possible pair of
variables. The syntax for the SAS statements that will do this is:
PROC CORR
DATA=data-set-name
VAR
variable-list ;
TITLE1 ' your-name ';
RUN;

options ;

The actual program. In the preceding VAR statement, the variable-list should simply be
a list of all numeric variables that you wish to analyze. For example, the current weight loss
study involves one criterion variable (KG_LOST), and four predictor variables (MOTIVAT,
EXERCISE, CALORIES, and IQ). Here are the SAS statements that will cause PROC
CORR to compute every possible correlation between these variables:
28
29
30
31
32
33
34
35

[First part of the DATA step appears here]


20 8.20 30 2 1200 110
21 7.80 35 4 1600 130
22 9.00 35 6 1600 120
;
PROC CORR
DATA=D1;
VAR KG_LOST MOTIVAT EXERCISE
TITLE1 'JANE DOE';
RUN;

CALORIES

IQ;

The only difference between these SAS statements and those presented in the preceding
section is the fact that the VAR statement (on line 33) now lists all variables in the study,
rather than just KG_LOST and MOTIVAT.
Omitting the VAR statement. It should be noted that, if the VAR statement had been
omitted from the program, PROC CORR would again have computed every possible
correlation between every possible combination of numeric variables. If your data set
consists of a large number of numeric variables (as is often the case with conducting
research with questionnaires), this will result in a long printout of results with a very large
matrix of correlation coefficients. This might be undesirable.

322 Step-by-Step Basic Statistics Using SAS: Student Guide

Results from the SAS Output


Results that were generated by the statements in the preceding section are reproduced here:
JANE DOE

Variable
KG_LOST
MOTIVAT
EXERCISE
CALORIES
IQ

KG_LOST

MOTIVAT

EXERCISE

CALORIES

IQ

Variables:

N
22
22
21
22
22

12

The CORR Procedure


KG_LOST MOTIVAT EXERCISE CALORIES IQ

Simple Statistics
Mean
Std Dev
Sum
4.98591
2.27488
109.69000
20.00000
8.99735
440.00000
3.23810
1.84132
68.00000
1773
356.14579
39000
120.00000
17.99471
2640

Pearson Correlation Coefficients


Prob > |r| under H0: Rho=0
Number of Observations
KG_LOST
MOTIVAT
EXERCISE
1.00000
0.88524
0.53736
<.0001
0.0120
22
22
21
0.88524
1.00000
0.47845
<.0001
0.0282
22
22
21
0.53736
0.47845
1.00000
0.0120
0.0282
21
21
21
-0.55439
-0.54984
-0.22594
0.0074
0.0080
0.3247
22
22
21
0.02361
0.10294
0.31201
0.9169
0.6485
0.1685
22
22
21

Minimum
1.00000
5.00000
0
1200
90.00000

Maximum
9.00000
35.00000
6.00000
2400
150.00000

CALORIES
-0.55439
0.0074
22
-0.54984
0.0080
22
-0.22594
0.3247
21
1.00000
22
0.19319
0.3890
22

IQ
0.02361
0.9169
22
0.10294
0.6485
22
0.31201
0.1685
21
0.19319
0.3890
22
1.00000
22

Output 10.5. All possible correlations between the five variables


that were assessed in the weight loss study.

Understanding the output. In most respects, Output 10.5 (which presents the correlations
between all five variables) is very similar to Output 10.4 (which essentially presented only
the correlation between KG_LOST and MOTIVAT).
For example, the top section of the output presents simple descriptive statistics.
The lower section of the output presents the correlation coefficients.
The difference is that Output 10.5 presents a larger matrix of correlation coefficients. This
matrix consists of five rows running horizontally from left to right (each headed with the
name of one of the variables being analyzed), and five columns running vertically from top
to bottom (also headed with the name of one of the variables being analyzed).

Chapter 10: Bivariate Correlation 323

Interpreting the information in each cell. Where the row for one variable intersects with
the column for another variable, you will find a cell that provides three pieces of information
concerning the correlation between those two variables: (a) the Pearson correlation between
the variables, (b) the p value associated with that correlation, and (c) the size of the sample
(N) on which the correlation is based.
Within the cell, this information is presented in the following format:
correlation
p value
N
For example, in Output 10.5 consider the cell appearing where the row KG_LOST
intersects with the column MOTIVAT. This cell shows that the correlation between the
variables is .88524.
The p value for this correlation is < 0.0001.
The size of the sample on which the correlation is based is 22. These figures are identical to
those shown in Output 10.4.
Now consider the cell that appears at the intersection of the row KG_LOST and the
column headed IQ. The information in this cell shows that the correlation between
KG_LOST and IQ is .02361, which rounds to .02. This is a very small correlation
coefficient, indicating a very weak relationship between the two variables.
Predictably, the p value for this correlation is quite large at .9169.
As with the earlier correlation coefficient, the size of the sample on which the correlation is
based is equal to 22. Given that the p value for this correlation is quite large at .9169 (which
is higher than the standard criterion of .05), it is clear that this correlation is not significantly
different from zero.
As a final example, consider the cell where the row KG_LOST intersects with the column
EXERCISE. Here, you can see that the correlation in this cell is based on just 21 subjects,
which is one less than the sample size for most of the other correlations in this matrix. Why
is this sample size smaller? It is because there was one instance of missing data for the
EXERCISE variable. If you review Table 10.1 (presented at the beginning of this chapter),
you will see that Subject #17 (Sam) had missing data on EXERCISE. This means that the
correlation between EXERCISE and any other variable will be based on 21 observations,
not 22.

324 Step-by-Step Basic Statistics Using SAS: Student Guide

Summarizing Results Involving a Nonsignificant


Correlation
Overview
This section discusses the case of a correlation coefficient that is not significantly different
from zero. This is done so that you will be able to prepare analysis reports summarizing
nonsignificant correlations.
The Results from PROC CORR
Output 10.5 presented a matrix of correlations containing a few nonsignificant correlations.
For convenience, part of that output is reproduced here as Output 10.6.

KG_LOST

MOTIVAT

EXERCISE

CALORIES

IQ

Pearson Correlation Coefficients


Prob > |r| under H0: Rho=0
Number of Observations
KG_LOST
MOTIVAT
EXERCISE
1.00000
0.88524
0.53736
<.0001
0.0120
22
22
21
0.88524
1.00000
0.47845
<.0001
0.0282
22
22
21
0.53736
0.47845
1.00000
0.0120
0.0282
21
21
21
-0.55439
-0.54984
-0.22594
0.0074
0.0080
0.3247
22
22
21
0.02361
0.10294
0.31201
0.9169
0.6485
0.1685
22
22
21

CALORIES
-0.55439
0.0074
22
-0.54984
0.0080
22
-0.22594
0.3247
21
1.00000
22
0.19319
0.3890
22

IQ
0.02361
0.9169
22
0.10294
0.6485
22
0.31201
0.1685
21
0.19319
0.3890
22
1.00000
22

Output 10.6. Correlation coefficient and p value obtained for the correlation
between kilograms lost and IQ.

The correlation coefficient. As mentioned in the preceding section, the results of this
analysis revealed that the relationship between KG_LOST and IQ was very weak.
Information about this correlation appears in the cell where the row KG_LOST intersects
with the column IQ. This cell shows that the Pearson correlation between the number of
kilograms lost and subject IQ was only .02361, which rounds to .02. This correlation is quite
close to zero, indicating that there is essentially no relationship between KG_LOST and IQ:
Knowing where a subject stands on IQ does not predict where he stands on KG_LOST.

Chapter 10: Bivariate Correlation 325

The probability value. The second entry in this cell provides the p value associated with
this correlation. You will remember that this p value indicates the probability that you would
obtain a correlation coefficient this large or larger if the actual correlation between these
variables in the population was equal to zero.
Your obtained p value in this case is .9169. This means that, if the actual correlation
between KG_LOST and IQ in the population was equal to zero, the probability that you
would obtain a correlation of .02 in a sample with N = 22 is equal to .9169 (a very high
probability).
As stated earlier, this guide recommends that a statistic be considered statistically significant
only when its p values is less than .05. Because the p value for this correlation (.9169) is
much larger than this criterion of .05, you would conclude that the correlation between
KG_LOST and IQ is not statistically significant. In other words, you would conclude that
the correlation between KG_LOST and IQ is not significantly different from zero. This only
makes sense, because your obtained correlation coefficient was only .02, a value quite close
to zero.
The Results from PROC PLOT
Why create a scattergram? An earlier section of this chapter indicated that you should
always use PROC PLOT to create a scattergram for two variables before computing the
correlation between them. Among other things, this scattergram will help you verify that the
relationship between the two variables is linear (a linear relationship is an assumption for the
Pearson correlation coefficient).
At this point, it will be instructive to review the scattergram plotting KG_LOST against IQ.
This illustrates what a scattergram might look like when there is virtually no relationship
between the two variables. Output 10.7 presents this scattergram.

326 Step-by-Step Basic Statistics Using SAS: Student Guide

Plot of KG_LOST*IQ.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

|
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
KG_LOST | A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
| A
|
3 +
A
|
|
A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 +
A
---+----------+----------+----------+----------+----------+----------+-90
100
110
120
130
140
150
IQ

Output 10.7. Scattergram plotting kilograms lost against subject IQ.

Interpreting the shape of the scattergram. An earlier section said that, when there is a
relationship between two variables, their scattergram will usually be elliptical in shape
(shaped like a football), and will display some degree of slope. If the correlation is positive,
the ellipse will slant from the lower left corner to the upper right corner; if the correlation is
negative, the ellipse will slant from the upper left corner to the lower right corner.
However, you can see that the scattergram in Output 10.7 does not form an ellipse; instead,
it is fairly round in shape. If you were to draw a best-fitting straight line through the
scattergram, the line would not have any slope to speak ofit would be horizontal. To
illustrate this, Output 10.8 reproduces the same scattergram as appears in Output 10.7, this
time with a good-fitting straight line drawn through it.

Chapter 10: Bivariate Correlation 327

Plot of KG_LOST*IQ.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

|
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
KG_LOST | A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
| A
|
3 +
A
|
|
A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 +
A
---+----------+----------+----------+----------+----------+----------+-90
100
110
120
130
140
150
IQ

Output 10.8. Scattergram plotting kilograms lost against subject IQ, with
straight line drawn through the center of the scattergram.

These characteristicsa round scattergram with a near-horizontal slopeare the


characteristics of a scattergram for two variables that are not correlated with one another.

328 Step-by-Step Basic Statistics Using SAS: Student Guide

Summarizing the Results of the Analysis


Suppose that, in this study, you had initially hypothesized that IQ would be correlated with
the number of kilograms of weight lost. Here is an example of how you might prepare a
report summarizing your analysis and results:
A) Statement of the research question: The purpose of this
study was to determine whether subject IQ is correlated with
the amount of weight lost over a 10-week period.
B) Statement of the research hypothesis: There will be a
positive relationship between subject IQ and the amount of
weight lost over a 10-week period.
C) Nature of the variables: This analysis involved two
variables. The predictor variable was subject IQ: scores from
a standard intelligence test. This was an interval-level
variable. The criterion variable was the number of kilograms
of weight lost during the 10-week study. The criterion
variable was assessed on a ratio scale.
Pearson correlation coefficient.

D) Statistical test:

E) Statistical null hypothesis (H0): =0; In the population,


the correlation between IQ and the number of kilograms lost is
equal to zero.
F) Statistical alternative hypothesis (H1): 0; In the
population, the correlation between IQ and the number of
kilograms lost is not equal to zero.
G) Obtained statistic:

r=.02

H) Obtained probability (p) value:

p=.9169

I) Conclusion regarding the statistical null hypothesis: Fail


to reject the null hypothesis.
J) Conclusion regarding the research hypothesis: These
findings fail to provide support for the studys research
hypothesis.
K) Coefficient of determination:

.00.

L) Formal description of the results for a paper: Results were


analyzed by computing a Pearson correlation coefficient. This
analysis revealed a nonsignificant correlation between IQ and
the amount of weight lost, r=.02, p=.9169. The coefficient of
determination showed that IQ accounted for approximately 0% of
the variance in the amount of weight actually lost.

Chapter 10: Bivariate Correlation 329

Some notes regarding the preceding report:

Item K of the preceding report indicated that the coefficient of determination for this
analysis was equal to .00. This is because the correlation coefficient was r = .02. This
value squared was equal to .0004, which rounds to .00.

Item L of the preceding report provides a formal description of the results for a paper.
Notice that this paragraph does not provide a description of the nature of the relationship
between the two variables (i.e., it does not report that high scores on IQ tended to be
associated with high scores on the amount of weight lost). This is because the correlation
coefficient was found to be nonsignificant. When the results are nonsignificant,
researchers typically do not describe the nature of the relationship between the variables in
their papers.

Item L also reports that the ...coefficient of determination showed that IQ accounted for
approximately 0% of the variance in the amount of weight actually lost (italics added).
The word approximately was used here because, technically, the coefficient of
determination was not exactly zeroit was .0004, which became zero only because it was
rounded to two decimal places.

Using the VAR and WITH Statements to Suppress the


Printing of Some Correlations
Overview
An earlier section showed you how to compute all possible correlations for a group of
variables. When you are working with a moderate-to-large number of variables, however,
this approach has some disadvantages. Among these are the fact that it can result in a very
large number of correlations and many pages of output.
In those situations, it is sometime preferable to print a limited number of correlations. You
can include the VAR and WITH statements within the PROC step to achieve this. Using this
approach, SAS will again print a matrix of correlation coefficients. The variables that you
list in the VAR statement will appear as the columns in this matrix, and the variables that
you list in the WITH statement will appear as the rows. This gives you greater control over
your output, and enables you to avoid printing all possible correlations.
Writing the SAS Program
The syntax. Here is the syntax for using the VAR and WITH statements with PROC
CORR:
PROC CORR
DATA=data-set-name
VAR
column-variables ;
WITH row-variables ;
RUN;

options ;

330 Step-by-Step Basic Statistics Using SAS: Student Guide

The second line in the preceding syntax is the VAR statement. This statement includes the
entry column-variables to indicate that any variable that you list there will appear as a
column (running up and down) in the resulting matrix of correlation coefficients. The third
line of the general form is the WITH statement. This statement includes the entry rowvariables to indicate that any variable that you list there will appear as a row (running from
left to right) in the resulting matrix of correlation coefficients. You can list the same variable
in both the VAR and WITH statements.
Suppose that you want to compute correlations between two sets of variables. One set of
variables will be KG_LOST and MOTIVAT, and the second set of variables will be
EXERCISE, CALORIES, and IQ. As output, you want to create a matrix of correlation
coefficients in which the column variables are KG_LOST and MOTIVAT, and the row
variables are EXERCISE, CALORIES, and IQ.
Here are the SAS statements that will cause PROC CORR to create this matrix of
correlations. Notice that the VAR statement includes KG_LOST and MOTIVAT, and the
WITH statement includes EXERCISE, CALORIES, and IQ.
28
29
30
31
32
33
34
35
36

[First part of the DATA step appears here]


20 8.20 30 2 1200 110
21 7.80 35 4 1600 130
22 9.00 35 6 1600 120
;
PROC CORR DATA=D1;
VAR
KG_LOST MOTIVAT;
WITH EXERCISE CALORIES
TITLE1 ' JANE DOE ';
RUN;

IQ;

Chapter 10: Bivariate Correlation 331

Results from the SAS Output


Results generated by the statements in the preceding section are reproduced in Output 10.9:
JANE DOE

13

The CORR Procedure


3 With Variables:
EXERCISE CALORIES IQ
2
Variables:
KG_LOST MOTIVAT

Variable
EXERCISE
CALORIES
IQ
KG_LOST
MOTIVAT

N
21
22
22
22
22

Mean
3.23810
1773
120.00000
4.98591
20.00000

Simple Statistics
Std Dev
Sum
1.84132
68.00000
356.14579
39000
17.99471
2640
2.27488
109.69000
8.99735
440.00000

Minimum
0
1200
90.00000
1.00000
5.00000

Maximum
6.00000
2400
150.00000
9.00000
35.00000

Pearson Correlation Coefficients


Prob > |r| under H0: Rho=0
Number of Observations
KG_LOST
MOTIVAT
EXERCISE
0.53736
0.47845
0.0120
0.0282
21
21
CALORIES
-0.55439
-0.54984
0.0074
0.0080
22
22
IQ
0.02361
0.10294
0.9169
0.6485
22
22
Output 10.9. Output produced by using the VAR and WITH statements
in the SAS program.

In Output 10.9, notice that the variables KG_LOST and MOTIVAT appear as columns. This
is because they were listed in the VAR statement. Notice that the variables EXERCISE,
CALORIES, and IQ appear as rows. This is because they were listed in the WITH
statement.

332 Step-by-Step Basic Statistics Using SAS: Student Guide

Computing the Spearman Rank-Order Correlation


Coefficient for Ordinal-Level Variables
Overview
So far, this chapter has focused on computing the Pearson product-moment correlation
coefficient, which is appropriate when both of your variables are assessed on an interval
scale or ratio scale. In some situations, however, one or both of your variables may be
assessed on an ordinal scale. In those situations, it may be more appropriate to compute the
Spearman rank-order correlation coefficient. This is easy to do with SAS.
Situations Appropriate for This Statistic
You can use the Spearman rank-order correlation coefficient when both of the following are
true:

both of your variables are assessed on an ordinal scale

one of your variables is assessed on an ordinal scale, and the other variable is assessed on
an interval or ratio scale.

In Chapter 2 of this book, you learned that values on an ordinal scale represent the rank
order of the subjects with respect to the variable that was being assessed. When a variable is
on an ordinal scale, equal differences in scale values do not necessarily have equal
quantitative meaning. The best example of an ordinal scale is a variable that is created by
rank ordering the subjects according to a particular construct.
Example of When to Compute the Spearman Rank-Order Correlation
Coefficient
For example, suppose that you measured two of the variables in your weight-loss study in a
particular way. Suppose that, to create the variable KG_LOST, you simply rank-ordered
your subjects with respect to how much weight they lost. The subject who lost the most
weight was assigned a value (score) of 1, the subject who lost the next most weight was
given a value of 2, and so on.
You used the same procedure to create a second variable in your study, MOTIVAT: You
rank-ordered your subjects with respect to their level of motivation to lose weight. The
subject who was most motivated to lose weight was given a value of 1, the subject who was
next most motivated was given a value of 2, and so on.
The variables KG_LOST and MOTIVAT are now rank-order variables. When you compute
the correlation between them, it is no longer appropriate to compute the Pearson correlation
coefficient. It is now appropriate to compute the Spearman rank-order correlation
coefficient.

Chapter 10: Bivariate Correlation 333

Writing the SAS Program


The syntax for computing a Spearman correlation is identical to the syntax for computing a
Pearson correlation, except that you must include the key word SPEARMAN in the options
field of the PROC CORR statement. The syntax is as follows:
PROC CORR
DATA=data-set-name
VAR
variable-list ;
TITLE1 ' your-name ';
RUN;

SPEARMAN ;

For example, to compute the Spearman correlation between KG_LOST and MOTIVAT, the
SAS statements would appear as follows:
PROC CORR DATA=D1 SPEARMAN;
VAR KG_LOST MOTIVAT ;
TITLE1 'JANE DOE';
RUN;

Understanding the SAS Output


When you compute Spearman correlations, the output is almost identical to the output that is
produced for Pearson correlations. This means that the output page will contain a matrix of
cells. Within a cell you will find the Spearman correlation at the top with the p value, and
the sample size below. You will interpret these correlation coefficients and p values in the
same manner as with the Pearson correlation coefficient. You will know that it is a matrix of
Spearman correlations appearing on a page output because a heading similar to the
following heading will appear above the matrix:
Spearman Correlation Coefficients, N = 22
Prob > |r| under H0: Rho=0

Some Options Available with PROC CORR


Overview
Several options for PROC CORR enable you to control the way that data are analyzed, the
types of statistics that are computed, and the way that output is presented. This section
discusses a few of the more popular options, and lists the keywords that can be used to
request these options.

334 Step-by-Step Basic Statistics Using SAS: Student Guide

Where in the Program to Request Options


An earlier section showed the following syntax for the CORR procedure:
PROC CORR
DATA=data-set-name
VAR
variable-list ;
TITLE1 ' your-name ';
RUN;

options ;

The word options appears as part of the PROC CORR statement, meaning that this is the
location where keywords for options should appear. For example, the previous section
showed that the keyword SPEARMAN requests that PROC CORR compute Spearman
correlations rather than Pearson correlations. If you had computed the Spearman correlation
between KG_LOST and MOTIVAT, the last part of your SAS program would have looked
like the following (notice where the keyword SPEARMAN appears on line 32):
28
29
30
31
32
33
34
35

[First part of the DATA step appears here]


20 8.20 30 2 1200 110
21 7.80 35 4 1600 130
22 9.00 35 6 1600 120
;
PROC CORR
DATA=D1 SPEARMAN;
VAR KG_LOST MOTIVAT;
TITLE1 'JANE DOE';
RUN;

Description of Some Options


Below are the keywords for some of the options that tend to be used most frequently when
conducting research in the social and behavioral sciences. Remember that these keywords
would appear in the same location that the keyword SPEARMAN appeared on line 32 in the
program presented in the preceding section.
ALPHA
Prints coefficient alpha (a measure of reliability that is often used for summated
rating scales). Coefficient alpha will be computed for the variables listed in the VAR
statement. You must use the NOMISS option in conjunction with the ALPHA
option. The NOMISS option is discussed later.
COV
Prints covariances between the variables. This is useful when you need to create a
variance-covariance table, rather than a table of correlations.
KENDALL
Prints Kendalls tau-b coefficient, a measure of bivariate association for variables
assessed at the ordinal level.

Chapter 10: Bivariate Correlation 335

NOMISS
Excludes from the analysis any observation (subject) with missing data on any of the
variables listed in the VAR statement. Using this option ensures that all correlations
will be based on exactly the same observations (and, therefore, on the same number
of observations).
NOPROB
Suppresses the printing of p values associated with the correlations.
RANK
For each variable, reorders the correlations from highest to lowest (in absolute
value) and prints them in this order.
SPEARMAN
Prints Spearman rank-order correlation coefficients, which are appropriate for
variables that are measured on an ordinal level.
Where to Find More Options for PROC CORR
PROC CORR is a part of base SAS software, and this means that you will find a complete
listing of all of the options that are available with PROC CORR in the SAS Procedures
Guide (SAS Institute Inc., 1999c). Please note that you will not find PROC CORR covered
in the SAS/STAT Users Guide (SAS Institute Inc., 1999d).

Problems with Seeking Significant Results


Overview
One of the positive aspects of using SAS is that it makes it very easy to compute a large
number of correlation coefficients for a large number of variables. But this also creates a
trap into which many researchers fall. This trap involves computing a large number of
coefficients, searching through the results to identify the significant coefficients, and then
preparing a research report in which you create the impression that the significant
relationships that you obtained were the ones that you predicted from the beginning. This
section explains why this approach is not good science.
Reprise: Null Hypothesis Testing with Just Two Variables
The study. Suppose that you are investigating the relationship between just two variables in a
sample of 200 adults: You are investigating the correlation between height (in inches) and IQ.
Suppose that you dont know the correlation between these two variables, but it is in fact equal
to zero in the population. That is, if it were possible to study the entire population of possible
subjects, you would find that there is absolutely no relationship between height and IQ.

336 Step-by-Step Basic Statistics Using SAS: Student Guide

You state your null hypothesis as follows:


Statistical null hypothesis (H0): = 0; In the population, the correlation between
height and IQ is equal to zero.
You draw a random sample of 200 adults, measure their height in inches, assess their IQ,
and compute the correlation between the two variables.
Making a Type I error. Before the study began, you made the decision that you would
reject the null hypothesis of no correlation only if the p value that you obtain is less than .05.
This decision gives you some protection against making a Type I error. When you make a
Type I error, you reject a null hypothesis that was trueyou reject a null hypothesis that
should not have been rejected. In other words, you conclude that there is a correlation
between the two variables in the population when, in fact, there is none.
When you conduct a study such as the one described above, what is the probability that you
will commit a Type I error? It is equal to the criterion that you set for your p value. If you
decide that you will reject the null hypothesis only if your p value is less than .05, then you
have only a 5% chance of making a Type I error. If you decide that you will reject the null
hypothesis only if your p value is less than .01, then you have only a 1% chance of making a
Type I error. This criterion is typically referred to as the alpha level that you set for your test
(symbolized as = .05 or = .01, for example). The alpha level can be defined as the
significance level that a researcher sets for a null hypothesis test; it is the probability that the
analysis will result in a Type I error.
If you are computing the correlation between just two variables and you set alpha at .05,
then you know that you have only a 5% chance of making a Type I error. This enables you
to have some confidence in your findings.
Null Hypothesis Testing with a Larger Number of Variables
Overview. The situation is different when you compute a large number of correlations
between a large number of variables. This is where researchers often commit Type I errors
without knowing it.
The study. Consider the following situation: Suppose that you have obtained data on a
relatively large number of variables from a sample of 200 subjects. Imagine that, in the
population, there is absolutely no correlation between any of these variables. That is, if you
were able to gather data from the entire population, you would find that every possible
correlation between these variables is equal to zero. But you dont know this, because you
cannot study the entire population.
So instead you gather data from your 200 subjects, and compute every possible correlation
between the variables. Assume that this results in exactly 100 correlation coefficients. You
decide that you will consider a correlation coefficient to be significant if its p value is less
than .05.

Chapter 10: Bivariate Correlation 337

The results. You review the results produced by SAS, and find that five of your correlation
coefficients are significant at the .05 level. Some of these correlations look interesting. For
example, you find that there is a significant correlation between height and IQ. You prepare an
article that summarizes your findings, and it is eventually published by a research journal. All
over the world, researchers are now reading about the relationship between height and IQ.
Making Type I errors. The problem, of course, is that you have made a Type I error. In the
population, the correlation between height and IQ is actually equal to zero. Yet you have
rejected this null hypothesis, and have told the research community that there is a correlation
between these two variables.
How could this have happened? It is because you computed such a large number of
correlations. Whenever you (a) set your alpha level at .05 and (b) compute 100 correlations,
you should expect about five of the correlations to be statistically significant, even if there is
actually no correlation between any of the variables in the population. When you set alpha at
.05, it means that you expect a 5% chance of making a Type I error. If you compute just one
correlation coefficient, you are fairly safe. But if you compute 100 correlation coefficients,
about five of them (that is, about 5% of the 100 coefficients) are going to be significant
simply due to sampling error.
Sampling error occurs whenever a sample is not perfectly representative of the population
from which it was drawn. Some degree of sampling error occurs just about any time that you
work with a sample. Because of sampling error, five pairs of variables in your data set
demonstrated correlations that were fairly largelarge enough to be statistically significant.
But these correlations were not significant because the variables were actually correlated in
the population. They were significant because of sampling error.
Because you published an article in which you discussed your significant results, you have
now misled other researchers into believing that there is a correlation between height and IQ
when, in fact, there is not. You have led them down a blind alley that will ultimately prove
to be a waste of their time.
How to Avoid This Problem
Unfortunately, a large number of researchers conduct research in the manner described
above. But there are several things that you can do to avoid this path:

In any study, you should generally compute as few correlations as possible. Any analysis
should be driven by theory or by previous research. You should never blindly compute a
large number of correlations, and then review the results to see what turns out
significant.

The research reports that you prepare should specify the number of correlations that you
computeincluding correlations that proved to be nonsignificant. In this way your
reviewers can assess the probability that your results are due to sampling error.

In some cases in which you compute a relatively large number of correlations, you may
want to consider using a more conservative alpha level for each correlation. For example,

338 Step-by-Step Basic Statistics Using SAS: Student Guide

you may decide that you will consider a correlation to be significant only if it is significant
at the .001 level (rather than the .05) level. The logic behind this approach is similar to the
logic behind using the Bonferroni t test to control the familywise error rate when
performing analysis of variance (ANOVA). See Howell (1997, pp 362365) for a
discussion of the Bonferroni t test.

Conclusion
This chapter has shown you how to use PROC PLOT and PROC CORR to investigate the
relationship between two numeric variables. It has shown you how to compute the Pearson
correlation coefficient, and how to determine whether that correlation coefficient is
significantly different from zero.
A statistical procedure that is similar to bivariate correlation is bivariate linear regression.
Linear regression is a procedure for identifying the best-fitting straight line that
summarizes the relationship between two numeric variables. Bivariate linear regression
enables you to assess where an individual stands on one variable, and then use that score to
predict where they probably stand on a second variable.
The following chapter introduces you to bivariate linear regression. It shows how to use
PROC REG to perform regression, how to determine whether the resulting regression
coefficient is significantly different from zero, how to draw a regression line through a
scattergram, and how to use this regression line to predict where specific subjects will
probably stand on the criterion variable.

Bivariate
Regression
Introduction............................................................................................ 341
Overview................................................................................................................... 341
Choosing between the Terms Predictor Variable, Criterion Variable,
Independent Variable, and Dependent Variable ............................... 341
Overview................................................................................................................... 341
Nonexperimental Research ...................................................................................... 341
Experimental Research............................................................................................. 342
Choosing the Terms to Use ...................................................................................... 343
Situations Appropriate for Bivariate Linear Regression ....................... 344
Overview................................................................................................................... 344
Scale of Measurement Used with the Predictor and Criterion Variables................... 344
The Type-of-Variable Figure ..................................................................................... 344
Example of a Study Providing Data Appropriate for This Procedure......................... 345
Summary of Assumptions Underlying Bivariate Linear Regression .......................... 346
Example 11.1: Predicting Weight Loss from a Variety of
Predictor Variables............................................................................ 346
Overview................................................................................................................... 346
The Study ................................................................................................................. 347
The Predictor Variables and Criterion Variable in the Analysis ................................. 347
Data Set to be Analyzed ........................................................................................... 348
The DATA Step for the SAS Program....................................................................... 349

340 Step-by-Step Basic Statistics Using SAS: Student Guide

Using PROC REG: Example with a Significant Positive


Regression Coefficient ...................................................................... 350
Overview................................................................................................................... 350
Verifying That Your Data Are Appropriate for the Analysis ....................................... 351
Writing the SAS Program with PROC REG .............................................................. 351
Results from the SAS Output.................................................................................... 353
Steps in Interpreting the Output ................................................................................ 353
Drawing a Regression Line through the Scattergram ............................................... 358
Reviewing the Table of Predicted Values ................................................................. 365
Predicting Y within the Range of X............................................................................ 367
Summarizing the Results of the Analysis.................................................................. 367
Notes about the Preceding Summary ....................................................................... 370
Using PROC REG: Example with a Significant Negative
Regression Coefficient ...................................................................... 371
Overview................................................................................................................... 371
Correlation between Kilograms Lost and Calorie Consumption ................................ 371
Using PROC PLOT to Create a Scattergram............................................................ 372
Using PROC REG to Perform the Regression Analysis............................................ 374
Summarizing the Results of the Analysis.................................................................. 376
Notes about the Preceding Summary ....................................................................... 378
Using PROC REG: Example with a Nonsignificant
Regression Coefficient ...................................................................... 379
Overview................................................................................................................... 379
Correlation between Kilograms Lost and IQ Scores ................................................. 379
Using PROC REG to Perform the Regression Analysis............................................ 379
Summarizing the Results of the Analysis.................................................................. 380
Note about the Preceding Summary ......................................................................... 382
Conclusion.............................................................................................. 383

Chapter 11: Bivariate Regression 341

Introduction
Overview
Linear regression is a procedure for identifying the best-fitting straight line that summarizes
the relationship between two variables. It is called bivariate regression here because this
chapter focuses on situations in which you are dealing with just two variables: a single
predictor variable and a single criterion variable.
You may use bivariate regression in the same types of situations in which a Pearson
correlation coefficient is appropriate: situations in which you want to investigate the
relationship between two numeric variables that are assessed on an interval scale or ratio
scale. When variables X and Y are correlated with one another, you can use linear regression
for prediction: If you know where a given subject stands on the X variable, you can use
regression procedures to compute a best estimate of where that subject probably stands on the
Y variable.
This chapter builds on the information already covered in Chapter 10, Bivariate Correlation.
It begins by providing some guidance on the appropriate use the terms predictor variable,
criterion variable, independent variable, and dependent variable. It shows you how to use the
SAS Systems REG procedure to analyze data from nonexperimental research and to develop
the regression equation for a pair of variables. It illustrates how you can use this regression
equation to draw a best-fitting straight line through a scattergram created by PROC PLOT.
Finally, it shows how PROC REG can be used to create a table of predicted values, along with
the residuals of prediction.

Choosing between the Terms Predictor Variable,


Criterion Variable, Independent Variable,
and Dependent Variable
Overview
In the remaining chapters of this book, you will often see the terms predictor variable,
criterion variable, independent variable, and dependent variable. Two of these terms are more
appropriately used with one type of research investigation, while the other two terms are more
appropriately used with a different type of investigation. This section reviews these two types
of research, and offers guidelines for using these terms.
Nonexperimental Research
Most of the studies described in this chapter are examples of nonexperimental research,
research in which you are studying the relationship between naturally occurring variables. In

342 Step-by-Step Basic Statistics Using SAS: Student Guide

nonexperimental research, you are not manipulating or controlling variables; rather, you are
studying the variables as they naturally exist.
For example, suppose that you are interested in studying the relationship between taking
vitamin E and the symptoms of depression. You might believe that, the more vitamin E a
person takes, the fewer symptoms of depression that person will report. To investigate this
idea, you administer a survey to a large sample of individuals. The survey assesses two
variables: (a) the quantity of vitamin E taken by the subject, and (b) the depression symptoms
reported by the subject. You do not manipulate or control the amount of vitamin E that the
subjects are takingyou are simply measuring the amount of vitamin E that they normally
take on their own. You will analyze the data to determine whether there is a relationship
between the two variables. If there is, it may be possible to use scores on the amount of
vitamin E taken variable to predict scores on the symptoms of depression variable.
The study described above involved one criterion variable and one predictor variable. A
criterion variable is an outcome variable that can be predicted from one or more predictor
variables. The criterion variable is often the main focus of the study because it is the outcome
variable that is mentioned in the statement of the research problem. In the study described
above, the criterion variable was the symptoms of depression variable. It was the outcome
variable that was of central interest in the study. When discussing bivariate regression, a
criterion variable is sometimes referred to as a Y variable, and is often symbolized by the
letter Y.
In nonexperimental research, a predictor variable is a variable that is used to predict values
on the criterion. In some studies, you might even believe that the predictor variable has a
causal effect on the criterion (although nonexperimental research generally provides only
weak evidence of cause and effect). In the study described above, the predictor variable was
the amount of vitamin E taken. If you had found that there was a relationship between the
two variables, you could have gone on to use scores on the amount of vitamin E taken
variable to predict scores on the symptoms of depression variable. When discussing
bivariate regression, a predictor variable is sometimes referred to as a X variable, and is
often symbolized by the letter X.
Experimental Research
In contrast to nonexperimental research, experimental research typically involves a much
higher degree of control over the subjects and over the environmental conditions experienced
by the subjects. Chapter 2 of this book indicated that most experimental research is identified
by three characteristics:

subjects are randomly assigned to experimental conditions

the researcher manipulates an independent variable

subjects in different treatment conditions are treated similarly with regard to all variables
except for the independent variable.

Chapter 11: Bivariate Regression 343

An experiment involves at least one independent variable and at least one dependent variable.
An independent variable is a variable whose values (or levels) are selected by the
experimenter to determine what effect the independent variable has on the dependent variable.
In contrast, a dependent variable is some aspect of the subjects behavior that is assessed to
determine whether it has been affected by the independent variable.
To illustrate these concepts, it is possible to modify the preceding study so that it becomes a
true experiment. Suppose that you begin with a sample of 200 subjects, and you randomly
assign 100 of subjects to an experimental group and the other 100 subjects to a control group.
Subjects in the experimental group are each given 60 IU (International Units) of vitamin E
each day. Subjects in the control group are each given zero IU of vitamin E each day. Subjects
experience these conditions over a six-month period, and at the end of this period each subject
completes a questionnaire in which he or she reports the number of symptoms of depression
they have experienced recently.
In this study, the independent variable was the amount of vitamin E taken. You know this
because this was the variable that was manipulated by you, the researcher. In this study, the
dependent variable consisted of scores on the symptoms of depression questionnaire. You
know this because this was that aspect of the subjects behavior that you measured to
determine whether it had been affected by the independent variable (the amount of vitamin E
taken).
Choosing the Terms to Use
From the preceding, you can that see the term predictor variable is to some extent a
counterpart to the term independent variable: The term predictor variable is relevant to
nonexperimental research, and the term independent variable is relevant to experimental
research. In the same way, the term criterion variable is to some extent a counterpart to the
term dependent variable. Because students are sometimes confused by the use of these terms,
this section will summarize the circumstances under which it is best to use the terms
independent and dependent variable, and the circumstances under which it is best to use the
terms predictor and criterion variable.

In general, it is best to use the terms independent variable and dependent variable only when
you are discussing a true experiment in which the researcher has actually manipulated some
variable of interest. In this book, however, you will see some exceptions to this rule. For
example, this chapter shows only examples of nonexperimental research. However, when
the data are analyzed, you will see that the SAS output includes the heading Dependent
Variable where the criterion variable is listed.

In general, it is best to use the terms predictor variable and criterion variable when you are
discussing nonexperimental research (i.e., research in which you are simply studying the
relationship between naturally occurring variables). However, the terms predictor variable
and criterion variable are general-purpose terms, and it is also acceptable to use them when
discussing experimental research. In fact, this book uses them with respect to both types of
research. For example, the following chapters of this book will show you how to prepare

344 Step-by-Step Basic Statistics Using SAS: Student Guide

analysis reports for both experimental as well as nonexperimental studies. For simplicity,
these research reports will use the the headings Predictor variable and Criterion variable,
regardless of whether the study being described was a true experiment or a nonexperimental
study.

Situations Appropriate for Bivariate Linear Regression


Overview
Linear regression is generally used for purposes of prediction. You use this statistic when you
suspect that there might be a significant relationship between two variables, and you want to
use subject scores on one variable (the predictor variable) to predict subject scores on the
second variable (the criterion variable).
The first part of this section describes the types of situations in which linear regression is
usually performed, and discusses a few of its statistical assumptions. A more complete
summary of assumptions is presented at the end of this section.
Scale of Measurement Used with the Predictor and Criterion
Variables
Predictor variable. When performing linear regression and prediction, the predictor variable
should be a numeric variable that is assessed on an interval or ratio scale of measurement.
Criterion variable. The criterion variable should also be a numeric variable that is assessed
on an interval or ratio scale of measurement.
The Type-of-Variable Figure
When researchers perform bivariate regression, they are typically studying the relationship
between (a) a criterion variable that is a multi-value numeric variable and (b) a predictor
variable that is also a multi-value numeric variable. The type-of-variable figure below
illustrates this situation:
Criterion

Predictor

The Multi symbol that appears to the left of the equal sign in the above figure represents the
fact that the criterion variable in this analysis is usually a multi-value variable (a variable that
assumes more than six values in your sample). The Multi symbol that appears to the right of
the equal sign shows that the predictor variable in the regression analysis is also typically a
multi-value variable.

Chapter 11: Bivariate Regression 345

Example of a Study Providing Data Appropriate for This Procedure


The study. Suppose that you are a college administrator who wants to use scores on a
fictitious test called the Learning Aptitude Test (LAT) to predict whether applicants will be
successful if admitted to your university. You administer the aptitude test to a group of 400
high school students while they are still in high school. Two years later, you track down the
students (who are now in college) and record their college grade point averages (GPAs). At
that time, you compute the correlation between the students LAT scores and their college
GPAs. You are pleased to find that there is a strong positive correlation between the two
variables. This means that it should be possible to use LAT scores to predict college GPA
with some degree of accuracy.
Using the same data set, you then use bivariate regression to further investigate the
relationship between the two variables. You use the REG procedure to perform a regression in
which college GPA is the criterion variable, and LAT scores are the predictor variable. The
output from PROC REG includes the regression coefficient and intercept that you need to
create a regression equation (these terms will be explained later in this chapter).
In subsequent years, you use this regression equation to predict college GPA for applicants
who want to be admitted to your university. Specifically, you obtain the LAT scores from
high school students who have applied for admission. You insert the applicants LAT scores
in the regression equation, and use the equation to predict what the students GPA will
probably be if they are admitted to your school. You then accept those students who are likely
to have GPAs over 2.00 (according to the regression equation).
Why this data would be appropriate for this procedure. This section has already indicated
that, in bivariate regression, the predictor variable should be a numeric variable that is
assessed on an interval or ratio scale of measurement. The predictor variable in this study
consisted of scores on the Learning Aptitude Test (LAT). The LAT is fictitious, but if we
assume that it is similar to other carefully developed aptitude tests (such as the Graduate
Record Exam), then most researchers will agree that its scores constitute an interval scale of
measurement.
To perform regression, the criterion variable should also be on an interval scale or ratio scale
of measurement. The criterion variable in the present study was college grade point average.
Assuming that GPA in this study was assessed in the usual fashion (e.g., using the system in
which 4.00 represents straight As), most researchers would probably agree that it is also
on an interval scale of measurement.
Remember that, when you perform linear regression, the predictor and the criterion variables
are usually multi-value variables. To determine whether this is the case for the current study,
you would use the FREQ procedure to create simple frequency tables for the predictor and
criterion variables (similar to those shown in Chapter 5, Creating Frequency Tables). If you
observe more than six values for each of them in their frequency tables, then you know that
both variables are multi-value variables.

346 Step-by-Step Basic Statistics Using SAS: Student Guide

Summary of Assumptions Underlying Bivariate Linear Regression

Interval-level measurement. Both the predictor and criterion variables should be assessed
on an interval or ratio level of measurement.

Random sampling. Each subject in the sample should contribute one score on the predictor
variable, and one score on the criterion variable. These pairs of scores should represent a
random sample drawn from the population.

Linearity. The relationship between the criterion variable and the predictor variable should
be linear. This means that, in the population, the mean criterion scores at each value of the
predictor variable should fall on a straight line. Linear regression procedures are not
appropriate for assessing the nature of the relationship between two variables that are
involved in a curvilinear relationship.

Equal variances. The variances of the Y scores should be approximately the same at each
value of X. This condition is referred to as homoscedasticity.

Bivariate normal distribution. The pairs of scores should follow a bivariate normal
distribution. This means that (a) scores on the criterion variable should form a normal
distribution at each value of the predictor variable and (b) scores of the predictor variable
should form a normal distribution at each value of the criterion variable. When scores
represent a bivariate normal distribution, they form an elliptical scattergram when they are
plotted (i.e., their scattergram is shaped like a football: relatively fat in the middle and
tapered on the ends).

Normally distributed residuals. When predicting the Y variable from the X variable, the
residuals of prediction (i.e., the errors of prediction) will be normally distributed with a
mean of zero and a standard deviation of one. The concept of residuals of prediction will
be explained later in this chapter.

Example 11.1: Predicting Weight Loss from a Variety of


Predictor Variables
Overview
This chapter illustrates regression and prediction by referring to the fictitious study on weight
loss that was presented in Chapter 10, Bivariate Correlation. It is assumed that you have
completed Chapter 10 before moving on to the present chapter; for this reason, the highlights
of that study will only be briefly reviewed here to refresh your memory. You are encouraged
to complete Chapter 10 before continuing with the current chapter.

Chapter 11: Bivariate Regression 347

The Study
Overview. Suppose that you conduct a correlational study that is designed to identify
variables that are predictive of weight loss in a sample of 22 men over a 10-week period.
Throughout the study, you assess four predictor variables that you believe should be
correlated with weight loss. At the end of the study, you assess how much weight each man
has lost, and investigate the relationship between this criterion variable and your four
predictor variables.
Method. At the beginning of the study, you administer a 5-item scale that is designed to
assess each subjects motivation to lose weight. The scale consists of statements such as It is
very important to me to lose weight. Subjects respond to each item using a 7-point response
format in which 1 = Disagree Very Strongly and 7 = Agree Very Strongly. You sum their
responses to the five items to create a single motivation score for each subject. Scores on this
measure may range from 5 to 35, with higher scores representing greater motivation to lose
weight.
Throughout the 10-week study, you ask each subject to record the number of hours that he
exercises each week. At the end of the study, you determine the average number of hours
spent exercising for each subject, and correlate this number with subsequent weight loss.
Throughout the study, you also ask each subject to keep a log of the number of calories that he
consumes each day. At the end of the study, you compute the average number of calories that
are consumed by each subject each day. You will correlate this measure of daily calorie intake
with subsequent weight loss.
At the beginning of the study you also administer the Weschler Adult Intelligence Scale
(WAIS) to each subject. The combined IQ score from this instrument will serve as the
measure of intelligence in your study. You then correlate IQ with subsequent weight loss.
Throughout the 10-week study, you weigh each subject and record their body weight in
kilograms (1 kilogram is equal to approximately 2.2 pounds). When the study is completed,
you subtract the subjects body weight at the beginning of the study from their weight at the
end of the study. You use the resulting difference as your measure of weight loss.
The Predictor Variables and Criterion Variable in the Analysis
This study involves one criterion variable and four predictor variables. The criterion variable
is weight loss measured in kilograms (kgs). In the analysis, you will give this variable the
SAS variable name KG_LOST for kilograms lost.
The first predictor variable is motivation to lose weight, as measured by the questionnaire
described above. In the analysis, you will use the SAS variable name MOTIVAT to represent
this variable.

348 Step-by-Step Basic Statistics Using SAS: Student Guide

The second predictor variable is the average number of hours that the subjects spent
exercising each week during the study. In the analysis, you will use the SAS variable name
EXERCISE to represent this variable.
The third predictor variable is the average number of calories that are consumed each day. In
the analysis, you will use the SAS variable name CALORIES to represent this variable.
The final predictor variable is intelligence, as measured by the WAIS test. In the analysis, you
will use the SAS variable name IQ to represent this variable.
Data Set to be Analyzed
Table 11.1 presents fictitious scores for each subject on each of the variables to be analyzed in
this study.
Table 11.1
Variables Analyzed in the Weight Loss Study
___________________________________________________________________
Kilograms
Hours
Calories
Subject
lost
Motivation
exercising
consumed
IQ
___________________________________________________________________
01. John
2.60
5
0
2400
100
02. George
1.00
5
0
2000
120
03. Fred
1.80
10
2
1600
130
04. Charles
2.65
10
5
2400
140
05. Paul
3.70
10
4
2000
130
06. Jack
2.25
15
4
2000
110
07. Emmett
3.00
15
2
2200
110
08. Don
4.40
15
3
1400
120
09. Edward
5.35
15
2
2000
110
10. Rick
3.25
20
1
1600
90
11. Ron
4.35
20
5
1800
150
12. Dale
5.60
20
3
2200
120
13. Bernard
6.44
20
6
1200
90
14. Walter
4.80
25
1
1600
140
15. Doug
5.75
25
4
1800
130
16. Scott
6.90
25
5
1400
140
17. Sam
7.75
25
.
1400
100
18. Barry
5.90
30
4
1600
100
19. Bob
7.20
30
5
2000
150
20. Randall
8.20
30
2
1200
110
21. Ray
7.80
35
4
1600
130
22. Tom
9.00
35
6
1600
120
___________________________________________________________________

Table 11.1 provides scores for 22 male subjects. The first subject appearing in the table is
named John. Table 11.1 shows the following values for John on the studys variables:

He lost 2.60 kgs of weight by the end of the study.

His score on the motivation scale was 5 (out of a possible 35).

Chapter 11: Bivariate Regression 349

His score on Hours Exercising was 0, meaning that he exercised zero hours per week on
the average.

His score on calories was 2400, meaning that he consumed 2400 calories each day, on the
average.

His IQ was 100 (with the WAIS, the mean IQ is 100 and the standard deviation is
approximately 15 in the population).

Scores for the remaining subjects can be interpreted in the same way.
The DATA Step for the SAS Program
Below is the DATA step for the SAS program that will read the data presented in
Table 11.1.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
KG_LOST
MOTIVAT
EXERCISE
CALORIES
IQ;
DATALINES;
01 2.60
5 0 2400
02 1.00
5 0 2000
03 1.80 10 2 1600
04 2.65 10 5 2400
05 3.70 10 4 2000
06 2.25 15 4 2000
07 3.00 15 2 2200
08 4.40 15 3 1400
09 5.35 15 2 2000
10 3.25 20 1 1600
11 4.35 20 5 1800
12 5.60 20 3 2200
13 6.44 20 6 1200
14 4.80 25 1 1600
15 5.75 25 4 1800
16 6.90 25 5 1400
17 7.75 25 . 1400
18 5.90 30 4 1600
19 7.20 30 5 2000
20 8.20 30 2 1200
21 7.80 35 4 1600
22 9.00 35 6 1600
;

100
120
130
140
130
110
110
120
110
90
150
120
90
140
130
140
100
100
150
110
130
120

350 Step-by-Step Basic Statistics Using SAS: Student Guide

Some notes about the preceding program:


Line 1 of the preceding program contains the OPTIONS statement which, in this case,
specifies the size of the printed page of output.
One entry in the OPTIONS statement is PS=60, which is an abbreviation for
PAGESIZE=60. This keyword requests that each page of output have up to 60 lines of text
on it. Depending on the font that you are using (and other factors), requesting PS=60 may
cause the bottom of your scatterplot to be cut off when it is printed. If this happens, you
should change the OPTIONS statement so that it requests just 50 lines of text per page. You
will do this by including PS=50 in your OPTIONS statement, rather than PS=60. Your
complete OPTIONS statement should appear as follows:
OPTIONS

LS=80

PS=50;

You can see that lines 38 of the preceding program provide the INPUT statement. The SAS
variable name SUB_NUM represents subject number, KG_LOST represents kilograms
lost, MOTIVAT represents motivation to lose weight, and so on.
The data appear in lines 1031. The data on these lines are identical to the data appearing in
Table 11.1, except that the names of the subjects have been removed.

Using PROC REG: Example with a Significant Positive


Regression Coefficient
Overview
When there is a positive relationship between variable X and variable Y, it means that

low scores on X are associated with low scores on Y

high scores on X are associated with high scores on Y.

Chapter 10 presented one analysis in which the motivation to lose weight (MOTIVAT) was
correlated with kilograms of weight lost (KG_LOST). The analysis revealed a strong positive
relationship between the two variables. The strong relationship shows that, if you know where
a subject stands on MOTIVAT, it should be possible to predict where they stand on
KG_LOST with some degree of accuracy.
In this section, you will again explore the relationship between MOTIVAT and KG_LOST.
This time, however, you will use PROC REG to develop a regression equation that can be
used to predict weight loss scores from motivation scores.

Chapter 11: Bivariate Regression 351

Verifying That Your Data Are Appropriate for the Analysis


Before performing a regression analysis, you should always perform some preliminary
analyses to verify that your data are in proper form. At a minimum, you should perform
PROC MEANS and PROC UNIVARIATE to verify that the data set does not contain any
obvious typing errors and that the variables do not demonstrate a marked departure from
normality. Chapter 7, Measures of Central Tendency and Variability showed how to
perform these analyses, and so those procedures will not be repeated here.
In addition, you should perform the PLOT procedure to verify that the relationship between
your variables is linear and to verify that your data forms an approximately bivariate normal
distribution. Chapter 10 of this book showed how to do that.
Writing the SAS Program with PROC REG
The syntax. Below is the syntax for the SAS statements that request the REG procedure:
PROC REG DATA=data-set-name ;
MODEL criterion-variable = predictor-variable /options ;
TITLE1 ' your-name ';
RUN;
The preceding shows that, in the PROC REG statement, you should use the DATA option to
identify the data set that is to be analyzed.
You can see that the MODEL statement includes an equal sign (=). To the left of this equal
sign, you provide the name of the criterion variablethat is, the outcome variable that you are
trying to predict. Many texts refer to this as the Y variable. It is also sometimes referred to
as the dependent variable (you will see that the SAS output uses the heading Dependent
Variable when the name of this variable is printed).
To the right of the equal sign in the MODEL statement, you provide the name of the predictor
variable. Many texts refer to this as the X variable. It is also sometimes referred to as the
independent variable.
If you are requesting any options for the analysis, add a slash mark (/) followed by the
keyword for each option in the MODEL statement. This chapter illustrates two of the many
options available with PROC REG:
STB
The keyword STB requests that PROC REG print the standardized regression coefficient
for the analysis. A regression coefficient is an estimate of the average amount of change
that takes place in Y for every one-unit change in X. A standardized regression
coefficient is an estimate of what the regression coefficient would be if both variables were
standardized to have a mean of zero and a standard deviation of 1 (regression coefficients
will be discussed in greater detail in a later section).

352 Step-by-Step Basic Statistics Using SAS: Student Guide

P
The keyword P requests that PROC REG print predicted values (Y values) on the
criterion variable for each observation. The table produced by this option includes the
following:

An observation number for each observation.

Each observations actual score on the criterion variable.

Each observations predicted score on the criterion variable (based on the regression
equation).

The residual for each prediction (the difference between each observations actual score
on the criterion versus that observations predicted score).

A later section will discuss the concepts of predicted scores and residuals in greater detail.
For a complete discussion of other options available with PROC REG, see the chapter on the
REG procedure in the SAS/STAT Users Guide (1999d).
The program. Below are the SAS statements that request that PROC REG be performed,
specifying KG_LOST as the criterion variable and MOTIVAT as the predictor variable.
25
26
27
28
29
30
31
32
33
34
35

.
.
.
20 8.20
21 7.80
22 9.00
;
PROC REG
MODEL
TITLE1
RUN;

30
35
35

2
4
6

1200
1600
1600

110
130
120

DATA=D1;
KG_LOST = MOTIVAT
'JANE DOE';

STB

P;

Some notes about the preceding program:

To conserve space, the preceding shows just the last few data lines from the DATA step on
lines 2830. This DATA step was presented in full in the preceding section titled The
DATA Step for the SAS Program.

The PROC REG statement appears on line 32. The DATA option for this statement requests
that the analysis be performed on the data set named D1.

The MODEL statement appears on line 33. It requests that KG_LOST serve as the criterion
variable (Y variable) in the analysis, and that MOTIVAT serve as the predictor variable (X
variable).

Lines 3435 present the TITLE1 and RUN statements for the program.

Chapter 11: Bivariate Regression 353

Results from the SAS Output


The preceding program produces two pages of output. Page 1 contains the analysis of variance
table and the parameter estimates section. Page 2 contains the table of predicted values.
Output 11.1 presents Page 1 from this output.
JANE DOE
The REG Procedure
Model: MODEL1
Dependent Variable: KG_LOST

Source
Model
Error
Corrected Total

DF
1
20
21

Root MSE
Dependent Mean
Coeff Var

Variable
Intercept
MOTIVAT

DF
1
1

Parameter
Estimate
0.50944
0.22382

Analysis of Variance
Sum of
Mean
Squares
Square
85.16485
85.16485
23.51188
1.17559
108.67673
1.08425
4.98591
21.74625

R-Square
Adj R-Sq

Parameter Estimates
Standard
Error
t Value
0.57450
0.89
0.02630
8.51

F Value
72.44

Pr > F
<.0001

0.7837
0.7728

Pr > |t|
0.3857
<.0001

Standardized
Estimate
0
0.88524

Output 11.1. Results of PROC REG in which kilograms of weight lost (KG_LOST)
serves as the criterion variable. Motivation to lose weight (MOTIVAT)
serves as the predictor variable.

Steps in Interpreting the Output


Step 1: Make sure that everything looks right. In some instances, you might perform a
number of regression analyses in which you use different variables as predictor and criterion
variables. In sorting through the output from these analyses, you must be able to quickly
identify the criterion variable and the predictor variable on which the current analysis is based.
To do this, first check to the right of the heading Dependent Variable. Here, you will find
the name of the criterion variable in your analysis. In Output 11.1, you can see that the
criterion variable was KG_LOST ( ).
Next, look in the lower left corner of the page, below the heading Variable. Below this
heading, you will find the word Intercept, and below Intercept will be the name of the
predictor variable in your analysis. In Output 11.1, the predictor variable is MOTIVAT ( ).
Next, use the degrees of freedom printed in the output to verify that the analysis was
performed on the correct number of observations. You will find this information in the upper
half of the output, in the section headed Analysis of Variance ( ). In this part of the output,
look below the heading DF and to the right of the heading Corrected Total. At this

354 Step-by-Step Basic Statistics Using SAS: Student Guide

location, you will find the corrected total degrees of freedom for your analysis. In Output
11.1, the entry at that location is 21 ( ). This means that the corrected total degrees of
freedom for your analysis is 21.
The corrected total degrees of freedom for this type of analysis is equal to
N1
where N is equal to the total number of observations (in this case, the number of subjects who
contributed valid data for the analysis). In the weight loss study, the total number of
observations was 22, so the corrected total degrees of freedom would be
N 1 = 22 1 = 21
When computed manually, the corrected total degrees of freedom is equal to the corrected
total degrees of freedom as computed by SAS. Again, this is what you would expect to seeif
the corrected total degrees of freedom in Output 11.1 had been, for example, 12, it might have
meant that there was an error in either the DATA step or the PROC step. In fact, however,
there is no evidence of problems, and so you can proceed.
Step 2: Review the intercept and nonstandardized regression coefficient. One of your
main objectives in conducting this analysis is to develop a linear regression equation that can
be used to predict KG_LOST from MOTIVAT. The general form for this linear regression
equation is as follows:
Y = b (X) + a
where:
Y

represents a given subjects predicted score on the Y variable (the criterion variable).

represents the regression coefficient, or the slope of the regression line. This
regression coefficient represents the average amount of change in Y that is associated
with a one-unit change in X.

represents a particular subjects actual score on the X variable (the predictor variable).

represents the Y-intercept, also called the intercept constant. This is the value of Y
at the location where the regression line crosses the Y axis (assuming that both the X
axis and Y axis begin at zero).

To construct the regression equation for a particular data set, the only items that you need
from the preceding output are b (the regression coefficient, or slope), and a (the Y-intercept).
Both of these statistics are computed by PROC REG, and appear in the section of output titled
Parameter Estimates. This was the lower section appearing in Output 11.1. For convenience,
this section is reproduced again as Output 11.2.

Chapter 11: Bivariate Regression 355

Parameter Estimates

Variable DF
Intercept 1
MOTIVAT
1

Parameter
Estimate
0.50944
0.22382

Standard
Error
0.57450
0.02630

t Value
0.89
8.51

Pr > |t|
0.3857
<.0001

Standardized
Estimate
0
0.88524

Output 11.2. Parameter estimates section of output of PROC REG in which


KG_LOST was regressed on MOTIVAT.

In Output 11.2, the first column is headed Variable. Below this heading are the entries
Intercept and MOTIVAT.
The third column is headed Parameter Estimate.
Where the row headed Intercept intersects with the column headed Parameter Estimate,
you will find the Y-intercept (or intercept constant) for your regression equation. In Output
11.2, this Y-intercept is 0.50944, which rounds to .509. At this point, you can insert your Yintercept into the regression equation, as shown here:
Y = b (X) + a
Y = b (X) + .509
In Output 11.2, where the row headed MOTIVAT intersects with the column headed
Parameter Estimate, you will find the regression coefficient (or slope) for your regression
equation. In Output 11.2, this slope is .22382, which rounds to .224. You can now insert this
slope into the regression equation, as shown here:
Y = b (X) + .509
Y = .224 (X) + .509
The slope for this equation indicates that, for every one-unit increase in scores on MOTIVAT,
there is an average increase of .224 units on KG_LOST. In other words, for every increase of
1 point on the motivation to lose weight scale, there is an increase of about .22 kilograms of
weight actually lost.
A later section will show how you can use this regression equation to predict how much
weight a subject is likely to lose, given his score on the motivation scale.
The type of regression coefficient that has been discussed in this section is a nonstandardized
regression coefficient. A nonstandardized regression coefficient is obtained when the X and Y
variables have not been standardized to have equal variances. When you use the MODEL
statement options recommended here, PROC REG will also print a standardized regression
coefficient. The interpretation of a standardized coefficient will be discussed in a later section.
Step 3: Review the significance test for the nonstandardized regression coefficient. In
most cases, you will be interested in interpreting the regression coefficient (slope) of the

356 Step-by-Step Basic Statistics Using SAS: Student Guide

regression equation only if it is significantly different from zero. Fortunately, PROC REG
provides a t statistic to test the null hypothesis that the slope is equal to zero in the population.
The t statistic for this test appears in Output 11.2 in the column headed t Value.
Where the column with this heading intersects with the row headed MOTIVAT, you can see
a t statistic of 8.51. The degrees of freedom for this t test are equal to
N2
where N is equal to the number of valid observations in the analysis. In the present analysis,
N = 22, so
(N 2) = (22 2) = 20
Therefore, the present t test is based on 20 degrees of freedom.
The probability, or p value for this t test appears in Output 11.2 under the heading Pr > |t|.
Where the column with this heading intersects with the row headed MOTIVAT, you can see
the entry <.0001. This means that the p value associated with this t statistic is less than
.0001. Remember that in this book a statistic is significant if its p value is less than .05.
Clearly, the p value for MOTIVAT is statistically significant. This p value indicates that, if
the regression coefficient for the relationship between KG_LOST and MOTIVAT were
actually equal to zero in the population, there is less than 1 chance in 10,000 that you would
obtain a regression coefficient equal to .224 (or larger) in a sample of this size. Because this
probability is so low, you will reject the null hypothesis, and tentatively conclude that the
regression coefficient is probably larger than zero in the population.
It is worth mentioning that Output 11.2 also presents a t statistic that tests the null hypothesis
that the intercept is equal to zero in the population. This t statistic appears where the row
headed Intercept intersects with the column headed t Value. However, it is rare that a
researcher is interested in testing the null hypothesis that the intercept is equal to zero in the
population, so in most cases you should disregard this section of the output.
Step 4: Review the standardized regression coefficient. Earlier, it was stated that a
standardized regression coefficient is an estimate of what the regression coefficient would be
if both variables were standardized to have a mean of zero and a standard deviation of 1.
Output 11.2 also provides the standardized regression coefficient for the current analysis. For
convenience, this section is again reproduced next as Output 11.3.

Chapter 11: Bivariate Regression 357

Parameter Estimates
Variable
Intercept
MOTIVAT

DF
1
1

Parameter
Estimate
0.50944
0.22382

Standard
Error
0.57450
0.02630

t Value
0.89
8.51

Pr > |t|
0.3857
<.0001

Standardized
Estimate
0
0.88524

Output 11.3. Standardized regression coefficient obtained when KG_LOST was


regressed on MOTIVAT.

The standardized regression coefficient appears in Output 11.3 below the heading
Standardized Estimate.
Where the row headed MOTIVAT intersects with the column headed Standardized
Estimate, you can see that the standardized regression coefficient for this analysis is .88524,
which rounds to .89. This number should sound familiar to you because it is equal to the
Pearson correlation between MOTIVAT and KG_LOST that was reported in the previous
chapter. And this is no coincidencewhen a regression equation contains just one criterion
variable and just one predictor variable, the standardized regression coefficient will always be
equal to the Pearson correlation between the two variables.
Step 5: Review the coefficient of determination. The coefficient of determination (denoted
2
as r or r-square) refers to the proportion of variance in the criterion variable that is
accounted for by variability in the predictor variable. This coefficient may range from .00 to
1.00, with higher values indicating that a higher proportion of variance is accounted for. When
the coefficient of determination for two variables is high, it means that there is a strong
relationship between the two variables.
When a regression is performed with only two variables (as is the case in this chapter), the
coefficient of determination is equal to r2, the square of the Pearson correlation between the
variables. This means that, after you compute the Pearson correlation between two variables,
you can calculate the coefficient of determination by squaring the correlation coefficient.
The coefficient of determination also appears in the output of PROC REG, in that section of
output headed Analysis of Variance. For convenience, this section of output is reproduced
again as Output 11.4.

358 Step-by-Step Basic Statistics Using SAS: Student Guide

Source

DF

Model
Error
Corrected Total

1
20
21

Root MSE
Dependent Mean
Coeff Var

Analysis of Variance
Sum of
Mean
Squares
Square
85.16485
23.51188
108.67673
1.08425
4.98591
21.74625

85.16485
1.17559
R-Square
Adj R-Sq

F Value

Pr > F

72.44

<.0001

0.7837
0.7728

Output 11.4. Coefficient of determination (r2) obtained when KG_LOST was


regressed on MOTIVAT.

The coefficient of determination for this analysis appears in Output 11.4 to the right of the
heading R-Square. In this case, you can see that the coefficient of determination is .7837,
which rounds to .78. This coefficient indicates that approximately 78% of the variance in
KG_LOST is accounted for by variability in MOTIVAT. This is a very large percentage of
variance (however, you should remember that the results presented here are fictitious, and it is
unlikely that such a large percentage of variability in weight loss would be accounted for by
motivation in a real study).
Drawing a Regression Line through the Scattergram
Overview. In Chapter 10, Bivariate Correlation, you learned how to use the PLOT
procedure to create a scattergram that plots a criterion variable against a predictor variable.
Once you have the output generated by PROC REG, it is a relatively simple matter to draw a
best-fitting regression line through the center of the scattergram that was created by PROC
PLOT. For students of elementary statistics, this is a useful exercise for understanding the
meaning of the regression equation generated by PROC REG.
Here is a short overview of how it will be done (the remainder of this section provides more
detailed, step-by-step instructions): First, you will print out a copy of the scattergram that
plots your Y variable (criterion variable) against your X variable (predictor variable). Next,
you will select a low value on the X variable, insert this X value into your regression equation,
and compute the Y (predicted Y) value that is associated with that low value of X. You will
place a dot on your scattergram that represents the location of this predicted value.
You will then start with a high value on the X variable, insert this X value into your regression
equation, and compute the Y value that is associated with that high value of X. You will
place a dot on your scattergram that represents the location of this predicted value.
Finally, you will draw a straight line through the scattergram connecting the two dots that you
placed there. This line represents the regression line for your scattergram: it will be a bestfitting line that goes through the center of the scattergram. It will represent the predicted value
of Y that is associated with every possible value of X.
The remainder of this section provides step-by-step instructions for drawing this best-fitting line.

Chapter 11: Bivariate Regression 359

Step 1: Printing the output from PROC PLOT. Here are the PROC PLOT statements that
will create a scattergram plotting KG_LOST against MOTIVAT:
PROC PLOT DATA=D1;
PLOT KG_LOST*MOTIVAT;
TITLE1 'JANE DOE';
RUN;
When the weight loss data from Table 11.1 are analyzed, these statements produce the
scattergram presented in Output 11.5.
Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

KG_LOST |
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
|
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 11.5. Scattergram that plots kilograms lost against motivation


to lose weight.

360 Step-by-Step Basic Statistics Using SAS: Student Guide

Step 2: Computing a predicted value of Y that is associated with a low value of X.


Review the X axis (the axis for MOTIVAT, in this case), and find a point on the X axis that
represents a relatively low score on the X variable. From Output 11.5, you can see that a
relatively low score on MOTIVAT would be a score of 10.
Next, insert this low X score into your regression equation. Here is the regression equation for
the relationship between KG_LOST and MOTIVAT, as reported earlier in this chapter:
Y = .224 (X) + .509
Inserting a score of 10 into this equation gives you the following:
Y = .224 (10) + .509
Y = 2.24 + .509
Y = 2.749
So Y (the predicted value of Y) is equal to 2.749, which rounds to 2.75. This means that, if a
subjects score on X (MOTIVAT) is 10, your regression equation predicts that his score on Y
(KG_LOST) would be 2.75.
Step 3: Marking the location on the scattergram that corresponds to this low Y value.
On your scattergram, find the point on the X axis that corresponds to a score of 10, and
imagine an invisible line going straight up from this point. At the same time, find the point on
the Y axis that corresponds to a score of 2.75 (Y), and imagine an invisible line going straight
to the right from this point. The point at which your two imaginary lines intersect represents
Y (the predicted value of Y that is associated with an X score of 10). Place a dot at that point.
This step is illustrated in Output 11.6. The vertical dotted line goes up from the low score of
10 on the X axis. The horizontal dotted line goes to the right from the predicted value on Y of
2.75. A dot has been drawn at the point where the two lines meet.

Chapter 11: Bivariate Regression 361

Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

KG_LOST |
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
|
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 11.6. Scattergram that plots kilograms lost against motivation to lose
weight. (This scattergram identifies the Y value that is associated with
a low value of X.)

Step 4: Computing a predicted value of Y that is associated with a high value of X.


Review the X axis (the axis for MOTIVAT, in this case), and find a point on the X axis that
represents a relatively high score on the X variable. From Output 11.6, you can see that a
relatively high score on MOTIVAT would be a score of 30.

362 Step-by-Step Basic Statistics Using SAS: Student Guide

Next, insert this high X score into your regression equation. Inserting a score of 30 into this
equation provides the following:
Y = .224 (X) + .509
Y = .224 (30) + .509
Y = 6.72 + .509
Y = 7.229
So Y (the predicted value of Y) is equal to 7.229, which rounds to 7.23. This means that, if a
subjects score on X (MOTIVAT) was 30, your regression equation predicts that his score
would be 7.23 on Y (KG_LOST).
Step 5: Marking the location on the scattergram that corresponds to this high Y value.
On your scattergram, find the point on the X axis that corresponds to a score of 30, and
imagine an invisible line going straight up from this point. At the same time, find the point on
the Y axis that corresponds to a score of 7.23 (Y), and imagine an invisible line going straight
to the right from this point. The point at which your two imaginary lines intersect represents
Y (the predicted value of Y that is associated with an X score of 30). Place a dot at that point.
This step is illustrated in Output 11.7. There, the vertical dotted line that goes up from the
high score of 30 on the X axis. The horizontal dotted line goes to the right from the predicted
value on Y of 7.23. A dot has been drawn at the point where the two lines meet.

Chapter 11: Bivariate Regression 363

Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

KG_LOST |
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
|
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 11.7. Scattergram that plots kilograms lost against motivation to lose
weight. (This scattergram identifies the Y value that is associated
with a high value of X.)

Step 6: Draw a regression line that connects the two dots. The final step is simple: draw a
straight line that connects the two dots that you have just made. Be sure that your line does not
extend beyond the range of your X variable. That is, be sure that your line does not extend any
lower than your lowest observed score on the X variable, or any higher than your highest
observed score on the X variable. In the present case, this means that your regression line
should not extend to the left of a score of 5 on the X axis (the MOTIVAT axis), and it should
not extend to the right of a score of 35 on the X axis. A section appearing later in this chapter

364 Step-by-Step Basic Statistics Using SAS: Student Guide

(titled Predicting Y within the Range of X) will discuss this issue concerning the range of
X in greater detail.
The line that you have now drawn on your output page is your regression line: it is the best-fitting
line that goes through the center of your scattergram. Points on this line represent predicted values
of Y that are associated with the values of X that appear directly below them on the X axis.
Output 11.8 provides the final version of the scattergram with the regression line drawn
through it.

Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

KG_LOST |
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
|
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 11.8. Scattergram that plots kilograms lost against motivation to lose
weight (with regression line).

Chapter 11: Bivariate Regression 365

Reviewing the Table of Predicted Values


The preceding section showed you how to draw a best-fitting regression line through the
center of your scattergram. You can use this regression line to identify the predicted value of
Y that is associated with any value of X. You can also use the regression line to estimate
residuals of prediction (some textbooks refer to these as the errors of prediction). A
residual of prediction is simply the difference between the predicted value of Y for a
particular subject versus the actual value of Y for that subject.
Fortunately, there is an easier way to do all of this that does not require you to work with a
regression line on a scattergram. PROC REG will print predicted values and residuals of
prediction when you include the keyword P in the MODEL statement. You may remember
that, in the SAS program presented earlier in this chapter, the MODEL statement did in fact
include the keyword P. The table of predicted values and residuals produced by this option
appears here as Output 11.9.
JANE DOE
The REG Procedure
Model: MODEL1
Dependent Variable: KG_LOST
Output Statistics

Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Dep Var
KG_LOST
2.6000
1.0000
1.8000
2.6500
3.7000
2.2500
3.0000
4.4000
5.3500
3.2500
4.3500
5.6000
6.4400
4.8000
5.7500
6.9000
7.7500
5.9000
7.2000
8.2000
7.8000
9.0000

Predicted
Value
1.6286
1.6286
2.7477
2.7477
2.7477
3.8668
3.8668
3.8668
3.8668
4.9859
4.9859
4.9859
4.9859
6.1050
6.1050
6.1050
6.1050
7.2241
7.2241
7.2241
8.3433
8.3433

Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)

Residual
0.9714
-0.6286
-0.9477
-0.0977
0.9523
-1.6168
-0.8668
0.5332
1.4832
-1.7359
-0.6359
0.6141
1.4541
-1.3050
-0.3550
0.7950
1.6450
-1.3241
-0.0241
0.9759
-0.5433
0.6567
0
23.51188
27.64729

Output 11.9. Table of predicted values and residuals for the PROC REG analysis
in which KG_LOST was regressed on MOTIVAT.

366 Step-by-Step Basic Statistics Using SAS: Student Guide

The table in Output 11.9 consists of four columns.


The observation column. The first column is headed Obs and provides observation
numbers for each of your subjects. Reading down this column, you can see that there are 22
observations; that is, there were 22 subjects in the data set that you analyzed (from Table
11.1). If you compare Output 11.9 to Table 11.1 (presented at the beginning of this chapter),
you can see that observation #1 is John, observation #2 is George, observation #3 is Fred, and
so on.
The criterion variable column. The second column in Output 11.9 is headed Dep Var
KG_LOST, which stands for Dependent Variable: KG_LOST. This column presents each
subjects actual score on KG_LOST, the criterion variable in the study. If you compare this
column to the column headed Kilograms Lost from Table 11.1, you will find that they are
identical.
The predicted value column. The third column in Output 11.9 is headed Predicted Value.
This column presents each subjects Y score: his predicted score on the Y variable
(KG_LOST), given his score on the X variable (MOTIVAT). These predicted scores are based
on the regression equation presented earlier in this chapter:
Y = .224 (X) + .509
This column is useful because it eliminates your having to insert X values into the regression
equation and computing Y values by hand. For example, assume that you want to know what
predicted value of Y is associated with an X value of 10. If you review Table 11.1 from the
beginning of this chapter, you can see that the third subject in the table (Fred) has a score of
10 on the X variable (that is, a score of 10 on MOTIVAT). In Output 11.9, Fred appears as
observation #3. Output 11.9 shows that the predicted Y value for observation #3 is 2.7477,
which rounds to 2.75. In other words, PROC REG predicts that subjects with an X score of 10
are likely to have a Y score of 2.75. This number (2.75) is the same value that we computed
previously by inserting an X score of 10 in the regression equation. In short, the results
presented in Output 11.9 are useful because they can save you from having to do a large
number of manual calculations when you need to compute Y values.
The residual column. Finally, the fourth column in Output 11.9 is headed Residual. This
column presents the residuals of prediction (also called the errors of prediction). A residual
is computed by subtracting from a subjects Y score the predicted Y score (Y) that was
generated by the regression equation.
For example, consider observation #1 from Output 11.9. For observation #1, the actual score
on the Y variable was 2.6000, and the predicted score on Y was 1.6286. Subtracting the latter
from the former produces the following:
2.6000 1.6286 = 0.9714
So the residual of prediction is equal to .9714. You can see that this is exactly the value that
appears in the Residual column for observation #1 in Output 11.9. Performing this

Chapter 11: Bivariate Regression 367

subtraction for each of the remaining observations in Output 11.9 shows that their residual
scores were computed in the same way.
Predicting Y within the Range of X
When you perform a regression analysis, it is important to always predict Y only within the
range of X. For example, suppose that you are using this regression formula (presented
earlier) to compute predicted scores on Y:
Y = .224 (X) + .509
You have already learned that, to do this, you insert a value of X into the formula, and solve
for Y. However, in doing this, it is important that

you do not insert a value of X that is lower than the lowest observed value of X in your
sample, and

you do not insert a value of X that is higher than the highest observed value of X in your
sample.

For example, the preceding regression equation was based on an analysis in which the X
variable was MOTIVAT (motivation), and the Y variable was KG_LOST (kilograms of
weight lost). Table 11.1 provided the data set for this analysis. That table shows that the
lowest observed score on the X variable (motivation) was 5, and the highest observed score
was 35. This means that when you are predicting scores on Y, you should not use an X score
lower than 5, or higher than 35.
What is the reason for this? Remember that the preceding regression equation was based on a
specific sample of data. In this sample, the lowest X score was 5, and the highest was 35. You
cannot be sure that you would have obtained the same regression equation (i.e., the same
intercept and regression coefficient) if you had analyzed a sample with a greater range on the
X variable (i.e., a sample in which the lowest X score was below 5 and the highest was above
35). That is why you should only predict Y within the range of X that was observed in your
sample.
Summarizing the Results of the Analysis
Overview. When you perform bivariate regression, it is possible to compute a regression
coefficient (b) that represents the nature of the relationship between the predictor variable and
the criterion variable. You have also learned that PROC REG produces a t statistic for this
regression coefficient. This t statistic tests the null hypothesis that your sample was drawn
from a population in which this regression coefficient is equal to zero.
This section shows you how to prepare an analysis report that is appropriate for this null
hypothesis test. You will see that this analysis report is somewhat similar to the one used for
the Pearson correlation coefficient (presented in the last chapter). The following report

368 Step-by-Step Basic Statistics Using SAS: Student Guide

summarizes the results of the analysis in which the predictor variable was the motivation to
lose weight, and the criterion variable was the number of kilograms lost.
A) Statement of the research question: The purpose of this
study was to determine whether the regression coefficient
representing the relationship between the motivation to lose
weight and the amount of weight actually lost over a 10-week
period is significantly different from zero.
B) Statement of the research hypothesis: There will be a
positive relationship between the motivation to lose weight and
the amount of weight that is actually lost over a 10-week
period.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was the motivation to lose weight.
This was a multi-value variable and was assessed on an
interval scale.
The criterion variable was the number of kilograms of weight
lost during the 10-week study. This was a multi-value
variable and was assessed on a ratio scale.
D) Statistical procedure:

Linear bivariate regression.

E) Statistical null hypothesis (H0): b = 0; In the population,


the regression coefficient representing the relationship
between the motivation to lose weight and the number of
kilograms of weight lost is equal to zero.
F) Statistical alternative hypothesis (H1): b 0; In the
population, the regression coefficient representing the
relationship between the motivation to lose weight and the
number of kilograms of weight lost is not equal to zero.
G) Obtained statistic:

b = .224, t (20) = 8.51

H) Obtained probability (p) value:

p < .0001

I) Conclusion regarding the statistical null hypothesis: Reject


the null hypothesis.
J) Conclusion regarding the research hypothesis: These
findings provide support for the studys research hypothesis.
K) Coefficient of determination:

.78.

L) Formal description of the results for a paper: Results were


analyzed by using linear regression to regress kilograms of
weight lost on the motivation to lose weight. This analysis
revealed a significant regression coefficient, b = .224, t(20)

Chapter 11: Bivariate Regression 369

= 8.51, p < .0001. The nature of the regression coefficient


showed that, on the average, an increase of .224 kilograms of
weight loss was associated with every 1-unit increase in the
motivation to lose weight. The analysis showed that motivation
accounted for 78% of the variance in weight loss.
M) Figure representing the results: Output 11.10 is a
scattergram showing the relationship between the motivation to
lose weight and kilograms of weight actually lost.

Plot of KG_LOST*MOTIVAT.

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

KG_LOST |
9 +
A
|
|
|
|
|
A
8 +
|
A
A
|
|
|
|
A
7 +
|
A
|
|
A
|
|
6 +
|
A
A
|
A
|
|
A
|
5 +
|
A
|
|
|
A
A
|
4 +
|
|
A
|
|
A
|
3 +
A
|
| A
A
|
|
A
|
2 +
|
A
|
|
|
|
1 + A
---+----------+----------+----------+----------+----------+----------+-5
10
15
20
25
30
35
MOTIVAT

Output 11.10. Scattergram representing the relationship between the


motivation to lose weight and kilograms of weight actually lost.

370 Step-by-Step Basic Statistics Using SAS: Student Guide

Notes about the Preceding Summary


Degrees of freedom for the t test. Item G in the preceding summary provides the
nonstandardized regression coefficient for the analysis, along with the t statistic and degrees
of freedom associated with that regression coefficient. This section of the summary is again
reproduced here:
G) Obtained statistic:

b = .224, t (20) = 8.51

The preceding shows that the t statistic for the analysis was 8.51. The (20) that appears next to
the t statistic represents the degrees of freedom associated with that test.
An earlier section of this chapter titled Steps in Interpreting the Output discussed where on
the output you can find the nonstandardized regression coefficient and the t statistic associated
with that coefficient. However, the output does not provide the degrees of freedom for this
test; the degrees of freedom must be computed manually.
As explained earlier, the formula for computing the degrees of freedom for this t test is
N2
where N is equal to the number of pairs of scores. When individual human subjects are your
unit of observation (as is the present case), then N is equal to the number of subjects that are
included in your analysis.
The present analysis was based on 22 subjects, so the degrees of freedom are calculated as
(N 2) = (22 2) = 20
Presenting the p value. Item H in the preceding report presents the obtained probability value
associated with the t statistic:
H) Obtained probability (p) value:

p < .0001

You can see that this item used the less-than sign (<) to indicate that the obtained p value
was less than .0001. This item uses the less-than sign because the less-than sign actually
appeared in the SAS output (see callout number in Output 11.2, presented previously).
There, the p value was presented as <.0001.
If the less-than sign had not appeared with the p value in the PROC REG output, then you
would have instead used the equal sign (=) when reporting your p value in the analysis
report. For example, assume that, in the PROC REG output, the p value was reported as
.0367 (without the < sign). If this had been the case, you would have reported the p value
in your analysis report as follows:
H) Obtained probability (p) value:

p = .0367

Chapter 11: Bivariate Regression 371

Regression line in the scattergram. Output 11.10 shows the scattergram that was created
when PROC PLOT was used to plot KG_LOST against MOTIVAT. The regression line
drawn through the center of the scattergram was created by following the steps described in
the preceding section titled Drawing a Regression Line through the Scattergram.

Using PROC REG: Example with a Significant Negative


Regression Coefficient
Overview
When there is a negative relationship between variable X and variable Y, it means that

high scores on X are associated with low scores on Y

low scores on X are associated with high scores on Y.

Among the variables in the weight loss study, it seems likely that the average number of
calories consumed each day should demonstrate a negative correlation with weight loss. It
only makes sense that (a) people with high scores for calorie consumption are going to have
low scores on weight loss (lose little weight), and (b) people with low scores for calorie
consumption are going to have high scores on weight loss (lose more weight).
To illustrate the nature of a negative correlation, the following sections show how to use the
results of PROC CORR, PROC PLOT, and PROC REG to explore the relationship between
calorie consumption and kilograms lost.
Correlation between Kilograms Lost and Calorie Consumption
Chapter 10, Bivariate Correlation showed you how to use PROC CORR to compute the
Pearson correlation between a number of variables. Output 11.11 reproduces a correlation
matrix that was first presented in Chapter 10. This matrix includes every possible correlation
between the five variables measured in the weight loss study (remember that these results are
fictitious):

372 Step-by-Step Basic Statistics Using SAS: Student Guide

KG_LOST
MOTIVAT
EXERCISE
CALORIES
IQ

KG_LOST
1.00000
22
0.88524
<.0001
22
0.53736
0.0120
21
-0.55439
0.0074
22
0.02361
0.9169
22

Pearson Correlation Coefficients


Prob > |r| under H0: Rho=0
Number of Observations
MOTIVAT
EXERCISE
CALORIES
0.88524
0.53736
-0.55439
<.0001
0.0120
0.0074
22
21
22
1.00000
0.47845
-0.54984
0.0282
0.0080
22
21
22
0.47845
1.00000
-0.22594
0.0282
0.3247
21
21
21
-0.54984
-0.22594
1.00000
0.0080
0.3247
22
21
22
0.10294
0.31201
0.19319
0.6485
0.1685
0.3890
22
21
22

IQ
0.02361
0.9169
22
0.10294
0.6485
22
0.31201
0.1685
21
0.19319
0.3890
22
1.00000
22

Output 11.11. All possible correlations between the five variables assessed in
the weight loss study.

In Output 11.11, find the cell that appears at the point where the row headed KG_LOST
intersects with the column headed CALORIES ( ). There, you can see that the correlation
between weight loss and calorie consumption is equal to .55439, which rounds to .55. The
sign of this coefficient indicates that the relationship between the two variables is negative.
The p value of .0074 is less than the standard criterion of .05, which means that the coefficient
is significantly different from zero.
Because you know that the relationship is negative, you can use PROC PLOT to create a
scattergram that will reveal the general shape of the bivariate distribution.
Using PROC PLOT to Create a Scattergram
Following are the SAS statements that will cause PROC PLOT to create a scattergram for the
variables KG_LOST and CALORIES:
PROC PLOT DATA=D1;
PLOT KG_LOST*CALORIES;
TITLE1 'JANE DOE';
RUN;

Chapter 11: Bivariate Regression 373

Output 11.12 presents the results generated by the preceding statements.


Plot of KG_LOST*CALORIES.
KG_LOST |
9 +
|
|
|
|
|
8 +
|
|
|
|
|
7 +
|
|
|
|
|
6 +
|
|
|
|
|
5 +
|
|
|
|
|
4 +
|
|
|
|
|
3 +
|
|
|
|
|
2 +
|
|
|
|
|
1 +

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

A
A

A
A
A

A
A
A

A
A

A
A
A
B
A
A

A
--+----------+----------+----------+----------+----------+----------+-1200
1400
1600
1800
2000
2200
2400
CALORIES

Output 11.12. Scattergram plotting kilograms lost against calorie consumption.

The scattergram in Output 11.12 demonstrates two notable features. First, you can see that
there some tendency for subjects with low score on CALORIES to have relatively high scores
on KG_LOST, and for subjects with high scores on CALORIES to have relatively low scores
on KG_LOST. This, of course, is the defining characteristic of a negative relationship.
Second, you can see that, although there is something of a negative trend in the data, it is not
an extremely strong trend. This can be seen in the general shape of the scattergram: It forms
an ellipse, but it is not a particularly narrow ellipse. This reflects the fact that the correlation

374 Step-by-Step Basic Statistics Using SAS: Student Guide

between CALORIES and KG_LOST is not an extremely strong relationship; the correlation is
only .55.
For purpose of contrast, turn back to Output 11.10, which presented a scattergram plotting
KG_LOST against MOTIVAT. The ellipse in that figure was much more narrow, with scores
clustering around the (imaginary) regression line much more tightly. This was because the
correlation between KG_LOST and MOTIVAT was stronger at .89.
Using PROC REG to Perform the Regression Analysis
Program and output. Following are the statements that will cause PROC REG to perform a
regression analysis with KG_LOST as the criterion variable and CALORIES as the predictor
variable:
PROC REG DATA=D1;
MODEL KG_LOST = CALORIES / STB
RUN;

P;

The preceding statements produced two pages of output. The first page includes the
analysis of variance table and parameter estimates tables. These results are presented in
Output 11.13.

Source
Model
Error
Corrected Total

JANE DOE
The REG Procedure
Model: MODEL1
Dependent Variable: KG_LOST
Analysis of Variance
Sum of
Mean
DF
Squares
Square
1
33.40216
33.40216
20
75.27458
3.76373
21
108.67673

Root MSE
Dependent Mean
Coeff Var

Variable
Intercept
CALORIES

DF
1
1

Parameter
Estimate
11.26348
-0.00354

1.94003
4.98591
38.91032

R-Square
Adj R-Sq

Parameter Estimates
Standard
Error
t Value
2.14745
5.25
0.00119
-2.98

F Value
8.87

Pr > F
0.0074

0.3074
0.2727

Pr > |t|
<.0001
0.0074

Standardized
Estimate
0
-0.55439

Output 11.13. Results of PROC REG with weight lost as the criterion variable,
and calorie consumption as the predictor variable (Analysis of Variance and
Parameter Estimates tables).

Variable names. When you review the output of PROC REG you should first verify that you
are looking at the output for the correct criterion variable and predictor variable. The name of
the criterion variable appears toward the top of the page, to the right of the heading
Dependent Variable. In Output 11.13, the criterion variable is KG_LOST ( ). The name of

Chapter 11: Bivariate Regression 375

the predictor variable appears in the Parameter Estimates section, below the heading
Variable. For the current analysis, you can see that the predictor variable is CALORIES
( ).
Slope and intercept. To construct the regression equation for this analysis, you will need the
nonstandardized regression coefficient (slope) and the intercept. These statistics appear in the
Parameter Estimates section below the heading Parameter Estimate. Output 11.13 shows
that, for the current analysis, the regression coefficient is .00354 ( ), which rounds to
.0035, and that the intercept is 11.26348 ( ), which rounds to 11.26. The regression equation
for this analysis therefore takes the following form:
Y = b (X) + a
Y = .0035 (X) + 11.26
The size and sign of the slope in this regression equation shows that, on the average, a
decrease of .0035 kilograms of weight loss was associated with every 1-unit increase in
calorie consumption. The regression coefficient was negative, and not positive, because
people who consumed more calories tended to lose less weight.
Significance test. As mentioned previously, PROC REG provides a t test for the null
hypothesis that your sample was drawn from a population in which the slope is equal to zero.
This appears in Output 11.13 below the heading t Value. You can see that the t statistic in
this case is equal to 2.98 ( ). The probability value for this t test is .0074 ( ). Because the
probability value is less than the standard criterion of .05, you can conclude that the slope is in
fact significantly different from zero.
Standardized regression coefficient. Below the heading Standardized Estimate you can
see that the standardized regression coefficient for the analysis is .55439, which rounds
to .55 ( ). You might remember that the Pearson correlation between KG_LOST and
CALORIES was also .55. This was no coincidence because the standardized regression
coefficient in bivariate regression will always be equal to the Pearson correlation between the
two variables that are being analyzed.
Coefficient of determination. Finally, to the right of the heading R-Square, you can see
that the coefficient of determination for the analysis is .3074 ( ), which rounds to .31. This
value is the square of the Pearson correlation between the two variables. It shows that
approximately 31% of the variance in kilograms lost was accounted for by calorie
consumption.

376 Step-by-Step Basic Statistics Using SAS: Student Guide

Summarizing the Results of the Analysis


The following report summarizes the results of the analysis in which the number of kilograms
lost was correlated with calorie consumption.
A) Statement of the research question: The purpose of this
study was to determine whether the regression coefficient,
representing the relationship between calorie consumption and
the amount of weight lost over a 10-week period, is
significantly different from zero.
B) Statement of the research hypothesis: There will be a
negative relationship between calorie consumption and the
amount of weight that is actually lost over a 10-week period.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was calorie consumption. This was a
multi-value variable and was assessed on a ratio scale.
The criterion variable was the number of kilograms of weight
lost during the 10-week study. This was a multi-value
variable and was assessed on a ratio scale.
D) Statistical procedure:

Linear bivariate regression.

E) Statistical null hypothesis (H0): b = 0; In the population,


the regression coefficient, representing the relationship
between calorie consumption and the number of kilograms of
weight lost, is equal to zero.
F) Statistical alternative hypothesis (H1): b 0; In the
population, the regression coefficient, representing the
relationship between calorie consumption and the number of
kilograms of weight lost, is not equal to zero.
G) Obtained statistic:

b = .0035, t (20) = 2.98

H) Obtained probability (p) value:

p = .0074

I) Conclusion regarding the statistical null hypothesis: Reject


the null hypothesis.
J) Conclusion regarding the research hypothesis: These
findings provide support for the studys research hypothesis.
K) Coefficient of determination:

.31.

L) Formal description of the results for a paper: Results were


analyzed by using linear regression to regress kilograms of
weight lost on the average number of calories that were
consumed each day. This analysis revealed a significant

Chapter 11: Bivariate Regression 377

regression coefficient, b = .0035, t(20) = 2.98, p = .0074.


The nature of the regression coefficient showed that, on the
average, a decrease of .0035 kilograms of weight loss was
associated with every 1-unit increase in calories consumed per
day. The analysis showed that calorie consumption accounted for
31% of the variance in weight loss.
M) Figure representing the results: Output 11.14 presents a
scattergram showing the relationship between calorie
consumption and kilograms of weight lost.
Plot of KG_LOST*CALORIES.
KG_LOST |
9 +
|
|
|
|
|
8 +
|
|
|
|
|
7 +
|
|
|
|
|
6 +
|
|
|
|
|
5 +
|
|
|
|
|
4 +
|
|
|
|
|
3 +
|
|
|
|
|
2 +
|
|
|
|
|
1 +

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

A
A

A
A
A

A
A
A

A
A

A
A
A
B
A
A

A
--+----------+----------+----------+----------+----------+----------+-1200
1400
1600
1800
2000
2200
2400
CALORIES

Output 11.14. Scattergram plotting kilograms lost against calorie consumption.

378 Step-by-Step Basic Statistics Using SAS: Student Guide

Notes about the Preceding Summary


Degrees of freedom for the t test. Item G in the preceding summary provides the
nonstandardized regression coefficient for the analysis, along with the t statistic and degrees
of freedom that are associated with that regression coefficient. The 20 that appears in
parentheses are the degrees of freedom associated with that test.
Again, the formula for computing the degrees of freedom for this t test is:
N2
The present analysis was based on 22 subjects, so the degrees of freedom are calculated as
(N 2) = (22 2) = 20
Presenting the p value. Item H in the preceding report presents the obtained probability value
associated with the t statistic:
H) Obtained probability (p) value:

p = .0074

Notice that this item uses the equal sign (=) to show that the p value is equal to .0074. It
uses the equal sign because the less-than sign (<) did not appear with the p value in the
PROC REG output.
Regression line in the scattergram. Output 11.14 shows the scattergram created when PROC
PLOT was used to plot KG_LOST against CALORIES. The regression line drawn through the
center of the scattergram was created by following the steps described in the preceding section
Drawing a Regression Line through the Scattergram.
Formal description of the results. The last sentence in item L reads in part, ...on the
average, a decrease of .0035 kilograms of weight loss was associated with every 1-unit
increase in calories that were consumed per day (italics added). This sentence uses the word
decrease rather than increase because the slope of the regression coefficient (.0035) was
a negative value rather than a positive value.

Chapter 11: Bivariate Regression 379

Using PROC REG: Example with a Nonsignificant


Regression Coefficient
Overview
In this section you will learn to recognize and summarize results that include a nonsignificant
relationship by using a regression analysis to obtain a nonsignificant regression coefficient.
Correlation between Kilograms Lost and IQ Scores
Output 11.15 presents the Pearson correlations between KG_LOST and the four predictor
variables in the weight loss study (this is detail taken from Output 11.11).
KG_LOST

KG_LOST
1.00000
22

MOTIVAT
0.88524
<.0001
22

EXERCISE
0.53736
0.0120
21

CALORIES
-0.55439
0.0074
22

IQ
0.02361
0.9169
22

Output 11.15. Pearson correlations for the relationship between kilograms lost
and the four predictor variables of the weight loss study.

In Output 11.15, find the cell where the row headed KG_LOST intersects with the column
headed IQ. This cell provides information about the correlation between kilograms of
weight lost and the scores on a standard IQ test.
The top figure in the column is the Pearson correlation between KG_LOST and IQ. You can
see that this correlation is .02361 ( ), which rounds to .02. This near-zero correlation
suggests that there is virtually no relationship between the two variables. Below the
correlation, the probability value for the correlation is .9169 ( ). Because this p value is
greater than .05, you will conclude that this correlation coefficient is not significantly different
from zero. This means that the regression coefficient from the regression analysis (to be
discussed below) will also fail to be significantly different from zero.
Using PROC REG to Perform the Regression Analysis
Output 11.16 presents the first page of output that is generated when PROC REG is used to
regress KG_LOST on IQ. You should review this table and interpret it in the usual way. In
particular, note the nonsignificant t statistic that tests the null hypothesis that the regression
coefficient (slope) is equal to zero in the population ( ).

380 Step-by-Step Basic Statistics Using SAS: Student Guide

Source
Model
Error
Corrected Total

JANE DOE
The REG Procedure
Model: MODEL1
Dependent Variable: KG_LOST
Analysis of Variance
Sum of
Mean
DF
Squares
Square
1
0.06060
0.06060
20
108.61613
5.43081
21
108.67673

Root MSE
Dependent Mean
Coeff Var

Variable DF
Intercept 1
IQ
1

Parameter
Estimate
4.62767
0.00299

2.33041
4.98591
46.73990

R-Square
Adj R-Sq

Parameter Estimates
Standard
Error
t Value
3.42745
1.35
0.02826
0.11

F Value
0.01

Pr > F
0.9169

0.0006
-0.0494

Pr > |t|
0.1920
0.9169

Standardized
Estimate
0
0.02361

Output 11.16. Results of PROC REG with kilograms of weight lost as the
criterion variable and IQ as the predictor variable (Analysis of
Variance and Parameter Estimate tables).

In Output 11.16, you can see that the t statistic for the regression coefficient is only 0.11 ( ).
The p value associated with this statistic is quite large at .9169 ( ). Because the p value is
larger than the standard criterion of .05, you can conclude that the regression coefficient is not
significantly different from zero.
Summarizing the Results of the Analysis
The following report summarizes the results of the analysis in which the number of kilograms
lost was regressed on IQ scores.
A) Statement of the research question: The purpose of this
study was to determine whether the regression coefficient
representing the relationship between IQ and the amount of
weight lost over a 10-week period is significantly different
from zero.
B) Statement of the research hypothesis: There will be a
positive relationship between IQ and the amount of weight that
is actually lost over a 10-week period.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was IQ. This was a multi-value
variable and was assessed on an interval scale.
The criterion variable was the number of kilograms of weight
lost during the 10-week study. The criterion variable was

Chapter 11: Bivariate Regression 381

also a multi-value variable and was assessed on a ratio


scale.
D) Statistical procedure:

Linear bivariate regression.

E) Statistical null hypothesis (H0): b = 0; In the population,


the regression coefficient representing the relationship
between IQ and the number of kilograms of weight lost is equal
to zero.
F) Statistical alternative hypothesis (H1): b 0; In the
population, the regression coefficient representing the
relationship between IQ and the number of kilograms of weight
lost is not equal to zero.
G) Obtained statistic:

b = .0030, t (20) = .11

H) Obtained probability (p) value:

p = .9169

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.

Fail

J) Conclusion regarding the research hypothesis: These


findings fail to provide support for the studys research
hypothesis.
K) Coefficient of determination:

.00.

L) Formal description of the results for a paper: Results were


analyzed by using linear regression to regress kilograms of
weight lost on IQ. This analysis revealed a nonsignificant
regression coefficient, b = .0030, t(20) = .11, p = .9169. The
analysis showed that IQ accounted for less than 1% of the
variance in weight loss.
M) Figure representing the results: Output 11.17 presents a
scattergram showing the relationship between IQ and kilograms
of weight lost.

382 Step-by-Step Basic Statistics Using SAS: Student Guide

Plot of KG_LOST*IQ.
KG_LOST |
9 +
|
|
|
|
|
8 +
|
|
|
|
|
7 +
|
|
|
|
|
6 +
|
|
|
|
|
5 +
|
|
|
|
|
4 +
|
|
|
|
|
3 +
|
|
|
|
|
2 +
|
|
|
|
|
1 +

JANE DOE
Legend: A = 1 obs, B = 2 obs, etc.

A
A

A
A
A

A
A
A
A
A

A
A
A
A

A
A
A

A
--+----------+----------+----------+----------+----------+----------+-90
100
110
120
130
140
150
IQ

Output 11.17. Scattergram plotting kilograms lost against IQ.

Note about the Preceding Summary


Item L from the preceding summary provides a formal description of the results for a paper. It
is similar to the summary for the two other analyses reported earlier in this chapter, with the
exception that this description omits any interpretation of the meaning of the regression
coefficient (e.g., The nature of the regression coefficient showed that, on the average, an
increase of .0030 kilograms of weight loss was associated with every 1-unit increase in IQ).
This is because the regression coefficient (slope) in the analysis was not significantly different

Chapter 11: Bivariate Regression 383

from zero. When statistics such as this are nonsignificant, a summary of results typically does
not include any attempt to interpret the nature of the relationship.

Conclusion
Many of the statistics covered in a course on basic statistics can be divided into two families:
tests of association versus tests of group differences. With a test of association, you are
typically studying a single population of individuals and wish to know whether there is a
relationship between two (or more) variables within that population. For example, in Chapter
10, Bivariate Correlation, you learned about the Pearson correlation coefficient, one of the
most widely used measures of association. You learned how to use SAS to compute a Pearson
correlation coefficient and to test the null hypothesis that the correlation was equal to zero in
the population. In this chapter, you learned how to compute a related measurethe regression
coefficientand test the null hypothesis that the coefficient was equal to zero in the
population.
Now that you are familiar with tests of association, you will begin learning about tests of
group differences. With a test of group differences, you typically want to know whether two
(or more) populations differ from one another with respect to their mean scores on a criterion
(or dependent) variable. The t test and analysis of variance (ANOVA) are widely used tests of
group differences. Discussion of this family of tests begins in Chapter 12 with one of the more
elementary tests of group differences: the single-sample t test.

384 Step-by-Step Basic Statistics Using SAS: Student Guide

Single-Sample
t Test
Introduction..........................................................................................387
Overview................................................................................................................ 387
Situations Appropriate for the Single-Sample t Test ..........................387
Overview................................................................................................................ 387
Example of a Study Providing Data Appropriate for This Procedure...................... 387
Summary of Assumptions Underlying the Single-Sample t Test ............................ 388
Results Produced in a Single-Sample t Test .......................................388
Overview................................................................................................................ 388
Test of the Null Hypothesis .................................................................................... 389
Confidence Interval for the Mean ........................................................................... 391
Effect Size.............................................................................................................. 391
Example 12.1: Assessing Spatial Recall in a Reading Comprehension
Task (Significant Results) ...............................................................393
Overview................................................................................................................ 393
The Study .............................................................................................................. 393
Data Set to Be Analyzed........................................................................................ 394
Choosing the Comparison Number for the Analysis .............................................. 395
Writing the SAS Program....................................................................................... 396
Output Produced by PROC TTEST ....................................................................... 398
Steps in Interpreting the Output ............................................................................. 399
Summarizing the Results of the Analysis............................................................... 404
One-Tailed Tests versus Two-Tailed Tests .........................................406
Dividing the Obtained p Value by 2........................................................................ 406
Caution .................................................................................................................. 407

386 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 12.2: An Illustration of Nonsignificant Results ...................407


Overview................................................................................................................ 407
The Study .............................................................................................................. 407
Interpreting the SAS Output................................................................................... 408
Summarizing the Results of the Analysis............................................................... 410
Conclusion............................................................................................412

Chapter 12: Single-Sample t Test 387

Introduction
Overview
This chapter shows you how to use the SAS System to perform a t test for a single-sample
mean. This is a parametric procedure that is appropriate when you want to test the null
hypothesis that a given sample of data was drawn from a population that has a specified
mean. The chapter shows you how to write the appropriate SAS program, interpret the
output, and prepare a report summarizing the results of the analysis. Special emphasis is
given to testing the null hypothesis, interpreting the confidence interval for the mean, and
computing an index of effect size.

Situations Appropriate for the Single-Sample t Test


Overview
You may use a single sample t test when

You have obtained interval- or ratio-level data from a single sample of subjects.

You want to determine whether the mean for this sample is significantly different from
some specified population mean.

You may perform this analysis only when the population mean of interest is known. That is,
you may perform it when the population mean has already been established by earlier
research, or when it is established by theoretical considerations. It is not necessary that the
standard deviation of scores in the population be known; this will be estimated from the
sample data.
Example of a Study Providing Data Appropriate for This Procedure
The study. Suppose you are conducting research with the ESP Club on campus. The club
includes ten members who claim to be able to predict the future. To prove it, they each
complete 100 trials in which they predict the results of a coin flip (i.e., they predict prior to
the flip whether the results will be heads or tails).
The members show some variability in their performance. One member achieves a score of
only 40 correct guesses, another achieves a score of 70 correct guesses, and so on. When
you average their scores, you find that the average for the group is 60. This means that, on
the average, the ten members guessed correctly on 60 out of 100 flips. Members of the ESP
Club are very happy with this average score. They point out that, if they did not have
precognition skills, they should have made an average of only 50 correct guesses out of 100
flips, based on the probability that correctly guessing a flip was only .5.

388 Step-by-Step Basic Statistics Using SAS: Student Guide

It is true that their sample mean of 60 correct guesses is higher than the hypothetical
population mean of 50 correct guesses, but is it significantly higher? To find out, you
perform a single-sample t test, testing the null hypothesis that the sample mean came from a
population in which the population mean was actually equal to 50. If you reject this null
hypothesis, it will provide some support to the club members claim that they have ESP.
Why these data would be appropriate for this procedure. To perform a single-sample t
test, you need a criterion variable. The criterion variable should be a numeric variable that is
assessed on an interval or ratio scale of measurement. In the present study, the criterion
variable is the number of correct guesses out of 100 coin flips. Conceivably, subjects could
get a score of zero, a score of 100, or any number in between. You know that this variable is
on a ratio scale because equal differences between scale values do have equal quantitative
meaning, and also because there is a true zero point (a score of zero on this measure means
you made no correct guesses at all). Therefore, this assumption appears to be met (additional
assumptions for this test are listed in the following section).
When researchers perform a single-sample t test, the numeric criterion variable being
analyzed is usually a multi-value variable. You could ensure that this is the case in the
present study by using PROC FREQ to create a frequency table for the criterion variable
(number of correct guesses), and verifying that the variable assumes more than six values in
your sample.
Summary of Assumptions Underlying the Single-Sample t Test
Level of measurement. The criterion variable should be a numeric variable that is assessed
on an interval or ratio level of measurement.
Random sampling. Scores on the criterion variable should represent a random sample
drawn from the population of interest.
Normal distributions. The sample should be drawn from a normally distributed population
(you can use PROC UNIVARIATE to test the null hypothesis that the sample is from a
normally distributed population). If the sample contains over 30 subjects, the single-sample t
test is robust against moderate departures from normality (when a test is robust against
violations of certain assumptions, it means that violating those assumptions will have only a
negligible effect on the results).

Results Produced in a Single-Sample t Test


Overview
When you use PROC TTEST to perform a single-sample t test, SAS automatically performs
a test of the null hypothesis and estimates a confidence interval for the mean. If you use a

Chapter 12: Single-Sample t Test 389

few statistics that are included in the output of PROC TTEST, it is relatively easy to
compute by hand an index of effect size. This section explains the meaning of these results.
Test of the Null Hypothesis
Overview. When you perform a single-sample t test, you begin with the null hypothesis that
your sample was drawn from a population with a specific mean. This is analogous to saying
that your sample represents a population with a specific mean. SAS computes the mean of
your sample, and compares it against the hypothetical population mean. In this analysis, it
computes a t statistic. The more different your sample mean is from the population mean
stated in the null hypothesis, the larger this t statistic will be (in absolute terms).
SAS also computes a p value (probability value) associated with this t statistic. If this p
value is less than some standard criterion (alpha level), you will reject the null hypothesis.
This book recommends that you use an alpha level of .05. This means that, if your obtained
p value is less than .05, you will reject the null hypothesis that your sample was drawn from
a population with the mean stated in your null hypothesis. In this case, you will conclude
that you have statistically significant results.
The statistical null hypothesis. As an illustration, consider the fictitious study on coin flips.
In that study, each of ten subjects participated in 100 trials in which they attempted to
predict the results of coin flips. Theoretically, we would expect the average subject to be
correct 50 times (because they had a .5 probability of being correct from chance, there were
100 coin flips, and 100 .5 = 50). This means that you begin with the null hypothesis that
your sample was drawn from a population in which the mean number of correct guesses was
50. Symbolically, you can state the null hypothesis in this way:
H0: = 50
In this null statement, H0 is the symbol for null hypothesis, and is the symbol for the
population mean.
The statistical alternative hypothesis. Before stating the alternative hypothesis, you must
first decide whether you wish to perform a directional test or a nondirectional test. A
directional test is sometimes called a one-sided or one-tailed test, and it involves stating a
directional alternative hypothesis. You state a directional alternative hypothesis if you not
only predict that there will be a difference, but also make a specific prediction about the
direction of that difference. For example, if you have strong reason to believe that the mean
number of correct guesses in your sample will be significantly higher than 50, you might
state the following directional alternative hypothesis:
Statistical alternative hypothesis (H1): > 50; In the population, the average number
of correct guesses is greater than 50.
This is analogous to predicting that your sample was drawn from a population in which the
mean is greater than 50. In the preceding alternative hypothesis, notice that the symbol form

390 Step-by-Step Basic Statistics Using SAS: Student Guide

of the alternative hypothesis uses the greater-than sign (>) rather than the equal sign (=) to
reflect this direction.
On the other hand, if you have strong reason to believe that the mean number of correct
guesses in your sample will be significantly lower than 50, you might state a different
directional alternative hypothesis:
Statistical alternative hypothesis (H1): < 50; In the population, the average number
of correct guesses is less than 50.
This is analogous to predicting that your sample was drawn from a population in which the
mean is less than 50. In the preceding alternative hypothesis, notice that the symbol form of
the alternative hypothesis uses the less-than sign (<).
You may remember from Chapter 2 that this book recommends that, in most cases, you
should not use a directional alternative hypothesis, and should instead use a nondirectional
alternative hypothesis. With a single-sample t test, a nondirectional alternative hypothesis
simply predicts that your sample was not drawn from a population that has a specific mean.
It predicts that there will be a significant difference between your sample mean and the
population mean stated in the null hypothesis, but it does not predict whether your sample
mean will be higher or lower than the population mean. A nondirectional t test is often
referred to as a two-sided or two-tailed test.
For the current study, a nondirectional alternative hypothesis would be stated in this fashion:
Statistical alternative hypothesis (H1): 50; In the population, the average number
of correct guesses is not equal to 50.
With the preceding alternative hypothesis, notice that the symbol form of the hypothesis
includes the not-equal sign (). This reflects the fact that you are simply predicting that
actual population mean is not 50; you are not specifically predicting whether it is higher than
50 or lower than 50.
Obtaining significant results. An earlier section of this chapter asked you to suppose that
you have conducted this study and found that, in reality, the average number of correct
guesses in your sample is 60. This sample mean of 60 is higher than the hypothetical
population mean of 50, but is it significantly higher? To find out, you perform a singlesample t test. Suppose that, after performing this test, your obtained p value is .001. This
obtained p value is less than our standard criterion of .05, and so you reject the null
hypothesis. You tentatively conclude that your sample was probably not drawn from a
population in which the mean number of correct guesses is 50. In other words, you conclude
that there is a statistically significant difference between your sample mean of 60 and the
hypothetical population mean of 50.

Chapter 12: Single-Sample t Test 391

Confidence Interval for the Mean


Confidence interval defined. When you use PROC TTEST to perform a single-sample t
test, SAS also automatically computes a confidence interval for the mean. A confidence
interval is an interval that extends from a lower confidence limit to an upper confidence
limit and is assumed to contain a population parameter with a stated probability, or level of
confidence.
An example. As an illustration, consider the coin toss study described earlier. Suppose that
you analyze your data, and compute the 95% confidence interval for the mean. Remember
that your sample mean is 60. Suppose that SAS estimates that the 95% interval extends from
55 (the lower limit) to 65 (the upper limit). This means that there is a 95% probability that
your sample of ten subjects was drawn from a population in which the mean number of
correct guesses was somewhere between 55 and 65. You do not know the exact mean of this
population, but you estimate that there is a 95% probability that it is somewhere between 55
and 65.
Notice that, with this confidence interval, you are not stating that there is a 95% probability
that the sample mean is somewhere between 55 and 65. You know exactly what the sample
mean isyou have already computed it to be 60. The confidence interval that SAS
computed is a probability statement about the population mean, not the sample mean.
Effect Size
The need for an index of effect size. Suppose that you perform a single-sample t test and
your obtained p value is less than the standard criterion of .05. You therefore reject the null
hypothesis. You know that there is a statistically significant difference between the
population mean stated under the null hypothesis versus the observed sample mean. But is it
a relatively large difference? The null hypothesis test, by itself, does not tell you whether the
difference is large or small. In fact, if your sample is large, you may obtain statistically
significant results even if the difference is relatively trivial.
Effect size defined. Because of this problem with null hypothesis testing, many researchers
are now supplementing these tests with measures of effect size. The exact definition of
effect size will vary, depending upon the type of analysis that you are performing. For a
single-sample t test, we can define effect size as the degree to which the sample mean differs
from the population mean, stated in terms of the standard deviation of the population. The
symbol for effect size is d, and the formula is as follows:
d=

| X 0 |

where:
X = the observed mean of the sample
0 = the hypothetical population mean stated under the null hypothesis
= the standard deviation of the null hypothesis population.

392 Step-by-Step Basic Statistics Using SAS: Student Guide

The preceding formula shows that, to compute effect size, you subtract the population mean
from the sample mean and divide the resulting difference by the population standard
deviation. This means that the effect size is essentially the number of standard deviations
that the sample mean differs from the null hypothesis population mean.
One problem with the preceding formula is that (the standard deviation of the null
hypothesis population) is often unknown. In these situations, you may instead use sX, the
estimated population standard deviation. In Chapter 7, Measures of Central Tendency and
Variability, you learned that sX is an estimate of the population standard deviation, and is
computed from sample data. You may recall that the formula for sX uses N1 in the
denominator, whereas the formula for uses N. Later, you will see that PROC TTEST
automatically reports sX, which makes it relatively easy to compute the effect size for a
single-sample t test. With sX substituted for , the formula for the d statistic becomes the
following:
d=

| X 0 |

sX

An example. Suppose that you analyze data from the coin toss study described earlier.
Under the null hypothesis, the population 0 = 50. Suppose that your observed sample mean
(X) is 60, and the estimated population standard deviation (sX) is 15. In this situation, you
would compute effect size as follows:

d=

| X 0 |

sX

d=

| 60 50 |

15

d=

10

15

d=

.667

d=

.67

Guidelines for interpreting effect size. This effect size of .67 indicates that the observed
sample mean was approximately .67 standard deviations from the hypothetical population
mean. This is an effect, but would most researchers view it as being a relatively large effect?

Chapter 12: Single-Sample t Test 393

In making this interpretation, many researchers refer to the guidelines provided in Cohens
(1969) classic book on statistical power. Cohens guidelines appear in Table 12.1:
Table 12.1
Guidelines for Interpreting Effect Size
_________________________________________
Effect size
Obtained d statistic
_________________________________________
Small effect
d = .20
Medium effect
d = .50
Large effect
d = .80
_________________________________________

Earlier, you computed the effect size for the coin toss study as d = .67. The guidelines
appearing in Table 12.1 show that this is somewhere between a medium effect and a large
effect (remember that the coin toss study is fictitious!).

Example 12.1: Assessing Spatial Recall in a Reading


Comprehension Task (Significant Results)
Overview
This section shows you how to use the PROC TTEST to perform a single-sample t test. You
will use a fictitious study that provides data appropriate for this procedure. You will learn
how to arrange your data set, write the SAS program, interpret the results of PROC TTEST,
and summarize the results of your analysis in a report.
The Study
Suppose that you are a cognitive psychologist doing exploratory research in the area of
spatial recall. You ask a sample of 13 subjects to read the same text (say, a 100-page-long
excerpt from a biology textbook). Each page of text is divided into four quadrants, and a
single paragraph appears in each quadrant, as follows:

394 Step-by-Step Basic Statistics Using SAS: Student Guide

After reading the 100-page text, the subject takes a 100-item test. However, in this study,
you are not interested in whether subjects can remember the facts they have learned. Instead,
what you really want to know is whether subjects can remember where on the page they saw
specific pieces of information.
For example, assume that item 5 on the test deals with the concept of ecological niche,
and the information relevant to this item appeared in quadrant 2 of page 5 in the text. In
responding to this item, the subject indicates the quadrant that the relevant information
appeared in by circling 1 for quadrant 1, 2 for quadrant 2, and so forth. A response is
considered correct if the subject correctly recalls the quadrant in which the information
appeared.
Needless to say, this is a difficult task. If the subjects are totally unable to recall the location
on the page where the information appeared, you would expect them to guess the correct
quadrant about 25% of the time. This is because they have four choices1, 2, 3, and 4and
if they guessed randomly they would be correct about one time in four just due to chance.
Since there are 100 items on the test, you would therefore expect them to guess correctly on
about 25 items due to chance alone (because 100 25% = 25). If the subjects get more than
25 items correct, it may meant that they have good spatial recall skillsthat they really can
remember where the information appeared on the page.
So you determine how many correct guesses were made by each of your 13 subjects. You
then use a single-sample t test to determine whether their average number of correct guesses
was significantly larger than the theoretical population mean of 25 correct guesses.
Data Set to Be Analyzed
Table 12.2 shows the scores on the criterion variable: the number of correct guesses made
by each of the 13 subjects.
Table 12.2
Number of Correct Guesses Regarding Spatial Location
______________________
Correct
Subject
guesses
______________________
01
31
02
34
03
26
04
36
05
31
06
30
07
29
08
30
09
34
10
28
11
28
12
30
13
33
___________________________

Chapter 12: Single-Sample t Test 395

Choosing the Comparison Number for the Analysis


With this analysis, you will test the null hypothesis that your sample of 13 subjects was
drawn from a population in which the average score is 25. This is another way of saying that
your sample represents a population in which the average score is 25. Symbolically, it can
be represented this way:
H0: = 25
So 25 is the comparison number for this analysis. In this type of analysis, the comparison
number is always the mean score that you would expect in the population, based on
theoretical considerations. In the present study, the comparison number is 25 because

Subjects are making 100 guesses.

If the subjects have no spatial recall skills, the average number of correct guesses they
make should be about the number that you would expect based on chance (guessing
randomly).

Since there are four possible responses, the probability of guessing correctly when
guessing randomly is equal to 25%.

Theoretically, if subjects are guessing randomly, you would expect them to make about
25 correct guesses, because 25% of 100 guesses is equal to 25 correct guesses.

This forms the basis for your single-sample t test. If the average score in your sample is
significantly different from 25 (the comparison number), you will be able to reject this null
hypothesis. If the average score in your sample is significantly larger than 25, it will provide
support for the idea that your subjects do have spatial recall skills.
Remember that the comparison number was 25 in this analysis because this was the number
specified in the null hypothesis. In a different study and analysis, a different comparison
number would likely be used, depending on the nature of the null hypothesis.

396 Step-by-Step Basic Statistics Using SAS: Student Guide

Writing the SAS Program


The DATA step. Suppose that you create a SAS data set that contains just two variables.
The variable SUB_NUM contains a unique subject number for each subject, and the variable
RECALL contains each subjects score on the criterion variable (the number of correct
guesses regarding spatial location). Here is the DATA step for this program (line numbers in
italic have been added on the left):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
RECALL;
DATALINES;
01 31
02 34
03 26
04 36
05 31
06 30
07 29
08 30
09 34
10 28
11 28
12 30
13 33
;

The data lines appear on lines 618 of the DATA step. You can see that these are the same
data lines that appeared in Table 12.2.
The PROC Step. The syntax for the PROC step of a program that will perform a singlesample t test is as follows:
PROC TTEST

DATA=data-set-name
H0=comparison-number
ALPHA=alpha-level;
VAR criterion-variable;
TITLE1 ' your-name ';
RUN;

In the syntax, the PROC TTEST statement contains the following option:
H0=comparison-number
The comparison-number that appears in this option should be the population mean stated
under the null hypothesis. It is the mean score that you wish your sample mean score to be
compared against. The number that you type in this location will depend on the nature of
your study. The preceding section, Choosing the Comparison Number for the Analysis,

Chapter 12: Single-Sample t Test 397

stated that the appropriate comparison number for the current analysis is 25. Therefore, you
will include the following option in the PROC TTEST statement:
H0=25
Note that the 0 that appears in the preceding option H0 is a zero (0), and is not the upper
case of the letter O.
If you omit the H0 option from the PROC TTEST statement, the default comparison number
is zero.
The syntax for the PROC TTEST statement also contains the following option:
ALPHA=alpha-level
The ALPHA option allows you to specify the size of the confidence interval that you will
estimate around the sample mean. Specifying ALPHA=0.01 produces a 99% confidence
interval, specifying ALPHA=0.05 produces a 95% confidence interval, and specifying
ALPHA=0.1 produces a 90% confidence interval. Suppose that, in this analysis, you wish to
create a 95% confidence interval for your mean. This means that you will include the
following option in the PROC TTEST statement:
ALPHA=0.05
The preceding syntax included the following VAR statement:
VAR criterion-variable;
In the VAR statement, criterion-variable should be the name of the variable that is of central
interest in the analysis. In the present study, the criterion variable is RECALL: the number
of correct guesses regarding spatial location. This means that the SAS program will contain
the following VAR statement:
VAR RECALL;
Below are the statements that constitute the PROC step for the current analysis:
PROC TTEST DATA=D1 H0=25
VAR RECALL;
TITLE1 'JOHN DOE';
RUN;

ALPHA=0.05;

The complete SAS program. Here is the program that you can use to analyze the fictitious
data from the preceding study. This program will perform a single-sample t test to determine
whether the mean RECALL score in a sample of 13 subjects is significantly different from a
comparison number of 25. It will estimate the 95% confidence interval for the mean.

398 Step-by-Step Basic Statistics Using SAS: Student Guide

OPTIONS LS=80 PS=60;


DATA D1;
INPUT SUB_NUM
RECALL;
DATALINES;
01 31
02 34
03 26
04 36
05 31
06 30
07 29
08 30
09 34
10 28
11 28
12 30
13 33
;
PROC TTEST DATA=D1 H0=25
VAR RECALL;
TITLE1 'JOHN DOE';
RUN;

ALPHA=0.05;

Output Produced by PROC TTEST


With the line size and page size options requested in the OPTIONS statement, the preceding
program would produce one page of output, shown in Output 12.1.
JOHN DOE

The TTEST Procedure

Variable
RECALL

N
13

Lower CL
Mean
29.057

Statistics
Upper CL
Mean
Mean
30.769
32.481

Variable
RECALL

Statistics
Std Err
Minimum
0.7857
26

Variable
RECALL

T-Tests
DF
t Value
12
7.34

Lower CL
Std Dev
2.0315

Std Dev
2.833

Upper CL
Std Dev
4.6765

Maximum
36

Pr > |t|
<.0001

Output 12.1. Results of the TTEST procedure performed on


subject recall data.

Chapter 12: Single-Sample t Test 399

Here are several main points in Output 12.1 from PROC TTEST:
The name of the variable being analyzed appears below the heading Variable. In this case,
you can see that the criterion variable being analyzed was RECALL.
The output includes two sections headed Statistics. These sections report the mean,
standard deviation, confidence intervals, and other information.
The section headed T-Tests reports the t statistic and other information relevant for the
null hypothesis that the sample was drawn from a population with a specified mean. The
following section describes the various sections of PROC TTEST output in greater detail.
Steps in Interpreting the Output
1. Make sure that everything looks right. The output from the analysis is again
reproduced as Output 12.2. Callout numbers identify the sections that you should review to
verify that there were no obvious errors in entering your data or requesting the TTEST
procedure.
JOHN DOE

Variable
RECALL

N
13

Lower CL
Mean
29.057

Variable
RECALL

Variable
RECALL

The TTEST Procedure


Statistics
Upper CL
Lower CL
Mean
Mean
Std Dev
30.769
32.481
2.0315
Statistics
Std Err
0.7857

Minimum
26

T-Tests
DF
t Value
12
7.34

Std Dev
2.833

Upper CL
Std Dev
4.6765

Maximum
36

Pr > |t|
<.0001

Output 12.2. Sections to review to verify that there were no obvious errors in
writing the SAS program or entering data.

First check the name of the criterion variable to verify that you are looking at results for the
correct variable (RECALL, in this case).
Check the number of valid observations in the column headed N to verify that the data set
includes the expected number of subjects. Here, the N is 13, as expected.
Next review the mean, the minimum value, and the maximum value to verify that you
have not made any obvious errors in keying the data (e.g., verify that you dont have, say, a
maximum observed score of 200, although the highest possible score was supposed to be
100). So far, these results do not reveal any problems.

400 Step-by-Step Basic Statistics Using SAS: Student Guide

JOHN DOE

Variable
RECALL

The TTEST Procedure


Statistics
Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
29.057
30.769
32.481
2.0315
Statistics

N
13

Variable
RECALL

Std Err
0.7857

Minimum
26

Std Dev
2.833

Upper CL
Std Dev
4.6765

Maximum
36

T-Tests
Variable
RECALL

DF
12

t Value
7.34

Pr > |t|
<.0001

Output 12.3. Sections to review for the test of the studys null hypothesis.

2. Review the results of the t test. Output 12.3 presents the output from PROC TTEST
again, this time identifying information relevant to the test of the null hypothesis.
Remember that the null hypothesis for your study states that your sample was drawn from a
population in which the mean score was 25 (25 is the number of correct responses that you
would expect if your subjects were responding correctly at a chance level). Symbolically,
the null hypothesis was stated this way:
H0: = 25
Output 12.3 shows that the mean score obtained for your sample of 13 subjects was 30.769.
This is higher than the population mean of 25 that was stated in the null hypothesis, but is it
significantly higher? To find out, you will need to consult the results of the t test.
The results of this test appear in the lower part of the output, in the section headed T-Tests.
Below the heading t Value, you can see that the obtained t statistic for your analysis is 7.34.
The section headed DF provides the degrees of freedom for the t test, which in this case
was 12.
Finally, the section headed Pr > | t | provides the p value (probability value) for the t test.
Remember that the p value is the probability that you would obtain a t value this large or
larger (in absolute terms) if the null hypothesis were true. Output 12.3 shows that the p value
for this test is <.0001 (less than one in 10,000).
This book recommends that you should reject the null hypothesis whenever the p value
associated with a test is less than .05. In the present case, the obtained p value is less than
.0001, which is much less than .05. Therefore, you will reject the null hypothesis that your
sample was drawn from a population in which the average recall score was 25. You will
conclude that your obtained sample mean of 30.769 is significantly higher than the

Chapter 12: Single-Sample t Test 401

hypothetical mean of 25. In other words, you will conclude that the subjects in your sample
were able to recall spatial location at a rate higher than would be expected with random
guessing.
3. Review the confidence interval for the mean. An earlier section of this chapter
indicated that a confidence interval is an interval that extends from a lower confidence limit
to an upper confidence limit and is assumed to contain a population parameter with a stated
probability, or level of confidence. For example, if you compute the 95% confidence
interval for a single-sample t test, you can be 95% sure that the actual population mean is
somewhere between those two limits.
When you wrote your SAS program to perform this analysis, you included the following
PROC TTEST statement:
PROC TTEST

DATA=D1

H0=25

ALPHA=0.05;

The option ALPHA=0.05 that is included in this statement requests that SAS compute the
95% confidence interval for the mean (if you had desired the 99% confidence interval, you
would have included the option ALPHA=0.01). The 95% confidence interval for the current
analysis appears in the output for the PROC TTEST and is shown again as Output 12.4.
JOHN DOE

The TTEST Procedure


Statistics

Variable
RECALL

N
13

Lower CL
Mean
29.057

Variable
RECALL

Upper CL
Mean
Mean
30.769
32.481
Statistics
Std Err
0.7857

Minimum
26

Lower CL
Std Dev
2.0315

Std Dev
2.833

Upper CL
Std Dev
4.6765

Maximum
36

T-Tests
Variable
RECALL

DF
12

t Value
7.34

Pr > |t|
<.0001

Output 12.4. Sections to review to find the confidence interval for the mean.

You have already seen that the mean RECALL score for your sample of 13 subjects was
30.769.
In Output 12.4, the lower confidence limit for the mean appears below the heading Lower
CL Mean. You can see that the lower confidence limit in this case is 29.057.
In the same output, the upper confidence limit for the mean appears below the heading
Upper CL Mean. You can see that the upper confidence limit in this case is 32.481. Taken
together, these results show that the 95% confidence interval for the current analysis ranges

402 Step-by-Step Basic Statistics Using SAS: Student Guide

from 29.057 to 32.481. This means that there is a 95% likelihood that the actual mean for
the population from which your sample was drawn is somewhere between 29.057 and
32.481.
Notice that this confidence interval does not contain the mean of 25 correct recalls that was
stated in null hypothesis. This finding is consistent with the idea that the average number of
correct guesses displayed by your sample is significantly greater than 25, the number that
would have been expected with purely random guessing.
When you are looking for the confidence interval for the mean from an analysis such as this,
be sure that you do not look under the headings ( ) Lower CL Std Dev and ( ) Upper
CL Std Dev. Under these headings, you will find instead the lower and upper confidence
limits (respectively) for the studys standard deviation. With most studies conducted in the
social sciences and in education, it is much more common to report the confidence interval
for the mean rather than the confidence interval for the standard deviation.
4. Compute the index of effect size. An earlier section defined effect size as the degree to
which the sample mean differs from the population mean, stated in terms of standard
deviation of the population. The symbol for effect size is d. When the population standard
deviation is being estimated from sample data, the formula for d is as follows:
d=

| X 0 |

sX

In the preceding formula, X represents the observed sample mean, 0 represents the
theoretical population mean as stated in the null hypothesis, and sX represents the standard
deviation of the null hypothesis population, as estimated from sample data.
Although SAS does not include d as a part of the output from PROC TTEST, the statistic is
relatively easy to compute by hand, as we will see. Two of the three values to be inserted
into the preceding formula appear on the SAS output. See Output 12.5 for the output from
the current analysis.

Chapter 12: Single-Sample t Test 403

JOHN DOE

The TTEST Procedure


Statistics

Variable
RECALL

N
13

Lower CL
Mean
29.057

Variable
RECALL

Upper CL
Mean
Mean
30.769
32.481
Statistics
Std Err
0.7857

Minimum
26

Lower CL
Std Dev
2.0315

Std Dev
2.833

Upper CL
Std Dev
4.6765

Maximum
36

T-Tests
Variable
RECALL

DF
12

t Value
7.34

Pr > |t|
<.0001

Output 12.5. Sections to review to compute d, the index of effect size.

First, you will need the observed sample mean.


In Output 12.5, this appears below the heading Mean. As you have already seen the
sample mean from this analysis is 30.769, and this value is now inserted in the formula for
d:
d=

30.769 0

sX

With the preceding formula, the symbol sX represents the population standard deviation, as
estimated from sample data.
In Output 12.5, this value appears below the heading Std Dev. This standard deviation was
based on sample data, using N 1 in the denominator (rather than N). Output 12.5 shows
that the estimated population standard deviation is 2.833. This value is now inserted in the
formula below:
d=

30.769 0

2.833

The final value in the formula is 0 , which represents the population mean under the null
hypothesis. For the current study, the null hypothesis was stated symbolically in this way:
H0: = 25
Your null hypothesis stated that the current sample was drawn from a population in which
the mean score is 25. The number 25 was chosen because that is the number of correct
guesses you would expect the subjects to make if they were guessing in a random fashion.

404 Step-by-Step Basic Statistics Using SAS: Student Guide

You will remember that this is the reason that you chose 25 as the comparison number for
the H0 option that you included in PROC TTEST statement:
PROC TTEST

DATA=D1

H0=25

ALPHA=0.05;

This should generally be the case. In most instances, the comparison number from your
PROC TTEST statement will serve as the value of 0 in the formula for computing d.
Because the comparison number for the current analysis is 25, that value is now inserted in
the formula below and the value of d is computed.
d=

30.769 25

2.833

d=

5.769

2.833

d = 2.036
d = 2.04
And so the index of effect size for the current analysis is 2.04. Is this a relatively large effect
or a relatively small effect? Cohens guidelines for evaluating effect size are shown again in
Table 12.3:
Table 12.3
Guidelines for Interpreting Effect Size
_________________________________________
Effect size
Obtained d statistic
_________________________________________
Small effect
d = .20
Medium effect
d = .50
Large effect
d = .80
_________________________________________

According to the table, an effect is considered large if d = .80. For the current analysis, d =
2.04, which is much larger than .80. Therefore, you can conclude that the present data
produced a relatively large index of effect size.
Summarizing the Results of the Analysis
Here is the analysis report for the current analysis. Notice that it follows a format that is not
identical to the format used for analysis reports in the previous chapter. Following the report
are a number of notes that clarify the information that it contains.

Chapter 12: Single-Sample t Test 405

A) Statement of the research question: The purpose of this


study was to determine whether subjects performing a fourchoice spatial recall task will perform at a level that is
higher than the level expected with random responding.
B) Statement of the research hypothesis: Subjects performing
a four-choice spatial recall task will perform at a level that
is higher than the level expected with random responding.
C) Nature of the variable: The criterion variable was RECALL,
the number of times the subjects correctly guessed the
quadrant where targeted information appeared on the text page.
This was a multi-value variable and was assessed on a ratio
scale.
Single-sample t test.

D) Statistical test:

E) Statistical null hypothesis (H0): = 25; In the


population, the average number of correct recalls is equal to
25 out of 100 (the number expected with random responding).
F) Statistical alternative hypothesis (H1): 25; In the
population, the average number of correct recalls is not equal
to 25 out of 100.
G) Obtained statistic:

t = 7.34

H) Obtained probability (p) value:

p < .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Confidence interval: The sample mean on the criterion
variable (number of correct recalls) was 30.769. The 95%
confidence interval for the mean extended from 29.057 to
32.481.
K) Effect size:

d = 2.04.

L) Conclusion regarding the research hypothesis: These


findings provide support for the studys research hypothesis.
M) Formal description of results for a paper: Results were
analyzed using a single-sample t test. This analysis revealed
a significant t value, t(12) = 7.34, p < .0001. In the
sample, the mean number of correct recalls was 30.769 (SD =
2.833), which was significantly higher than the 25 correct
recalls that would have been expected with random responding.
The 95% confidence interval for the mean extended from 29.057
to 32.481. The effect size was computed as d = 2.04. According
to Cohens (1969) guidelines, this represents a relatively
large effect.

406 Step-by-Step Basic Statistics Using SAS: Student Guide

Notes regarding the preceding report. In general, the preceding summary was prepared
according to the conventions recommended by the Publication Manual of the American
Psychological Association (1994). A few words of explanation may be necessary:

The second sentence of item M reports the obtained t statistic in the following way:
t(12) = 7.34, p < .0001.
The 12 that appears in parentheses in the excerpt is the degrees of freedom for the
analysis. With a single-sample t test, the degrees of freedom are equal to N 1, where N
represents the number of subjects who provided valid data for the analysis. In the present
case, N = 13, so it makes sense that the degrees of freedom would be 13 1 = 12. In the
SAS output, the degrees of freedom for the t test appear below the heading DF.

In item H and item M, the probability value for the t statistic is presented in this way:
p < .0001
This item used the less-than sign (<), because the less-than sign actually appeared in
Output 12.3, (that is, the p value that appeared below the heading Pr > | t | was
<.0001). If the less-than sign is not actually printed in the SAS output, you should
instead use the equal sign (=) when indicating your obtained p value. For example,
assume that the p value for this t statistic had actually been printed as .0143. In that
situation, you would have used the equal sign:
H) Obtained probability (p) value:

p = .0143

The third sentence of item M states,


In the sample, the mean number of correct recalls was
30.769 (SD = 2.833)...
The symbol SD represents the standard deviation of the criterion variable, RECALL. You
will remember that this standard deviation appeared on Output 12.5 below the heading
Std Dev.

One-Tailed Tests versus Two-Tailed Tests


Dividing the Obtained p Value by 2
When you use PROC TTEST to perform a single-sample t test, the resulting p value is the
probability value for a nondirectional test (i.e., for a two-tailed or two-sided test). If you
instead wish to perform a directional test (i.e., a one-tailed or one-sided test), just divide the
p value on the output by 2; the result will be the p value for a directional test.
For example, imagine that you perform an analysis that produces a nondirectional p value of
.08 in the output. This is larger than the standard criterion of .05, and thus fails to attain
significance as a nondirectional test. However, suppose that, prior to the analysis, you

Chapter 12: Single-Sample t Test 407

decided that a directional test was appropriate. You therefore divide the obtained p value of
.08 by 2, resulting in a directional p value of .04. This obtained p value of .04 is less than the
standard criterion of .05, and so you reject the null hypothesis and conclude that you have
significant results. It makes sense that the directional test was significant while the
nondirectional test was not, as a directional test has greater power.
Caution
Avoid the temptation to begin an analysis as a nondirectional test, and then to decide that it
was actually a directional test when you see that the results are nonsignificant as a two-tailed
test. By following such a strategy, your actual probability of making a Type I error is higher
than the standard .05 level assumed by your readers. In most analyses, you are advised to
use a nondirectional test (see Abelson [1995 pp 5759] for a discussion of one-tailed tests,
two-tailed tests, and alternatives).

Example 12.2: An Illustration of Nonsignificant Results


Overview
So that you will be able to recognize SAS results that show a nonsignificant t value, this
section provides the results of a t test that failed to attain significance. This section focuses
on the same recall study described in the previous section. Here, however, you will analyze
a different fictitious data set designed to produce nonsignificant results. You will review the
SAS output and then prepare an analysis report for the new results.
The Study
Overview. The previous section described a study in which subjects read 100 pages of text,
and then tried to recall where on the page a given piece of information appeared. They
completed 100 trials, and so their scores on the criterion variable could range from zero (if
none of their guesses were correct) to 100 (if all of their guesses were correct). If they are
guessing randomly, you expect them to be correct 25 times out of 100, on the average
(because each page is divided into four sections, and 100 divided by 4 is equal to 25.
The data set. Table 12.4 provides data from this fictitious study. Notice that the scores
under the heading Correct guesses are different from those that appeared in Table 12.2.

408 Step-by-Step Basic Statistics Using SAS: Student Guide


Table 12.4
Number of Correct Guesses Regarding Spatial Location
(Data Producing Nonsignificant Results)
______________________
Correct
Subject
guesses
______________________
01
24
02
30
03
27
04
26
05
25
06
33
07
23
08
24
09
26
10
20
11
26
12
28
13
27
___________________________

The SAS program. The SAS program to analyze the preceding data set would be identical to
the SAS program presented earlier (in the section titled Writing the SAS Program), except
that the scores for the criterion variable RECALL would be replaced with those appearing in
Table 12.4. The option H0=25 would again be included in the PROC TTEST statement to
request that the sample mean be compared against a population value of 25.
Interpreting the SAS Output
Reviewing the sample mean. Output 12.6 presents the results of the single-sample t test
performed on the data from Table 12.4.
JOHN DOE

The TTEST Procedure


Statistics

Variable
RECALL

N
13

Lower CL
Mean
24.127
Variable
RECALL

Upper CL
Lower CL
Mean
Mean
Std Dev
Std Dev
26.077
28.027
2.3137
3.2265
Statistics
Std Err
Minimum
Maximum
0.8949
20
33

Upper CL
Std Dev
5.3261

T-Tests
Variable
RECALL

DF
12

t Value
1.20

Pr > |t|
0.2520

Output 12.6. Results of the TTEST procedure performed on subject recall data
(nonsignificant results).

Chapter 12: Single-Sample t Test 409

Your first clue that you are probably going to obtain nonsignificant results is the size of the
sample mean itself. Output 12.6 shows that the sample mean is 26.077. This value is very
close to the population mean of 25, which is stated by the null hypothesis.
Reviewing the results of the t test. To determine whether the sample mean differs from the
population mean, consult
the T-Tests section
the obtained t statistic. For this analysis it was only 1.20.
the p value. For this t statistic, the value is .2520.
Because this p value is larger than our standard criterion of .05, you fail to reject the null
hypothesis. You conclude that the sample mean of 26.077 is not significantly different from
the population mean of 25. In other words, you conclude that the number of correct guesses
made by your subjects was not significantly greater than the number that would be expected
with random guessing.
Reviewing the confidence interval. These results are also reflected by the 95% confidence
interval for the mean that SAS computed. Output 12.6 shows that the lower confidence
limit for the mean is 24.127, the upper confidence limit for the mean is 28.027. This
means that there is a 95% likelihood that your sample was drawn from a population in which
the population mean was somewhere between 24.127 and 28.027. Notice that this interval
contains the number 25, which was the population mean stated by the studys null
hypothesis. This confidence interval gives us another way of seeing that your sample could
very likely have come from a population in which the mean number of correct responses
was 25. And that is why you failed to reject the null hypothesis with your t test. Whenever a
confidence interval contains the population mean stated by your null hypothesis, you can
expect the corresponding t statistic to be nonsignificant. This, of course, assumes that you
use the same alpha level for both the confidence interval and the t test (e.g., you compute the
95% confidence interval and set alpha = .05 for the t test).
Computing the index of effect size. The formula for d, the effect size index, is again
provided below:
d=

| X 0 |

sX

Thus far, we have already established that the sample mean ( X ) for the current analysis is
26.077, and that the population mean ( 0 ) under the null hypothesis is again 25. The only
remaining value needed for the formula is sX, the estimated population standard deviation.
This can be found in Output 12.6.

410 Step-by-Step Basic Statistics Using SAS: Student Guide

Under the heading Std Dev, you can see that the standard deviation is 3.2265. When you
insert these values into the formula, you can compute the effect size for the current analysis
as follows:
d=

| X 0 |

sX

d=

26.077 25

3.2265

d=

1.077

3.2265

d=

.334

d=

.33

The obtained effect index is .33. To determine whether this is considered relatively large or
relatively small, we consult Cohens (1969) guidelines in Table 12.5:
Table 12.5
Guidelines for Interpreting Effect Size
_________________________________________
Effect size
Obtained d statistic
_________________________________________
Small effect
d = .20
Medium effect
d = .50
Large effect
d = .80
_________________________________________

According to Table 12.5, your obtained effect size of .33 falls somewhere between a small
effect (d = .20) and a medium effect (d = .50).
Summarizing the Results of the Analysis
Here is the analysis report for the current analysis. Notice that the results have been changed
to be consistent with those reported in Output 12.6.
A) Statement of the research question: The purpose of this
study was to determine whether subjects performing a fourchoice spatial recall task will perform at a level that is
higher than the level expected with random responding.

Chapter 12: Single-Sample t Test 411

B) Statement of the research hypothesis: Subjects performing


a four-choice spatial recall task will perform at a level that
is higher than the level expected with random responding.
C) Nature of the variable: The criterion variable was RECALL,
the number of times the subjects correctly guessed the
quadrant where targeted information appeared on the text page.
This was a multi-value variable and was assessed on a ratio
scale.
Single-sample t test.

D) Statistical test:

E) Statistical null hypothesis (H0): = 25; In the


population, the average number of correct recalls is equal to
25 out of 100 (the number expected with random responding).
F) Statistical alternative hypothesis (H1): 25; In the
population, the average number of correct recalls is not equal
to 25 out of 100.
t = 1.20

G) Obtained statistic:

H) Obtained probability (p) value:

p = .2520

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.

Fail

J) Confidence interval: The sample mean on the criterion


variable (number of correct recalls) was 26.077. The 95%
confidence interval for the mean extended from 24.127 to
28.027.
K) Effect size:

d =

.33.

L) Conclusion regarding research hypothesis: These findings


fail to provide support for the studys research hypothesis.
M) Formal description of the results for a paper: Results
were analyzed using a single-sample t test. This analysis
revealed a nonsignificant t value, t(12) = 1.20, p = .2520.
In the sample, the mean number of correct recalls was 26.077
(SD = 3.2265), which was not significantly higher than the 25
correct recalls that would have been expected with random
responding. The 95% confidence interval for the mean extended
from 24.127 to 28.027. The effect size was computed as d =
.33. According to Cohens (1969) guidelines, this falls
somewhere between a small effect and a medium effect.

412 Step-by-Step Basic Statistics Using SAS: Student Guide

Conclusion
In this chapter you learned how to use SAS to perform a single-sample t test. This test is
useful when you have computed the mean score on a criterion variable for just one sample,
and wish to determine whether this mean is significantly different from a specified
population value.
With this foundation laid, you are now ready to move on to one of the most widely used
inferential statistics: the independent-samples t test. The independent-samples t test is useful
when you have obtained numeric data from two samples and wish to determine whether the
sample means are significantly different from each other. For example, you might use this
test when you have conducted an experiment with an experimental group and a control
group, and wish to determine whether the observed difference between the two group means
is significant. Chapter 13 discusses the assumptions underlying this test, and shows how it
can be performed using SAS.

IndependentSamples t Test
Introduction..........................................................................................415
Overview................................................................................................................ 415
Independent Samples versus Paired Samples ...................................................... 415
Situations Appropriate for the Independent-Samples t Test ..............417
Overview................................................................................................................ 417
Nature of the Predictor and Criterion Variables ..................................................... 417
The Type-of-Variable Figure .................................................................................. 417
Example of a Study Providing Data Appropriate for This Procedure...................... 418
Summary of Assumptions Underlying the Independent-Samples t Test ................ 419
Results Produced in an Independent-Samples t Test .........................420
Overview................................................................................................................ 420
Test of the Null Hypothesis .................................................................................... 420
Confidence Interval for the Difference between the Means ................................... 426
Effect Size.............................................................................................................. 426
Example 13.1: Observed Consequences for Modeled Aggression:
Effects on Subsequent Subject Aggression
(Significant Differences) .................................................................428
Overview................................................................................................................ 428
The Study .............................................................................................................. 428
The Predictor Variable and Criterion Variables in the Analysis.............................. 429
Data Set to Be Analyzed........................................................................................ 430
The DATA Step for the Program ............................................................................ 431
Writing the SAS Program....................................................................................... 432
Results from the SAS Output................................................................................. 434

414 Step-by-Step Basic Statistics Using SAS: Student Guide

Steps in Interpreting the Output ............................................................................. 435


Summarizing the Results of the Analysis............................................................... 443
Example 13.2: An Illustration of Results Showing
Nonsignificant Differences..............................................................446
Overview................................................................................................................ 446
The SAS Output..................................................................................................... 446
Interpreting the Output ........................................................................................... 446
Summarizing the Results of the Analysis............................................................... 448
Conclusion............................................................................................450

Chapter 13: Independent-Samples t Test 415

Introduction
Overview
This chapter shows you how to use the SAS System to perform an independent-samples t
test. You use this test when you want to compare two independent groups, to determine
whether there is a significant difference between the groups with respect to their mean
scores on some numeric criterion variable. The criterion variable must be assessed on an
interval or ratio level (additional assumptions will be discussed). This chapter shows you
how to write the appropriate SAS program, interpret the output, and prepare a report
summarizing the results of the analysis. Special emphasis is given to testing the null
hypothesis, interpreting the confidence interval for the difference between the means, and
computing an index of effect size.
Independent Samples versus Paired Samples
Overview. This chapter shows how to perform the independent-samples t test, and the
following chapter shows how to perform the paired-samples t test (also called the correlatedsamples t test or related-samples t test). Your choice of which procedure to use depends in
part upon whether the observations in the two samples are independent. When observations
are independent, you may analyze the data using an independent-samples t test; when
observations are not independent, you should analyze the data using a paired-samples t test.
This section explains the meaning of independence as it applies in this context and provides
examples of studies that provide data appropriate for the independent-samples t test versus
the paired-samples t test.
Independent samples. Suppose that you conduct an experiment in which you obtain scores
on a dependent variable under an experimental condition and a control condition. The scores
obtained in these two conditions are independent if the probability of a specific score
occurring under one condition is not influenced by the scores that occur under the other
condition. If the scores are independent, then the samples are considered independent and it
is appropriate to analyze the data using the independent-samples t test (assuming that certain
other assumptions are met).
There are a number of ways that researchers can create independent samples. One way is to
begin with a pool of subjects, and then randomly assign half of the subjects to the
experimental condition, and the other half to the control condition. With this procedure, the
scores that occur under one condition do not have any influence on scores that occur under
the other condition, and so they are independent. This research design is called a
randomized-subjects design. The randomized-subjects design is generally regarded as
being a fairly strong experimental design because the random assignment of subjects to
conditions makes it likely that the two groups will be equivalent on most important variables
at the outset of the study.

416 Step-by-Step Basic Statistics Using SAS: Student Guide

A second way to create independent samples is to use a subject variable (such as subject
sex) as a quasi-independent variable in a study. For example, assume that you conduct an
investigation in which you compare males versus females with respect to their scores on a
test of verbal reasoning. In this instance, the scores are again independent because a specific
score obtained under one condition (say the male condition) cannot be paired in any
meaningful way with the specific score obtained under the second condition (the female
condition). Therefore, it would again be appropriate to analyze the data using the
independent-samples t test. You should remember, however, that the type of research design
described here is not nearly as desirable as the randomized-subjects design described earlier.
This is because, when the so-called independent variable is actually a subject
characteristic (such as subject sex), it is not likely that the groups will be equivalent on most
important variables at the outset of the study.
Paired samples. In some types of investigations, observations are not independent. In those
studies, it is possible to pair scores obtained under one condition with scores obtained under
a different condition. The resulting samples are referred to as paired samples, correlated
samples, or related samples.
There are a number of ways that researchers can create paired samples. One way is to use a
repeated-measures design in conducting the study. With a repeated-measures design, each
subject is exposed to every treatment condition under the independent variable. This means
that each subject provides a score on the dependent variable under each condition of the
independent variable. Scores obtained in this way are no longer independent, because scores
obtained under different treatment conditions are obtained from the same set of subjects.
A second way to create paired samples is to use a matched-subjects design. With this
approach, the subjects that are assigned to different conditions are matched on some variable
of interest. This means that a subject in one condition is paired with a subject in another
condition. Subjects are paired because they are similar to each other on some matching
variable.
For example, assume that you randomly assign half of your subjects to an experimental
condition, and the other half to a control condition. Assume that the dependent variable in
your study will again be scores on a test of verbal reasoning. Therefore, you match the
subjects on IQ scores, because you believe that IQ scores will be correlated with scores on
the verbal reasoning dependent variable. This means that, if subject #1 (in the experimental
condition) has a high score on IQ, she will be matched with a subject in the control
condition who also has a high score on IQ. Scores on the dependent variable that are
obtained in this way are no longer independent, because they have been obtained from pairs
of subjects who are similar to each other on the matching variable. When the data are
analyzed, they will be analyzed in a special way to take advantage of this matching (this will
be illustrated in the next chapter).
Between-subjects designs versus within-subjects designs. Randomized-subjects designs
and other research procedures that produce data appropriate for an independent-samples t
test are typically referred to as between-subjects designs. Repeated-measures designs,

Chapter 13: Independent-Samples t Test 417

matched-subjects designs, and other procedures that produce data appropriate for a pairedsamples t test are typically referred to as within-subjects designs.

Situations Appropriate for the Independent-Samples


t Test
Overview
The independent-samples t test is a test of group differences. You use this statistic when you
want to determine whether there is a significant difference between two groups with respect
to their mean scores on some numeric criterion (or dependent) variable. The t test is
appropriate when you are comparing exactly two groups, and is not appropriate for studies
that involve more than two groups. For guidance in analyzing data from studies with three or
more groups, see Chapters 15 and 16.
The first part of this section describes the types of situations in which this statistic is
typically computed, and discusses a few of the assumptions underlying the procedure. A
more complete summary of assumptions is presented at the end of this section.
Nature of the Predictor and Criterion Variables
Predictor variable. To perform an independent-samples t test, the predictor (or
independent) variable should be a dichotomous variable (i.e., a variable that assumes just
two values). The predictor variable in an independent-samples t test is simply the variable
that indicates which group a subject is in. The predictor variable may be assessed on any
scale of measurement: nominal, ordinal, interval, or ratio.
Criterion variable. The criterion (or dependent) variable should be a numeric variable that
is assessed on an interval or ratio scale of measurement.
The Type-of-Variable Figure
The following figure illustrates the types of variables that are typically being analyzed when
performing an independent-samples t test.
Criterion
Variable

Predictor
Variable

D i

The symbol that appears to the left of the equals sign represents the criterion variable in the
analysis. The word Multi that appears in the figure shows that the criterion variable in an
independent-samples t test is typically a multi-value variable (a variable that assumes more
than six values in your sample).

418 Step-by-Step Basic Statistics Using SAS: Student Guide

The letters Di on the right of the equals sign show that the predictor variable in this
procedure is a dichotomous variable (i.e., a variable that assumes just two values). As was
stated earlier, the predictor variable in this procedure simply indicates which group a subject
is in.
Example of a Study Providing Data Appropriate for This Procedure
The study. Suppose you are a criminologist doing research on drunk driving. In your
current project, you wish to determine whether people who live in dry counties (counties
that prohibit the sale of alcohol) tend to drive under the influence of alcohol less frequently
than people who live in wet counties (counties that allow the sale of alcohol).
Suppose that you survey people in both types of counties about their behavior. Your
criterion variable is the number of times that each subject has driven a car under the
influence of alcohol in the past month. You then use an independent-samples t test to
determine whether the average score for the subjects in the dry counties is significantly
lower than the average score for subjects in the wet counties.
Why these data would be appropriate for this procedure. Earlier sections have indicated
that, to perform an independent-samples t test, you need a predictor variable. The predictor
variable should be a dichotomous variable. The predictor variable in this study was type of
county policy toward alcohol. You know that this is a dichotomous variable, because it
consists of just two values: dry versus wet (that is, each subject was classified as either
being from a dry county or from a wet county).
Earlier sections have stated that, to perform an independent-samples t test, the criterion
variable should be a numeric variable that is assessed on an interval or ratio scale of
measurement. The criterion variable in the present study is the number of times driving
under the influence of alcohol in past month. You know that scores on this criterion
variable are assessed on a ratio level of measurement because equal intervals between scale
scores have equal quantitative meaning, and also because there is a true zero point (i.e., if
someone has a score of zero, it means that he or she did not drink and drive at all).
An earlier section also indicated that, when researchers perform a t test, the criterion
variable is usually a multi-value variable. To determine whether this is the case for the
current study, you would use PROC FREQ to create a simple frequency table for the
criterion variable (similar to those shown in Chapter 5: Creating Frequency Tables). You
would know that the criterion is a multi-value variable if you observe more than six values
in its frequency table.

Chapter 13: Independent-Samples t Test 419

Summary of Assumptions Underlying the Independent-Samples t


Test

Level of measurement. The criterion variable should be assessed on an interval or ratio


level of measurement. The predictor variable may be assessed on any level of
measurement.

Independent observations. A given observation should not be dependent on any other


observation in either group. In an experiment, this is normally achieved by drawing a
random sample, and randomly assigning each subject to only one of the two treatment
conditions. This assumption would be violated if a given subject contributed scores on the
criterion variable under both treatment conditions. The independence assumption is also
violated when one subjects behavior influences another subjects behavior within the
same condition. For example, if subjects are given experimental instructions in groups of
five, and are allowed to interact in the course of providing scores on the criterion variable,
it is likely that their scores will not be independent: each subjects score is likely to be
affected by the other subjects in that group. In these situations, scores from the subjects
constituting a given group of five should be averaged, and these average scores should
constitute the unit of analysis. None of the tests discussed in this text are robust against
violations of the independence assumption.

Random sampling. Scores on the criterion variable should represent a random sample
drawn from the populations of interest.

Normal distributions. Each sample should be drawn from a normally distributed


population (you can use PROC UNIVARIATE with the NORMAL option to test the null
hypothesis that the sample is from a normally distributed population). If each sample
contains over 30 subjects, the test is robust against moderate departures from normality
(when a test is robust against violations of certain assumptions, it means that violating
those assumptions will have only a negligible effect on the results). If the assumption of
normality is violated, you may instead analyze your data using PROC NPAR1WAY. For
guidance, see the NPAR1WAY procedure in the SAS/STAT Users Guide.

Homogeneity of variance. To use the equal-variances t test, you should draw the
samples from populations with equal variances on the criterion. If the null hypothesis of
equal population variances is rejected, you should use the unequal-variances t test. Both
types of tests are provided in the output of PROC TTEST, to be described in the following
section.

420 Step-by-Step Basic Statistics Using SAS: Student Guide

Results Produced in an Independent-Samples t Test


Overview
When you use PROC TTEST to perform an independent-samples t test, SAS automatically
performs a test of the null hypothesis and estimates a confidence interval for the difference
between the means. If we use a few statistics that are included in the output of PROC
TTEST, it is relatively easy to compute by hand an index of effect size. This section
explains the meaning of these results.
Test of the Null Hypothesis
Overview. This section provides a concrete example of a another study that would provide
data appropriate for an independent-samples t test. It discusses the statistical null hypothesis
and alternative hypothesis that would be stated for the analysis. It discusses the distinction
between directional versus nondirectional alternative hypotheses as they apply to the
independent-samples t test. Finally, it discusses the meaning of rejecting the null hypothesis,
as it applies to the current example.
A study on memory. Suppose that you are conducting research on the herbal supplement
Ginkgo biloba. You wish to determine whether taking Ginkgo biloba affects subject
performance on a memory test.
You begin with a pool of 100 subjects. You randomly assign 50 of the subjects to an
experimental condition. Subjects in this condition will take 60 mg of Ginkgo biloba three
time per day for six weeks. You label this the ginkgo condition.
You randomly assign the other 50 subjects to a control condition. Subjects in this condition
take a placebo pill three times per day for six weeks. You label this the placebo condition.
At the end of the six-week period, you administer a memory test to all subjects. Scores on
this test may range from zero to 100, with higher scores representing better memory.
Assume that score on this test on an interval scale of measurement and are normally
distributed within conditions.
Samples and populations. You assume that your sample of 50 subjects in the ginkgo
condition represent a population of subjects taking the same dose of Ginkgo biloba. The
symbol 1 represents the mean score of this population on the memory test that you are
using. Of course, you cannot compute 1 (because you cannot administer your memory test
to all members of the population), but you can compute the mean score on the memory test
for your sample of 50 subjects in the ginkgo condition. The symbol X1 will represent this
sample mean.
Similarly, you assume that your sample of 50 subjects in the placebo condition represents a
different populationa population of subjects who are not taking ginkgo. The symbol 2
represents the mean score of this population on the memory test that you are using. Again,

Chapter 13: Independent-Samples t Test 421

you cannot actually compute 2, but you can compute the mean memory test score for your
sample of 50 subjects in the placebo condition. The symbol X2 will represent this sample
mean.
The statistical null hypothesis for a nondirectional test. In Chapter 12, you learned that
you state statistical hypotheses differently depending on whether you plan to perform a
directional test or a nondirectional test. This section and the next section focuses on the
situation in which you plan to perform a nondirectional test: a test in which you predict
that there will be a difference, but do not make a specific prediction regarding the nature of
that difference (i.e., a situation in which you do not predict which group will score higher on
the criterion variable). A nondirectional test is sometimes called a two-sided test or a twotailed test. A later section will show how to state null hypotheses and alternative hypotheses
when you wish to perform a directional test.
In earlier chapters, you learned that a statistical null hypothesis is a hypothesis of no
difference or no association. The statistical null hypothesis describes the results that you will
obtain if your independent variable has no effect. Symbolically, the null hypothesis for an
independent-samples t test can be stated in a number of ways. Here is one possibility:
H0: 1 = 2
The preceding null hypothesis states that the mean of population 1 is equal to the mean of
population 2. If you are conducting an experiment with two treatment conditions, this is
equivalent to saying that, in the population, the mean for the subjects in condition 1 is equal
to the mean for the subjects in condition 2.
To make this concrete, think about the study on Ginkgo biloba that you are conducting.
Assume for a moment that ginkgo has no effect on memory. If that were the case, you would
expect the mean score on the memory test obtained for the population of people taking
ginkgo to be equal to the mean score obtained for the population of people taking the
placebo. This means that, in verbal terms, you could state the null hypothesis for the current
study in this way: In the population, there is no difference between subjects in the ginkgo
condition versus subjects in the placebo condition with respect to their mean scores on the
criterion variable (the memory test).
You can see how this would be an appropriate representation of the null hypothesis. If
ginkgo really has no effect on memory, you would expect the mean memory test score for
the ginkgo population to be equal to the mean memory test score for the placebo population.
When you prepare an analysis report for an independent-samples t test, you should provide
the null hypothesis in symbolic form combined with the null hypothesis in verbal form.
Below is an example of how this could be done for the ginkgo study:
Statistical null hypothesis (H0): 1 = 2; In the population, there is no difference
between subjects in the ginkgo condition versus subjects in the placebo condition with
respect to their mean scores on the criterion variable (the memory test).

422 Step-by-Step Basic Statistics Using SAS: Student Guide

Some textbooks portray this statistical null hypothesis in a somewhat different fashion.
Some textbooks represent it in this way:
H0: 1 2 = 0
Symbolically, the preceding null hypothesis states that the difference between 1 and 2 is
equal to zero. You can see that this is essentially the same as saying that 1 is equal to 2 ,
because if two means are equal to each other, then the difference between them must be
equal to zero. Therefore, these two ways of representing the null hypothesis are essentially
equivalent.
Below is an example of how you might state this type of null hypothesis in an analysis
report:
Statistical null hypothesis (H0): 1 2 = 0; In the population, the difference between
subjects in the ginkgo condition versus subjects in the placebo condition with respect to
their mean scores on the criterion variable (the memory test) is equal to zero.
The statistical alternative hypothesis for a nondirectional test. The statistical alternative
hypothesis is a hypothesis that there is a difference or that there is an association. Again, this
section illustrates the nature of the alternative hypothesis when you are performing a
nondirectional test: a test in which you are not predicting which group will score higher on
the criterion variable.
Symbolically, a nondirectional alternative hypothesis for an independent-samples t test may
be stated in this way:
H1: 1 2
The preceding alternative hypothesis states that the mean of population 1 is not equal to the
mean of population 2. If you are conducting an experiment with two treatment conditions,
this is equivalent to saying that, in the population, the mean for the subjects in condition 1 is
not equal to the mean for the subjects in condition 2.
In verbal terms, there are a number of ways that this alternative hypothesis may be stated.
Here is one possibility for expressing the alternative hypothesis in very general terms:
Statistical alternative hypothesis (H1): 1 2; In the population, there is a difference
between subjects in the first condition versus subjects in the second condition with
respect to their mean scores on the criterion variable.
To make things more concrete, here is one way of stating the alternative hypothesis for the
Ginkgo biloba study described previously:
Statistical alternative hypothesis (H1): 1 2; In the population, there is a difference
between subjects in the ginkgo condition versus subjects in the placebo condition with
respect to their mean scores on the criterion variable (the memory test).

Chapter 13: Independent-Samples t Test 423

The preceding section on the statistical null hypothesis indicated that some textbooks state
the null hypothesis in this way:
H0: 1 2 = 0
This null hypothesis states that the difference between the mean of population 1 and the
mean of population 2 is equal to zero (this is equivalent to saying that there is no difference
between the two population means). If you state your null hypothesis in this way, it is
appropriate to state a corresponding alternative hypothesis in a similar fashion. Below is one
way that this could be done for the ginkgo study:
Statistical alternative hypothesis (H1): H0: 1 2 0; In the population, the difference
between subjects in the ginkgo condition versus subjects in the placebo condition with
respect to their mean scores on the criterion variable (the memory test) is not equal to
zero.
Obtaining significant results with a nondirectional test. When you use PROC TTEST to
perform an independent-samples t test, the procedure analyzes data from your sample and
computes a t statistic. With other factors held constant, the greater the observed difference
between your two sample means, the larger this t statistic will be (in absolute terms).
SAS also computes a p value (probability value) associated with this t statistic. If this p
value is less than some standard criterion (alpha level), you will reject the null hypothesis.
This book recommends that you use an alpha level of .05. This means that, if your obtained
p value is less than .05, you will reject the null hypothesis that your two samples were
drawn from populations with the same mean. In this case, you will conclude that you have
statistically significant results.
When you perform a nondirectional test, the only requirement for rejecting the null
hypothesis is that you obtain a p value that is below some set criterion (such as an alpha
level of .05). When you perform a nondirectional test, it does not matter which sample mean
is higher than the otheras long as your p value is less than the criterion, you may reject the
null hypothesis.
The null and alternative hypothesis for a directional test. The preceding section showed
you how to state the null hypothesis and alternative hypothesis for a nondirectional test. This
section will show how to state these hypotheses for a directional test.
A directional test is a test in which you not only predict that there will be a difference, but
you also make a specific prediction regarding the nature of that difference. For example,
when you are performing an independent-samples t test, you would use a directional test if
you were making a specific prediction about which group was going to score higher on the
criterion variable. A directional test is sometimes called a one-sided test or a one-tailed test.
Consider the Ginkgo biloba study. It is possible that previous research suggests that taking
ginkgo should have a positive effect on memory. Therefore, you develop the following
research hypothesis: Subjects who take Ginkgo biloba will later demonstrate higher scores
on the memory test, compared to subjects who take the placebo.

424 Step-by-Step Basic Statistics Using SAS: Student Guide

It will probably be easier to understand the statistical hypotheses if you consider the
statistical alternative hypothesis prior to considering the statistical null hypothesis.
For the current study on ginkgo and memory, the statistical alternative hypothesis could be
stated in the following way:
Statistical alternative hypothesis (H1): 1 > 2; In the population, subjects in the ginkgo
condition will score higher than subjects in the placebo condition with respect to their
mean scores on the criterion variable (the memory test).
With the preceding alternative hypothesis, assume that 1 represents the mean score on the
memory test for the ginkgo condition in the population, and 2 represents the mean score on
the memory test for the placebo condition in the population. You can see that the symbolic
version of the alternative hypothesis predicts that 1 > 2 (i.e., that the ginkgo population will
score higher than the placebo population). This is consistent with the research hypothesis.
Below is the statistical null hypothesis that serves as counterpart to the preceding alternative
hypothesis:
Statistical null hypothesis (H0): 1 2; In the population, subjects in the ginkgo
condition score lower than or equal to subjects in the placebo condition with respect to
their mean scores on the criterion variable (the memory test).
The preceding null hypothesis predicts that the ginkgo population scores lower than or
equal to the placebo population on the memory test. This hypothesis includes the words
lower than as well as the words equal to because there are two types of outcomes that
you could obtain that would fail to support your research hypothesis:

If there were no statistically significant difference between the ginkgo condition and the
placebo condition, this outcome would fail to support your research hypothesis. This is
why your null hypothesis must contain some type of statement to the effect that the ginkgo
population is equal to the placebo population on the memory test.

If the placebo condition scores significantly higher than the ginkgo condition on the
memory test, this outcome would also fail to support your research hypothesis (in fact, this
outcome would be the exact opposite of your research hypothesis!). This is why your null
hypothesis must contain some type of statement to the effect that the ginkgo population
scores lower than the placebo population on the memory test.

When you performed a nondirectional test (earlier), your statistical null hypothesis was
stated in the following way:
H0: 1 = 2
But this type of null hypothesis would be inadequate for a directional test. For the reasons
stated above, the null hypothesis for the current ginkgo study must contain the sign rather
than the = sign.
This section has shown you how your null and alternative hypotheses should appear when
your research hypothesis predicts that 1 > 2. But what about a situation in which your

Chapter 13: Independent-Samples t Test 425

research hypothesis predicts that 1 < 2? In this situation, your null and alternative
hypotheses must predict the opposite direction of results.
For example, assume that the research literature actually suggests that Ginkgo biloba has a
negative effect on memory, rather than a positive effect. If this were the case, you might
state your research question in this way: Subjects who take Ginkgo biloba will later
demonstrate lower scores on the memory test, compared to subjects who take the placebo.
Below is the alternative hypothesis that would be appropriate for such a research hypothesis:
Statistical alternative hypothesis (H1): 1 < 2; In the population, subjects in the ginkgo
condition will score lower than subjects in the placebo condition with respect to their
mean scores on the criterion variable (the memory test).
Below is the null hypothesis that would be appropriate for such a research hypothesis:
Statistical null hypothesis (H0): 1 2; In the population, subjects in the ginkgo
condition score higher than or equal to subjects in the placebo condition with respect to
their mean scores on the criterion variable (the memory test).
Obtaining significant results with a directional test. When you perform a directional test
with PROC TTEST, you will again obtain a t statistic and a p value that is associated with
that statistic. It is important to remember, however, that the p value computed by SAS is the
p value for a nondirectional test. In order to compute the p value for a directional test, it is
necessary divide this p value by 2.
For example, suppose that you perform the analysis, and the results of PROC TTEST
include a p value of .06. This is larger than the standard criterion of .05 that is recommended
by this book. Therefore, at first glance your results appear to be nonsignificant.
However, this p = .06 is relevant only to the nondirectional test. To compute the p value for
the directional test, you divide it by two, and arrive at an actual value of p = .03. This is
below the standard criterion of .05, which means that your results are in fact statistically
significant.
When you perform a directional test, it is important to remember that there are actually two
conditions that must be met before you may reject your null hypothesis. They are:

Your p value must be below some set criterion (as described above), and

Your sample means must be in the direction specified by the alternative hypothesis.

The second of these two conditions emphasizes that you may reject the null hypothesis only
if the mean that you predicted would be higher (in your alternative hypothesis) is, in fact,
higher.
For example, consider the ginkgo study. Assume that you began your analysis with the
following statistical alternative hypothesis:

426 Step-by-Step Basic Statistics Using SAS: Student Guide

Statistical alternative hypothesis (H1): 1 > 2; In the population, subjects in the ginkgo
condition will score higher than subjects in the placebo condition with respect to their
mean scores on the criterion variable (the memory test).
If this were the case, you would be justified in rejecting your null hypothesis only if your p
value were less than .05 (for example), and the sample mean for the ginkgo condition were,
in fact, higher than the sample mean for the placebo condition. If the sample mean for the
placebo condition were higher than the sample mean for the ginkgo condition, it would not
be appropriate to reject the null hypothesis even if the p value were less than .05.
Confidence Interval for the Difference between the Means
Confidence interval defined. When you use PROC TTEST to perform an independentsamples t test, SAS also automatically computes a confidence interval for the difference
between the means. In the last chapter, you learned that a confidence interval is an interval
that extends from a lower confidence limit to an upper confidence limit and is assumed to
contain a population parameter with a stated probability, or level of confidence. In that
chapter, you learned that SAS can compute the confidence interval for a mean. In this
chapter, you will see that SAS can also compute a confidence interval for the difference
between two means.
An example. As an illustration, consider the Ginkgo biloba study described above. Assume
that the mean score on the memory test for a sample of subjects in the ginkgo condition is
56, and the mean score for a sample of subjects in the placebo condition is 50. The
difference between the two sample means is 56 50 = 6. This means that the observed
difference between the sample means is equal to 6. Assume that you analyze your data using
PROC TTEST, and SAS estimates that the 95% confidence interval for this difference
between the means extends from 4 (the lower confidence limit) to 8 (the upper confidence
limit). This means that there is a 95% probability that, in the population, the actual
difference between the ginkgo condition versus the placebo condition is somewhere between
4 and 8 points on the memory test. You do not know the actual difference between the
population means, but you estimate that there is a 95% probability that it is somewhere
between 4 and 8 points.
Notice that with this confidence interval you are not stating that there is a 95% probability
that the difference between the sample means is somewhere between 4 and 8 points. You
know exactly what the difference between the sample means isyou have already
computed it to be 6 points. The confidence interval computed by SAS is a probability
statement about the difference in the populations, not in the samples.
Effect Size
The need for an index of effect size. In Chapter 12, you learned that, when you perform a
significance test, it is best to supplement that test with an estimate of effect size. When you
analyze data from an experiment, an index of effect size indicates how large the treatment

Chapter 13: Independent-Samples t Test 427

effect was: It indicates how much change occurs in the dependent variable as a result of a
change in the independent variable. In Chapter 12, you also learned about an index of effect
size that can be used with a single-sample t test. With that statistical procedure, effect size
was defined as the degree to which the sample mean differs from the population mean,
stated in terms of the standard deviation of the population.
Effect size defined. When you perform an independent-samples t test, effect size can
defined in a somewhat different way. With this procedure, effect size can be defined as the
degree to which one sample mean differs from the a second sample mean, stated in terms of
the standard deviation of the population. The symbol for effect size is d (as was the case in
the previous chapter), and the formula (adapted from Thorndike and Dinnel [2001]) is as
follows:
d=

| X1 X2 |

sp

where:
X1 = the observed mean of sample 1 (i.e., the subjects in treatment condition 1)
X2 = the observed mean of sample 2 (i.e., the subjects in treatment condition 2)
sp = the pooled estimate of the population standard deviation.
You can see that the procedure for computing d is fairly straightforward: You simply
subtract one sample mean from the other sample mean and divide the absolute value of the
result by the pooled estimate of the population standard deviation. The resulting statistic
represents the number of standard deviations that the two means differ from one another.
The index of effect size is not automatically computed by PROC TTEST, although it can
easily be calculated by hand from other statistics that do appear on the procedure output. A
later section of this chapter will show how this is done, and will provide some guidelines for
interpreting the size of d.

428 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 13.1: Observed Consequences for Modeled


Aggression: Effects on Subsequent Subject Aggression
(Significant Differences)
Overview
This section provides a complete illustration of an independent-samples t test. It describes a
simple fictitious experiment that produces data appropriate for an independent-samples t
test. It shows how to use SAS to perform an independent-samples t test on data from this
study. It also shows how to handle the DATA step, how to run PROC TTEST, how to
interpret the output from PROC TTEST and how to prepare a report that summarizes the
results of the analysis. This section illustrates significant results; a section that follows will
illustrate nonsignificant results.
Note: Although the study described here is fictitious, it was inspired by the actual
investigation reported by Bandura (1965).
The Study
The research hypothesis. Suppose that you are a social psychologist studying aggression in
children. In your investigation, you wish to determine whether exposure to aggressive
models causes children to behave more aggressively. You have an hypothesis that children
who see a model rewarded for engaging in aggressive behavior will themselves behave more
aggressively than children who see a model punished for engaging in aggressive behavior.
This research hypothesis is illustrated in Figure 13.1.

Figure 13.1. Causal relationship between consequences for model


and the number of subsequent subject aggressive acts, as predicted by
the research hypothesis.

Chapter 13: Independent-Samples t Test 429

The causal arrow going from right to left in Figure 13.1 illustrates your prediction that
observed consequences for the model should have an effect on the number of subsequent
subject aggressive acts displayed by the subject (the child who observes the model). You
predict that:

When children observe a model behaving aggressively and receiving the consequence of
being rewarded, those children will themselves subsequently behave more aggressively.

When children observe a model behaving aggressively and receiving the consequence of
being punished, those children will themselves subsequently behave less aggressively.

The study. To test this hypothesis you conduct a study in two stages. In Stage 1, you begin
with a pool of 36 children, and randomly assign each child to either a model-rewarded
condition or to a model-punished condition. In both conditions, children watch a
videotape of a model behaving in an aggressive fashion with an inflatable Bobo doll. In
the videotape, the model punches the Bobo doll in the nose, strikes it with a rubber mallet,
and kicks it around the room.
However, the ending of the video is different for children in the two conditions (this is how
you manipulate the independent variable). For the 18 children in the model-rewarded
condition the video ends with a second adult entering the room, praising the model, and
rewarding him with candy and soft drinks. For the 18 children in the model-punished
condition, the video ends with the second adult entering the room, scolding the model, and
spanking him.
The independent variable in your study, therefore, is the observed consequences for the
model. Half of your subjects observe the model being rewarded, while the other half observe
the model being punished.
In Stage 2, you measure your dependent variable: the number of aggressive acts displayed
by the children after they have watched the video. In Stage 2, each child is given the
opportunity to play for 30 minutes (alone) in a room similar to the one seen in the video. The
play room contains a Bobo doll (as in the video), a rubber mallet, and a wide variety of other
toys. While the child plays, you and your assistants view the child through a one-way mirror
and count the number of aggressive acts performed by the child (e.g., punching the inflatable
doll). This number of aggressive acts serves as the dependent variable in your study.
The Predictor Variable and Criterion Variables in the Analysis
The predictor variable in your study is observed consequences for the model, which
consists of just two conditions: the model-rewarded condition versus the model-punished
condition. In the analysis, you will give this variable the SAS variable name VIDGRP,
which stands for video group.
The criterion variable in your study is the number of aggressive acts displayed by the
children during the 30-minute play period after they have watched the videotape. This
variable is assessed on a ratio scale, since it has equal intervals and a true zero point. In the

430 Step-by-Step Basic Statistics Using SAS: Student Guide

analysis, you will give it the SAS variable name AGGRESS, which refers to subject
aggressive acts.
Data Set to Be Analyzed
Table 13.1 provides the data set that you will analyze.
Table 13.1
Number of Aggressive Acts Displayed by Subjects as a Function of
Video Group
____________________________________
Video
Aggressive
a
acts
Subject
group
____________________________________
01
PUN
3
02
PUN
6
03
PUN
4
04
PUN
8
05
PUN
7
06
PUN
0
07
PUN
5
08
PUN
2
09
PUN
4
10
PUN
5
11
PUN
6
12
PUN
1
13
PUN
2
14
PUN
3
15
PUN
4
16
PUN
5
17
PUN
3
18
PUN
4
19
REW
8
20
REW
9
21
REW
5
22
REW
10
23
REW
7
24
REW
7
25
REW
5
26
REW
3
27
REW
4
28
REW
11
29
REW
6
30
REW
6
31
REW
9
32
REW
8
33
REW
6
34
REW
7
35
REW
7
36
REW
10
_____________________________________
a With the variable Video group, the value PUN identifies subjects in
the model-punished condition, and the value REW identifies subjects in
the model-rewarded condition.

Chapter 13: Independent-Samples t Test 431

You can see that Table 13.1 consists of three columns. The first column is headed Subject,
and simply provides a unique subject number for each participant.
The second column is headed Video group. The values that appear in this column indicate
the treatment condition to which a given subject was assigned. The value PUN identifies the
subjects who were assigned to the model-punished condition, and the value REW
identifies the subjects who were assigned to the model-rewarded condition. You can see
that Subjects 1-18 were assigned to the model-punished condition, and Subjects 19-36 were
assigned to the model-rewarded condition.
The third column in the table is headed Aggressive acts, and this column indicate the
number of aggressive acts that each child displayed in Stage 2 of the study, after watching
the video. You can see that Subject 1 displayed 3 aggressive acts, Subject 2 displayed 6
aggressive acts, and so forth.
The DATA Step for the Program
Suppose that you prepare a SAS program to input the data set presented in Table 13.1. You
use the SAS variable name SUB_NUM to represent subject numbers, the SAS variable
name VIDGRP to represent the video group predictor variable, and you use the SAS
variable name AGGRESS to represent subject scores on the criterion variable of the number
of aggressive acts. Following are the SAS statements that constitute the DATA step of this
program. Notice that a dollar sign ($) appears to the right of VIDGRP to identify it as a
character variable. Notice also that the data set appearing below the DATALINES statement
is essentially identical to the one appearing in Table 13.1.
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
VIDGRP $
AGGRESS;
DATALINES;
01 PUN 3
02 PUN 6
03 PUN 4
04 PUN 8
05 PUN 7
06 PUN 0
07 PUN 5
08 PUN 2
09 PUN 4
10 PUN 5
11 PUN 6
12 PUN 1
13 PUN 2
14 PUN 3
15 PUN 4
16 PUN 5
17 PUN 3

432 Step-by-Step Basic Statistics Using SAS: Student Guide

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
;

PUN 4
REW 8
REW 9
REW 5
REW 10
REW 7
REW 7
REW 5
REW 3
REW 4
REW 11
REW 6
REW 6
REW 9
REW 8
REW 6
REW 7
REW 7
REW 10

Writing the SAS Program


The PROC Step. The syntax for the PROC step that will perform an independent-samples t
test is as follows:
PROC TTEST DATA=data-set-name
CLASS predictor-variable ;
VAR criterion-variable ;
TITLE1 ' your-name ';
RUN;

ALPHA=alpha-level ;

In this syntax, the PROC TTEST statement contains the following option:
ALPHA=alpha-level
This ALPHA= option allows you to specify the size of the confidence interval that PROC
TTEST will estimate for the observed difference between sample means. Specifying
ALPHA=0.01 produces a 99% confidence interval, specifying ALPHA=0.05 produces a
95% confidence interval, and specifying ALPHA=0.1 produces a 90% confidence interval.
Assume that, in this analysis, you wish to create a 95% confidence interval. This means that
you will include the following option in the PROC TTEST statement:
ALPHA=0.05

Chapter 13: Independent-Samples t Test 433

The SAS statements. Here are the statements that will perform an independent-samples t
test on the current data set:
PROC TTEST DATA=D1 ALPHA=0.05;
CLASS VIDGRP;
VAR AGGRESS;
TITLE1 'JANE DOE';
RUN;
Some notes about the syntax:

The PROC step begins with the PROC TTEST statement, in which you provide the name
of the data set to be analyzed. In this case, the data set was D1.

In the CLASS statement, you provide the name of the predictor variable in the analysis.
For a t test, this will always be a dichotomous variable that simply indicates which group a
given subject is in. In this analysis, the predictor variable was VIDGRP.

In the VAR statement, you provide the name of the numeric criterion variable to be
analyzed. In the current analysis, the criterion variable was AGGRESS.

The PROC step ends with the usual TITLE1 and RUN statements.

The Complete SAS Program. Here is the programincluding the DATA stepthat you
can use to analyze the fictitious data from the preceding study:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
VIDGRP $
AGGRESS;
DATALINES;
01 PUN 3
02 PUN 6
03 PUN 4
04 PUN 8
05 PUN 7
06 PUN 0
07 PUN 5
08 PUN 2
09 PUN 4
10 PUN 5
11 PUN 6
12 PUN 1
13 PUN 2
14 PUN 3
15 PUN 4
16 PUN 5
17 PUN 3
18 PUN 4
19 REW 8
20 REW 9
21 REW 5
22 REW 10

434 Step-by-Step Basic Statistics Using SAS: Student Guide

23 REW 7
24 REW 7
25 REW 5
26 REW 3
27 REW 4
28 REW 11
29 REW 6
30 REW 6
31 REW 9
32 REW 8
33 REW 6
34 REW 7
35 REW 7
36 REW 10
;
PROC TTEST DATA=D1 ALPHA=0.05;
CLASS VIDGRP;
VAR AGGRESS;
TITLE1 'JANE DOE';
RUN;

Results from the SAS Output


Output 13.1 presents the results obtained from the preceding program.
JANE DOE
1
The TTEST Procedure
Statistics
Lower CL
Upper CL Lower CL
Variable Class
N
Mean
Mean
Mean
Std Dev Std Dev
AGGRESS
PUN
18
2.9766
4
5.0234
1.5443
2.058
AGGRESS
REW
18
6.0338 7.1111
8.1884
1.6256
2.1663
AGGRESS
Diff (1-2)
-4.542 -3.111
-1.68
1.709
2.1128
Statistics
Upper CL
Variable Class
Std Dev
Std Err
Minimum
Maximum
AGGRESS
PUN
3.0852
0.4851
0
8
AGGRESS
REW
3.2476
0.5106
3
11
AGGRESS
Diff (1-2)
2.7682
0.7043
Variable
AGGRESS
AGGRESS

Method
Pooled
Satterthwaite

Variable
AGGRESS

T-Tests
Variances
Equal
Unequal

DF
34
33.9

Equality of Variances
Method
Num DF
Den DF
Folded F
17
17

t Value
-4.42
-4.42
F Value
1.11

Pr > |t|
<.0001
<.0001
Pr > F
0.8350

Output 13.1. Results of the PROC TTEST analysis of aggression data;


significant differences observed.

Chapter 13: Independent-Samples t Test 435

The results in Output 13.1 are in divided into four sections. The contents of these sections
will be very briefly summarized here, and will be discussed in much greater detail later in
this chapter.
The first section of Output 13.1 contains a table labeled Statistics that provides simple
univariate statistics for the studys criterion variable. This table provides means, standard
deviations, and other statistics.
The second section contains another table also labeled Statistics that provides additional
univariate statistics, such as the standard error of the difference between means.
The third section contains a table headed T-Tests that provides the results of the
independent-samples t test. In fact, the results of two t tests are actually presented here; one
test assumes equal variances, and a second assumes unequal variances (more on this later).
Finally, a fourth table is headed Equality of Variances. This section presents the results of
an F' statistic which tests the null hypothesis that the two samples were drawn from
populations with equal variances.
Steps in Interpreting the Output
1. Make sure that everything looks right. Before reviewing the results of the t tests, you
should always review the results of the two tables headed Statistics to verify that there
were no obvious errors in preparing your SAS program. For example, you should verify that
the number of subjects in each condition (as reported in the output) is what you expect.
Reviewing these tables will also provide you with an understanding of the general trend in
your results. To make this easier, the two statistics tables from Output 13.1 are reproduced
as Output 13.2.
JANE DOE
1
The TTEST Procedure
Statistics
Lower CL
Upper CL Lower CL
Variable Class
N
Mean
Mean
Mean
Std Dev Std Dev
AGGRESS
PUN
18
2.9766
4
5.0234
1.5443
2.058
AGGRESS
REW
18
6.0338 7.1111
8.1884
1.6256
2.1663
AGGRESS
Diff (1-2)
-4.542 -3.111
-1.68
1.709
2.1128
Statistics
Upper CL
Variable Class
Std Dev
Std Err
Minimum
Maximum
AGGRESS
PUN
3.0852
0.4851
0
8
AGGRESS
REW
3.2476
0.5106
3
11
AGGRESS
Diff (1-2)
2.7682
0.7043
Output 13.2. Two statistics tables from the PROC TTEST analysis of
aggression data; significant differences observed.

Below the heading Variable you will find the name of the criterion variable in your
analysis. In this case, you can see that the criterion variable is AGGRESS (the number of
aggressive acts displayed by subjects).

436 Step-by-Step Basic Statistics Using SAS: Student Guide

Below the heading Class you will find the names of the values that were used to identify
the two treatment conditions. In the present analysis, you can see that these two values were
PUN (which was used to identify subjects in the model-punished condition) and REW
(which was used to identify subjects in the model-rewarded condition). To the right of PUN
you find statistics relevant to the model-punished sample (e.g., this samples mean and
standard deviation on the criterion variable). To the right of REW you find statistics relevant
to the model-rewarded sample.
The third entry down in the Class column is Diff (1 2). This row provides information
about the difference between sample 1 (the model-punished sample) and sample 2 (the
model-rewarded sample). Among other things, this row reports the difference between the
means of these two samples on the criterion variable.
The column labeled N indicates the number of subjects in each sample. You can see that
there were 18 subjects in the model-punished condition, and also 18 subjects in the modelrewarded condition.
The column headed Mean provides the average score for each sample on the criterion
variable. Output 13.2 shows that the mean score for subjects in the model-punished
condition was 4, and that the mean score for subjects in the model-rewarded condition was
7.1111. This means that, on the average, children in the model-rewarded condition
displayed a greater number of aggressive acts. However, at this point you do not know
whether this difference is statistically significant. You will learn that later, when you review
the results of the t test.
The third entry in the Mean column (to the right of Diff (1 2) is 3.111. This indicates
that the difference between the means of the model-punished condition versus the modelrewarded condition is equal to 3.111.
The column headed Std Dev provides the estimated population standard deviations (sx) for
the two samples (computed separately). You can see that the estimated population standard
deviation for the model-punished condition is 2.058, and the corresponding statistic for the
model-rewarded condition is 2.1663.
The third entry in Std Dev column, to the right of Diff (1 2), is the pooled estimate of
the population standard deviation (sp). You will need this statistic when you later compute
the index of effect size, d. For the current analysis, Output 13.2 shows that the pooled
estimate of the population standard deviation is 2.1128.
The second statistics table has a column headed Std Err that provides standard errors. Of
greatest interest is the third entry down, which appears to the right of Diff (1 2). This
entry is the standard error of the difference between the means: the estimated standard
deviation of the sampling distribution of differences between means. For the current
analysis, you can see that the standard error of the difference is 0.7043.
The last two columns in the second statistics tables are headed Minimum and
Maximum. These columns display the lowest observed score and the highest observed
score (respectively) for your two samples. It is a good idea to review these columns to verify

Chapter 13: Independent-Samples t Test 437

that they do not contain any out-of-bounds values. An out-of-bounds value is a value that
is either too low or too high to be reasonable, given the nature of your criterion variable. In
the present study, the criterion variable was the number of aggressive acts displayed by the
children during a 30-minute period after they watched the videotape. These columns show
that, for subjects in the model-punished condition, scores on this criterion variable ranged
from a low of zero to a high of 8. For subjects in the model-rewarded condition, scores on
the criterion variable ranged from 3 to 11. For both groups these numbers seem reasonable.
Therefore, information from the Minimum and Maximum columns do not provide any
evidence that you made any obvious errors in typing your data or preparing other sections of
your SAS program.
2. Review the F' test for equality of variances. Output 13.1 shows that PROC TTEST
actually computes two t statistics, but only one of these will be relevant for a specific
analysis. One of the t statistics is the standard statistic based on the assumption that the two
samples were drawn from populations with equal variances. The second t statistic is based
on the assumption that the two samples were drawn from populations with unequal
variances.
To determine which t statistic is appropriate for your analysis, you will refer to the F' test
that appears in a table labeled Equality of Variances. This table appears toward the bottom
of the output of PROC TTEST. For convenience, that table is reproduced again at this point
as Output 13.3.
Equality of Variances
Variable
AGGRESS

Method
Folded F

Num DF
17

Den DF
17

F Value
1.11

Pr > F
0.8350

Output 13.3. The equality of variances table from the PROC TTEST
analysis of aggression data.

The heading Equality of Variances that appears above this table conveys the nature of the
null hypothesis that is being tested. This null hypothesis states that, in the population, there
is no difference between the model-punished condition versus the model-rewarded condition
with respect to their variances on the AGGRESS criterion variable (notice that this null
hypothesis deals with the difference between variances, not means). PROC TTEST
computes an F' statistic to test this null hypothesis. If the p value for the resulting F' test is
less than .05, you will reject the null hypothesis of no differences and conclude that the
variances are unequal. In this case, you will refer to the results of the unequal variances t
test. On the other hand, if the p value is greater than .05, you can tentatively conclude that
the variances are equal, and instead use the results of the equal variances t test.
The F' statistic relevant to this test appears below the heading F Value in Output 13.3. For
the present analysis, you can see that the F' value is 1.11.
The p value for this F' statistic appears below the heading Pr > F. For the current analysis,
this p value is 0.8350. This means that the probability of obtaining an F' this large or larger
when the population variances are equal is quite largeit is 0.8350. Your obtained p value

438 Step-by-Step Basic Statistics Using SAS: Student Guide

is greater than the criterion of .05, and so you fail to reject the null hypothesis of equal
variancesinstead you tentatively conclude that the variances are equal. This means that
you can interpret the equal variances t statistic (in the step that follows this one).
To review, here is a summary of how you are to interpret the results presented in the
Equality of Variances table:

When the Pr > F' is nonsignificant (greater than .05), report the t test based on equal
variances.

When the Pr > F' is significant (less than .05), report the t test based on unequal
variances.

3. Review the t test for the difference between the means. You are now ready to
determine whether there is a significant difference between your sample means. To do this,
you will refer to the Statistics table and the T-Tests table from your output. For your
convenience, those tables are reproduced here as Output 13.4.
JANE DOE
1
The TTEST Procedure
Statistics
Lower CL
Upper CL Lower CL
Variable Class
N
Mean
Mean
Mean
Std Dev Std Dev
AGGRESS
PUN
18
2.9766
4
5.0234
1.5443
2.058
AGGRESS
REW
18
6.0338 7.1111
8.1884
1.6256
2.1663
AGGRESS
Diff (1-2)
-4.542 -3.111
-1.68
1.709
2.1128
Statistics
Upper CL
Variable Class
Std Dev
Std Err
Minimum
Maximum
AGGRESS
PUN
3.0852
0.4851
0
8
AGGRESS
REW
3.2476
0.5106
3
11
AGGRESS
Diff (1-2)
2.7682
0.7043
T-Tests
Variable
AGGRESS
AGGRESS

Method
Pooled
Satterthwaite

Variances
Equal
Unequal

DF
34
33.9

t Value
-4.42
-4.42

Pr > |t|
<.0001
<.0001

Output 13.4. The statistics table and t-tests table from the PROC TTEST
analysis of aggression data; significant differences observed.

The sample means for your two treatment conditions appear in the Statistics section of the
output in the column headed Mean. From Output 13.4, you can see that the mean for the
model-punished condition was 4, and the mean for the model-rewarded condition was
7.1111. There appears to be a fairly large difference between these two means. But is the
difference statistically significant? To find out, you will review the results of the t test.
The t test that you are about to review is a test of the following statistical null hypothesis
(the following is for a nondirectional test):
Statistical null hypothesis (H0): 1 = 2 ; In the population, there is no difference
between the subjects in the model-rewarded condition versus the subjects in the model-

Chapter 13: Independent-Samples t Test 439

punished condition with respect to their mean scores on the criterion variable (the
number of aggressive acts displayed by the subjects).
The t statistic relevant to this null hypothesis appears in the T-Tests section of the output.
From Output 13.4, you can see that there are actually two t statistics reported there.
Below the heading Variances, you can see the entries Equal and Unequal. To the right
of Equal, you will see the results for the equal-variances t test. You should report the
results in this row if the F' test (discussed in the preceding section) was nonsignificant. To
the right of Unequal, you will see the results for the unequal-variances t test. You should
report the results in this row if the F' test (discussed in the preceding section) was
significant. You will remember that the F' test for the current analysis was nonsignificant.
This means that you will focus on the results of the equal-variances t test (i.e., the results
that are presented in the Equal row).
The degrees of freedom for this test appear in the column headed DF. You can see that the
degrees of freedom for the equal-variances t test are 34.
The obtained t statistic for the current analysis appears in the column headed t Value. You
can see that the equal-variances t statistic for the current analysis is 4.42.
The probability value (p value) for this t statistic appears in the column headed Pr > | t |.
This value estimates the probability that you would obtain the present results if the null
hypothesis were true. From Output 13.4, you can see that the p value for the equal-variances
t statistic is <.0001. This means that the probability that you would obtain the present
results if the null hypothesis were true is less than .0001 (less than 1 in 10,000). This is a
very low probability. This book recommends that you should generally reject the null
hypothesis anytime that your obtained p value is less than .05. For the present analysis, your
p value is <.0001, which is much less than .05. Therefore, you will reject the null
hypothesis, and will conclude that the difference between your sample means is statistically
significant.
Rejecting the null hypothesis means that you will tentatively accept your statistical
alternative hypothesis. For the current study, the nondirectional alternative hypothesis may
be stated as follows:
Statistical alternative hypothesis (H1): 1 2; In the population, there is a
difference between the subjects in the model-rewarded condition versus the subjects in
the model-punished condition with respect to their mean scores on the criterion variable
(the number of aggressive acts displayed by the subjects).
Earlier, you reviewed the column headed Mean in Output 13.4, and you saw that the mean
for the model-punished condition was 4, and the mean for the model-rewarded condition
was 7.1111. The relative size of these means, combined with the results of the t test, tells
you that the subjects in the model-rewarded condition displayed a significantly higher
number of aggressive acts than the subjects in the model-punished condition.

440 Step-by-Step Basic Statistics Using SAS: Student Guide

This section has shown you how to interpret the p value that SAS computed for a
nondirectional (two-tailed) test. If you instead wish to perform a directional (one-tailed) test,
you should divide the obtained p value by 2. For example, suppose that you perform an
analysis, and the SAS Output displays a p value of .0800. This is the p value for a
nondirectional test (because PROC TTEST computes this by default). If you wish to
perform a directional test instead, you would divide this p value by 2, resulting in a final p
value of .0400. This p value of .0400 is the appropriate p value for a directional test.
4. Review the confidence interval for the difference between the means. When you use
PROC TTEST, SAS computes a confidence interval for the difference between the means.
The PROC TTEST statement included in the program for the current analysis contained the
following ALPHA option:
ALPHA=0.05
This option causes SAS to compute the 95% confidence interval. If you had instead wanted
the 99% confidence interval, you would have used this option instead:
ALPHA=0.01
The 95% confidence interval appears in the Statistics table in the PROC TTEST output.
That table is reproduced here as Output 13.5:
JANE DOE
The TTEST Procedure
Statistics

Variable
AGGRESS
AGGRESS
AGGRESS

Class
PUN
REW
Diff (1-2)

N
18
18

Lower CL
Mean
Mean
2.9766
4
6.0338 7.1111
-4.542 -3.111

Upper CL
Mean
5.0234
8.1884
-1.68

Lower CL
Std Dev Std Dev
1.5443
2.058
1.6256
2.1663
1.709
2.1128

Output 13.5. The lower confidence limit and upper confidence limit for the
difference between the means.

The row headed AGGRESS Diff (1-2) contains the information that you will need at this
point. This row provides information about the difference between condition 1 (the modelpunished condition) versus condition 2 (the model-rewarded condition). Where the row
headed AGGRESS Diff (1-2) ( ) intersects with the column headed Mean ( ), you
can see that the observed difference between the samples means is 3.111. This difference
is computed by starting with the sample mean for condition 1 (the model-punished
condition), and subtracting from it the sample mean of condition 2 (the model-rewarded
condition). This observed difference indicates that, on the average, subjects in the modelpunished condition displayed 3.111 fewer aggressive acts, compared to subjects in the
model-rewarded condition.
As you remember from Chapter 12, a confidence interval extends from a lower confidence
limit to an upper confidence limit. To find the lower confidence limit for the difference, find
the location where the row headed AGGRESS Diff (1-2) ( ) intersects with the column

Chapter 13: Independent-Samples t Test 441

headed Lower CL Mean ( ). There, you can see that the lower confidence limit for the
difference is 4.542. To find the upper confidence limit for the difference, find the location
where the row headed AGGRESS Diff (1-2) ( ) intersects with the column headed
Upper CL Mean ( ). There, you can see that the upper confidence limit for the difference
is 1.68.
When they are combined, these findings indicate that the 95% confidence interval for the
difference between means extends from 4.542 to 1.68. This means that you can estimate
with a 95% probability that the actual difference between the mean of the model-rewarded
condition and the mean of the model-punished condition (in the population) is somewhere
between 4.542 and 1.68. Notice that this interval does not contain the value of zero. This
is consistent with your rejection of the null hypothesis in the previous section (i.e., you
rejected the null hypothesis which stated In the population, there is no difference between
the subjects in the model-rewarded condition versus the subjects in the model-punished
condition with respect to their mean scores on the criterion variable).
5. Compute the index of effect size. Earlier in this chapter, you learned that effect size can
be defined as the degree to which one sample mean differs from the a second sample mean,
stated in terms of the standard deviation of the population. The symbol for effect size is d,
and the formula for effect size is as follows:
| X1 X2 |

sp

d=

where:
X1 = the observed mean of sample 1 (i.e., the subjects in treatment condition 1)
X2 = the observed mean of sample 2 (i.e., the subjects in treatment condition 2)
sp = the pooled estimate of the population standard deviation.
Although SAS does not automatically compute effect size, you can easily do so yourself
using the information that appears in the Statistics table from the output of PROC TTEST,
reproduced as Output 13.6.

Variable
AGGRESS
AGGRESS
AGGRESS

Class
PUN
REW
Diff (1-2)

JANE DOE
The TTEST Procedure
Statistics
Lower CL
Upper CL
N
Mean
Mean
Mean
18
2.9766
4
5.0234
18
6.0338 7.1111
8.1884
-4.542
-3.111
-1.68

1
Lower CL
Std Dev Std Dev
1.5443
2.058
1.6256
2.1663
1.709
2.1128

Output 13.6. Information needed to compute the index of effect size.

In the preceding formula, X1 is the observed mean of sample 1 (which, in the present study,
is the model-punished sample).

442 Step-by-Step Basic Statistics Using SAS: Student Guide

The column headed Mean shows that the mean for the model-punished condition is 4. In
the formula, X2 is the observed mean of sample 2 (which, in the present study, is the modelrewarded sample).
The column headed Mean shows that the mean for the model-rewarded condition is
7.1111.
In the preceding formula, sp represents the pooled estimate of the population standard
deviation. This statistic appears in Output 13.6 in the location where the row headed
AGGRESS Diff (12) ( ) intersects with the column headed Std Dev ( ). For the
present analysis, you can see that the pooled estimate of the population standard deviation is
2.1128.
You can now insert these statistics into the formula and compute the index of effect size in
this way:
d=

| X1 X2 |

sp

d=

| 4.0000 7.1111 |

2.1128

d=

| 3.1111 |

2.1128

d=

3.1111

2.1128

d=

1.4725

d=

1.47

And so the obtained index of effect size for the current analysis is 1.47. This means that the
sample mean for the model-punished condition differs from the sample mean of the model
rewarded condition by 1.47 standard deviations. To determine whether this is a relatively
large difference or a relative small difference, you can consult the guidelines provided by
Cohen (1969). Cohens guidelines are reproduced in Table 13.2:

Chapter 13: Independent-Samples t Test 443


Table 13.2
Guidelines for Interpreting Effect Size
_________________________________________
Effect size
Obtained d statistic
_________________________________________
Small effect
d = .20
Medium effect
d = .50
Large effect
d = .80
_________________________________________

Your obtained d statistic of 1.47 is larger than the large effect value of .80 that appears in
Table 13.2. This means that the manipulation in your study produced a relatively large
effect.
Summarizing the Results of the Analysis
The following format may be used to summarize the result of the analysis:
A) Statement of the research question: The purpose of this
study was to determine whether children who observe a model
being rewarded for engaging in aggressive behavior will later
demonstrate a greater number of aggressive acts, compared to
children who observe a model being punished for engaging in
aggressive behavior.
B) Statement of the research hypothesis: Children who observe
a model being rewarded for engaging in aggressive behavior
will later demonstrate a greater number of aggressive acts,
compared to children who observe a model being punished for
engaging in aggressive behavior.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was the observed consequences for the
model. This was a dichotomous variable that was assessed on
a nominal scale and included two levels: a model-rewarded
condition (coded as REW), and a model-punished condition
(coded as PUN).
The criterion variable was the number of aggressive acts
displayed by the children after observing the model. This
was a multi-value variable and was assessed on a ratio
scale.
D) Statistical test:

Independent-samples t test.

E) Statistical null hypothesis (H0): 1 = 2 ; In the


population, there is no difference between the subjects in the
model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the

444 Step-by-Step Basic Statistics Using SAS: Student Guide

criterion variable (the number of aggressive acts displayed by


the subjects).
F) Statistical alternative hypothesis (H1): 1 2; In the
population, there is a difference between the subjects in the
model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the
criterion variable (the number of aggressive acts displayed by
the subjects).
G) Obtained statistic:

t = 4.42

H) Obtained probability (p) value:

p = .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Confidence interval: Subtracting the mean of the modelrewarded condition from the mean of the model-punished
condition resulted in an observed difference of 3.111. The
95% confidence interval for this difference extended from
4.542 to 1.68.
K) Effect size:

d = 1.47.

L) Conclusion regarding the research hypothesis: These


findings provide support for the studys research hypothesis.
M) Formal description of results for a paper: Results were
analyzed using an independent-samples t test. This analysis
revealed a significant difference between the two conditions,
t(34) = -4.42, p = .0001. The sample means are displayed in
Figure 13.2, which shows that subjects in the model-rewarded
condition scored significantly higher on aggression compared
to subjects in the model-punished condition (for modelrewarded group, M = 7.11, SD = 2.17; for model-punished group,
M = 4.00, SD = 2.06). The observed difference between the
means was 3.11, and the 95% confidence interval for the
difference between means extended from 4.54 to 1.68. The
effect size was computed as d = 1.47. According to Cohens
(1969) guidelines, this represents a relatively large effect.

Chapter 13: Independent-Samples t Test 445

N) Figure representing the results:

Figure 13.2. Mean number of subject aggressive acts as a function of the


observed consequences for the model.

Notes regarding the preceding report. Item M of the preceding report provided a
description of the results for a published paper. The second sentence of that summary reports
the obtained t statistic in the following way:
t(34) = 4.42, p = .0001.
The number 34 that appears in parentheses in the preceding excerpt represents the degrees of
freedom for the analysis. With an independent-samples t test, the degrees of freedom are
equal to N 2, where N represents the total number of subjects from both groups combined.
In the present case, N = 36, so it makes sense that the degrees of freedom would be 36 2 =
34. In Output 13.4 (presented earlier), the degrees of freedom appeared below the heading
DF ( ).
The third sentence of Item M contains the following excerpt:
...(for model-rewarded group, M = 7.11, SD = 2.17; for
model-punished group, M = 4.00, SD = 2.06)
In this excerpt, the symbol M represents sample mean and SD represents sample standard
deviation. These statistics may be found in Output 13.6: Means appear in the column
headed Mean ( ), and standard deviations appear in the column headed Std Dev ( ).

446 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 13.2: An Illustration of Results Showing


Nonsignificant Differences
Overview
This section presents the results of an analysis of a different data seta data set designed to
produce nonsignificant results. This will allow you to see how nonsignificant results will
appear in the output of PROC TTEST. A later section will also show you how to summarize
nonsignficant results in an analysis report.
The SAS Output
Output 13.7 contains the results of the analysis of a different fictitious data set, one in which
the means for the two treatment conditions were not significantly different.
JANE DOE
The TTEST Procedure
Statistics

Variable
AGGRESS
AGGRESS
AGGRESS

Class
PUN
REW
Diff (1-2)

Variable
AGGRESS
AGGRESS
AGGRESS

Lower CL
Upper CL Lower CL
Mean
Mean
Mean
Std Dev Std Dev
2.9766
4
5.0234
1.5443
2.058
3.9143 4.9444
5.9745
1.5544
2.0714
-2.343 -0.944
0.4542
1.6701
2.0647
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
3.0852
0.4851
0
8
3.1054
0.4882
1
9
2.7052
0.6882

N
18
18

Class
PUN
REW
Diff (1-2)

T-Tests
Variable
AGGRESS
AGGRESS

Method
Pooled
Satterthwaite

Variances
Equal
Unequal

DF
34
34

t Value
-1.37
-1.37

Pr > |t|
0.1790
0.1790

Equality of Variances
Variable
AGGRESS

Method
Folded F

Num DF
17

Den DF
17

F Value
1.01

Pr > F
0.9789

Output 13.7. Results of PROC TTEST analysis of aggression data;


nonsignificant differences observed.

Interpreting the Output


Overview. You would normally interpret Output 13.7 following the same steps that were
listed in Example 13.1. To save space, however, this section focuses on the results that are
most relevant to the significance test, the confidence interval, and the index of effect size.

Chapter 13: Independent-Samples t Test 447

Review the F' test for equality of variances. ( )This test appears in the table headed
Equality of Variances. You will remember that you begin with the null hypothesis that, in
the population, the variances for the two conditions are equal.
The p value associated with this null hypothesis is .9789. Because this is greater than usual
criterion of .05, you conclude that it is nonsignificant and fail to reject the null hypothesis.
This means that you will interpret the equal-variances t test in the next step.
Review the t test for the difference between the means. In the column headed Mean,
( ) you can see that the mean for the model-punished condition is 4, and the mean for the
model-rewarded condition is 4.9444. There is a difference between the means, but is this
difference large enough to be statistically significant? To find out, you will review the t
statistic.
You can see that the obtained t statistic for the current analysis is 1.37, and that the p value
associated with this statistic is .1790. This p value is above the standard criterion of .05, and
so you conclude that your results are nonsignificant and fail to reject the null hypothesis.
You will conclude that the observed difference between the means is probably due to
sampling error.
Review the confidence interval for the difference between the means. The Mean
column ( ) in Output 13.7 shows that the observed difference between the model-punished
condition and the model-rewarded condition is .944.
The Lower CL Mean column shows that the lower confidence limit for this difference is
2.343.
The Upper CL Mean column shows that the upper confidence limit for this difference is
.4542. Combined, this means that the 95% confidence interval for the difference between
means extends from 2.343 to .4542. Notice that this interval does include the value of zero,
which is consistent with your finding that the difference between means is nonsignificant.
Compute the index of effect size. You have already seen that the mean for the modelpunished condition is 4, and the mean for the model-rewarded condition is 4.9444. The only
other piece of information that you need to compute the effect size is sp, the pooled estimate
of the population standard deviation.
This appears in Output 13.7 as the third entry in the column headed Std Dev. There, you
can see that the pooled estimate is 2.0647. You may now insert these statistics into the
formula for effect size:
d=

| X1 X2 |

sp

d=

| 4.0000 4.9444 |

2.0647

448 Step-by-Step Basic Statistics Using SAS: Student Guide

d=

| .9444 |

2.0647

d=

.9444

2.0647

d=

.4574

d=

.46

And so the index of effect size for the current analysis is .46. According to Cohens
guidelines appearing in Table 13.2, this falls somewhere between a small effect and a
medium effect.
Summarizing the Results of the Analysis
Here is an example of how you might summarize the preceding analysis:
A) Statement of the research question: The purpose of this
study was to determine whether children who observe a model
being rewarded for engaging in aggressive behavior will later
demonstrate a greater number of aggressive acts, compared to
children who observe a model being punished for engaging in
aggressive behavior.
B) Statement of the research hypothesis: Children who observe
a model being rewarded for engaging in aggressive behavior
will later demonstrate a greater number of aggressive acts,
compared to children who observe a model being punished for
engaging in aggressive behavior.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was the observed consequences for the
model. This was a dichotomous variable that was assessed on
a nominal scale and included two levels: a model-rewarded
condition (coded as REW), and a model-punished condition
(coded as PUN).
The criterion variable was the number of aggressive acts
displayed by the children after observing the model. This
was a multi-value variable and was assessed on a ratio
scale.
D) Statistical test:

Independent-samples t test.

Chapter 13: Independent-Samples t Test 449

E) Statistical null hypothesis (H0): 1 = 2 ; In the


population, there is no difference between the subjects in the
model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the
criterion variable (the number of aggressive acts displayed by
the subjects).
F) Statistical alternative hypothesis (H1): 1 2; In the
population, there is a difference between the subjects in the
model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the
criterion variable (the number of aggressive acts displayed by
the subjects).
G) Obtained statistic:

t = 1.37

H) Obtained probability (p) value:

p = .1790

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.

Fail

J) Confidence interval: Subtracting the mean of the modelrewarded condition from the mean of the model-punished
condition resulted in an observed difference of .944. The 95%
confidence interval for this difference extended from 2.343
to .4542.
K) Effect size:

d = .46.

L) Conclusion regarding the research hypothesis: These


findings fail to provide support for the studys research
hypothesis.
M) Formal description of results for a paper: Results were
analyzed using an independent-samples t test. This analysis
revealed a nonsignificant difference between the two
conditions, t(34) = -1.37, p = .1790. The sample means are
displayed in Figure 13.3, which shows that the subjects in the
model-rewarded condition displayed a mean score on aggression
which was similar to that displayed by the subjects in the
model-punished condition (for model-rewarded group, M = 4.94,
SD = 2.07; for model-punished group, M = 4.00, SD = 2.06). The
observed difference between the means was .94, and the 95%
confidence interval for the difference between means extended
from 2.34 to .45. The effect size was computed as d = 46.
According to Cohens (1969) guidelines, this falls somewhere
between a small effect and a medium effect.

450 Step-by-Step Basic Statistics Using SAS: Student Guide

N) Figure representing the results:

Figure 13.3. Mean number of subject aggressive acts as a function of the


observed consequences for the model (nonsignificant results).

Conclusion
An earlier section in this chapter indicated that researchers make a distinction between an
independent-samples t test versus a paired-samples t test. This chapter has focused on the
independent-samples test, which is appropriate when the subjects in Condition 1 and
Condition 2 are two entirely different groups of people, and when the subjects in Condition
1 are not matched with subjects in Condition 2 in any systematic way.
But what if subjects in Condition 1 were matched with subjects in Condition 2? For
example, what if each subject in Condition 1 were paired with a corresponding subject in
Condition 2 on the basis of similarity in demographic characteristics? For example, a young
male Caucasian in Condition 1 might be paired with a young male Caucasian in Condition 2;
a middle-aged female African-American in Condition 1 might be paired with a middle-aged
female African-American in Condition 2, and so forth. Data obtained from a study such as
this should not be analyzed using the independent-samples t test discussed in the current
chapter, because the observations from the two groups are no longer independent. Instead,
these data should be analyzed using a paired-samples t test, a statistical procedure that is
covered in the following chapter.

PairedSamples t Test
Introduction..........................................................................................453
Overview................................................................................................................ 453
Situations Appropriate for the Paired-Samples t Test ........................453
Overview................................................................................................................ 453
The Independent-Samples t Test versus the Paired-Samples t Test ..................... 453
Nature of the Predictor and Criterion Variables ..................................................... 454
The Type-of-Variable Figure .................................................................................. 454
Examples of Studies Providing Data Appropriate for This t Test ........................... 454
Summary of Assumptions Underlying the Paired-Samples t Test.......................... 456
Similarities between the Paired-Samples t Test and the
Single-Sample t Test .......................................................................457
Introduction ............................................................................................................ 457
A Repeated-Measures Study ................................................................................. 457
Outcome 1: The Manipulation Has an Effect ........................................................ 457
Outcome 2: The Manipulation Has No Effect........................................................ 459
Summary ............................................................................................................... 460
Results Produced in a Paired-Samples t Test .....................................461
Overview................................................................................................................ 461
Test of the Null Hypothesis .................................................................................... 461
Confidence Interval for the Difference between the Means ................................... 462
Effect Size.............................................................................................................. 463

452 Step-by-Step Basic Statistics Using SAS: Student Guide

Example 14.1: Womens Responses to Emotional versus


Sexual Infidelity...............................................................................463
Overview................................................................................................................ 463
Background............................................................................................................ 464
The Study .............................................................................................................. 464
The Predictor and Criterion Variables in the Analysis ............................................ 466
Data Set to Be Analyzed........................................................................................ 467
The SAS DATA Step for the Program.................................................................... 468
The PROC Step for the Program ........................................................................... 468
The Complete SAS Program ................................................................................. 471
Steps in Interpreting the Output ............................................................................. 472
Summarizing the Results of the Analysis............................................................... 479
Notes Regarding the Preceding Report ................................................................. 481
Example 14.2: An Illustration of Results Showing
Nonsignificant Differences..............................................................483
Overview................................................................................................................ 483
The SAS Output..................................................................................................... 483
Steps in Interpreting the Output ............................................................................. 483
Summarizing the Results of the Analysis............................................................... 485
Conclusion............................................................................................487

Chapter 14: Paired-Samples t Test 453

Introduction
Overview
This chapter shows you how to use the SAS System to perform a paired-samples t test, also
known as a correlated-sample t test, a matched-samples t test, t test for dependent samples,
and a t test for a within-subjects design. This is a parametric procedure that is appropriate
when you want to determine whether the mean score that is obtained under one condition is
significantly different from the mean score obtained under a second condition. With this test,
each score in one condition is paired with, or dependent upon, a specific score in the second
condition. This chapter shows you how to write the appropriate SAS program, interpret the
output, and prepare a report that summarizes the results of the analysis.

Situations Appropriate for the Paired-Samples t Test


Overview
The paired-samples t test is a test of differences between means. You use this test when you
want to compare average scores on a criterion variable obtained under two conditions, to
determine whether there is a significant difference between the two means. Your study
should contain no more than two conditions. The criterion variable must be on an interval or
ratio level, and each observation in one set of scores must be paired in a meaningful way
with a corresponding observation in the second set of scores.
This section describes the types of situations in which this statistic is typically computed. A
summary of assumptions underlying the procedure is presented at the end of the section.
The Independent-Samples t Test versus the Paired-Samples t Test
Before conducting a t test, it is important to determine whether your data should be analyzed
using an independent-samples t test or a paired-samples t test. Briefly, the independentsamples t test is appropriate if the observations that are obtained under one treatment
condition are independent of (unrelated to) the observations obtained under the other
treatment condition. For example, this would be the case if both of the following were true:

you conducted an experiment with one group of subjects in the experimental condition and
an entirely different group of subjects in the control condition

you made no effort to match subjects in the two conditions.

In contrast, a paired samples t test is appropriate if each observation in one set of scores is
paired in a meaningful way with a corresponding observation in the second set of scores.
This is normally accomplished either by using a repeated-measures design or a matching
procedure.

454 Step-by-Step Basic Statistics Using SAS: Student Guide

This chapter provides examples of studies that would provide data appropriate for a pairedsamples t test. A more complete discussion of the differences between independent samples
versus paired samples in provided in Chapter 13, Independent-Samples t Test, in the
section titled Independent Samples versus Paired Samples.
Nature of the Predictor and Criterion Variables
Predictor variable. To perform a paired-samples t test, the predictor (or independent)
variable should be a dichotomous variable (i.e., a variable that assumes only two values).
The predictor variable may be assessed on any scale of measurement: nominal, ordinal,
interval, or ratio.
Criterion variable. The criterion (or dependent) variable should be a numeric variable that
is assessed on an interval or ratio scale of measurement.
The Type-of-Variable Figure
The figure below illustrates the types of variables that are typically being analyzed when
performing a paired-samples t test.
Criterion

Predictor

D i

The Multi symbol that appears in the above figure shows that the criterion variable in a
paired-samples t test is typically a multi-value variable (a variable that assumes more than
six values in your sample).
The Di symbol that appears to the right of the equal sign in the above figure shows that
the predictor variable in this procedure is a dichotomous variable (i.e., a variable that
assumes only two values).
Examples of Studies Providing Data Appropriate for This t Test
Overview. Earlier, this chapter stated that the paired-samples t test is typically used to
analyze data from studies that employ either a repeated-measures design, or a subjectmatching procedure. These two approaches to research are illustrated in the following two
studies.
Study 1. A physiological psychologist wants to determine whether there is a relationship
between the affiliative motive (the desire for warm relationships with others) and release of
the neurotransmitter, dopamine. She measures dopamine levels in saliva in a single group of
subjects at Time 1, and then shows the group a film designed to arouse the affiliative
motive. After showing the film (at Time 2), she again measures dopamine levels in saliva.
She analyzes the data to determine whether the mean level of saliva obtained at Time 2 is
significantly higher than the mean level of saliva obtained at Time 1. Because this research

Chapter 14: Paired-Samples t Test 455

design involves taking repeated measures from a single sample of subjects, she uses a
paired-samples t test to compare Time 1 dopamine levels to Time 2 dopamine levels.
Why these data would be appropriate for this procedure. In this study, the predictor
variable is treatment condition (before the affiliative-film condition versus after the
affiliative-film condition). You know that this is a dichotomous variable because it consists
of only two conditions.
The criterion variable is dopamine levels in saliva. Earlier sections indicated that, to
perform a paired-samples t test, the criterion variable must be on an interval or ratio scale of
measurement. Here, you can assume that the researchers measure of dopamine levels in
saliva is on a ratio scale because it has equal intervals and a true zero point (i.e., when
subjects have a score of zero, it means that they have no dopamine in their saliva).
Finally, earlier sections indicated that, to perform a paired-samples t test, each score in one
condition must be paired with, or dependent upon, a specific score in the second condition.
The researcher achieves this in the current study by using a repeated-measures research
design. Each subject provides scores on the criterion variable under both experimental
conditions: They provide scores on the dopamine measure (a) at Time 1 (before watching
the film), and again (b) at Time 2 (after watching the film). In performing the pairedsamples t test, each subjects score at Time 1 will be paired with his or her score at Time 2
(later sections will show how this is done).
Study 2. A doctoral candidate in political science wants to determine whether the way that
political issues are presented affects citizen support for government programs. To find out,
he prepares two different versions of a film about a federal program that provides aid to the
poor.

The abstract version of the film deals with issues of poverty in abstract, impersonal
terms, using a good number of statistics and charts.

The personal version of the film discusses the same issues by focusing on the lives of
two families actually living in poverty.

He shows the abstract version of the film to one group of subjects, and the personal version
of the film to a second group. After the film, each subject rates his or her support for the
federal program described.
Before the study was conducted, each subject in the abstract group was paired with a similar
subject in the personal group. Subjects were matched so that the two subjects in each pair
were similar with respect to income, sex, and education. In other words, this is a study that
uses a matched-subjects design.
Why these data would be appropriate for this procedure. In this study, the predictor
variable is presentation condition (abstract condition versus personal condition). You
know that this a dichotomous variable because it consists of only two conditions.
The criterion variable is rated support for the program. Most researchers would agree that,
if this criterion variable is assessed with a carefully developed summated rating scale, it is

456 Step-by-Step Basic Statistics Using SAS: Student Guide

probably on an interval-scale of measurement (meaning that it displays equal intervals but


no true zero point).
As was stated earlier, an additional requirement for a paired-samples t test is the requirement
that each score in one condition must be paired with, or dependent upon, a specific score in
the second condition. Although there are two groups of subjects in this study, they are not
truly independent because the groups were formed using a matching procedure. Specifically,
subjects were matched for income, sex, and education. This means that (for example), a
wealthy female with a college education in the abstract condition was matched with a
wealthy female with a college education in the personal condition; a poor male with a high
school education in the abstract condition was matched with a poor male with a high school
education in the personal condition, and so on. In performing the t test, a particular subjects
score on the criterion variable in one group will be paired with the score of his or her
matched counterpart in the other group.
Summary of Assumptions Underlying the Paired-Samples t Test
Level of measurement. The criterion variable should be assessed on an interval or ratio
level of measurement. The predictor variable should be a dichotomous variable (that is, it
should have only two categories), and it may be assessed on any level of measurement.
Paired observations. A particular observation that appears in one condition must be paired
in some meaningful way with a corresponding observation that appears in the other
condition. This is often accomplished by using a repeated-measures design in which each
subject contributes one score under Condition 1, and a separate score under Condition 2.
Observations can also be paired by using a matching procedure.
Independent observations. A particular subjects score in one condition should not be
affected by any other subjects score in either of the two conditions. It is, of course,
acceptable for a subjects score in one condition to be dependent upon his or her own score
in the other condition. This is another way of saying that it is acceptable for subjects scores
in Condition 1 to be correlated with their scores in Condition 2.
Random sampling. Subjects contributing data should represent a random sample drawn
from the populations of interest.
Normal distribution for difference scores. The differences in paired scores should be
normally distributed. These difference scores are usually created by (a) beginning with a
subjects score on the criterion variable obtained under one treatment condition, and (b)
subtracting from it that subjects score on the criterion variable obtained under the other
treatment condition (the nature of these difference scores will be discussed in greater
detail in the following section). It is not necessary that the individual criterion variables be
normally distributed, as long as the distribution of difference scores is normally distributed.
Homogeneity of variance. The populations represented by the two conditions should have
equal variances on the criterion.

Chapter 14: Paired-Samples t Test 457

Similarities between the Paired-Samples t Test and the


Single-Sample t Test
Introduction
This section explains why a paired-samples t test is essentially equivalent to a single-sample
t test. It begins by describing a fictitious study that would provide data appropriate for a
paired-samples t test. It shows how it is possible to use the data from this study to create
difference scores and then perform a single-sample test on these difference scores. Finally, it
explains why the results of this single-sample t test can be used to determine whether there
is a significant difference between the scores that were obtained under the two treatment
conditions. Tables illustrate the type of data that you would expect to see if your
experimental manipulation had an effect, as well as the type of data that you would expect to
see if your manipulation had no effect.
A Repeated-Measures Study
Suppose that you conduct a study in which you wish to determine whether taking the herb
ginkgo biloba will affect subject scores on a test of memory. For your study, you ask a
sample of six subjects to ingest 60 mg of ginkgo daily for one month. At the end of the
month you administer a memory test to these subjects. With this test, higher scores indicate
better memory. You refer to this as the ginkgo condition (Condition 1).
Later, you have the same six subjects ingest placebo pills (pills that do not contain any
substance expected to have any effect on memory). After doing this for one month, you have
these subjects take the same memory test that was taken earlier. You refer to this as the
placebo condition (Condition 2).
Outcome 1: The Manipulation Has an Effect
The results. Table 14.1 provides some fictitious data obtained from the memory study.
These are the type of data that you would expect to see if your manipulation had an effect on
memory test scores.

458 Step-by-Step Basic Statistics Using SAS: Student Guide


Table 14.1
Scores on the Memory Task Obtained under the Ginkgo Condition versus the
Placebo Condition (Outcome 1)
________________________________________________________
Ginkgo
Placebo
condition
condition
Difference
Subject
(Condition 1)
(Condition 2)
scores (D)
________________________________________________________
1
50
40
10
2
65
60
5
3
40
34
6
4
62
50
12
5
70
64
6
6
61
52
9
________________________________________________________
Means:
58
50
8
________________________________________________________

Below the heading Subject in the table, you will find subject numbers assigned to each
participant (six subjects participated in the study). Below the heading Ginkgo condition
(Condition 1) you will find each subjects score on the memory task under the ginkgo
condition. Below the heading Placebo condition (Condition 2) you will find each subjects
score on the memory test under the placebo condition.
Reviewing the results one row at a time, you can see that Subject 1 had a score of 50 on the
memory test under the ginkgo condition, and a score of 40 under the placebo condition.
Subject 2 had a score of 65 under the ginkgo condition, and a score of 60 under the placebo
condition. Results for the remaining subjects can be interpreted in the same way.
The mean scores obtained under the two conditions appear to the right of the heading
Means at the bottom of the table. In Table 14.1, you can see that the mean memory test
score obtained under the ginkgo condition was 58, and the mean memory test score obtained
under the placebo condition was 50. Since the mean score obtained under the ginkgo
condition was somewhat higher than that obtained under the placebo condition, this may be
seen as evidence that taking ginkgo has a positive effect on memory (you should not draw
any conclusions at this point as you have not yet performed any statistical analyses).
Creating difference scores. For each subject, it is possible to create a difference score by
subtracting the score obtained under the placebo condition from the score obtained under
the ginkgo condition. This difference score will be represented using the symbol D. In
Table 14.1, difference scores have already been created for each subject, and they appear
in the column headed Difference scores (D). You can see that

The difference score for Subject 1 was 10 (because 50 40 = 10)

The difference score for Subject 2 was 5 (because 65 60 = 5)

The difference score for Subject 3 was 6 (because 40 34 = 6)

The difference scores for the remaining subjects can be interpreted in the same way.

Chapter 14: Paired-Samples t Test 459

In Table 14.1, the mean of the difference score is shown where the row headed Means
intersects with the column headed Difference score (D). You can see that the average
difference score was 8. This means that, on the average, scores obtained under the ginkgo
condition were 8 points higher than scores obtained under the placebo condition. Again, this
is consistent with the idea that ginkgo might have a positive effect on memory (if the
average difference score had been a negative number, it would have been consistent with the
idea that ginkgo has a negative effect on memory).
Performing a single-sample t test. Until now, you have been looking at the data, but have
not performed any significance tests. If you wanted to know whether there is a significant
difference between the mean memory test score obtained under the ginkgo condition versus
the placebo condition, it would be possible to perform a single-sample t test on the scores
appearing in the column headed Difference scores (D). In this analysis, you could test the
null hypothesis that this sample was drawn from a population in which the mean difference
score is equal to zero. Symbolically, this null hypothesis can be represented like this:
H0: = 0
Again, your null hypothesis states that your sample was drawn from a population in which
the mean difference score is equal to zero. This is because the null hypothesis is the
hypothesis of no difference; it is the hypothesis that states that your manipulation did not
have any effect. If ginkgo does not really have any effect on memory, this means that (in the
population) there should be no difference between average memory scores obtained under
the ginkgo condition versus those obtained under the placebo condition.
When you perform a paired-samples t test, you are actually performing a single-sample t test
on a sample of difference scores (D scores). In this analysis, you test the null hypothesis that
your sample was drawn from a population of D scores in which the average score is equal to
zero. If the average difference score observed in your sample is substantially different from
zero, you will reject this null hypothesis and conclude that your manipulation probably did
have an effect. If the average difference score obtained in your sample is fairly close to zero,
you will fail to reject the null hypothesis and conclude that your manipulation probably did
not have an effect.
Outcome 2: The Manipulation Has No Effect
As an additional example, this section provides some fictious results that would be
consistent with the idea that ginkgo does not have an effect on memory. Table 14.2 contains
the same headings that appeared in Table 14.1, but you can see that a different data set has
now been inserted in the body of the table.

460 Step-by-Step Basic Statistics Using SAS: Student Guide


Table 14.2
Scores on the Memory Task Obtained under the Ginkgo Condition versus the
Placebo Condition (Outcome 2)
________________________________________________________
Ginkgo
Placebo
condition
condition
Difference
Subject
(Condition 1)
(Condition 2)
scores (D)
________________________________________________________
1
50
45
5
2
65
70
-5
3
40
39
1
4
62
63
-1
5
70
66
4
6
61
65
-4
_________________________________________________________
Means:
58
58
0
_________________________________________________________

To the right of the heading Means (toward the bottom of the table), you can see that the
observed sample mean obtained under the ginkgo condition is 58, and the mean obtained
under the placebo condition is now also 58. Because the means for the two conditions are
now identical, these data fail to support the idea that ginkgo has a positive effect on
memory.
In the column headed Difference scores (D), you can see that some of the difference
scores are now positive, and some are negative. In fact, the negative difference scores now
cancel out the positive ones, resulting in a mean difference score of zero. Again, this result
fails to support the idea that ginkgo has a positive effect on memory.
In the present example, the average difference score is equal to zero. However, remember
that your obtained average difference score does not have to be exactly equal to zero to be
seen as evidence that your manipulation had no effect. Even when the manipulation is
ineffective, you should still expect the mean difference score to sometimes be a bit above
zero, and to sometimes be a bit below zero simply due to sampling error. To determine
whether it is enough above (or below) zero to reject the null hypothesis, you should perform
a paired-samples t test.
Summary
In summary, a paired-samples t test is essentially equivalent to a single-sample t test. When
you use SAS to perform a paired-samples t test, the application (a) creates difference scores
by subtracting the scores obtained under Condition 2 from the scores obtained under
Condition 1, and (b) performs a single-sample t test on these difference scores. If the
probability value (p value) obtained from this single-sample t test is significant (i.e., if the p
value is less than .05), you might conclude that you have a statistically significant difference
between the mean scores obtained in Condition 1 versus Condition 2.

Chapter 14: Paired-Samples t Test 461

Results Produced in a Paired-Samples t Test


Overview
When you perform a paired-samples t test, you can interpret the test of the null hypothesis,
the confidence interval for the difference between means, and the index of effect size. This
section discusses the meaning of these results.
Test of the Null Hypothesis
The single-sample convention. The preceding section indicated that a paired-samples t test
is essentially equivalent to a single-sample t test. For this reason, many statistics textbooks
show readers how to state a null hypothesis for a paired-samples test using a convention
similar to that used with the single-sample t test.
For example, here is one example of how you could state a nondirectional statistical null
hypothesis for the preceding memory study using the single-sample convention:
Statistical null hypothesis (H0): = 0; In the population, the average difference score
created by subtracting placebo condition scores from ginkgo condition scores is equal to
zero.
Here is the corresponding nondirectional statistical alternative hypothesis:
Statistical alternative hypothesis (H1): 0; In the population, the average difference
score created by subtracting placebo condition scores from ginkgo condition scores is not
equal to zero.
When you perform the paired-samples t test, SAS computes an obtained t statistic and a p
value associated with that statistic. If your obtained p value is less than .05, you may reject
the null hypothesis and tentatively accept the alternative hypothesis. In your report, you
would indicate that you obtained a statistically significant difference between mean scores
for the two conditions.
The two-sample convention. There is, however, another way to state the same hypotheses.
Because a paired-samples t test typically involves comparing scores obtained under two
treatment conditions, it is also possible to use conventions for stating hypotheses similar to
those introduced in Chapter 13, Independent-Samples t Test. The following illustrates the
way that you would state a nondirectional statistical null hypothesis for the memory study
using the two-sample convention:
Statistical null hypothesis (H0): 1 = 2; In the population, there is no difference
between the placebo condition versus the ginkgo condition with respect to mean scores on
the criterion variable (the memory test).

462 Step-by-Step Basic Statistics Using SAS: Student Guide

Here is the corresponding nondirectional statistical alternative hypothesis:


Statistical alternative hypothesis (H1): 1 2; In the population, there is a difference
between the placebo condition versus the ginkgo condition with respect to mean scores on
the criterion variable (the memory test).
Again, when you perform the paired-samples t test, you review the obtained t statistic and p
value computed by SAS. If the p value is less than .05, you reject the null hypothesis and
tentatively accept the alternative.
Convention to be used in this chapter. These two approaches to stating hypotheses are
essentially equivalent, so either approach may be used. To maintain continuity with the
preceding chapter, however, this chapter will focus on the two-sample convention as was
illustrated above.
Nondirectional versus directional tests. With the paired-samples t test, you may perform
either a nondirectional test (in which you do not make a specific prediction about which
condition will display the higher mean) or a directional test (in which you do make a
specific prediction about which condition will display the higher mean). The exact nature of
your null hypothesis and alternative hypothesis will vary depending on whether you plan to
perform a directional test or a nondirectional test. To conserve space, however, this section
will not review the differences between these two types of tests and hypotheses because they
were discussed in detail in the previous chapter. See the section titled Test of the Null
Hypothesis in Chapter 13, Independent-Samples t Test.
Confidence Interval for the Difference between the Means
When you use PROC TTEST to perform a paired-samples t test, SAS automatically
computes a confidence interval for the difference between the means. In Chapter 12, The
Single-Sample t Test, you learned that a confidence interval is an interval that extends
from a lower confidence limit to an upper confidence limit and is assumed to contain a
population parameter with a stated probability, or level of confidence. In that chapter, you
learned that SAS can compute the confidence interval for a mean. When you perform a
paired-samples t test, SAS instead computes a confidence interval for the difference between
the means observed in the two conditions.
For example, again consider the ginkgo biloba study described earlier. Assume that the
mean score on the memory test under the ginkgo condition is 56, and the mean score under
the placebo condition is 50. The difference between the two means is 56 50 = 6. This
means that the observed difference between means is equal to 6. Assume that you analyze
your data using PROC TTEST, and SAS estimates that the 95% confidence interval for this
difference extends from 4 (the lower confidence limit) to 8 (the upper confidence limit).
This means that there is a 95% probability that, in the population, the actual difference
between the ginkgo condition versus the placebo condition is somewhere between 4 and 8
points on the memory test.

Chapter 14: Paired-Samples t Test 463

Effect Size
When you perform a paired-samples t test, effect size can be defined as the degree to which
the mean obtained under one condition differs from the mean obtained under the second
condition, stated in terms of the standard deviation of the population of difference scores.
The symbol for effect size is d, and the formula (adapted from Spatz [2001]) is as follows:
d=

| X1 X2 |

sD

where:
X1 = the observed mean of the sample of scores obtained under Condition 1
X2 = the observed mean of the sample of scores obtained under Condition 2
sD = the estimated standard deviation of the population of difference scores.
You can see that the procedure for computing d is fairly straightforward: You simply
subtract one mean from the other mean and divide the absolute value of the result by the
estimated standard deviation of the difference scores. The resulting statistic represents the
number of standard deviations that the two means differ from one another.
The index of effect size is not automatically computed by PROC TTEST, although it can
easily be calculated by hand from other statistics that do appear on the output of PROC
MEANS and PROC TTEST. A later section of this chapter will show how this is done, and
will provide some guidelines for interpreting the size of d.

Example 14.1: Womens Responses to Emotional versus


Sexual Infidelity
Overview
This section describes a fictitious study in which female subjects provide scores on a
measure of psychological distress under two treatment conditions: (a) while imagining their
partner engaging in emotional infidelity, and (b) while imagining their partner engaging in
sexual infidelity. The chapter shows how to perform a paired-samples t test to determine
whether the mean distress score obtained under the emotional infidelity condition is
significantly different from the mean distress score obtained under the sexual infidelity
condition.
The section begins by describing the study and the data produced by the subjects. It shows
how to prepare the SAS DATA step and how to write the PROC statements that will provide
the needed results. It shows how to interpret the SAS output and prepare an analysis report.
Special emphasis is placed on

interpreting the test of the null hypothesis

464 Step-by-Step Basic Statistics Using SAS: Student Guide

interpreting the confidence interval for the difference between means

computing the index of effect size.

Note: Although the investigation described here is fictitious, it was inspired by an actual
study reported by Buunk et al. (1996).
Background
This fictitious study investigates the way that people respond to infidelity in a romantic
partner. Some researchers in the field of sociobiology believe that there may be differences
between women versus men with respect to what types of jealousy-provoking situations
cause them the greatest distress (e.g., Daly et al., 1982). Some have predicted that

a man should be more distressed by the thought that his partner has had sexual intercourse
with another man, compared to the thought that his partner has developed a deep
emotional attachment to another man

a woman should be more distressed by the thought that her partner has developed a deep
emotional attachment to another woman, compared to the thought that her partner has had
sexual intercourse with another woman.

The study described in this section addresses the second of these two predictions (i.e., the
prediction that women should be more distressed at the thought that her partner has
developed a deep emotional attachment, compared to a sexual attachment).
The Study
Overview. Suppose that you are a sociologist conducting research in this area. There are a
number of ways that you could test the ideas about sex and jealousy presented in the
preceding section. One approach might involve focusing on just one of the sexesfor
example, womenand determining whether the way that women respond to infidelity
depends, in part, on the nature of the infidelity.
For example, you might ask each subject in a sample of women to imagine how she would
feel if she had learned that her partner had committed emotional infidelity (i.e., that her
partner formed a deep emotional attachment to another person). While imagining this, each
woman could rate how distressed she would feel in this situation.
Next, you could ask the each member of the same group of women to imagine how she
would feel if she had learned that her partner had committed sexual infidelity (i.e., that her
partner had experienced sexual intercourse with another person). While imagining this, she
could again rate how distressed she would feel.
Research hypothesis. Here, your research question might be summarized in this fashion:
The purpose of this study is to determine whether there is a difference between emotional
infidelity versus sexual infidelity with respect to the amount of psychological distress that
they produce in women.

Chapter 14: Paired-Samples t Test 465

Your research hypothesis might be When asked to imagine how they would feel if they
learned that their partner had been unfaithful, women will display higher levels of
psychological distress when imagining emotional infidelity than when imagining sexual
infidelity.
In summary, your study is designed to determine whether the type of infidelity (emotional
versus sexual) has an effect on the subjects level of distress. The causal nature of your
research hypothesis is illustrated in Figure 14.1.

Figure 14.1. Causal relationship between type of infidelity and psychological


distress, as predicted by the research hypothesis.

Statistical hypotheses. You will use the two-sample convention for stating your statistical
hypotheses. Your research hypothesis (above) is technically a directional hypothesis,
because it predicts that distress scores will be higher under the emotional infidelity condition
than under the sexual infidelity condition. Nevertheless, you will state your statistical
hypotheses as nondirectional hypotheses to avoid concentrating all of your region of
rejection in just one tail of the sampling distribution. Below is the nondirectional statistical
null hypothesis for your analysis:
Statistical null hypothesis (H0): 1 = 2; In the population, there is no difference
between the emotional infidelity condition versus the sexual infidelity condition with
respect to mean scores on the criterion variable (the measure of psychological distress).
Here is the corresponding nondirectional statistical alternative hypothesis:
Statistical alternative hypothesis (H1): 1 2; In the population, there is a difference
between the emotional infidelity condition versus the sexual infidelity condition with
respect to mean scores on the criterion variable (the measure of psychological distress).
Research method. Suppose that you conduct the study using a repeated-measures design
with 17 female subjects. Each subject is presented with a set of 12 scenarios that may or
may not cause her to experience psychological distress. Some examples of the scenarios
include failing to get a promotion at work, learning that a friend is ill, and being in a
minor automobile collision.
For each of the 12 scenarios, the subjects read the description of the situation and try to
imagine how they would feel if this event actually happened to them. After imagining this,

466 Step-by-Step Basic Statistics Using SAS: Student Guide

they rate how distressed they feel by responding to the following four items (for each item,
the subject would circle one number from 1 to 7):
Not at all distressed

Extremely distressed

Not at all upset

Extremely upset

Not at all angry

Extremely angry

Not at all hurt

Extremely hurt

For a given subject, you sum her responses to the four items, resulting in a single distress
score that can range from a low of 4 (if she circled 1s for each item) to a high of 28 (if she
circled 7s for each item). With this scale, higher scores indicate higher levels of distress.
Although the subjects respond in this way to 12 different scenarios, there are only two
scenarios that you are actually interested in: one that deals with emotional infidelity, and
one that deals with sexual infidelity. The emotional infidelity scenario reads as follows:
Imagine how you would feel if your romantic partner formed a
deep emotional attachment to another person.
The sexual infidelity scenario reads as follows:
Imagine how you would feel if your romantic partner
experienced sexual intercourse with another person.
Obviously, subjects would make one set of distress ratings for the emotional infidelity
scenario, and a different set of distress ratings for the sexual infidelity scenario. As the
researcher, you want to determine whether the ratings obtained under the emotional
infidelity condition are significantly higher than the ratings obtained under the sexual
infidelity condition. Because the same group of people provided ratings under both
conditions, this is essentially a repeated-measures design. You may therefore analyze the
data using a paired-samples t test.
The Predictor and Criterion Variables in the Analysis
Technically, the predictor variable in your study is type of infidelity, a nominal-level
variable that consists of just two conditions (emotional infidelity versus sexual infidelity).
The criterion variable in your study is distress measured by the subjects scores on the
four-item ratings described above. The distress measure is on an interval scale, as it has
approximately equal intervals, but no true zero point.
However, in this analysis you will not create one SAS variable to represent your predictor
variable (type of infidelity), and a separate SAS variable to represent scores on the criterion
(distress), as might be the case if you were going to perform an independent-samples t test.
Instead, you will create both of the following:

one SAS variable to contain distress scores obtained under the emotional infidelity
condition

Chapter 14: Paired-Samples t Test 467

a second SAS variable to contain distress scores obtained under the sexual infidelity
condition.

The following sections show how the data should be arranged.


Data Set to Be Analyzed
Following are the (fictitious) scores on the distress measure obtained for the 17 subjects
under the two treatment conditions:
Table 14.3
Data from the Infidelity Study (Significant Differences)
________________________________________
Distress scores
_________________________
Emotional
Sexual
infidelity
infidelity
Subject
condition
condition
_______________________________________
01
21
18
02
24
19
03
23
21
04
27
24
05
25
25
06
24
21
07
26
22
08
25
21
09
28
21
10
20
19
11
22
20
12
27
23
13
26
23
14
23
22
15
22
19
16
22
20
17
23
20
_________________________________________

The preceding data set consists of three columns. The first column is headed Subject, and
provides a unique number for each subject. The next two columns appear under the major
heading Distress scores. The values in these two columns indicate the scores that each
subject displayed on the measure of psychological distress. The values appearing in the
column titled Emotional infidelity condition indicate the distress scores that the subjects
displayed when they imagined their partners engaging in emotional infidelity. The values
appearing in the column titled Sexual infidelity condition indicate the distress scores that
the subjects displayed when they imagined their partners engaging in sexual infidelity.
Each horizontal row in the preceding table presents the distress scores obtained for a specific
subject under the two treatment conditions. For example, the first row presents results for
Subject 1, who provided a distress score of 21 under the emotional infidelity condition, and
a distress score of 18 under the sexual infidelity condition. Subject 2 provided a score of 24

468 Step-by-Step Basic Statistics Using SAS: Student Guide

under the emotional condition, and a score of 19 under the sexual condition. The remaining
rows may be interpreted in the same fashion.
The SAS DATA Step for the Program
Suppose that you prepare a SAS program to input the data set presented in Table 14.3. You
use the SAS variable name SUB_NUM to represent subject numbers, the SAS variable
name EMOTION to represent subject distress scores obtained under the emotional infidelity
condition, and the SAS variable name SEXUAL to represent subject scores obtained under
the sexual infidelity condition.
Following are the statements that constitute the DATA step of this program. Notice that the
INPUT statement makes use of the SAS variable names that were described in the previous
paragraph. Notice also that the data set itself (which appears after the DATALINES
statement) is essentially identical to the data set that appears in Table 14.3.
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
EMOTION
SEXUAL;
DATALINES;
01 21 18
02 24 19
03 23 21
04 27 24
05 25 25
06 24 21
07 26 22
08 25 21
09 28 21
10 20 19
11 22 20
12 27 23
13 26 23
14 23 22
15 22 19
16 22 20
17 23 20
;

The PROC Step for the Program


Overview. The PROC step of your program will include two SAS procedures. First, PROC
MEANS enables you to review the means and other descriptive statistics obtained on the
criterion variable under the two treatment conditions. Second, PROC TTEST enables you to
produce a test of the null hypothesis and the confidence interval for the difference between
means. This section shows you how to request both procedures.

Chapter 14: Paired-Samples t Test 469

Syntax for PROC MEANS. When performing a paired-samples t test, the syntax for PROC
MEANS is as follows:
PROC MEANS DATA=data-set-name ;
VAR criterion-variable-1 criterion-variable-2
TITLE1 ' your-name ';
RUN;

The second line of the preceding syntax contains the VAR statement. You can see that the
VAR statement contains the entries criterion-variable-1 and criterion-variable-2. These
entries represent scores on the criterion variable obtained under Condition 1 and scores
on the criterion variable obtained under Condition 2, respectively.
SAS statements for PROC MEANS. Here are the statements that would request PROC
MEANS for the current analysis:
PROC MEANS DATA=D1;
VAR EMOTION SEXUAL;
TITLE1 'JANE DOE';
RUN;
You can see that the PROC MEANS statement specifies DATA=D1. This is because D1
is the name of the data set that you created.
Below the PROC MEANS statement, the VAR statement lists EMOTION and SEXUAL.
This is because EMOTION contains scores on the criterion variable obtained under
Condition 1 (the emotional infidelity condition), and SEXUAL contains scores on the
criterion variable obtained under Condition 2 (the sexual infidelity condition). Notice how
this is consistent with the syntax of the VAR statement presented above.
Syntax for PROC TTEST. The syntax for the section of the program that will request a
paired-samples t test is as follows:
PROC TTEST

DATA=data-set-name
H0=comparison-number
ALPHA=alpha-level ;
PAIRED criterion-variable-1*criterion-variable-2
TITLE1 ' your-name ' ;
RUN;

In the preceding syntax, the PROC TTEST statement contains the following option:
H0=comparison-number
The comparison-number that appears in this option should be the mean difference score
expected under the null hypothesis. When you perform a paired-sample t test, the mean
difference score that is usually expected under the null hypothesis is zero. Therefore, you
should generally use the following option when performing a paired-samples t test:
H0=0

470 Step-by-Step Basic Statistics Using SAS: Student Guide

Note that the 0 that appears in the preceding option H0 is a zero (0), and is not the
upper case of the letter O. In addition, the 0 that appears to the right of the equal sign is
also a zero, and is not the upper case of the letter O.
If you omit the H0 option from the PROC TTEST statement, the default comparison number
is zero. This means that, in most cases there is no harm in omitting this option.
The syntax of the PROC TTEST statement also contains the following option:
ALPHA=alpha-level
This ALPHA option enables you to specify the size of the confidence interval that you will
estimate around the difference between the means. Specifying ALPHA=0.01 produces a
99% confidence interval, specifying ALPHA=0.05 produces a 95% confidence interval, and
specifying ALPHA=0.1 produces a 90% confidence interval. Suppose that, in this analysis,
you wish to create a 95% confidence interval. This means that you will include the
following option in the PROC TTEST statement:
ALPHA=0.05
The preceding general form includes the following PAIRED statement:
PAIRED criterion-variable-1*criterion-variable-2

In the PAIRED statement, you should list the names of the two SAS variables that contain
the scores on the criterion variable obtained under the two treatment conditions. Notice that
there is an asterisk (*) that separates the two variable names.
An earlier section of this chapter indicated that when SAS performs a paired-samples t test it
subtracts scores obtained under one condition from scores obtained under the other
condition to create a new variable that consists of the resulting difference scores. The order
in which you type your criterion variable names in the PAIRED statement determines how
these difference scores are created. Specifically, SAS subtracts scores on criterion-variable2 from scores on criterion-variable-1. In other words, it subtracts scores on the variable on
the right side of the asterisk from scores on the variable on the left side of the asterisk.
SAS statements for PROC TTEST. Here are the actual statements that request SAS to
perform a paired-samples t test on the present data set:
PROC TTEST DATA=D1 H0=0 ALPHA=0.05;
PAIRED EMOTION*SEXUAL;
RUN;
In the preceding PROC TTEST statement, you have requested the option H0=0. This option
requests that SAS test the null hypothesis that your sample was drawn from a population in
which the average difference score was equal to zero.
The PROC TTEST statement also includes the option ALPHA=0.05. This statement
requests that SAS compute the 95% confidence interval for the difference between the
means.

Chapter 14: Paired-Samples t Test 471

The PAIRED statement lists the SAS variable EMOTION on the left side of the asterisk and
SEXUAL on the right side. This means that scores on SEXUAL will be subtracted from
scores on EMOTION to compute difference scores. With this arrangement, if the mean
difference score is a positive number, you will know that the subjects displayed higher
distress scores under the emotional infidelity condition than under the sexual infidelity
condition, on the average. On the other hand, if the mean difference score is a negative
number, you will know that the subjects displayed higher distress scores under the sexual
infidelity condition than under the emotional infidelity condition.
The Complete SAS Program
Below is the complete SAS programincluding the DATA stepto analyze the fictitious
data from the preceding study. Notice that PROC MEANS and PROC TTEST are both
included in the same program.
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
EMOTION
SEXUAL;
DATALINES;
01 21 18
02 24 19
03 23 21
04 27 24
05 25 25
06 24 21
07 26 22
08 25 21
09 28 21
10 20 19
11 22 20
12 27 23
13 26 23
14 23 22
15 22 19
16 22 20
17 23 20
;
PROC MEANS DATA=D1;
VAR EMOTION SEXUAL;
TITLE1 'JANE DOE';
RUN;
PROC TTEST DATA=D1 H0=0 ALPHA=0.05;
PAIRED EMOTION*SEXUAL;
RUN;

472 Step-by-Step Basic Statistics Using SAS: Student Guide

Steps in Interpreting the Output


Overview. The preceding program produces two pages of output. The first page contains the
results produced by PROC MEANS. These results include the means and standard
deviations (along with other information) for the variables EMOTION and SEXUAL. The
second page of output contains the results of the paired-samples t test (test of the null
hypothesis, confidence intervals, and other information).
This section shows you how to make sense of this output. It shows you:

how to verify that there were no obvious errors in typing the data or program

how to determine which treatment condition displayed the higher mean on the criterion
variable

how to interpret the t test for the difference between the means

how to interpret the confidence interval

how to compute the index of effect size.

1. Make sure that everything looks correct. The first page of output provides results from
PROC MEANs performed on EMOTION and SEXUAL. These results appear here as
Output 14.1.
JANE DOE
The MEANS Procedure

Variable
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------EMOTION
17
24.0000000
2.2912878
20.0000000
28.0000000
SEXUAL
17
21.0588235
1.9193289
18.0000000
25.0000000
---------------------------------------------------------------------------Output 14.1. Reviewing the results of PROC MEANS for signs of
possible errors.

Below the heading Variable you will see the name of the variables being analyzed. In the
row to the right of EMOTION you will find descriptive statistics for EMOTION, and in
the row to the right of SEXUAL you will find descriptive statistics for SEXUAL.
Check the number of usable observations in the column headed N to verify that the data
set includes the expected number of subjects. Here, the N is 17, as expected.
Remember that the variable EMOTION contains psychological distress scores obtained
under the emotional infidelity condition, and the variable SEXUAL contains psychological
distress scores obtained under the sexual infidelity condition. With the distress variable, the
lowest possible score was supposed to be 4, and the highest possible score was supposed to
be 28. With this in mind, you can review the descriptive statistics in Output 14.1 to verify
that you did not key any impossible values. An impossible value is a value that is out-ofbounds: One that is either lower than the lowest possible score (4), or higher than the
highest possible score (28).

Chapter 14: Paired-Samples t Test 473

The average distress scores obtained under the two conditions can be found in the column
headed Mean. In Output 14.1, you can see that the mean for EMOTION was 24.0000, and
the mean for SEXUAL was 21.0588. Both of these means seem reasonable for a scale where
scores could range from 4 to 28.
The column titled Minimum contains the lowest score that was observed for each variable.
Output 14.1 shows that the lowest observed score on EMOTION was 20, and the lowest
observed score on SEXUAL was 18. Neither of these are lower than the lowest possible
score of 4, and so you see no obvious evidence of an error in typing your data.
The column titled Maximum shows the highest score that was observed for each variable.
Output 14.1 shows that the highest observed score on EMOTION was 28, and the highest
observed score on SEXUAL was 25. Neither of these are higher than the highest possible
score of 28, and so you again see no obvious evidence of an error in typing your data.
Next, look at the results produced by PROC TTEST to see if they display any obvious errors
in preparing the program. These results are presented below as Output 14.2.
JANE DOE
The TTEST Procedure

Difference
EMOTION - SEXUAL

N
17

Difference
EMOTION - SEXUAL

Statistics
Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
Std Dev
2.0989
2.9412
3.7835
1.2201
1.6382
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
2.4933
0.3973
0
7

Difference
EMOTION - SEXUAL

T-Tests
DF
16

t Value
7.40

Pr > |t|
<.0001

Output 14.2. Reviewing the results of PROC PROC TTEST for signs
of possible errors.

When you use SAS to perform a paired-samples t test, the results look similar to the results
from a single-sample t test. The output consists of three tables. Two of the tables are headed
Statistics ( and in Output 14.2). These tables provide information regarding the
difference score variable that was created in the analysis. The third table is headed T-Tests
( ), and this table provides information relevant to the paired-samples t test itself.
In the first statistics table, under the heading Difference ( ), you will find the names of
the two variables that were used to create difference scores in your analysis. In Output 14.2,
you can see the entry EMOTION - SEXUAL below this heading, meaning that SAS used
the variables EMOTION and SEXUAL to create difference scores for the current analysis.
This is as it should be. On the output, the entry EMOTION - SEXUAL tells you that, to
create these difference scores, each subjects score for the SEXUAL variable was subtracted
from her score for the EMOTION variable. Again, this is how you requested it in your SAS
program, and so there is no obvious sign of an error.

474 Step-by-Step Basic Statistics Using SAS: Student Guide

Below the heading N ( ), you will find the number of valid observations in the data set.
When you conduct a repeated-measures study (as you did in the present case), N should be
equal to the number of subjects in your study. Output 14.2 shows that N is equal to 17,
which seems correct.
Below the heading Mean ( ), you will find the average difference score that was created
when SEXUAL scores were subtracted from EMOTION scores. Output 14.2 shows that the
mean difference score is 2.9412. This means that, according to PROC TTEST, the average
distress score was 2.9412 points higher under the emotional infidelity condition than under
the sexual infidelity condition.
To determine whether there is any obvious error, you can manually compute the difference
between means, and compare it against this mean difference score of 2.9412. To compute
the difference manually, you need to refer to the results of PROC MEANS from Output 14.1
(presented earlier). There, you saw that the mean score on EMOTION was 24.0000, and the
mean score on SEXUAL was 21.0588. Subtracting the latter mean from the former results in
the following:
24.0000 21.0588 = 2.9412
You can therefore see that using the means from PROC MEANS to manually compute the
mean difference results in the same difference that was reported under Mean in the output
of PROC TTEST, shown in Output 14.2. Again, this suggests that there was no error made
in typing the data or the SAS program itself. Because the output has passed this criterion,
you can now review the results to see what implications they have for your research
question.
2. Review the means on the criterion variable that you obtained under the two
conditions. In an earlier section of this chapter, you stated the following research question:
The purpose of this study is to determine whether there is a difference between emotional
infidelity versus sexual infidelity with respect to the amount of psychological distress that
they produce in women. To obtain an answer to this question, you will review a number of
pieces of information. One of the first things you will review will be the mean scores for
psychological distress that the women displayed under the emotional infidelity condition
versus the sexual infidelity condition. These means were presented in Output 14.1. For
convenience, they are again reproduced here as Output 14.3.
JANE DOE
1
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------EMOTION
17
24.0000000
2.2912878
20.0000000
28.0000000
SEXUAL
17
21.0588235
1.9193289
18.0000000
25.0000000
---------------------------------------------------------------------------Output 14.3. Mean scores on the criterion variable (Psychological Distress)
obtained under the emotional infidelity condition versus the
sexual infidelity condition.

Chapter 14: Paired-Samples t Test 475

As was mentioned earlier, the mean distress score obtained under the emotional infidelity
condition was 24.0000 ( ), while the mean distress score obtained under the sexual
infidelity condition was only 21.0588 ( ). The mean score obtained under the emotional
infidelity condition was a bit higher than that obtained under the sexual infidelity condition,
but was it enough higher to conclude that there is a statistically significant difference
between the two means? To find out, you must consult the paired-sample t test.
3. Review the t test for the difference between means. The paired-samples t test for the
current analysis appears in the results of PROC TTEST, presented earlier. For convenience,
that output is reproduced here as Output 14.4.

Difference
EMOTION - SEXUAL

N
17

Difference
EMOTION - SEXUAL

JANE DOE
2
The TTEST Procedure
Statistics
Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
Std Dev
2.0989
2.9412
3.7835
1.2201
1.6382
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
2.4933
0.3973
0
7
T-Tests

Difference
EMOTION - SEXUAL

DF
16

t Value
7.40

Pr > |t|
<.0001

Output 14.4. The paired-samples t test, comparing mean distress scores


obtained under the emotional infidelity condition versus
the sexual infidelity condition.

You have already learned that the value that appears below the title Mean in the output
represents the mean difference score created when SEXUAL scores are subtracted from
EMOTION scores. Output 14.4 shows that this mean difference score is 2.9412. When you
perform a paired-samples t test, you determine whether this obtained mean difference score
is significantly different from zero.
The results of this test appear lower in the output, in the section titled T-Tests.
The degrees of freedom for this test appear below the heading DF. You can see that there
are 16 degrees of freedom for the current analysis.
Below the heading t Value, you will find the obtained t statistic for the current pairedsamples t test. You can see that the obtained t statistic is 7.40.
To determine whether the obtained t value is statistically significant, you consult the
probability value (or p value) associated with that statistic. This p value appears below the
heading Pr > | t | ( ). You can see that the p value for the current analysis is < .0001,
which means that it is less than one in ten thousand. This text recommends that you reject
the null hypothesis whenever your obtained p value is less than .05. The obtained p value of

476 Step-by-Step Basic Statistics Using SAS: Student Guide

<.0001 is clearly less than this criterion of .05. Therefore, you reject the null hypothesis.
Remember that your statistical null hypothesis stated the following:
Statistical null hypothesis (H0): 1 = 2; In the population, there is no difference
between the emotional infidelity condition versus the sexual infidelity condition with
respect to mean scores on the criterion variable (the measure of psychological distress).
In your analysis report, you will indicate that there was a statistically significant difference
between mean distress scores obtained under the emotional infidelity condition, versus the
sexual infidelity condition. These results, when combined with the results produced by
PROC MEANS, show that the subjects displayed significantly higher levels of distress when
they imagined emotional infidelity than when they imagined sexual infidelity.
4. Review the confidence interval for the difference between the means. When you use
PROC TTEST, SAS computes a confidence interval for the difference between the means.
The PROC TTEST statement that is included in the program for the current analysis
contained the following ALPHA option:
ALPHA=0.05
This option causes SAS to compute the 95% confidence interval. If you had instead wanted
the 99% confidence interval, you would have instead used this option:
ALPHA=0.01
The 95% confidence interval appears in the Statistics table in the PROC TTEST output.
That table is reproduced here as Output 14.5:
JANE DOE
The TTEST Procedure
Statistics

Difference
EMOTION - SEXUAL

N
17

Lower CL
Mean
2.0989

Mean
2.9412

Upper CL
Mean
3.7835

Lower CL
Std Dev
1.2201

Std Dev
1.6382

Output 14.5. The lower confidence limit and upper confidence limit for the
difference between the means.

Below the heading Difference, you can see the entry EMOTION SEXUAL. This
means that the information that appears in this row is information about the difference score
variable that was created by subtracting SEXUAL scores from EMOTION scores.
Below the heading Mean, you can see that the average difference score is 2.9412. This
indicates that the average distress score obtained under the emotional infidelity condition is
2.9412 point higher that the average distress score obtained under the sexual infidelity
condition.
As you remember from Chapter 12, a confidence interval extends from a lower confidence
limit to an upper confidence limit.

Chapter 14: Paired-Samples t Test 477

To find the lower confidence limit for the current difference between means, look below the
heading Lower CL Mean. There, you can see that the lower confidence limit for the
difference is 2.0989.
To find the upper confidence limit for the difference, look below the heading Upper CL
Mean. There, you can see that the upper confidence limit for the difference is 3.7835.
Combined, these findings indicate that the 95% confidence interval for the difference
between means ranges from 2.0989 to 3.7835. This indicates that you can estimate with a
95% probability that the actual difference between the mean of the emotional infidelity
condition and the mean of the sexual infidelity condition (in the population) is somewhere
between 2.0989 and 3.7835.
Notice that this interval does not contain the value of zero. This is consistent with your
rejection of the null hypothesis in the previous section (i.e., you rejected the null hypothesis
that stated In the population, there is no difference between the emotional infidelity
condition versus the sexual infidelity condition with respect to mean scores on the criterion
variable (the measure of psychological distress). If the null hypothesis had been true, you
would have expected the confidence interval to include a value of zero (i.e., a difference
score of zero). The fact that your confidence interval does not contain a value of zero is
consistent with your rejection of this null hypothesis.
5. Compute the index of effect size. Earlier in this chapter, you learned that effect size can
be defined as the degree to which the mean score obtained under one condition differs from
the mean score obtained under the second condition, stated in terms of the standard
deviation of the population of difference scores. The symbol for effect size is d. When
performing a paired-samples t test, the formula for effect size is as follows:
d=

| X1 X2 |

sD

where
X1 = the observed mean of the sample of scores obtained under Condition 1
X2 = the observed mean of the sample of scores obtained under Condition 2
sD = the estimated standard deviation of the population of difference scores.
Although SAS does not automatically compute effect size, you can easily do so by using the
information that appears in the output of PROC MEANS and PROC TTEST. First, you will
need the mean scores on the psychological distress criterion variable obtained under the
two treatment conditions. These mean scores appear in the output of PROC MEANS, and
that output is reproduced again here as Output 14.6.

478 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE
1
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------EMOTION
17
24.0000000
2.2912878
20.0000000
28.0000000
SEXUAL
17
21.0588235
1.9193289
18.0000000
25.0000000
---------------------------------------------------------------------------Output 14.6. Information from the results of PROC MEANS that are needed to
compute the index of effect size.

In the preceding formula, X1 represents the observed mean of the sample of scores that were
obtained under Condition 1 (the emotional infidelity condition).
In Output 14.6, you can see that the mean score on the distress variable that was obtained
under the emotional infidelity condition was 24.0000. In the preceding formula, X2
represents the observed mean of the sample of scores obtained under Condition 2 (the sexual
infidelity condition).
In Output 14.6, you can see that the mean score on the distress variable that was obtained
under this condition was 21.0588. Substituting these two means in the formula results in the
following:
d=

| 24.0000 21.0588 |

sD

In the formula for d, sD represents the estimated standard deviation of the population of
difference scores. This statistic appears in the Statistics table from the results of PROC
TTEST. The relevant table from the current analysis is reproduced here as Output 14.7.

Difference
EMOTION - SEXUAL

N
17

Difference
EMOTION - SEXUAL

Statistics
Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
Std Dev
2.0989
2.9412
3.7835
1.2201
1.6382
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
2.4933
0.3973
0
7

Output 14.7. The estimated standard deviation of


the population of difference scores.

The estimated standard deviation of the population of difference scores appears below the
heading Std Dev. For the current analysis, you can see that this standard deviation is
1.6382. Substituting this value in the formula results in the following:
d=

| 24.0000 21.0588 |

1.6382

Chapter 14: Paired-Samples t Test 479

d=

2.9412

1.6382

d=

1.7954

d=

1.80

And so the obtained index of effect size for the current analysis is 1.80. This means that the
mean distress score obtained under the emotional infidelity condition differs from the mean
distress score obtained under the sexual infidelity condition by 1.80 standard deviations. To
determine whether this is a relatively large difference or a relative small difference, you can
consult the guidelines provided by Cohen (1969). Cohens guidelines are reproduced in
Table 14.4.
Table 14.4
Guidelines for Interpreting Effect Size
_________________________________________
Effect size
Obtained d statistic
_________________________________________
Small effect
d = .20
Medium effect
d = .50
Large effect
d = .80
_________________________________________

Your obtained d statistic of 1.80 is larger than the large effect value of .80 that appears in
Table 14.4. This means that the manipulation in your study produced a relatively large
effect.
Summarizing the Results of the Analysis
Following is an analysis report that summarizes the preceding research question and results.
A) Statement of the research question: The purpose of this
study is to determine whether there is a difference between
emotional infidelity versus sexual infidelity with respect to
the amount of psychological distress that they produce in
women.
B) Statement of the research hypothesis: When asked to
imagine how they would feel if they learned that their partner
had been unfaithful, women will display higher levels of
psychological distress when imagining emotional infidelity
than when imagining sexual infidelity.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable.
The predictor variable was type of infidelity. This was a
dichotomous variable, was assessed on a nominal scale, and

480 Step-by-Step Basic Statistics Using SAS: Student Guide

included two conditions: an emotional infidelity condition


versus a sexual infidelity condition.
The criterion variable was subjects scores on a 4-item
measure of distress. This was a multi-value variable and was
assessed on an interval scale.
Paired-samples t test.

D) Statistical test:

E) Statistical null hypothesis (H0): 1 = 2; In the study


population, there is no difference between the emotional
infidelity condition versus the sexual infidelity condition
with respect to mean scores on the criterion variable (the
measure of psychological distress).
F) Statistical alternative hypothesis (H1): 1 2; In the
study population, there is a difference between the emotional
infidelity condition versus the sexual infidelity condition
with respect to mean scores on the criterion variable (the
measure of psychological distress).
G) Obtained statistic:

t = 7.40

H) Obtained probability (p) value:

p = .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Confidence interval: Subtracting the mean of the sexual
infidelity condition from the mean of the emotional infidelity
condition resulted in an observed difference of 2.94. The 95%
confidence interval for this difference ranged from 2.10 to
3.78.
K) Effect size:

d = 1.80.

L) Conclusion regarding the research hypothesis: These


findings provide support for the studys research hypothesis.
M) Formal description of results for a paper: Results were
analyzed using a paired-samples t test. This analysis
revealed a statistically significant difference between the
two conditions, t(16) = 7.40, p = .0001. The sample means are
displayed in Figure 14.2, which shows that the mean distress
score obtained under the emotional infidelity condition was
significantly higher than the mean distress score obtained
under the sexual infidelity condition (for emotional
infidelity, M = 24.00, SD = 2.29; for sexual infidelity, M =
21.06, SD = 1.92). The observed difference between the means
was 2.94, and the 95% confidence interval for the difference
between means ranged from 2.10 to 3.78. The effect size was
computed as d = 1.80. According to Cohens (1969) guidelines,
this represents a relatively large effect.

Chapter 14: Paired-Samples t Test 481

N) Figure representing the results:

Figure 14.2. Mean scores on the measure of psychological distress as a


function of the type of infidelity (significant differences).

Notes Regarding the Preceding Report


The output. Much of the information that was presented in the preceding analysis report
was taken from the results of PROC MEANS and PROC TTEST, presented earlier. For your
convenience, that output (combined) is reproduced here as Output 14.8.
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------EMOTION
17
24.0000000
2.2912878
20.0000000
28.0000000
SEXUAL
17
21.0588235
1.9193289
18.0000000
25.0000000
----------------------------------------------------------------------------

Difference
EMOTION - SEXUAL

N
17

Difference
EMOTION - SEXUAL

The TTEST Procedure


Statistics
Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
Std Dev
2.0989
2.9412
3.7835
1.2201
1.6382
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
2.4933
0.3973
0
7
T-Tests

Difference
EMOTION - SEXUAL

DF
16

t Value
7.40

Pr > |t|
<.0001

Output 14.8. Results from PROC MEANS and PROC TTEST, infidelity study
(significant differences).

482 Step-by-Step Basic Statistics Using SAS: Student Guide

Rounding to two decimal places. Item J in the preceding report provided information about
the 95% confidence interval for the difference between the means. This information was
taken from the first Statistics table ( ) from the output of PROC TTEST. Notice that, in
the analysis report, the values have been rounded to two decimal places. Item J refers to an
observed difference of 2.94. The 95% confidence interval for this difference ranged from
2.10 to 3.78.
When you present statistics in an analysis report, in most cases you should round them to
two decimal places (you might have noticed that most of the statistics in the preceding
analysis have been rounded to two decimal places). You should report more than two
decimal places when it is necessary to convey important information, such as the p value
associated with the statistic (most analysis reports in this text report p values to four decimal
places).
The t statistic and related information. The obtained t statistic for the analysis is reported
in Items G and M of the analysis report. The t statistic itself appears in the output of PROC
TTEST below the heading t Value ( ).
Item M from the preceding report provides a summary of the results for a paper. The second
sentence reports the obtained t statistic in the following way:
t(16) = 7.40, p = .0001.
The 16 that appears in parentheses in the preceding excerpt represents the degrees of
freedom for the analysis. With a paired-samples t test, the degrees of freedom are equal to N
1, where N represents the number of difference scores computed. In the present case, N =
17, so it makes sense that the degrees of freedom would be 17 1 = 16. In the output from
PROC TTEST, the degrees of freedom are reported below the heading DF ( ).
The p = .0001 from this excerpt indicates that the obtained p value for the analysis was
.0001. This came from the output of PROC TTEST, below the heading Pr > | t | ( ).
Means and standard deviations for the treatment conditions. In the formal description of
results for a paper (in Item M), the third sentence states:
...(for emotional infidelity, M = 24.00, SD = 2.29; for
sexual infidelity, M = 21.06, SD = 1.92).
In this excerpt, the symbol M represents the mean, and SD represents the standard
deviation. This sentence reports the mean and standard deviation of EMOTION (which
contained distress scores obtained under the emotional infidelity condition), and SEXUAL
(which contained distress scores obtained under the sexual infidelity condition). These
means and standard deviations appear in the results of PROC MEANS, under the headings
Mean ( ) and Std Dev ( ), respectively.

Chapter 14: Paired-Samples t Test 483

Example 14.2: An Illustration of Results Showing


Nonsignificant Differences
Overview
This section presents the results of an analysis of a different data seta data set that is
designed to produce nonsignificant results. This will enable you to see how nonsignificant
results might appear in your output. A later section will show you how to summarize
nonsignificant results in an analysis report.
The SAS Output
Output 14.9 resulted from the analysis of a different fictitious data set, one in which the
means for the two treatment conditions are not significantly different.
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
-------------------------------------------------------------------------EMOTION
17
21.0588235
1.9193289
18.0000000
25.0000000
SEXUAL
17
20.9411765
1.5996323
18.0000000
24.0000000
-------------------------------------------------------------------------The TTEST Procedure
Statistics

Difference
EMOTION - SEXUAL

N
17

Difference
EMOTION - SEXUAL

Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
Std Dev
-0.807
0.1176
1.0424
1.3396
1.7987
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
2.7375
0.4362
-4
3
T-Tests

Difference
EMOTION - SEXUAL

DF
16

t Value
0.27

Pr > |t|
0.7909

Output 14.9. Results from PROC MEANS and PROC TTEST, infidelity study
(nonsignificant differences).

Steps in Interpreting the Output


Overview. You would normally interpret Output 14.9 following the same steps that were
listed in the previous section titled Steps in Interpreting the Output. However, this section
will focus only on those results that are most relevant to the significance test, the confidence
interval, and the index of effect size.

484 Step-by-Step Basic Statistics Using SAS: Student Guide

1. Review the means on the criterion variable obtained under the two conditions. In the
output from PROC MEANS, below the heading Mean ( ), you can see that the mean
distress score obtained under the emotional infidelity condition was 21.0588, and the mean
score obtained under the sexual infidelity condition was 20.9412. You can see that there
does not appear to be a large difference between these two treatment means.
2. Review the t test for the difference between the means. From the results of PROC
TTEST, below the title t Value ( ), you can see that the obtained t statistic for the current
analysis is 0.27. The p value that is associated with this t statistic is 0.7909 ( ). Because this
obtained p value is larger than the standard criterion of .05, you fail to reject the null
hypothesis. In your report, you will indicate that the difference between the two treatment
conditions is not statistically significant.
3. Review the confidence interval for the difference between the means. From the output
of PROC TTEST, you can see that the observed difference between the means for the two
treatment conditions is 0.1176 ( ). The 95% confidence interval for this difference ranges
from 0.807 ( ) to 1.0424 ( ). Notice that this interval does include the value of zero,
which is consistent with your failure to reject the null hypothesis.
4. Compute the index of effect size. The formula for computing effect size in a pairedsamples t test is reproduced here:
d=

| X1 X2 |

sD

The symbols X1 and X2 represent the sample means on the distress variable obtained
under the two treatment conditions. From Output 14.9, you can see that the mean distress
score obtained under the emotional infidelity condition was 21.0588, and the mean distress
score obtained under the sexual infidelity condition was 20.9412 ( ). Inserting these means
into the formula for the effect size index results in the following:
d=

| 21.0588 20.9412 |

sD

Chapter 14: Paired-Samples t Test 485

The symbol sD in the formula represents the estimated standard deviation of difference
scores in the population. This statistic may be found in the Statistics table produced by
PROC TTEST. This table is reproduced here:

Difference
EMOTION - SEXUAL

N
17

Difference
EMOTION - SEXUAL

Statistics
Lower CL
Upper CL
Lower CL
Mean
Mean
Mean
Std Dev
Std Dev
-0.807
0.1176
1.0424
1.3396
1.7987
Statistics
Upper CL
Std Dev
Std Err
Minimum
Maximum
2.7375
0.4362
-4
3

Output 14.10. Estimated population standard deviation that is needed to


compute effect size for the infidelity study (nonsignificant differences).

In Output 14.10, you can see that the estimated population standard deviation is 1.7987 ( ).
Substituting this value into the formula for effect size results in the following:
d=

| 21.0588 20.9412 |

1.7987

d=

| .1176 |

1.7987

d=

.0654

d=

.07

Thus, the index of effect size for the current analysis is .07. Cohens guidelines (appearing
in Table 14.4) indicated that a small effect was obtained when d = .20. The value of .07
that was obtained with the present analysis was well below this criterion, indicating that the
present manipulation produced less than a small effect.
Summarizing the Results of the Analysis
Following is an analysis report that summarizes the preceding research question and results.
A) Statement of the research question: The purpose of this
study is to determine whether there is a difference between
emotional infidelity versus sexual infidelity with respect to
the amount of psychological distress that they produce in
women.
B) Statement of the research hypothesis: When they are asked
to imagine how they would feel if they learned that their
partner had been unfaithful, women will display higher levels
of psychological distress when imagining emotional infidelity
than when imagining sexual infidelity.

486 Step-by-Step Basic Statistics Using SAS: Student Guide

C) Nature of the variables: This analysis involved one


predictor variable and one criterion variable.
The predictor variable was type of infidelity. This was a
dichotomous variable, was assessed on a nominal scale, and
included two conditions: an emotional infidelity condition
versus a sexual infidelity condition.
The criterion variable was subjects scores on a 4-item
measure of distress. This was a multi-value variable and was
assessed on an interval scale.
Paired-samples t test.

D) Statistical test:

E) Statistical null hypothesis (H0): 1 = 2; In the study


population, there is no difference between the emotional
infidelity condition versus the sexual infidelity condition
with respect to mean scores on the criterion variable (the
measure of psychological distress).
F) Statistical alternative hypothesis (H1): 1 2; In the
study population, there is a difference between the emotional
infidelity condition versus the sexual infidelity condition
with respect to mean scores on the criterion variable (the
measure of psychological distress).
G) Obtained statistic:

t = 0.27

H) Obtained probability (p) value:

p = .7909

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.

Fail

J) Confidence interval: Subtracting the mean of the sexual


infidelity condition from the mean of the emotional infidelity
condition resulted in an observed difference of 0.12. The 95%
confidence interval for this difference extended from 0.81 to
1.04.
K) Effect size:

d = .07.

L) Conclusion regarding the research hypothesis: These


findings fail to provide support for the studys research
hypothesis.
M) Formal description of results for a paper: Results were
analyzed using a paired-samples t test. This analysis revealed
a statistically nonsignificant difference between the two
conditions, t(16) = 0.27, p = .7909. The sample means are
displayed in Figure 14.3, which shows that the mean distress
score obtained under the emotional infidelity condition was
similar to the mean distress score obtained under the sexual
infidelity condition (for emotional infidelity, M = 21.06, SD
= 1.92; for sexual infidelity, M = 20.94, SD = 1.60). The
observed difference between the means was 0.12, and the 95%

Chapter 14: Paired-Samples t Test 487

confidence interval for the difference between means ranged


from 0.81 to 1.04. The effect size was computed as d = .07.
According to Cohens (1969) guidelines, this represents less
than a small effect.
N ) Figure representing the results:

Figure 14.3. Mean scores on the measure of psychological distress as a


function of the type of infidelity (nonsignificant differences).

Conclusion
In this chapter, you learned how to perform a paired-samples t test. With the information
learned here, along with the information learned in Chapter 13, Independent-Samples t Test,
you should now be prepared to analyze data from many types of studies that compare two
treatment conditions.
But what if you are conducting an investigation that involves more than two treatment
conditions? For example, you might conduct a study that investigates the effect of caffeine on
learning in laboratory rats. Such a study might involve four treatment conditions: (a) a group
given zero mg of caffeine, (b) a group given 1 mg of caffeine, (c) a group given 2 mg of
caffeine, and (d) a group given 3 mg of caffeine.
You might think that the way to analyze the data from this study would be to perform a series
of t tests in which you compare every possible combination of conditions. But most
researchers and statisticians would advise against this approach. Instead, most statisticians and
researchers would counsel you to analyze your data using a one-way analysis of variance
(abbreviated as ANOVA).

488 Step-by-Step Basic Statistics Using SAS: Student Guide

Analysis of variance is one of the most flexible and widely used statistical procedures in the
behavioral sciences and education. It is essentially an expansion of the t test because it enables
you to analyze data from studies that involve more than two treatment conditions. The
following chapter shows you how to use SAS to perform a one-way analysis of variance.

One-Way
ANOVA with
One BetweenSubjects
Factor
Introduction..........................................................................................491
Overview................................................................................................................ 491
Situations Appropriate for One-Way ANOVA with One
Between-Subjects Factor ................................................................491
Overview................................................................................................................ 491
Nature of the Predictor and Criterion Variables ..................................................... 491
The Type-of-Variable Figure .................................................................................. 492
Example of a Study Providing Data That Are Appropriate for This Procedure....... 492
Summary of Assumptions Underlying One-Way ANOVA with One BetweenSubjects Factor .................................................................................................. 493
A Study Investigating Aggression .......................................................494
Overview................................................................................................................ 494
Research Method................................................................................................... 495
The Research Design ............................................................................................ 496
Treatment Effects, Multiple Comparison Procedures, and a
New Index of Effect Size .................................................................497
Overview................................................................................................................ 497
Treatment Effects................................................................................................... 497
Multiple Comparison Procedures ........................................................................... 498
2
R , an Index of Variance Accounted For ................................................................ 499
Some Possible Results from a One-Way ANOVA.................................500
Overview................................................................................................................ 500
Significant Treatment Effect, All Multiple Comparison Tests are Significant .......... 500

490 Step-by-Step Basic Statistics Using SAS: Student Guide

Significant Treatment Effect, Two of Three Multiple Comparison Tests Are


Significant........................................................................................................... 502
Nonsignificant Treatment Effect ............................................................................. 504
Example 15.1: One-Way ANOVA Revealing a Significant
Treatment Effect .............................................................................505
Overview................................................................................................................ 505
Choosing SAS Variable Names and Values to Use in the SAS Program .............. 505
Data Set to Be Analyzed........................................................................................ 506
Writing the SAS Program....................................................................................... 507
Keywords for Other Multiple Comparison Procedures ........................................... 510
Output Produced by the SAS Program .................................................................. 511
Steps in Interpreting the Output ............................................................................. 511
Using a Figure to Illustrate the Results .................................................................. 525
Analysis Report for the Aggression Study (Significant Results) ............................. 526
Notes Regarding the Preceding Analysis Report ................................................... 528
Example 15.2: One-Way ANOVA Revealing a Nonsignificant
Treatment Effect .............................................................................529
Overview................................................................................................................ 529
The Complete SAS Program ................................................................................. 530
Steps in Interpreting the Output ............................................................................. 531
Using a Graph to Illustrate the Results .................................................................. 534
Analysis Report for the Aggression Study (Nonsignificant Results) ....................... 535
Conclusion............................................................................................537

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 491

Introduction
Overview
This chapter shows how to enter data into SAS and prepare SAS programs that will perform
a one-way analysis of variance (ANOVA) using the GLM procedure. This chapter focuses
on between-subjects research designs: designs in which each subject is exposed to only one
condition under the independent variable. It shows how to determine whether there is a
significant effect for the studys independent variable, and how to use multiple comparison
procedures to identify the pairs of groups that are significantly different from each other,
how to request confidence intervals for differences between the means, and how to interpret
an index of effect size. Finally, it shows how to prepare a report that summarizes the results
of the analysis.

Situations Appropriate for One-Way ANOVA with One


Between-Subjects Factor
Overview
One-way ANOVA is a test of group differences: it enables you to determine whether there
are significant differences between two or more treatment conditions with respect to their
mean scores on a criterion variable. ANOVA has an important advantage over a t test: A t
test enables you to determine whether there is a significant difference between only two
groups. ANOVA, on the other hand enables you to determine whether there is a significant
difference between two or more groups. ANOVA is routinely used to analyze data from
experiments that involve three or more treatment conditions.
In summary, one-way ANOVA with one between-subjects factor can be used when you
want to investigate the relationship between (a) a single predictor variable (which classifies
group membership) and (b) a single criterion variable.
Nature of the Predictor and Criterion Variables
Predictor variable. With ANOVA, the predictor (or independent) variable is a type of
classification variable: it simply indicates which group a subject is in.
Criterion variable. With analysis of variance, the criterion (or dependent) variable is
typically a multi-value variable. It must be a numeric variable that is assessed on either an
interval or ratio level of measurement. The criterion variable in the analysis must also satisfy
a number of additional assumptions, and these assumptions are summarized in a later
section.

492 Step-by-Step Basic Statistics Using SAS: Student Guide

The Type-of-Variable Figure


The figure below illustrates the types of variables that are typically being analyzed when
researchers perform a one-way ANOVA with one between-subjects factor.
Criterion

Predictor

The Multi symbol that appears in the above figure shows that the criterion variable in an
ANOVA is typically a multi-value variable (a variable that assumes more than six values in
your sample).
The Lmt symbol that appears to the right of the equal sign in the above figure shows that
the predictor variable in this procedure is usually a limited-value variable (i.e., a variable
that assumes just two to six values).
Example of a Study Providing Data That Are Appropriate for This
Procedure
The Study. Suppose that you are an industrial psychologist studying work motivation and
work safety. You are trying to identify interventions that may increase the likelihood that
employees will engage in safe behaviors at work.
In your current investigation, you are working with pizza deliverers. With this population,
employers are interested in increasing the likelihood that they will display safe driving
behaviors. They are particularly interested in interventions that may increase the frequency
with which they come to a full stop at stop signs (as opposed to a more dangerous rolling
stop).
You are investigating two research questions:

Does setting safety-related goals for employees increase the frequency with which they
will engage in safe behaviors?

Do participatively set goals tend to be more effective than goals that are assigned by a
supervisor without participation?

To explore these questions, you conduct an experiment in which you randomly assign 30
pizza deliverers to one of three treatment conditions:

10 drivers are assigned to the participative goal-setting condition. These drivers meet as
a group and set goals for themselves with respect to how frequently they should come to a
full stop at stop signs.

10 drivers are assigned to the assigned goal-setting condition. The drivers in this group
meet with their supervisors, and the supervisors assign goals regarding how frequently
they should come to a full stop at stop signs (unbeknownst to the drivers, these goals are
the same as the goals developed by the preceding group).

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 493

10 drivers are assigned to the control condition. Drivers in this condition do not
experience any goal-setting.

You secretly observe the drivers at stop signs over a two-month period, noting how many
times each driver comes to a full stop out of 30 opportunities. You perform a one-way
ANOVA to determine whether there are significant differences between the three groups
with respect to their average number of full stops out of 30. The ANOVA procedure permits
you to determine (a) whether the two goal-setting groups displayed a greater average
number of full stops, compared to the control group, and (b) whether the participative goalsetting group displayed a greater number of full stops, compared to the assigned goal-setting
group.
Why these data would be appropriate for this procedure. The preceding study involved a
single predictor variable and a single criterion variable. The predictor variable was type of
motivational intervention. You know that this was a limited-value variable because it
assumed only three values: a participative goal-setting condition, an assigned goal-setting
condition, and a control condition. This predictor variable was assessed on a nominal scale
because it indicates group membership but does not convey any quantitative information.
However, remember that the predictor variable used in a one-way ANOVA may be assessed
on any scale of measurement.
The criterion variable in this study was the number of full stops displayed by the pizza
drivers. This was a numeric variable, and you know it was assessed on a ratio scale because
it had equal intervals and a true zero point. You would know that this was a multi-value
variable if you used a procedure such as PROC FREQ to verify that the drivers scores
displayed a relatively large number of values (i.e., some drivers had zero full stops out of a
possible 30, other drivers had 30 full stops out of a possible 30, and still other drivers had
any number of full stops between these two extremes).
Before you analyze your data with ANOVA, you first want to perform a number of other
preliminary analyses on your data to verify that they meet the assumptions underlying this
statistical procedure. The most important of these assumptions are summarized in the
following section.
Note: Although the study described here is fictitious, it is based on a real study reported by
Ludwig and Geller (1997).
Summary of Assumptions Underlying One-Way ANOVA with One
Between-Subjects Factor

Level of measurement. The criterion variable must be a numeric variable that is assessed
on an interval or ratio level of measurement. The predictor variable may be assessed on
any level of measurement, although it is essentially treated as a nominal-level
(classification) variable in the analysis.

Independent observations. A particular observation should not be dependent on any


other observation in any group. In practical terms, this means that (a) each subject is

494 Step-by-Step Basic Statistics Using SAS: Student Guide

exposed to just one condition under the predictor variable and (b) subject matching
procedures are not used.

Random sampling. Scores on the criterion variable should represent a random sample
drawn from the study populations.

Normal distributions. Each group should be drawn from a normally distributed


population. If each group contains over 30 subjects, the test is robust against moderate
departures from normality (in this context, robust means that the test will still provide
accurate results as long as violations of the assumptions are not large). You should analyze
your data with PROC UNIVARIATE using the NORMAL option to determine whether
your data meet this assumption. Be warned that the tests for normality that are provided by
PROC UNIVARIATE tend to be fairly sensitive when samples are large.

Homogeneity of variance. The populations that are represented by the various groups
should have equal variances on the criterion. If the number of subjects in the largest group
is no more than 1.5 times greater than the number of subjects in the smallest group, the test
is robust against moderate violations of the homogeneity assumption (Stevens, 1986).

A Study Investigating Aggression


Overview
Assume that you are conducting research concerning the possible causes of aggression in
children. You are aware that social learning theory (Bandura, 1977) predicts that exposure to
aggressive models can cause people to behave more aggressively. You design a study in
which you experimentally manipulate the amount of aggression that a model displays. You
want to determine whether this manipulation affects how aggressively children subsequently
behave, after they have viewed the model. Essentially, you wish to determine whether
viewing a models aggressive behavior can lead to an increased aggressive behavior on the
part of the viewer.
Your research hypothesis: There will be a positive relationship between the level of
aggression displayed by a model and the number of aggressive acts later demonstrated by
children who observe the model. Specifically, you predict the following:

Children who witness a high level of aggression will demonstrate a greater number of
aggressive acts than children who witness a moderate or low level of aggression.

Children who witness a moderate level of aggression will demonstrate a greater number of
aggressive acts than children who witness a low level of aggression.

You perform a single investigation to test these hypotheses. The following sections describe
the research method in more detail.
Note: Although the study and results presented here are fictitious, they are inspired by a real
study reported by Bandura (1965).

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 495

Research Method
Overview. You conduct a study in which 24 nursery-school children serve as subjects. The
study is conducted in two stages. In Stage 1, you show a short videotape to your subjects.
You manipulate the independent variable by varying what the children see in this videotape.
In Stage 2, you assess the dependent variable (the amount of aggression displayed by the
children) to determine whether it has been affected by this independent variable.
The following sections refer to your independent variable as a predictor variable. This is
because the term predictor variable is more general, and is appropriate regardless of
whether your variable is a true manipulated independent variable (as in the present case), or
a nonmanipulated subject variable (such as subject sex).
Stage 1: Manipulating the predictor variable. The predictor variable in your study is the
level of aggression displayed by the model or, more concisely, model aggression. You
manipulate this independent variable by randomly assigning each child to one of three
treatment conditions:

Eight children are assigned to the low-model-aggression condition. When the subjects in
this group watch the videotape, they see a model demonstrate a relatively low level of
aggressive behavior. Specifically, they see a model (an adult female) enter a room that
contains a wide variety of toys. For 90% of the tape, the model engages in nonaggressive
play (e.g., playing with building blocks). For 10% of the tape, the model engages in
aggressive play (e.g., violently punching an inflatable bobo doll).

Another eight children are assigned to the moderate-model-aggression condition. They


watch a videotape of the same model in the same playroom, but they observe the model
displaying a somewhat higher level of aggressive behavior. Specifically, in this version of
the tape, the model engages in nonaggressive play (again, playing with building blocks)
50% of the time, and engages in aggressive play (again, punching the bobo doll) 50% of
the time.

Finally, the remaining eight children are assigned to the high-model-aggression


condition. They watch a videotape of the same model in the same playroom, but in this
version the model engages in nonaggressive play 10% of the time, and engages in
aggressive play 90% of the time.

Stage 2: Assessing the criterion variable. This chapter will refer to the dependent variable
in the study as a criterion variable. Again, this is because the term criterion variable is a
more general term that is appropriate regardless of whether your study is a true experiment
(as in the present case), or is a nonexperimental investigation.
The criterion variable in this study is the number of aggressive acts displayed by the
subjects or, more concisely, subject aggressive acts. The purpose of your study was to
determine whether certain manipulations in your videotape caused some groups of children
to behave more aggressively than others. To assess this, you allowed each child to engage in
a free play period immediately after viewing the videotape. Specifically, each child was
individually escorted to a play room similar to the room that was shown in the tape. This
playroom contained a large assortment of toys, some of which were appropriate for

496 Step-by-Step Basic Statistics Using SAS: Student Guide

nonaggressive play (e.g., building blocks), and some of which were appropriate for
aggressive play (e.g., an inflatable bobo doll identical to the one in the tape). The children
were told that they could do whatever they liked in the play room, and were then left to play
alone.
Outside of the play room, three observers watched the child through a one-way mirror. They
recorded the total number of aggressive acts the child displayed during a 20-minute period
in the play room (an aggressive act could be an instance in which the child punches the
bobo doll, throws a building block, and so forth). Therefore, the criterion variable in your
study is the total number of aggressive acts demonstrated by each child during this period.
The Research Design
The research design used in this study is illustrated in Figure 15.1. You can see that this
design is represented by a figure that consists of three squares, or cells.

Figure 15.1. Research design for the aggression study.

The figure is titled Predictor Variable: Level of Aggression Displayed by Model. The first
cell (on the left) represents the eight subjects in Level 1 (the children who saw a videotape in
which the model displayed a low level of aggression). The middle cell represents the eight
subjects in Level 2 (the children who saw the model display a moderate level of aggression).
Finally, the cell on the right represents the eight subjects in Level 3 (the children who saw
the model display a high level of aggression).

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 497

Treatment Effects, Multiple Comparison Procedures,


and a New Index of Effect Size
Overview
This section introduces the three types of results that you will review when you conduct a
one-way ANOVA. First, it covers the concept of treatment effects: overall differences
among group means that are due to the experimental manipulations. Next, it discusses
multiple comparison procedurestests used to determine which pairs of treatment
conditions are significantly different from each other. Finally, it introduces the R2 statistic
a measure of the variance in the criterion variable that is accounted for by the predictor
variable. It discusses the use of R2 as an index effect size in experiments.
Treatment Effects
Null and alternative hypotheses. The concept of treatment effects is best understood with
reference to the concept of the null hypothesis. For an experiment with three treatment
conditions (as with the current study), the statistical null hypothesis may generally be stated
according to this format:
Statistical null hypothesis (Ho): 1 = 2 = 3; In the study population, there is no
difference between subjects in the three treatment conditions with respect to their mean
scores on the criterion variable.
In the preceding null hypothesis, the symbol represents the population mean on the
criterion variable for a particular treatment condition. For example 1 represents the
population mean for Treatment Condition 1, 2 represents the population mean for
Treatment Condition 2, and so on.
The preceding section describes a study that was designed to determine whether exposure to
aggressive models will cause subjects who view the models to behave aggressively
themselves. The statistical null hypothesis for this study may be stated in this way:
Statistical null hypothesis (H0): 1 = 2 = 3; In the study population, there is no
difference between subjects in the low-model-aggression condition, subjects in the
moderate-model-aggression condition, and subjects in the high-model-aggression
condition with respect to their mean scores on the criterion variable (the number of
aggressive acts displayed by the subjects).
For a study with three treatment conditions, the statistical alternative hypothesis may
generally be stated in this way:
Statistical alternative hypothesis (H1): Not all s are equal; In the study population,
there is a difference between at least two of the three treatment conditions with respect to
their mean scores on the criterion variable.

498 Step-by-Step Basic Statistics Using SAS: Student Guide

The statistical alternative hypothesis appropriate for the model aggression study that was
previously described may be stated in this way:
Statistical alternative hypothesis (H1): Not all s are equal; In the study population,
there is a difference between at least two of the following three groups with respect to
their mean scores on the criterion variable: subjects in the low-model-aggression
condition, subjects in the moderate-model-aggression condition, and subjects in the highmodel-aggression condition.
The following sections of this chapter will show you how to use SAS to perform a one-way
analysis of variance. You will learn that, in this analysis, SAS computes an F statistic that
tests the statistical null hypothesis. If this F statistic is significant, you can reject the null
hypothesis that all population means are equal. You may then tentatively conclude that at
least two of the population means differ from one another. In this situation, you have
obtained a significant treatment effect.
Treatment effects. In an experiment, a significant treatment effect refers to differences
among group means that are due to the influence of the independent variable. When you
conduct a true experiment and have a significant treatment effect, this means that your
independent variable had some type of effect on the dependent variable. It means that at
least two of the treatment conditions are significantly different from each other. In most
cases, this is what you want to demonstrate.
In a single-factor experiment (an experiment in which only one independent variable is
being manipulated), there can be only one treatment effect. However, in a factorial
experiment (an experiment in which more than one independent variable is being
manipulated), there may be more than one treatment effect. The present chapter will deal
exclusively with single-factor experiments; Chapter 16, Factorial ANOVA with Two
Between-Subjects Factors, will introduce you to the concept of factorial experiments.
Multiple Comparison Procedures
As was stated previously, when an F statistic is large enough to enable you to reject the null
hypothesis, you may tentatively conclude that, in the population, at least two of the three
treatment conditions differ from one another. But which two? It is possible, for example,
that Group 1 is significantly different from Group 2, but Group 2 is not significantly
different from Group 3. Clearly, researchers need a tool that will enable them to determine
which groups are significantly different from one another.
A special type of test called a multiple comparison procedure is normally used for this
purpose. A multiple comparison procedure is a statistical test that enables researchers to
determine the significance of the difference between pairs of means from studies that
include more than two treatment conditions. A wide variety of multiple comparison
procedures are available with SAS, but this chapter will focus on just one that is very widely
used: Tukeys studentized range (HSD) test.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 499

A later section shows how to request the Tukey test in your SAS program, and how to
interpret the output that it generates. You will supplement the Tukey test with an option that
requests confidence intervals for the difference between the means, somewhat similar to the
confidence intervals that you obtained with the independent-samples t tests that you
performed in Chapter 13. The results of these tests will give you a better understanding of
the nature of the differences between your treatment conditions.
R2, an Index of Variance Accounted For
The need for an index of effect size. In Chapter 12, Single-Sample t Test, you learned
that it is possible to conduct an experiment and obtain results that are statistically significant,
even though the magnitude of the treatment effect is trivial. This outcome is most likely to
occur when you conduct a study with a very large number of subjects. When your sample is
very large, your statistical test has a relatively large amount of power. This means that you
are fairly likely to reject the null hypothesis and conclude that you have obtained significant
differences, even when the magnitude of the difference is relatively small.
To address this problem, researchers are now encouraged to supplement their significance
tests with indices of effect size. An index of effect size is a measure of the magnitude of a
treatment effect. A variety of different effect size indices are available for use in research. In
the three chapters in this text that deal with t tests, you learned about the d statistic. The d
statistic indicates the degree to which one sample mean differs from the second sample
mean (or population mean), stated in terms of the standard deviation of the population. The
d statistic is often used as an index of effect size when researchers compute t statistics.
R2 as a measure of variance accounted for. In this chapter, you will learn about a different
index of effect sizeR2. R2 is an index that is often reported when researchers perform
analysis of variance. The R2 statistic indicates the proportion of variance in the criterion
variable that is accounted for by the studys predictor variable. It is computed by dividing
the sum of squares for the predictor variable (the between-groups sum of squares) by the
2
sum of squares for the corrected total. Values of R may range from .00 to 1.00, with larger
values indicating a larger treatment effect (the word effect is appropriate only for
experimental researchnot for nonexperimental research). The larger the value of R2, the
larger the effect that the independent variable had on the dependent variable. For example,
when a researcher conducts an experiment and obtains an R2 value of .40, she may conclude
that her independent variable accounted for 40% of the variance in the dependent variable.
Researchers typically hope to obtain relatively large values of R2.
Interpreting the size of R2. Chapters 1214 in this text provided information about t tests
and guidelines for interpreting the d statistic. Specifically, they provided tables that showed
how the size of a d statistic indicates a small effect versus a moderate effect versus a
large effect. Unfortunately, however, there are no similar widely accepted guidelines for
2
2
interpreting R . For example, although most researchers would agree that R values less than
.05 are relatively trivial, there is no widely accepted criterion for how large R2 must be to be
considered large. This is because the significance of the size of R2 depends on the nature

500 Step-by-Step Basic Statistics Using SAS: Student Guide

of the phenomenon being studied, and also on the size of R2 values that were obtained when
other researchers have studied the same phenomenon.
For example, researchers looking for ways to improve the grades of children in elementary
schools might find that it is difficult to construct interventions that have much of an impact.
2
If this is the case, an experiment that produces an R value of .15 may be considered a big
success, and the R2 value of .15 may be considered meaningful. In contrast, researchers
conducting research on reinforcement theory using laboratory rats and a bar-pressing
procedure may find that it is easy to construct manipulations that have a major effect on the
rats bar-pressing behavior. In these studies, they may routinely obtain R2 values over .80. If
this is the case, then a new experiment that produces an R2 value of .15 may be considered a
failure, and the R2 value of .15 may be considered relatively trivial.
The above example illustrates the problem with R2 as an index of effect size: in one
situation, an R2 value of .15 was interpreted as a meaningful proportion of variance, and in a
different situation, the same value of .15 was interpreted as a relatively trivial proportion of
variance. Therefore, before you can interpret an R2 value from a study that you have
conducted, you must first be familiar with the R2 values that have been obtained in similar
research that has already been conducted by others. It is only within this context that you can
determine whether your R2 value can be considered large or meaningful.
Summary. In summary, R2 is a measure of variance accounted for that may be used as an
index of effect size when you analyze data from experiments that use a between-subjects
design. Later sections of this chapter will show where the R2 statistic is printed in the output
of PROC GLM, and how you should incorporate it into your analysis reports.

Some Possible Results from a One-Way ANOVA


Overview
When you conduct a study that involves three or more treatment conditions, a number of
different types of outcomes are possible. Some of these possibilities are illustrated below.
All these examples are based on the aggression experiment described above.
Significant Treatment Effect, All Multiple Comparison Tests are
Significant
Figure 15.2 illustrates an outcome in which both of the following are true:

there is a significant treatment effect for the predictor variable (level of aggression
displayed by the model)

all multiple comparison tests are significant.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 501

Figure 15.2. Mean number of aggressive acts as a function


of the level of aggression displayed by the model
(significant treatment effect; all multiple comparison tests are significant).

Understanding the figure. The bar labeled Low in Figure 15.2 represents the children
who saw the model display a low level of aggression in the videotape. The bar labeled
Moderate represents the children who saw the model display a moderate level of
aggression, and the bar labeled High represents the children who saw the model display a
high level of aggression
The vertical axis labeled Subject Aggressive Acts in Figure 15.2 indicates the mean
number of aggressive acts that the various groups of children displayed after viewing the
videotape. You can see that the Low bar reflects a score of approximately 5 on this axis,
meaning that the children in the low-model-aggression condition displayed an average of
about five aggressive acts in the play room after viewing the videotape. In contrast, the bar
for the children in the Moderate group shows a substantially higher score of about 14
aggressive acts, and the bar for the children in the High group shows a even higher score
of about 23 aggressive acts.
Expected statistical results. If you analyzed the data for this figure using a one-way
ANOVA, you would probably expect the overall treatment effect to be significant because at
least two of the groups in Figure 15.2 display means that appear to be substantially different

502 Step-by-Step Basic Statistics Using SAS: Student Guide

from one another. You would also expect to see significant multiple comparison tests,
because

The mean for the moderate-model-aggression group appears to be substantially higher


than the mean for the low-model-aggression group; this suggests that the multiple
comparison test comparing these two groups would probably be significant.

The mean for the high-model-aggression group appears to be substantially higher than the
mean for the moderate-model-aggression group; this suggests that the multiple
comparison test comparing these two groups would probably be significant.

The mean for the high-model-aggression group appears to be substantially higher than the
mean for the low-model-aggression group; this suggests that the multiple comparison test
comparing these two groups would probably be significant.

Conclusions regarding the research hypotheses. Figure 15.2 shows that the greater the
amount of aggression modeled in the videotape, the greater the number of aggressive acts
subsequently displayed by the children in the play room. It would be reasonable to conclude
that the results provide support for the research hypotheses that were stated in the previous
section A Study Investigating Aggression.
Note: Of course, you dont arrive at conclusions such as this by merely preparing a figure
and eyeballing the data. Instead, you perform the appropriate statistical analyses to
confirm your conclusions; these analyses will be illustrated in later sections.
Significant Treatment Effect, Two of Three Multiple Comparison
Tests Are Significant
Figure 15.3 illustrates an outcome in which both of the following are true:

there is a significant treatment effect for the predictor variable

two of the three possible multiple comparison tests are significant.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 503

Figure 15.3. Mean number of aggressive acts as a function of the level of


aggression displayed by the model (significant treatment effect; two of three
multiple comparison tests are significant).

Expected statistical results. If you analyzed the data for this figure using a one-way
ANOVA, you would probably expect the overall treatment effect to be significant because at
least two of the groups in Figure 15.3 have means that appear to be substantially different
from one another. You would also expect to see two significant multiple comparison tests,
because

The mean for the high-model-aggression group appears to be substantially higher than the
mean for the moderate-model-aggression group.

The mean for the high-model-aggression group appears to be substantially higher than the
mean for the low-model-aggression group.

In contrast, there is very little difference between the mean for the moderate-modelaggression group and the mean for the low-model-aggression group. The multiple
comparison test comparing these two groups would probably not demonstrate significance.
Conclusions regarding the research hypotheses. It is reasonable to conclude that the
results shown in Figure 15.3 provide partial support for your research hypotheses. The
results are somewhat supportive because the high-model-aggression groups was more
aggressive than the other two groups. However, they were not fully supportive, as there was
not a significant difference between the Low and Moderate groups.

504 Step-by-Step Basic Statistics Using SAS: Student Guide

Nonsignificant Treatment Effect


Figure 15.4 illustrates an outcome in which the treatment effect for the predictor variable is
nonsignificant.

Figure 15.4. Mean number of aggressive acts as a function of the level of


aggression displayed by the model (treatment effect is nonsignificant).

Expected statistical results. If you analyzed the data for this figure using a one-way
ANOVA, you would probably expect the overall treatment effect to be nonsignificant. This
is because none of the groups in the figure appear to be substantially different from one
another with respect to their mean scores on the criterion variable. Each group displays a
mean of approximately 15 aggressive acts, regardless of condition.
When an overall treatment effect is nonsignificant, it is normally not appropriate to further
interpret the results of the multiple comparison procedures. This is important to remember
because a later section of this chapter will illustrate a SAS program that will request that
multiple comparison tests be computed and printed regardless of the significance of the
overall treatment effect. As the researcher, you must remember to always consult this overall
test first, and proceed to the multiple comparison results only if the overall treatment effect
is significant.
Conclusions regarding the research hypotheses. It is reasonable to conclude that the
results shown in Figure 15.4 fail to provide support for your research hypotheses.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 505

Example 15.1: One-Way ANOVA Revealing a Significant


Treatment Effect
Overview
The steps that you follow in performing an ANOVA will vary depending on whether the
treatment effect is significant. This section illustrates an analysis that results in a significant
treatment effect. It shows you how to prepare the SAS program, interpret the SAS output,
and summarize the results. These procedures are illustrated by analyzing fictitious data from
the aggression study that was described previously. In these analyses, the predictor variable
is the level of aggression displayed by the model, and the criterion variable is the number of
aggressive acts displayed by the children after viewing the videotape.
Choosing SAS Variable Names and Values to Use in the SAS
Program
Before you write a SAS program to perform an ANOVA, it is helpful to first prepare a
figure similar to Figure 15.5. The purpose of this figure is to help you choose a meaningful
SAS variable name for the predictor variable, and meaningful values to represent the
different levels under the predictor variables. Carefully choosing meaningful variable names
and values at this point will make it easier to interpret your SAS output later.

Figure 15.5. Predictor variable name and values to be used in the


SAS program for the aggression study.

SAS variable name for the predictor variable. You can see that Figure 15.5 is very
similar to Figure 15.1, except that variable names and values have now been added. Figure
15.5 again shows that the predictor variable in your study is the Level of Aggression
Displayed by Model. Below this heading is MOD_AGGR, which will be the SAS
variable name for the predictor variable in your SAS program (MOD_AGGR stands for
model aggression). Of course, you can choose any SAS variable name, but it should be
meaningful and must comply with the rules for SAS variable names.

506 Step-by-Step Basic Statistics Using SAS: Student Guide

Values to represent conditions under the predictor variable. Below the heading for the
predictor variable are the names of the three conditions under this predictor variable: Low,
Moderate, and High. Below these headings for the three conditions are the values that you
will use to represent these conditions in your SAS program: L represents children in the lowmodel-aggression condition, M represents children in the moderate-model-aggression
condition, and H represents children in the high-model-aggression condition. Choosing
meaningful letters such as L, M, and H will make it easier to interpret your SAS output later.
Data Set to Be Analyzed
Table 15.1 presents the data set that you will analyze.
Table 15.1
Variables Analyzed in the Aggression Study (Data Set Will Produce a
Significant Treatment Effect)
____________________________________________________
Model
Subject
Subject
aggression
aggression
____________________________________________________
01
L
02
02
L
14
03
L
10
04
L
08
05
L
08
06
L
15
07
L
03
08
L
12
09
M
13
10
M
25
11
M
16
12
M
20
13
M
21
14
M
21
15
M
17
16
M
26
17
H
20
18
H
14
19
H
23
20
H
22
21
H
24
22
H
26
23
H
19
24
H
29
__________________________________________________

Understanding the columns in the table. The columns in Table 15.1 provide the variables
that you will analyze in your study. The first column in Table 15.1 is headed Subject. This
column simply assigns a subject number to each child.
The second column is headed Model aggression. In this column, the value L identifies
children who saw the model in the videotape display a low level of aggression, M
identifies children who saw the model display a moderate level of aggression, and H
identifies children who saw the model display a high level of aggression.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 507

Finally, the column headed Subject aggression indicates the number of aggressive acts
that each child displayed in the play room after viewing the videotape. This variable will
serve as the criterion variable in your study.
Understanding the rows of the table. The rows in Table 15.1 represent individual children
who participated as subjects in the study. The first row represents Subject 1. The L under
Model Aggression tells you that this child was in the low condition under the predictor
variable. The 02 under Subject Aggression tells you that this child displayed two
aggressive acts after viewing the videotape. The data lines for the remaining children may be
interpreted in the same way.
Writing the SAS Program
The DATA step. In preparing the SAS program, you will type the data similar to the way
that they appear in Table 15.1. That is, you will have one column to contain subject
numbers, one column to indicate the subjects condition under the model-aggression
predictor variable, and one column to indicate the subjects scores on the criterion variable.
Here is the DATA step for your SAS program:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 L 02
02 L 14
03 L 10
04 L 08
05 L 08
06 L 15
07 L 03
08 L 12
09 M 13
10 M 25
11 M 16
12 M 20
13 M 21
14 M 21
15 M 17
16 M 26
17 H 20
18 H 14
19 H 23
20 H 22
21 H 24
22 H 26
23 H 19
24 H 29
;

508 Step-by-Step Basic Statistics Using SAS: Student Guide

You can see that the INPUT statement of the preceding program uses the following SAS
variable names:

The SAS variable name SUB_NUM is used to represent subject numbers.

The SAS variable name MOD_AGGR is used to code subject condition under the modelaggression predictor variable. Values are either L, M, or H for this variable. Note that the
variable name is followed by the $ symbol to indicate that it is a character variable.

The SAS variable name SUB_AGGR is used to contain subjects scores on the subject
aggression criterion variable.

The PROC step. Following is the syntax for the PROC step needed to perform a one-way
ANOVA with one between-subjects factor, and follow it with Tukey's HSD test:
PROC GLM DATA = data-set-name ;
CLASS predictor-variable ;
MODEL criterion-variable = predictor-variable ;
MEANS predictor-variable ;
MEANS predictor-variable /
TUKEY CLDIFF ALPHA=alpha-level ;
TITLE1 ' your-name ' ;
RUN;
QUIT;
Substituting the appropriate SAS variable names into this syntax results in the following
(line numbers have been added on the left; you will not actually type these line numbers):
1
2
3
4
5
6
7
8

PROC GLM
CLASS
MODEL
MEANS
MEANS
TITLE1
RUN;
QUIT;

DATA=D1;
MOD_AGGR;
SUB_AGGR = MOD_AGGR;
MOD_AGGR;
MOD_AGGR / TUKEY CLDIFF
'JOHN DOE';

ALPHA=0.05;

Some notes about the preceding code:

In Line 1, the PROC GLM statement requests the GLM procedure, and requests that the
analysis be performed on data set D1.

In Line 2, the CLASS statement lists the classification variable as MOD_AGGR (model
aggression, the predictor variable in the experiment).

Line 3 contains the MODEL statement for the analysis. The name of the criterion variable
SUB_AGGR appears to the left of the equal sign in this statement, and the name of the
predictor variable MOD_AGGR appears to the right of the equal sign.

Line 4 contains the first MEANS statement:


MEANS

MOD_AGGR;

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 509

This statement requests that SAS print the group means and standard deviations for the
treatment conditions under the predictor variable. You will need these means in
interpreting the results, and you will report the means and standard deviations in your
analysis report.

Line 5 contains the second MEANS statement:


MEANS

MOD_AGGR / TUKEY

CLDIFF

ALPHA=0.05;

This second MEANS statement requests the multiple comparison procedure that will
determine which pairs of treatment conditions are significantly different from each other.
You should list the name of your predictor variable to the right of the word MEANS. In
the preceding statement, MOD_AGGR is the name of the predictor variable in the
current analysis, and so it was listed to the right of the word MEANS. The name of your
predictor variable should be followed by a slash (/) and the keywords TUKEY,
CLDIFF, and ALPHA=0.05. The keyword TUKEY requests that the Tukey HSD test be
performed as a multiple comparison procedure. The keyword CLDIFF requests that
Tukey tests be presented as confidence intervals for the differences between the means.
The keyword ALPHA=0.05 requests that alpha (the level of significance) be set at .05
for the Tukey tests, and results in the printing of 95% confidence intervals for
differences between means. If you had instead used the keyword ALPHA=0.01, it would
have resulted in alpha being set at .01 for the Tukey tests, and in the printing of 99%
confidence intervals. If you had instead used the keyword ALPHA=0.1, it would have
resulted in alpha being set at .10 for the Tukey tests, and in the printing of 90%
confidence intervals. If you omit the ALPHA option, the default is .05.

Finally, lines 6, 7, and 8 contain the TITLE1, RUN, and QUIT statements for your
program.

The complete SAS program. Here is the complete SAS program that will input your data
set, perform a one-way ANOVA with one between-subjects factor, and follow with Tukeys
HSD test:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 L 02
02 L 14
03 L 10
04 L 08
05 L 08
06 L 15
07 L 03
08 L 12
09 M 13
10 M 25
11 M 16
12 M 20

510 Step-by-Step Basic Statistics Using SAS: Student Guide

13 M 21
14 M 21
15 M 17
16 M 26
17 H 20
18 H 14
19 H 23
20 H 22
21 H 24
22 H 26
23 H 19
24 H 29
;
PROC GLM
CLASS
MODEL
MEANS
MEANS
TITLE1
RUN;
QUIT;

DATA=D1;
MOD_AGGR;
SUB_AGGR = MOD_AGGR;
MOD_AGGR;
MOD_AGGR / TUKEY CLDIFF
'JOHN DOE';

ALPHA=0.05;

Keywords for Other Multiple Comparison Procedures


The preceding section showed you how to write a program that would request the Tukey
HSD test as a multiple comparison procedure. The sections that follow will show you how
to interpret the results generated by that test. However, it is possible that some readers will
want to use a multiple comparison procedure other than the Tukey test. In fact, a wide
variety of other tests are available with SAS. They can be requested with the MEANS
statement, using the following syntax:
MEANS

predictor-variable /
ALPHA=alpha-level ;

mult-comp-proc

You should insert the keyword for the procedure that you want in the location where multcomp-proc appears. With some of these procedures, you can also include the CLDIFF
option to request confidence intervals for differences between means.
Here is a list of keywords for some frequently used multiple comparison procedures that are
available with SAS:
BON

Bonferroni t tests of differences between means

DUNCAN

Duncans multiple range test

DUNNETT Dunnetts two-tailed t test, determining if any groups are significantly


different from a single control

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 511

GABRIEL

Gabriels multiple-comparison procedure

REGWQ

Ryan-Einot-Gabriel-Welsch multiple range test

SCHEFFE

Scheffes multiple-comparison procedure

SIDAK

Pairwise t tests of differences between means, with levels adjusted


according to Sidaks inequality

SNK

Student-Newman-Keuls multiple range test

Pairwise t tests (equivalent to Fishers least-significant-difference test when


cell sizes are equal)

TUKEY

Tukeys studentized range (HSD) test

Output Produced by the SAS Program


Using the OPTIONS statement shown, the preceding program would produce four pages of
output. The information that appears on each page is briefly summarized here; later sections
will provide detailed guidelines for interpreting these results.

Page 1 provides class level information and the number of observations in the data set.

Page 2 provides the ANOVA summary table from the GLM procedure.

Page 3 provides the results of the first MEANS statement. This MEANS statement simply
requests means and standard deviations on the criterion variable for the three treatment
conditions.

Page 4 provides the results of the second MEANS statement. This includes the results of
the Tukey multiple comparison tests and the confidence intervals.

Steps in Interpreting the Output


1. Make sure that everything looks correct. With most analyses, you should begin this
process by analyzing the criterion variable with PROC MEANS or PROC UNIVARIATE to
verify that (a) no Minimum observed value in your data set is lower than the theoretically
lowest possible score, and (b) no Maximum observed value in your data set is higher than
the theoretically highest possible score. Before you analyze your data with PROC GLM, you
should review the section titled Summary of Assumptions Underlying One-Way ANOVA
with One Between-Subjects Factor, earlier in this chapter. See Chapter 7, Measures of
Central Tendency and Variability, for a discussion of PROC MEANS and PROC
UNIVARIATE.
The output created by the GLM procedure in the preceding program also contains
information that might help to identify possible errors in the writing the program or in typing
the data. This section shows how to review that information.

512 Step-by-Step Basic Statistics Using SAS: Student Guide

First review the class level information that appears on Page 1 of the PROC GLM output.
This page is reproduced here as Output 15.1.
JOHN DOE
The GLM Procedure
Class Level Information
Class
MOD_AGGR

Levels
3

Number of observations

Values
H L M
24

Output 15.1. Class level information from one-way ANOVA performed on


aggression data, significant treatment effect.

First, verify that the name of your predictor variable appears under the heading Class.
Here, you can see that the classification variable is MOD_AGGR.
Under the heading Levels, the output should indicate how many groups of subjects were
included in your study. Output 15.1 correctly indicates that your predictor variable consists
of three groups.
Under the heading Values, the output should indicate the specific numbers or letters that
you used to code this predictor variable. Output 15.1 correctly indicates that you used the
values H, L, and M.
It is important to use uppercase and lowercase letters consistently when you are coding
treatment conditions under the predictor variable. For example, the preceding paragraph
indicated that you used uppercase letters (H, L, and M) in coding conditions. If you had
accidentally keyed a lowercase h instead of an uppercase H for one or more of your
subjects, SAS would have interpreted that as a code for a new and different treatment
condition. This, of course, would have led to errors in the analysis.
Finally, the last line in Output 15.1 indicates the number of observations in the data set. The
present example used three groups with eight subjects each, for a total N of 24. Output 15.1
indicates that your data set included 24 observations, so everything appears to be correct at
this point.
Page 2 of the output provides the analysis of variance table created by PROC GLM. It is
reproduced here as Output 15.2.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 513

JOHN DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.640854

DF
2
21
23

Sum of
Squares
788.250000
441.750000
1230.000000

Coeff Var
26.97924

Mean Square
394.125000
21.035714

Root MSE
4.586471

F Value
18.74

Pr > F
<.0001

SUB_AGGR Mean
17.00000

Source
MOD_AGGR

DF
2

Type I SS
788.2500000

Mean Square
394.1250000

F Value
18.74

Pr > F
<.0001

Source
MOD_AGGR

DF
2

Type III SS
788.2500000

Mean Square
394.1250000

F Value
18.74

Pr > F
<.0001

Output 15.2. ANOVA summary table from one-way ANOVA performed on


aggression data, significant treatment effect.

Near the top of output page 2 on the left side, the name of the criterion variable being
analyzed should appear to the right of the heading Dependent Variable. In Output 15.2,
the dependent variable is listed as SUB_AGGR. You will remember that SUB_AGGR
stands for subject aggression. The remainder of Output 15.2 provides information about
the analysis of this dependent variable.
The top half of Output 15.2 consists of the ANOVA summary table for the analysis. This
ANOVA summary table is made up of columns with headings such as Source, DF,
Sum of Squares, and so on.
The first column of this table is headed Source, and below this Source heading are three
subheadings: Model, Error, and Corrected Total.
Look to the right of the heading Corrected Total, and under the column headed DF. For
the current output, you will see the number 23. This number represents the corrected
total degrees of freedom. This number should always be equal to N 1, where N represents
the total number of subjects for whom you have a complete set of data. In this study, N was
24, and so the corrected total degrees of freedom should be equal to 24 1 = 23. Output 15.2
shows that the corrected total degrees of freedom are in fact equal to 23, so again it appears
that everything is correct so far.
Later, you will return to the ANOVA summary table that appears in Output 15.2 to
determine whether your treatment effect was significant, and to review other important
information. For now, however, continue reviewing other pages of output to see if there are
any other obvious signs of problems with your analysis.
The means and standard deviations for the three groups of subjects are found on output
page 3. This page is reproduced here as Output 15.3.

514 Step-by-Step Basic Statistics Using SAS: Student Guide

JOHN DOE
The GLM Procedure
Level of
MOD_AGGR
H
L
M

N
8
8
8

-----------SUB_AGGR---------Mean
Std Dev
22.1250000
4.58062691
9.0000000
4.75093976
19.8750000
4.42194204

Output 15.3. Table of means and standard deviations from one-way ANOVA
performed on aggression data, significant treatment effect.

On the left side of Output 15.3 is the heading Level of MOD_AGGR. Below this heading
are the three values used to code the three treatment conditions: H, L, and M. To the
right of these values are descriptive statistics for the three groups. You should review these
descriptive statistics for any obvious signs of problems (e.g., a sample size that is too large
or too small, a group mean that is lower than the lowest possible score on the criterion or
higher than the highest possible score).
Output 15.3 shows that, for the High group, the sample size was eight, the mean was
22.13, and the standard deviation was 4.58.
For the Low group, the sample size was 8, the mean was 9.00, and the standard deviation
was 4.75.
For the Moderate group, the sample size was 8, the mean was 19.88, and the standard
deviation was 4.42. These samples sizes are correct, and the means and standard deviations
all seem reasonable.
In summary, these results provide no evidence of an obvious error in writing the program or
typing the data. You can therefore proceed to interpret the results that are relevant to your
research questions.
2. Determine whether the treatment effect is statistically significant. Since there are no
obvious signs from the output that you made errors in writing your program, you can now
determine whether your studys predictor variable had an effect on the studys criterion
variable. To do this, you will review an F statistic that appears in the ANOVA summary
table produced by PROC GLM. This F statistic tests the studys null hypothesis. To refresh
your memory, that null hypothesis is reproduced again here:
Statistical null hypothesis (H0): 1 = 2 = 3; In the population, there is no difference
between subjects in the low-model-aggression condition, subjects in the moderate-modelaggression condition, and subjects in the high-model-aggression condition with respect to
their mean scores on the criterion variable (the number of aggressive acts displayed by
the subjects).
The F statistic relevant to this null hypothesis appears on page 2 of the output produced by
PROC GLM. This output is reproduced here as Output 15.4.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 515

JOHN DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total

DF
2
21
23

R-Square
0.640854
Source
MOD_AGGR
Source
MOD_AGGR

Sum of
Squares
788.250000
441.750000
1230.000000

Coeff Var
26.97924
DF
2
DF
2

Root MSE
4.586471

Type I SS
788.2500000
Type III SS
788.2500000

Mean Square
394.125000
21.035714

F Value
18.74

Pr > F
<.0001

SUB_AGGR Mean
17.00000

Mean Square
394.1250000
Mean Square
394.1250000

F Value
18.74
F Value
18.74

Pr > F
<.0001
Pr > F
<.0001

Output 15.4. Type III sums of squares from one-way ANOVA performed on
aggression data, significant treatment effect.

The F statistic that you are interested in appears in a sum of squares table toward the bottom
of Output 15.4. There are actually two sums of squares tables that appear at the bottom of
this page of output. The first table is based on the Type I sums of squares, and the second is
based on the Type III sums of squares. For the type of investigation described here, you
should generally interpret the Type III sums of squares.
Note that the results that are presented in the Type I table are often identical to the results in
the Type III table, which is the case for the current analysis. However, data from some types
of studies will lead to results in the Type I table that are not identical to the results in the
Type III table. This may happen, for example, if your study includes more than one
predictor variable or if different numbers of subjects appear in different treatment
conditions. It would typically be best to interpret the results from the Type III table, rather
than the Type I table in these situations. Therefore, to keep things simple, this text
recommends that you always interpret the Type III table for the types of studies presented
here.
On the left side of this section of output is the heading Source, which represents source
of variation.
Below this heading you will find the name of your predictor variable, which in this case is
MOD_AGGR. To the right, you will find analysis of variance information for this treatment
effect (the model aggression treatment effect):
Below the heading DF, you can see that the degrees of freedom associated with the
model-aggression predictor variable was equal to 2. The formula for these degrees of
freedom is k 1, where k is equal to the number of groups being compared. The current
analysis involves three groups, and 3 1 = 2.
Below the heading Type III SS, you can see that the sum of squares associated with the
model-aggression predictor variable was equal to 788.25.

516 Step-by-Step Basic Statistics Using SAS: Student Guide

Below the heading Mean Square, you can see that the mean square associated with the
model-aggression predictor variable was equal to 394.13.
Below the heading F Value, you can see that the obtained F statistic associated with the
model-aggression predictor variable was equal to 18.74. This is the F statistic that tests your
studys null hypothesis.
Below the heading Pr > F, you can see that the p value (probability value) associated with
the preceding F statistic is 0.0001. This p value indicates that the F statistic is significant at
the .0001 level.
This last heading (Pr > F) gives you the probability of obtaining an F statistic that is this
large or larger, if the null hypothesis were true. In the present case, this p value is very
small: it is equal to 0.0001. When a p value is less than .05, you may reject the null
hypothesis, so in this case the null hypothesis of no population differences is rejected. This
means that you have a significant treatment effect. In other words, you may tentatively
conclude that, in the population, there is a difference between at least two of the treatment
conditions.
Because you have obtained a significant F statistic, these results seem to provide support for
your research hypothesis that model aggression has an effect on subject aggression. Later,
you will review the group means and results of the multiple comparison procedures to see if
the results are in the predicted direction. First, however, you will prepare an ANOVA
summary table to summarize some of the information from Output 15.4.
3. Prepare your own version of the ANOVA summary table. Table 15.2 provides the
completed ANOVA summary table for the current analysis.
Table 15.2
ANOVA Summary Table for Study Investigating the Relationship between
Level of Aggression Displayed by Model and Subject Aggression
(Significant Treatment Effect)
___________________________________________________________________
2

Source
df
SS
MS
F
R
___________________________________________________________________
Model aggression
Within groups

788.25

394.13

21

441.75

21.04

18.74 *

.64

Total
23
1230.00
___________________________________________________________________
Note: N = 24.
* p < .0001

To complete the preceding table, you simply transfer information from Output 15.4 to the
appropriate line of the ANOVA summary table. For your convenience, Output 15.4 is
reproduced here as Output 15.5.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 517

JOHN DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.640854

DF
2
21
23

Sum of
Squares
788.250000
441.750000
1230.000000

Coeff Var
26.97924

Mean Square
394.125000
21.035714

Root MSE
4.586471

F Value
18.74

Pr > F
<.0001

SUB_AGGR Mean
17.00000

Source
MOD_AGGR

DF
2

Type I SS
788.2500000

Mean Square
394.1250000

F Value
18.74

Pr > F
<.0001

Source
MOD_AGGR

DF
2

Type III SS
788.2500000

Mean Square
394.1250000

F Value
18.74

Pr > F
<.0001

Output 15.5. Information needed for ANOVA summary table for analysis
report, aggression study with significant treatment effect.

Here are some instructions for transferring information from Output 15.5 to the ANOVA
summary table in Table 15.2:

Treatment effect for the predictor variable. The top line in an ANOVA summary table
such as Table 15.2 provides information for the studys predictor (or independent)
variable. The predictor variable in your study was level of aggression displayed by the
model.
Information concerning this effect appears to the right of the heading MOD_AGGR in
Output 15.5. You can see that all of the information regarding MOD_AGGR (e.g., degrees
of freedom, sum of squares, mean square) has been entered on the line headed Model
aggression in Table 15.2.
It is important that you give your predictor variable a short but meaningful name (such
as Model aggression). You should generally not use the SAS variable name that you
used for the predictor variable in your computer analyses because SAS variable names
are typically too short to be meaningful to the reader of a research article.
You can see that Table 15.2 does not include an entry for the p value associated with
your F statistic. Instead, it simply includes an asterisk (*) next to the F statistic to
indicate that it was statistically significant. A note at the bottom of Table 15.2 explains
the meaning of this asterisk. The note indicates * p < .0001, which means that the F
statistic was significant at the .0001 level.
If the F value had been significant at the .01 level, your note would have looked like this:
* p < .01
If the F value had been significant at the .05 level, your note would have looked like this:
* p < .05

518 Step-by-Step Basic Statistics Using SAS: Student Guide

If the F value is not statistically significant, you do not put an asterisk next to it or place
a note at the bottom of the table.
You can include a column for the p value in an ANOVA summary table, and you might
label this column p or Probability. Below this heading, you can record the actual p
value (.0001, in this case). If you do this, you may omit the asterisk and the note at the
bottom of the table.
The last column in Table 15.2 is headed R2, and it is in this location that you provide
the R2 value for your predictor variable.
In the output of PROC GLM, this statistic appears toward the middle of the page, below
the heading R-Square. In Output 15.5, you can see that the R2 value is 0.640854, which
2
rounds to .64 (R is typically rounded to two decimal places). Therefore, the value .64
appears below the heading R2 in Table 15.2. This means that the predictor variable
(model aggression) accounted for 64% of the variance in the criterion variable (subject
aggression) in this study.

Within groups. The Within-groups line of an ANOVA summary table contains


information about the error term from the analysis of variance.
To find this information for the current analysis, look to the right of the heading Error in
Output 15.5. You can see that the information from the Error line of Output 15.5 has
been copied onto the line headed Within groups in Table 15.2.

Total. The total degrees of freedom and the total sum of squares from an analysis of
variance can be found to the right of the heading Corrected Total in the output of PROC
GLM.
For the current analysis, look to the right of Corrected Total in Output 15.5. You can see
that the information from this line has been copied onto the line headed Total in
Table 15.2.

Note regarding sample size. Place a note at the bottom of the table to indicate the size of
the total sample (Note: N = 24). Your own version of the ANOVA summary table is now
complete.

4. Review the sample means and standard deviations. Earlier, you reviewed the F
statistic produced by PROC GLM and determined that it was statistically significant. This
told you that you had a significant treatment effect. However, at this point, you still do not
know which group scored higher on the criterion variable. To determine this, you will
review the output that was produced by your first MEANS statement. The first MEANS
statement was as follows:
MEANS

MOD_AGGR;

This relatively simple statement calculated the means and standard deviations on the
criterion variable for the three treatment conditions. The results produced by this MEANS
statement were presented earlier, and appear again in Output 15.6.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 519

JOHN DOE
The GLM Procedure
Level of
MOD_AGGR

H
L
M

8
8
8

-----------SUB_AGGR---------Mean
Std Dev
22.1250000
9.0000000
19.8750000

4.58062691
4.75093976
4.42194204

Output 15.6. Means and standard deviations on the criterion variable for the
three treatment conditions, aggression study with significant treatment
effect.

Below the heading Level of MOD_AGGR, you will find the values for the various
treatment conditions. You will remember that the value H indicates values for the highmodel-aggression condition, M indicates values for the moderate-model-aggression
condition, and the value L indicates values for the low-model-aggression condition.
Below the heading Mean you will find mean scores on the criterion variable for the three
treatment conditions. The values in this column show that subjects in the high-modelaggression condition displayed the highest average score on subject aggression, with a mean
of 22.13 (S.D. = 4.58). Subjects in the moderate-model-aggression-condition displayed the
next highest average score, with a mean of 19.88 (S.D. =4.42). Finally, subjects in the lowmodel-aggression condition displayed the lowest average score, with a mean of 9.00 (S.D. =
4.75).
You now know the relative ordering of the three groups mean scores on the criterion
variable. However, you do not know which groups means are significantly different from
which. In order to determine this, you must consult the result of the multiple comparison
test.
5. Review the results of the multiple comparison procedure. Because the F statistic
(presented in Output 15.4) was significant, you reject the null hypothesis of no differences in
population means. Instead, you tentatively accept the alternative hypothesis, that at least
one of the population means is different from at least one of the other population means.
But because you have three experimental conditions, you now have a new problem: Which
of these groups is significantly different from which? To answer this question, you have
requested a multiple comparison procedure called Tukey's HSD test. Following is the
MEANS statement that you included in your program to request this test:
MEANS

MOD_AGGR / TUKEY

CLDIFF

ALPHA=0.05;

The results of this analysis are presented here as Output 15.7.

520 Step-by-Step Basic Statistics Using SAS: Student Guide

JOHN DOE
The GLM Procedure
Tukey's Studentized Range (HSD) Test for SUB_AGGR
NOTE: This test controls the Type I experimentwise error rate.

Alpha
0.05
Error Degrees of Freedom
21
Error Mean Square
21.03571
Critical Value of Studentized Range 3.56463
Minimum Significant Difference
5.7803
Comparisons significant at the 0.05 level are indicated by ***.

MOD_AGGR
Comparison
H
H
M
M
L
L

Difference
Between
Means

M
L
H
L
H
M

2.250
13.125
-2.250
10.875
-13.125
-10.875

Simultaneous 95%
Confidence Limits
-3.530
7.345
-8.030
5.095
-18.905
-16.655

8.030
18.905
3.530
16.655
-7.345
-5.095

***
***
***
***

Output 15.7. Results of Tukey HSD multiple comparison test performed on


aggression data, significant treatment effect.

Some notes about Output 15.7:


The heading tells you that this page presents the results of a Tukey test performed on the
criterion variable SUB_AGGR.
To the right of Alpha is the entry 0.05, which tells you that these tests are performed at
the .05 level of significance. You can request different levels of significance (see the section
The PROC step for instructions on how to do this).
To the right of Minimum Significant Difference is the entry 5.7803. This means that,
according to the Tukey test, the means for two treatment conditions must differ by at least
5.7803 to be considered significantly different at the .05 level.
In the middle of the page, a note says Comparisons significant at the 0.05 level are
indicated by ***. This means that, in the lower portion of the table, you will know that the
comparison between two treatment conditions is significant at .05 level if it is flagged with
three asterisks.
In the lower portion of the output, the first section is headed MOD_AGGR Comparison.
This section indicates which treatment conditions are being compared. For example, the first
row is headed H M. This row provides information about the Tukey test in which the
H condition (the high-model-aggression condition) is compared to the M condition (the
moderate-model-aggression condition). Everything in this row provides information about
this one comparison.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 521

In the lower portion of the output, the second section is headed Difference Between
Means. The entries in this column represent the difference between the two means that are
being compared. For example, the first row contains the H M comparison, which can be
read as the High minus Moderate comparison. This comparison was made by starting with
the mean score on the criterion variable for the high group, and subtracting from it the
mean score on the criterion variable for the moderate group. Output 15.6 showed that the
mean for the high group was 22.125, and the mean for the moderate group was 19.875. The
difference between these means is calculated as follows:
22.125 19.875 = 2.25
As you can see from Output 15.7, the value 2.25 appears in the column headed Difference
Between Means for the comparison H M. The remaining values in this column can be
interpreted in the same way. In the row headed H L, you will find the difference
between means for the high condition versus the low condition (13.125); in the row headed
M H, you will find the difference between means for the moderate condition versus the
high condition (2.250), and so on.
In the lower portion of the output, on the right side of the table is an area where asterisks
(***) may appear. If these asterisks appear in the row for a particular comparison, it
means that there is a significant difference between the two treatment conditions being
compared in that row (according to the Tukey test). For example, the first row is the row for
the H M comparison. Notice that there are no asterisks on the far right of this row. This
means that the difference between the high group versus the moderate group is
nonsignificant, according to the Tukey test. The second row is the row for the H L
comparison. Notice that there are three asterisks on the far right side of this row. This means
that there is a significant difference between the high group versus the low group, according
to the Tukey test. The remaining rows can be interpreted in the same way. After you review
all of the comparisons, you will see that the difference between the high group and the low
group is significant, the difference between the moderate group and the low group is
significant, but the difference between the high group and the moderate group is
nonsignificant.
6. Review the confidence intervals for the differences between means. The MEANS
statement that you included in the SAS program that performed your analysis of variance is
once again presented here:
MEANS

MOD_AGGR / TUKEY

CLDIFF

ALPHA=0.05;

This MEANS statement contains the options TUKEY and CLDIFF. Together, these options
request the Tukey HSD test (discussed earlier), and also request that the results be presented
in the form of confidence limits. In Chapter 12, The Single-Sample t Test,, you learned
that a confidence interval is an interval that extends from a lower confidence limit to an
upper confidence limit and is assumed to contain a population parameter with a stated
probability, or level of confidence. In this case, the population parameter that you are
interested in is the difference between two treatment means in the population. When SAS
prints the results that are generated by the preceding MEANS statement, this output will not

522 Step-by-Step Basic Statistics Using SAS: Student Guide

only contain the observed differences between the sample means (which were discussed in
the preceding section), but it will also contain confidence intervals for these differences.
Output 15.8 reproduces the output generated by the preceding MEANS statement. To
conserve space, only the lower portion of the output page is reproduced in Output 15.8.
Comparisons significant at the 0.05 level are indicated by ***.

MOD_AGGR
Comparison
H
H
M
M
L
L

M
L
H
L
H
M

Difference
Between
Means
2.250
13.125
-2.250
10.875
-13.125
-10.875

Simultaneous 95%
Confidence Limits
-3.530
7.345
-8.030
5.095
-18.905
-16.655

8.030
18.905
3.530
16.655
-7.345
-5.095

***
***
***
***

Output 15.8. Confidence limits for differences between means from


aggression study, significant treatment effect.

As stated earlier, the section of the output that is headed MOD_AGGR Comparison ( )
indicates which treatment conditions are being compared. For example, the first row is
identified with H M ( ), which means that this row provides information about the
comparison between the high-model-aggression group versus the moderate-modelaggression group.
The second section in this portion of the output is headed Difference Between Means ( ),
and this column indicates the observed difference between the means of the two treatment
conditions that are being compared. The entry for the first row is 2.250, which means that
the difference between the means of the high group versus the moderate group is 2.250.
The confidence interval for this difference between means appears in the section headed
Simultaneous 95% Confidence Limits ( ). Two columns of numbers appear below this
heading. The column of numbers on the left provide the lower limit for the confidence
interval ( ), and the column of numbers on the right provide the upper limit for the
confidence interval ( ).
Once again, consider the first row in the tablethe row for the H M comparison ( ).
The observed difference between means for these two groups is 2.250, and the 95%
confidence interval extends from 3.530 to 8.030 (These latter two values came from the
section headed Simultaneous 95% Confidence Limits). This confidence interval means
that, although you do not know for sure what the actual difference is between these two
conditions in the population, you estimate that there is a 95% probability that the difference
is somewhere between 3.530 and 8.030.
Notice that the preceding confidence interval included the value of zero. This is consistent
with the results of the Tukey procedure indicating that the difference between the high
condition and the moderate condition was nonsignificant. Whenever you find a

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 523

nonsignificant difference between two conditions, the confidence interval for that difference
will typically contain zero.
Now consider the second row in the tablethe row for the H L comparison ( ). This
row provides information about the comparison between the high condition versus the low
condition. The observed difference between means for these two groups is 13.125, and the
95% confidence interval extends from 7.345 to 18.905. This confidence interval means that,
although you cannot be certain what the actual difference is between these two conditions in
the population, this procedure estimates that there is a 95% probability that the difference is
between 7.345 and 18.905.
Notice that the preceding confidence interval does not include the value of zero. This is
consistent with results of the Tukey procedure indicating that the difference between the
high condition versus the low condition was statistically significant. Anytime you find
significant difference between two conditions, the confidence interval for that difference
will typically not contain zero.
The remaining rows in Output 15.8 (for the remaining comparisons) may be interpreted in
the same way.
Notice that half of the rows in Output 15.8 seem to be redundant. For example, you have
already seen that the first row provides information about the comparison between the high
condition versus the moderate condition. You can see that the third row down provides
information about the same comparison. The difference is this: In the first row, the mean for
the moderate condition is subtracted from the mean for the high condition, while in the third
row the mean for the high condition is subtracted from the mean for the moderate condition.
The absolute value of all of the numbers involved are the same, so the two rows provide
information that is essentially redundant.
7. Prepare a table that presents the results of the Tukey tests and the confidence
intervals. Now that you know how to interpret the Tukey tests and the confidence intervals
for the differences between means, you will prepare a table that summarizes them for a
report. You can obtain the information that you need from the lower portion of the output
page that provided the results of the Tukey multiple comparison procedure. For your
convenience, those results are reproduced as Output 15.9.

524 Step-by-Step Basic Statistics Using SAS: Student Guide

Comparisons significant at the 0.05 level are indicated by ***.


MOD_AGGR
Comparison
H
H
M
M
L
L

M
L
H
L
H
M

Difference
Between
Means
2.250
13.125
-2.250
10.875
-13.125
-10.875

Simultaneous 95%
Confidence Limits
-3.530
7.345
-8.030
5.095
-18.905
-16.655

8.030
18.905
3.530
16.655
-7.345
-5.095

***
***
***
***

Output 15.9. Information needed for table presenting Tukey tests and
confidence intervals, aggression data with significant treatment effect.

The table that you prepare for a published report will use a format that is very similar to the
format used with Output 15.9. You can copy information directly from the SAS output to
your table. The main difference is that you will use titles, headings, and notes that convey
information more clearly. You will also omit lines of information that are redundant. The
completed table is shown in Table 15.3.
Table 15.3
Results of Tukey Tests Comparing High-Model-Aggression Group Versus
Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the
Criterion Variable (Subject Aggression)
____________________________________________________
Simultaneous 95%
Difference
confidence limits
between

a
Comparison
means
Lower
Upper
____________________________________________________
High Moderate
High - Low

2.250

-3.530

8.030

13.125 *

7.345

18.905

Moderate - Low
10.875 *
5.095
16.655
____________________________________________________
Note: N = 24.
Differences are computed by subtracting the mean for the second group
from the mean for the first group.
* Tukey test indicates that the differences between the means is
significant at p < .05.

Notice that, in Table 15.3, the column on the left is headed Comparison. The entries in
this column are the names for the treatment conditions (e.g., High Moderate), rather than
the simpler letter values that appeared in the SAS output (e.g., H M). This was done to
present the information more clearly to the reader.
You can see that a single asterisk is used to identify comparisons that are statistically
significant. In Table 15.3, these asterisks appear in the column headed Difference between

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 525

means (in the SAS output, they had appeared on the far right side). A note at the bottom of
the table indicates the significance level that is indicated by an asterisk.
Notice that there are only three comparisons listed in Table 15.3: high versus moderate,
high versus low, and moderate versus low. This includes all possible comparisons between
the three treatment conditions in the study. It is true that Output 15.9 actually lists six
comparisons, but three of these six comparisons were redundant (as was discussed in an
earlier section).
Using a Figure to Illustrate the Results
The results of an analysis of variance are easiest to understand when they are represented in
a figure that plots the means for each of the studys treatment conditions. The mean subject
aggression scores for the current analysis were presented in Output 15.6. They are presented
in bar chart form in Figure 15.6.

Figure 15.6. Mean number of aggressive acts as a function of the level of


aggression displayed by model (significant F statistic).

You can see that the three bars in Figure 15.6 illustrate the mean scores for the three groups
that are included in the analysis. The figure shows that the mean scores are 9.00, 19.88, and
22.13 for the low-, moderate-, and high-model-aggression conditions, respectively.

526 Step-by-Step Basic Statistics Using SAS: Student Guide

Analysis Report for the Aggression Study (Significant Results)


Following is an analysis report that summarizes the results of your analysis. A section
following this one explains where in your SAS output you can find some of the statistics that
appear in this report.
A) Statement of the research question: The purpose of this
study was to determine whether there was a relationship
between (a) the level of aggression displayed by a model and
(b) the number of aggressive acts later demonstrated by
children who observed the model.
B) Statement of the research hypothesis: There will be a
positive relationship between the level of aggression
displayed by a model and the number of aggressive acts later
demonstrated by children who observed the model. Specifically,
it is predicted that (a) children who witness a high level of
aggression will demonstrate a greater number of aggressive
acts than children who witness a moderate or low level of
aggression, and (b) children who witness a moderate level of
aggression will demonstrate a greater number of aggressive
acts than children who witness low level of aggression.
C) Nature of the variables: This analysis involved one
predictor variable and one criterion variable:
The predictor variable was the level of aggression displayed
by the model. This was a limited-value variable, was
assessed on an ordinal scale, and included three levels:
low, moderate, and high.
The criterion variable was the number of aggressive acts
displayed by the subjects after observing the model. This
was a multi-value variable, and was assessed on a ratio
scale.
D) Statistical test: One-way ANOVA with one between-subjects
factor.
E) Statistical null hypothesis (Ho): 1 = 2 = 3; In the
study population, there is no difference between subjects in
the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to their mean scores on the
criterion variable (the number of aggressive acts displayed by
the subjects).
F) Statistical alternative hypothesis (H1): Not all s are
equal; In the study population, there is a difference between
at least two of the following three groups with respect to
their mean scores on the criterion variable: subjects in the
low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 527

G) Obtained statistic:

F(2, 21) = 18.74

H) Obtained probability (p) value:

p = .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Multiple comparison procedure: Tukeys HSD test showed
that subjects in the high-model-aggression condition and
moderate-model-aggression condition scored significantly
higher on subject aggression than did subjects in the lowmodel-aggression condition. (p < .05). With alpha set at .05,
there were no significant differences between subjects in the
high-model-aggression condition versus the moderate-modelaggression condition.
K) Confidence intervals: Confidence intervals for differences
between the means are presented in Table 15.3.
L) Effect size: R2 = .64, indicating that model aggression
accounted for 64% of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These
findings provide partial support for the studys research
hypothesis. The findings provided support for the hypothesis
that (a) children who witness a high level of aggression will
demonstrate a greater number of aggressive acts than children
who witness a low level of aggression, as well as for the
hypothesis that (b) children who witness a moderate level of
aggression will demonstrate a greater number of aggressive
acts than children who witness low level of aggression.
However, the study failed to provide support for the
hypothesis that children who witness a high level of
aggression will demonstrate a greater number of aggressive
acts than children who witness a moderate level of aggression
N) Formal description of the results for a paper: Results
were analyzed using a one-way ANOVA with one between-subjects
factor. This analysis revealed a significant treatment effect
for level of aggression displayed by the model, F(2, 21) =
18.74, MSE = 21.04, p = .0001.
On the criterion variable (number of aggressive acts
displayed by subjects), the mean score for the high-modelaggression condition was 22.13 (SD = 4.58), the mean for the
moderate-model-aggression-condition was 19.88 (SD = 4.42), and
the mean for the low-model-aggression condition was 9.00 (SD =
4.75). The sample means are displayed in Figure 15.6. Tukeys
HSD test showed that subjects in the high-model-aggression
condition and moderate-model-aggression condition scored
significantly higher on subject aggression than did subjects
in the low-model-aggression condition. (p < .05). With alpha
set at .05, there were no significant differences between
subjects in the high-model-aggression condition versus the

528 Step-by-Step Basic Statistics Using SAS: Student Guide

moderate-model-aggression condition. Confidence intervals for


differences between the means are presented in Table 15.3.
In the analysis, R2 was computed as .64. This indicated
that model aggression accounted for 64% of the variance in
subject aggression.
O) Figure representing the results:

See Figure 15.6.

Notes Regarding the Preceding Analysis Report


Overview. With some sections of the preceding report, it might not be clear where, in your
SAS output, to find the necessary statistics to insert in that section. The main ANOVA
summary table that was produced by PROC GLM will be reproduced as Output 15.10. This
is because much of the information needed for your report appears on this page of the
output.
JOHN DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.640854

DF
2
21
23

Sum of
Squares
788.250000
441.750000
1230.000000

Coeff Var
26.97924

Mean Square
394.125000
21.035714

Root MSE
4.586471

F Value
18.74

Pr > F
<.0001

SUB_AGGR Mean
17.00000

Source
MOD_AGGR

DF
2

Type I SS
788.2500000

Mean Square
394.1250000

F Value
18.74

Pr > F
<.0001

Source
MOD_AGGR

DF
2

Type III SS
788.2500000

Mean Square
394.1250000

F Value
18.74

Pr > F
<.0001

Output 15.10. Information needed for analysis report on aggression study with
significant treatment effect.

F Statistic, degrees of freedom, and p value. Items G and H from the preceding analysis
report is reproduced here:
G) Obtained statistic: F(2, 21) = 18.74
H) Obtained probability (p) value: p = .0001
You can see that Items G and H provide the F statistic, the degrees of freedom for this
statistic, and the p value that is associated with this statistic. It is worthwhile to review
where these terms can be found in the SAS output. The F statistic of 18.74, which appears in
Item G of the preceding report, is the F statistic associated with the model aggression
predictor variable. It can be found in Output 15.10 where the row headed MOD_AGGR
intersects with the column headed F Value ( ).

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 529

It is customary to list the degrees of freedom for an F statistic within parentheses. In the
preceding report, these degrees of freedom were listed in Item G as F(2, 21). The first
term (the 2 within the parentheses) represents the degrees of freedom for the numerator in
the F ratio (i.e., the degrees of freedom for the model aggression treatment effect). This term
appears in Output 15.10 where the row headed MOD_AGGR intersects with the column
headed DF ( ). The second term (the 21 within parentheses) represents the degrees of
freedom for the denominator in the F ratio (i.e., the degrees of freedom for the error term).
This term appears in Output 15.10 where the row headed Error intersects with the column
headed DF ( ).
Finally, the p value listed in Item H of the preceding report is the p value associated with the
model aggression treatment effect. It appears in Output 15.10 where the row headed
MOD_AGGR intersects with the column headed Pr > F ( ).
The MSE (mean square error). Item N from the analysis report provides a statistic
abbreviated as the MSE. The relevant section of Item N is reproduced here:
N) Formal description of the results for a paper: Results
were analyzed using a one-way ANOVA with one betweensubjects factor. This analysis revealed a significant
treatment effect for level of aggression displayed by the
model, F(2, 21) = 18.74, MSE = 21.04, p = .0001.
The last sentence of the preceding excerpt indicates that MSE = 21.04. Here, MSE
stands for Mean Square Error. It is an estimate of the error variance in your analysis. In
the output from PROC GLM, you will find the MSE where the row headed Error
intersects with the column headed Mean Square. In Output 15.10, you can see that the
mean square error is equal to 21.035714 ( ), which rounds to 21.04.

Example 15.2: One-Way ANOVA Revealing a


Nonsignificant Treatment Effect
Overview
This section presents the results of a one-way ANOVA in which the treatment effect is
nonsignificant. These results are presented so that you will be prepared to write analysis
reports for projects in which nonsignificant outcomes are observed.

530 Step-by-Step Basic Statistics Using SAS: Student Guide

The Complete SAS Program


The study presented here is the same aggression study that was described in the preceding
section. The data will be analyzed with the same SAS program that was presented earlier.
Here, the data have been changed so that they will produce nonsignificant results. The
complete SAS program, including the new data set, is presented here:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 L 07
02 L 17
03 L 14
04 L 11
05 L 11
06 L 20
07 L 08
08 L 15
09 M 08
10 M 20
11 M 11
12 M 15
13 M 16
14 M 16
15 M 12
16 M 21
17 H 14
18 H 10
19 H 17
20 H 18
21 H 20
22 H 21
23 H 14
24 H 23
;
PROC GLM DATA=D1;
CLASS MOD_AGGR;
MODEL SUB_AGGR = MOD_AGGR;
MEANS MOD_AGGR;
MEANS MOD_AGGR / TUKEY CLDIFF
TITLE1 'JOHN DOE';
RUN;
QUIT;

ALPHA=0.05;

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 531

Steps in Interpreting the Output


As with the earlier data set, the SAS program that performs this analysis produces four
pages of output. This section will present just those sections of output that are relevant to
preparing the ANOVA summary table, the confidence intervals table, the figure, and the
analysis report. This section will review the output in a fairly abbreviated manner; for a
more detailed discussion of the output of PROC GLM, see earlier sections of this chapter.
1. Determine whether the treatment effect is significant. As before, you determine
whether the treatment effect is significant by reviewing the ANOVA summary table
produced by PROC GLM. This table appears on page 2 of the output, and is reproduced here
as Output 15.11.
JOHN DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total

DF
2
21
23

R-Square
0.151655

Sum of
Squares
72.3333333
404.6250000
476.9583333
Coeff Var
29.34496

Mean Square
36.1666667
19.2678571

Root MSE
4.389517

F Value
1.88

Pr > F
0.1778

SUB_AGGR Mean
14.95833

Source
MOD_AGGR

DF
2

Type I SS
72.33333333

Mean Square
36.16666667

F Value
1.88

Pr > F
0.1778

Source
MOD_AGGR

DF
2

Type III SS
72.33333333

Mean Square
36.16666667

F Value
1.88

Pr > F
0.1778

Output 15.11. ANOVA summary table for one-way ANOVA performed on


aggression data, nonsignificant treatment effect.

As with the earlier data set, you review the results of the analyses that appear in the section
headed Type III SS ( ), as opposed to the section headed Type I SS.
To determine whether the treatment effect is significant, look to the right of the heading
MOD_AGGR ( ). Here, you can see that the F statistic is only 1.88 ( ), with a p value
of .1778 ( ). The obtained p value is greater than the standard criterion of .05, which means
that this F statistic is nonsignificant. This means that you do not have a significant treatment
effect for your predictor variable.
2. Prepare your own version of the ANOVA summary table. The completed ANOVA
summary table for this analysis is presented here as Table 15.4.

532 Step-by-Step Basic Statistics Using SAS: Student Guide


Table 15.4
ANOVA Summary Table for Study Investigating the Relationship between
Level of Aggression Displayed by Model and Subject Aggression
(Nonsignificant Treatment Effect)
___________________________________________________________________
2

SS
MS
F
R
Source
df
___________________________________________________________________
Model aggression
Within groups

72.33

36.17

21

404.63

19.27

1.88

.15

Total
23
476.96
___________________________________________________________________
Note: N = 24.
F statistic is nonsignificant with alpha set at .05.

Notice how information from Output 15.11 was used to fill in the relevant sections of
Table 15.4:

Information from the line headed MOD_AGGR ( ) in Output 15.11 was transferred to
the lined headed Model aggression in Table 15.4.

Information from the line headed Error ( ) in Output 15.11 was transferred to the lined
headed Within groups in Table 15.4.

Information from the line headed Corrected Total ( ) in Output 15.7 was transferred to
the lined headed Total in Table 15.4.

The R2 value that appeared below the heading R-Square ( ) in Output 15.11 was
transferred below the heading R2 in Table 15.4.

3. Review the results of the multiple comparison procedure. Notice that, unlike the
previous section, this section does not advise you to review the results of the multiple
comparison procedure (the Tukey test). This is because the treatment effect in the current
analysis was nonsignificant, and you normally would not interpret the results of multiple
comparison procedures for treatment effects that are not significant.
4. Review the confidence intervals for the differences between means. Although the
results from the F statistic are nonsignificant, it may still be useful to review the size of the
confidence intervals created in the analysis. Output 15.12 presents these intervals.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 533

JOHN DOE
The GLM Procedure
Tukey's Studentized Range (HSD) Test for SUB_AGGR
NOTE: This test controls the Type I experimentwise error rate.

Alpha
0.05
Error Degrees of Freedom
21
Error Mean Square
19.26786
Critical Value of Studentized Range 3.56463
Minimum Significant Difference
5.532
Comparisons significant at the 0.05 level are indicated by ***.
MOD_AGGR
Comparison
H
- M
H
- L
M
- H
M
- L
L
- H
L
- M

Difference
Between
Means
2.250
4.250
-2.250
2.000
-4.250
-2.000

Simultaneous
95% Confidence
Limits
-3.282
7.782
-1.282
9.782
-7.782
3.282
-3.532
7.532
-9.782
1.282
-7.532
3.532

Output 15.12. Confidence intervals for differences between the means


created in analysis of aggression data, nonsignificant differences.

The confidence intervals for the current analysis appear below the heading Simultaneous
95% Confidence Limits ( ). You can see that none of the comparisons in this section are
flagged with three asterisks, which means that none of the differences were significant
according to the Tukey test. In addition, you can see that all of the confidence intervals in
this section contain the value of zero. This is also consistent with the fact that none of the
differences were statistically significant.
Table 15.5 presents the confidence intervals resulting from the Tukey tests that you might
prepare for a published report. The note at the bottom of the table tells the reader that all
comparison are nonsignificant. You can see that all of the differences between means and
confidence limits came from Output 15.12.

534 Step-by-Step Basic Statistics Using SAS: Student Guide


Table 15.5
Results of Tukey Tests Comparing High-Model-Aggression Group versus
Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the
Criterion Variable (Subject Aggression), Nonsignificant Differences
Observed
_____________________________________________________
Simultaneous 95%
Difference
confidence limits
between

a
b
Comparison
means
Lower
Upper
_____________________________________________________
High - Moderate

2.250

-3.282

7.782

High - Low

4.250

-1.282

9.782

Moderate - Low
2.000
-3.532
7.532
_____________________________________________________
Note: N = 24.
Differences are computed by subtracting the mean for the second group
from the mean for the first group.
b With alpha set at .05, all Tukey test comparisons were nonsignificant.
a

Using a Graph to Illustrate the Results


Journal articles typically do not include a graph to illustrate group means when the treatment
effect is nonsignificant. However, a graph presenting group means will be presented here as
an illustration.
The means for the three conditions of the present investigation appeared on page 3 of the
preceding output (which presented the results from the first MEANS statement). Page 3
from the analysis is reproduced here as Output 15.13.

Level of
MOD_AGGR

JOHN DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev

H
L
M

8
8
8

17.1250000
12.8750000
14.8750000

4.29077083
4.45413131
4.42194204

Output 15.13. Means and standard deviations produced by the MEANS


statement in analysis of aggression data, nonsignificant differences.

Below the heading Mean ( ) you will find the mean scores for these three groups. You
can see that the mean scores were 17.13, 12.88, and 14.88 for the high, low, and moderate
groups, respectively. Figure 15.7 illustrates these group means.

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 535

Figure 15.7. Mean number of aggressive acts as a function of the level of


aggression displayed by model (nonsignificant F statistic).

Analysis Report for the Aggression Study (Nonsignificant Results)


The results from the preceding analysis could be summarized in the following report. Notice
that some results (such as the results of the Tukey test) are not discussed because the
treatment effect was nonsignificant.
A) Statement of the research question: The purpose of this
study was to determine whether there was a relationship
between (a) the level of aggression displayed by a model and
(b) the number of aggressive acts later demonstrated by
children who observed the model.
B) Statement of the research hypothesis: There will be a
positive relationship between the level of aggression
displayed by a model and the number of aggressive acts later
demonstrated by children who observed the model. Specifically,
it is predicted that (a) children who witness a high level of
aggression will demonstrate a greater number of aggressive
acts than children who witness a moderate or low level of
aggression, and (b) children who witness a moderate level of
aggression will demonstrate a greater number of aggressive
acts than children who witness low level of aggression.

536 Step-by-Step Basic Statistics Using SAS: Student Guide

C) Nature of the variables: This analysis involved one


predictor variable and one criterion variable:
The predictor variable was the level of aggression displayed
by the model. This was a limited-value variable, was
assessed on an ordinal scale, and included three levels:
low, moderate, and high.
The criterion variable was the number of aggressive acts
displayed by the subjects after observing the model. This
was a multi-value variable, and was assessed on a ratio
scale.
One-way ANOVA with one between-subjects

D) Statistical test:
factor.

E) Statistical null hypothesis (Ho): 1 = 2 = 3; In the


population, there is no difference between subjects in the
low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to their mean scores on the
criterion variable (the number of aggressive acts displayed by
the subjects).
F) Statistical alternative hypothesis (H1): Not all s are
equal; In the population, there is a difference between at
least two of the following three groups with respect to their
mean scores on the criterion variable: subjects in the lowmodel-aggression condition, subjects in the moderate-modelaggression condition, and subjects in the high-modelaggression condition.
G) Obtained statistic:

F(2, 21) = 1.88

H) Obtained probability (p) value:

p = .1778

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.

Fail

J) Multiple comparison procedure: The multiple comparison


procedure was not appropriate because the F statistic for the
ANOVA was nonsignificant.
K) Confidence Intervals: Confidence intervals for
differences between the means are presented in Table 15.5.
2
L) Effect size: R = .15, indicating that model aggression
accounted for 15% of the variance in subject aggression.

M) Conclusion regarding the research hypothesis: These


findings fail to provide support for the studys research
hypothesis.
N) Formal description of the results for a paper: Results
were analyzed using a one-way ANOVA with one between-subjects
factor. This analysis revealed a nonsignificant treatment

Chapter 15: One-Way ANOVA with One Between-Subjects Factor 537

effect for level of aggression displayed by the model, F(2,


21) = 1.88, MSE = 19.27, p = .1778.
On the criterion variable (number of aggressive acts
displayed by subjects), the mean score for the high-modelaggression condition was 17.13 (SD = 4.29), the mean for the
moderate-model-aggression-condition was 14.88 (SD = 4.42), and
the mean for the low-model-aggression condition was 12.88 (SD
= 4.45). The sample means are displayed in Figure 15.7.
Confidence intervals for differences between the means (based
on Tukeys HSD test) are presented in Table 15.5.
In the analysis, R2 was computed as .15. This indicated
that model aggression accounted for 15% of the variance in
subject aggression.
O) Figure representing the results:

See Figure 15.7.

Conclusion
This chapter has shown how to perform an analysis of variance on data from studies in
which only one independent variable is manipulated. However, researchers in the social
sciences and education often conduct research in which two independent variables are
manipulated simultaneously in a single study. With such investigations, it is usually not
appropriate to perform two separate one-way ANOVAs on the dataone ANOVA for the
first independent variable, and a separate ANOVA for the second independent variable.
Instead, it is usually more appropriate to analyze the data with a different statistical
procedure: a factorial ANOVA. Performing a factorial analysis of variance not only enables
you to determine whether you have significant treatment effects for your two independent
variables; it also enables you to test for an entirely different type of effect: an interaction.
Chapter 16 introduces you to the concept of an interaction, and shows how to use the GLM
procedure to perform a factorial ANOVA with two between-subject factors.

538 Step-by-Step Basic Statistics Using SAS: Student Guide

Factorial
ANOVA with
Two BetweenSubjects
Factors
Introduction..........................................................................................542
Overview................................................................................................................ 542
Situations Appropriate for Factorial ANOVA with
Two Between-Subjects Factors ......................................................542
Overview................................................................................................................ 542
Nature of the Predictor and Criterion Variables ..................................................... 542
The Type-of-Variable Figure .................................................................................. 543
Example of a Study Providing Data Appropriate for This Procedure...................... 543
True Independent Variables versus Subject Variables .......................................... 544
Summary of Assumptions Underlying Factorial ANOVA with Two
Between-Subjects Factors ................................................................................ 545
Using Factorial Designs in Research ..................................................546
A Different Study Investigating Aggression ........................................546
Overview................................................................................................................ 546
Research Method................................................................................................... 547
The Factorial Design Matrix ................................................................................... 548
Understanding Figures That Illustrate the
Results of a Factorial ANOVA .........................................................550
Overview................................................................................................................ 550
Example of a Figure............................................................................................... 550
Interpreting the Means on the Solid Line ............................................................... 551
Interpreting the Means on the Broken Line ............................................................ 552
Summary ............................................................................................................... 553

540 Step-by-Step Basic Statistics Using SAS: Student Guide

Some Possible Results from a Factorial ANOVA ................................553


Overview................................................................................................................ 553
A Significant Main Effect for Predictor A Only........................................................ 554
Another Example of a Significant Main Effect for Predictor A Only ........................ 556
A Significant Main Effect for Predictor B Only........................................................ 557
A Significant Main Effect for Both Predictor Variables ........................................... 558
No Main Effects...................................................................................................... 559
A Significant Interaction ......................................................................................... 560
Interpreting Main Effects When an Interaction Is Significant.................................. 562
Another Example of a Significant Interaction ......................................................... 563
Example of a Factorial ANOVA Revealing Two Significant
Main Effects and a Nonsignificant Interaction ...............................565
Overview................................................................................................................ 565
Choosing SAS Variable Names and Values to Use in the SAS Program .................. 565
Data Set to Be Analyzed........................................................................................ 567
Writing the DATA Step of the SAS Program .......................................................... 568
Data Screening and Testing Assumptions Prior to Performing the ANOVA........... 570
Writing the SAS Program to Perform the Two-Way ANOVA.................................. 571
Log File Produced by the SAS Program ................................................................ 574
Output Produced by the SAS Program .................................................................. 575
Steps in Interpreting the Output ............................................................................. 576
Using a Figure to Illustrate the Results .................................................................. 592
Steps in Preparing the Graph ................................................................................ 595
Interpreting Figure 16.11........................................................................................ 596
Preparing Analysis Reports for Factorial ANOVA: Overview ................................. 597
Analysis Report Concerning the Main Effect for Predictor A (Significant Effect) ... 597
Notes Regarding the Preceding Analysis Report ................................................... 599
Analysis Report Regarding the Main Effect for Predictor B (Significant Effect)...... 601
Notes Regarding the Preceding Analysis Report ................................................... 603
Analysis Report Concerning the Interaction (Nonsignificant Effect) ...................... 605
Notes Regarding the Preceding Analysis Report ................................................... 607
Example of a Factorial ANOVA Revealing Nonsignificant
Main Effects and a Nonsignificant Interaction ...............................607
Overview................................................................................................................ 607
The Complete SAS Program ................................................................................. 608
Steps in Interpreting the Output ............................................................................. 609
Using a Figure to Illustrate the Results .................................................................. 612
Interpreting Figure 16.12........................................................................................ 613
Analysis Report Concerning the Main Effect for Predictor A
(Nonsignificant Effect) ........................................................................................ 614
Analysis Report Concerning the Main Effect for Predictor B
(Nonsignificant Effect) ........................................................................................ 615
Analysis Report Concerning the Interaction (Nonsignificant Effect) ...................... 616

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 541

Example of a Factorial ANOVA Revealing a Significant Interaction ..617


Overview................................................................................................................ 617
The Complete SAS Program ................................................................................. 617
Steps in Interpreting the Output ............................................................................. 618
Using a Graph to Illustrate the Results .................................................................. 620
Interpreting Figure 16.13........................................................................................ 621
Testing for Simple Effects ...................................................................................... 622
Analysis Report Concerning the Interaction (Significant Effect)............................. 622
Using the LSMEANS Statement to Analyze Data from
Unbalanced Designs........................................................................625
Overview................................................................................................................ 625
Reprise: What Is an Unbalanced Design? ............................................................ 625
Writing the LSMEANS Statements......................................................................... 626
Output Produced by LSMEANS ............................................................................. 627
Learning More about Using SAS for Factorial ANOVA ........................627
Conclusion............................................................................................628

542 Step-by-Step Basic Statistics Using SAS: Student Guide

Introduction
Overview
This chapter shows how to enter data and prepare SAS programs that will perform a twoway analysis of variance (ANOVA) using the GLM procedure. This chapter focuses on
factorial designs with two between-subjects factors, meaning that each subject is exposed to
only one condition under each independent variable. It discusses the differences between
main effects versus interaction effects in factorial ANOVA. It provides guidelines for
interpreting results that do not indicate a significant interaction, and separate guidelines for
interpreting results that do indicate a significant interaction. It shows how to use multiple
comparison procedures to identify the pairs of groups that are significantly different from
each other, how to request confidence intervals for differences between the means, how to
interpret an index of effect size, and how to prepare a figure that illustrates cell means.
Finally, it shows how to prepare a report that summarizes the results of the analysis.

Situations Appropriate for Factorial ANOVA with Two


Between-Subjects Factors
Overview
Factorial ANOVA is a test of group differences that enables you to determine whether there
are significant differences between two or more groups with respect to their mean scores on
a criterion variable. Furthermore, it enables you to investigate group differences with respect
to two independent variables (or predictor variables) at the same time. In summary, factorial
ANOVA with two between-subjects factors may be used when you wish to investigate the
relationship between (a) two predictor variables (each of which classifies group
membership) and (b) a single criterion variable.
Nature of the Predictor and Criterion Variables
Predictor variables. In factorial ANOVA, the two predictor (or independent) variables are
classification variables, that is, variables that indicate which group a subject is in. They
may be assessed on any scale of measurement (nominal, ordinal, interval, or ratio), but they
serve mainly as classification variables in the analysis.
Criterion variable. The criterion (or dependent) variable is typically a multi-value variable.
It must be a numeric variable that is assessed on either an interval or ratio level of
measurement. The criterion variable must also satisfy a number of additional assumptions,
and these assumptions are summarized in a later section.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 543

The Type-of-Variable Figure


The figure below illustrates the types of variables that are typically being analyzed when
researchers perform a factorial ANOVA with two between-subjects factors.
Criterion

Predictors

The Multi symbol that appears in the above figure shows that the criterion variable in a
factorial ANOVA is typically a multi-value variable (a variable that assumes more than six
values in your sample).
The Lmt symbols that appear to the right of the equal sign in the above figure show that
the two predictor variables in this procedure are usually limited-value variables (that is,
variables that assume only two to six values).
Example of a Study Providing Data Appropriate for This Procedure
The study. Suppose that you are a physiological psychologist conducting research on
aggression in children. You are interested in two research questions: (a) does consuming
sugar cause children to behave more aggressively?, and (b) are boys more aggressive than
girls? To investigate these questions, you conduct an experiment in which you randomly
assign 90 children to three treatment conditions:

30 children are assigned to a 0-gram condition. These children consume zero grams of
sugar each day over a two-month period.

30 children are assigned to a 20-gram condition. These children consume 20 grams of


sugar each day over a two-month period.

30 children are assigned to a 40-gram condition. These children consume 40 grams of


sugar each day over a two-month period.

You observe the children over the two-month period, and for each child you record the
number of aggressive acts that the child displays each day. At the end of the two month
period, you determine whether the children in the 40-gram condition displayed a mean
number of aggressive acts that is significantly higher than the mean displayed by the other
two groups. This analysis helps you to determine whether consuming sugar causes children
to behave more aggressively.
In the same study, however, you are also interested in investigating sex differences in
aggression. At the time that subjects were assigned to conditions you ensured that, within

544 Step-by-Step Basic Statistics Using SAS: Student Guide

each of the sugar consumption treatment groups, half of the children were male and half
were female. This means that, in your study, both of the following are true:

45 children were in the male group

45 children were in the female group.

At the end of the two-month period, you determine whether the male group displays a mean
number of aggressive acts that is significantly different from the mean number displayed by
the female group.
Why these data would be appropriate for this procedure. The preceding study involved
two predictor variables and a single criterion variable. The first predictor variable (Predictor
A) was amount of sugar consumed. You know that this was a limited-value variable,
because it assumed only three values: a 0-gram condition, a 20-gram condition, and a 40gram condition. This predictor variable was assessed on a ratio scale, since grams of sugar
has equal intervals and a true zero point. However, remember that the predictor variables
that are used in ANOVA may be assessed on any scale of measurement. In general, they are
treated as classification variables in the analysis.
The second predictor variable (Predictor B) was subject sex. You know that this was a
dichotomous variable because it involved only two values: a male group versus a female
group. This variable was assessed on a nominal scale, since it indicates group membership
but does not convey any quantitative information.
Finally, the criterion variable in this study was the number of aggressive acts that were
displayed by the children. You know that this was a multi-value variable if you verified that
the childrens scores took on a relatively large number of values (that is, some children
might have displayed zero aggressive acts each day, other children might have displayed 50
aggressive acts each day, and still other children might have displayed a variety of
aggressive acts between these two extremes). Remember that, for our purposes, we label a
variable a multi-value variable if it assumes more than six different values in the sample.
The criterion variable was assessed on a ratio scale. You know this because the number of
aggressive acts has equal intervals and a true zero point.
True Independent Variables versus Subject Variables
Notice that, with the preceding study, one of the predictor variables was a true independent
variable, while the other predictor variable was merely a subject variable. A true
independent variable is a variable that is manipulated and controlled by the researcher so
that it is independent of (uncorrelated with) any other independent variable in the study. In
this study, amount of sugar consumed (Predictor A) was a true independent variable
because it was manipulated and controlled by you, the researcher. You manipulated this
variable by randomly assigning subjects to either the 0-gram condition, the 20-gram
condition, or the 40-gram condition.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 545

In contrast to a true independent variable, a subject variable is a characteristic of the


subject that is not directly manipulated by the researcher, but is used as predictor variable in
the study. In the preceding study, subject sex (Predictor B) was a subject variable. You
know that this is a subject variable because sex is a characteristic of the subject that is not
directly manipulated by the researcher. You know that subject sex is not a true independent
variable because it is not possible to manipulate it in a direct fashion (i.e., it is not possible
to randomly assign half of your subjects to be male, and half to be female). With a subject
variable, you simply note which condition a subject is already in; you do not assign the
subject to that condition. Other examples of subject variables include age, political party,
and race.
When you perform a factorial ANOVA, it is possible to use any combination of true
independent variables and subject variables. That is, you can perform an analysis in which

both predictor variables are true independent variables

both predictor variables are subject variables

one predictor is a true independent variable and the other predictor is a subject variable.

Summary of Assumptions Underlying Factorial ANOVA with Two


Between-Subjects Factors

Level of measurement. The criterion variable must be a numeric variable that is assessed
on an interval or ratio level of measurement. The predictor variables may be assessed on
any level of measurement, although they are essentially treated as nominal-level
(classification) variables in the analysis.

Independent observations. An observation should not be dependent on any other


observation in any cell (the meaning of the term cell will be explained in a later section).
In practical terms, this means that each subject is exposed to only one condition under
each predictor variable, and that subject matching procedures are not used.

Random sampling. Scores on the criterion variable should represent a random sample
that is drawn from the populations of interest.

Normal distributions. Each cell should be drawn from a normally-distributed


population. If each cell contains over 30 subjects, the test is robust against moderate
departures from normality (in this context, robust means that the test will still provide
accurate results as long as violations of the assumptions are not large). You should analyze
your data with PROC UNIVARIATE using the NORMAL option to determine whether
your data meet this assumption. Remember that the significance tests for normality
provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large.

Homogeneity of variance. The populations represented by the various cells should have
equal variances on the criterion. If the number of subjects in the largest cell is no more
than 1.5 times greater than the number of subjects in the smallest cell, the test is robust
against moderate violations of the homogeneity assumption.

546 Step-by-Step Basic Statistics Using SAS: Student Guide

Using Factorial Designs in Research


Chapter 15, One-Way ANOVA with One Between-Subjects Factor, described a simple
experiment in which you manipulated a single independent variable: the level of aggression
that was displayed by a model. Because there was a single independent variable in that
study, it was analyzed using a one-way ANOVA.
But suppose that there are two independent variables that you wish to manipulate. In this
situation, you might think that it would be necessary to conduct two separate experiments,
one for each independent variable. But you would be wrong: In many cases, it will be
possible (and preferable) to manipulate both independent variables in a single study.
The research design that is used in these studies is called a factorial design. In a factorial
design, two or more independent variables are manipulated in a single study so that the
treatment conditions represent all possible combinations of the various levels of the
independent variables.
In theory, a factorial design might include any number of independent variables. In practice,
however, it often becomes impractical to use more than three or four. This chapter
illustrates factorial designs that include only two independent variables, and such designs
can be analyzed using a two-way ANOVA.

A Different Study Investigating Aggression


Overview
To illustrate the concept of factorial design, imagine that you are interested in conducting a
different type of study that investigates aggression in nursery-school children. You want to
test the following two research hypotheses:

Hypothesis A: There will be a positive relationship between the level of aggression


displayed by a model and the number of aggressive acts later demonstrated by children
who observe the model. Specifically, it is predicted that (a) children who witness a high
level of aggression will demonstrate a greater number of aggressive acts than children who
witness a moderate or low level of aggression, and (b) children who witness a moderate
level of aggression will demonstrate a greater number of aggressive acts than children who
witness low level of aggression.

Hypothesis B: Children who observe a model being rewarded for engaging in aggressive
behavior will later demonstrate a greater number of aggressive acts, compared to children
who observe a model being punished for engaging in aggressive behavior.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 547

You perform a single investigation to test these two hypotheses. In this investigation, you
will simultaneously manipulate two independent variables. One of the independent variables
will be relevant to Hypothesis A, and the other independent variable will be relevant to
Hypothesis B. The following sections describe the research method in more detail.
Note: Although the study and results presented here are fictitious, they are inspired by the
actual studies reported by Bandura (1965, 1977).
Research Method
Overview. Suppose that you conduct a study in which 30 nursery-school children serve as
subjects. The study is conducted in two stages. In Stage 1, you show a short videotape to
your subjects. You manipulate the two independent variables by varying what the children
see in this videotape. In Stage 2, you assess the dependent variable (the amount of
aggression displayed by the children) to determine whether it has been affected by the
independent variables.
The following sections refer to your independent variables as Predictor A and Predictor
B rather than Independent Variable A and Independent Variable B. This is because the
term predictor variable is more general, and is appropriate regardless of whether your
variable is a true manipulated independent variable (as in the present case), or a
nonmanipulated subject variable (such as subject sex).
Stage 1: Manipulating Predictor A. Predictor A in your study is the level of aggression
displayed by the model or, more concisely, model aggression. You manipulate this
independent variable by randomly assigning each child to one of three treatment conditions:

Ten children are assigned to the low condition. When the subjects in this group watch
the videotape, they see a model demonstrate a relatively low level of aggressive behavior.
Specifically, they see a model (an adult female) enter a room that contains a wide variety
of toys. For 90% of the tape, the model engages in nonaggressive play (e.g., playing with
building blocks). For 10% of the tape, the model engages in aggressive play (e.g.,
violently punching an inflatable bobo doll).

Another 10 children are assigned to the moderate condition. They watch a videotape of
the same model in the same playroom, but they observe the model displaying a somewhat
higher level of aggressive behavior. Specifically, in this version of the tape, the model
engages in nonaggressive play (again, playing with building blocks) 50% of the time, and
engages in aggressive play (again, punching the bobo doll) 50% of the time.

Finally, the last 10 children are assigned to the high condition. They watch a videotape
of the same model in the same playroom, but in this version the model engages in
nonaggressive play 10% of the time, and engages in aggressive play 90% of the time.

Stage 1 continued: Manipulating Predictor B: Predictor B is the consequences for the


model. You manipulate this independent variable by randomly assigning each child to one
of two treatment conditions:

548 Step-by-Step Basic Statistics Using SAS: Student Guide

Fifteen children are assigned to the model rewarded condition. Toward the end of the
videotape (described above), children in this group see the model rewarded for her
behavior. Specifically, the videotape shows another adult who enters the room with the
model, praises her, and gives her cookies.

The other 15 children are assigned to the model punished condition. Toward the end of
the same videotape, children in this group see the model punished for her behavior:
Another adult enters the room with the model, scolds her, shakes her finger at her, and
puts her in time out.

Stage 2: Assessing the criterion variable. This chapter will refer to the dependent
variable in the study as a criterion variable. Again, this is because the term criterion
variable is a more general term that is appropriate regardless of whether your study is a true
experiment (as in the present case), or is a nonexperimental investigation.
The criterion variable in this study is the number of aggressive acts displayed by the
subjects or, more concisely, subject aggressive acts. The purpose of your study was to
determine whether certain manipulations in your videotape caused some groups of children
to behave more aggressively than others. To assess this, you allowed each child to engage in
a free play period immediately after viewing the videotape. Specifically, each child was
individually escorted to a playroom similar to the one shown in the tape. This playroom
contained a large assortment of toys, some of which were appropriate for nonaggressive play
(e.g., building blocks), and some of which were appropriate for aggressive play (e.g., an
inflatable bobo doll identical to the one in the tape). The children were told that they could
do whatever they liked in the play room, and were then left to play alone.
Outside of the playroom, three observers watched the child through a one-way mirror. They
recorded the total number of aggressive acts the child displayed during a 20-minute period
in the playroom (an aggressive act could be an instance in which the child punches the
bobo doll, throws a building block, and so on). Therefore, the criterion variable in your
study is the total number of aggressive acts demonstrated by each child during this period.
The Factorial Design Matrix
The factorial design of this study is illustrated in Figure 16.1. You can see that this design is
represented by a matrix that consists of two rows and three columns.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 549

Figure 16.1. Factorial design used in the aggression study.

The columns of the matrix. When an experimental design is represented in a matrix such
as this, it is easiest to understand if you focus on only one aspect of the matrix at a time. For
example, first consider just the three columns in Figure 16.1. The three columns are headed
Predictor A: Level of Aggression Displayed by Model, and these columns represent the
various levels of the model aggression independent variable. The first column represents
the 10 subjects in level A1 (the children who saw a videotape in which the model displayed
a low level of aggression), the second column represents the 10 subjects in level A2 (the
children who saw the model display a moderate level of aggression), and the last column
represent the 10 subjects in level A3 (the children who saw the model display a high level of
aggression).
The rows of the matrix. Now consider just the two rows in Figure 16.1. These rows are
headed Predictor B: Consequences for Model. The first row is headed Level B1: Model
Rewarded, and this row represents the 15 children who saw the model rewarded for her
behavior. The second row is headed Level B2: Model Punished, and represents the 15
children who saw the model punished for her behavior.
The r c design. It is common to refer to a factorial design as an r c design, in which
r represents the number of rows in the matrix, and c represents the number of columns.
The present study is an example of a 2 3 factorial design because it has two rows and three
columns. If it included four levels of model aggression rather than three, it would be
referred to as a 2 4 factorial design.
The cells of the matrix. You can see that this matrix consists of six cells. A cell is a
location in the matrix where the column for one predictor variable intersects with the row for
a second predictor variable. For example, look at the cell where column A1 (low level of
model aggression) intersects with row B1 (model rewarded) . The entry 5 Subjects
appears in this cell, which means that there were five children who experienced this
particular combination of treatments under the two predictor variables. More specifically,

550 Step-by-Step Basic Statistics Using SAS: Student Guide

it means that there were five subjects who both (a) saw the model engage in a low level of
aggression, and (b) saw the model rewarded for her behavior.
Now look at the cell in which row A2 (moderate level of model aggression) intersects with
row B2 (model punished). Again, the cell contains the entry 5 Subjects, which means that
there was a different group of five children who experienced the treatments of (a) seeing the
model display a moderate level of aggression and (b) seeing the model punished for her
behavior. In the same way, you can see that there was a separate group of five children
assigned to each of the six cells of the matrix.
Earlier, it was said that a factorial design involves two or more independent variables being
manipulated so that the treatment conditions represent all possible combinations of the
various levels of the independent variables. The cells of Figure 16.1 illustrate this concept.
You can see that the six cells of the figure represent every possible combination of (a) level
of aggression displayed by the model and (b) consequences for the model. This means that,
for the children who saw a low level of model aggression, half of them saw the model
rewarded, and the other half saw the model punished. The same is true for the children who
saw a moderate level of model aggression, as well as for the children who saw a high level
of model aggression.

Understanding Figures That Illustrate the Results of a


Factorial ANOVA
Overview
Factorial designs are popular in research for a variety of reasons. One reason is that they
allow you to test for several different types of effects in a single investigation. The types of
effects that may be produced from a factorial study will be discussed in the next section.
However, it is important to note that this advantage has a corresponding drawback: Because
they involve different types of effects, factorial designs sometimes produce results that can
be difficult to interpret, compared to the simpler results that are produced in a one-way
ANOVA. Fortunately, however, this task of interpretation can be made much easier if you
first prepare a figure that plots the results of the factorial study. This section shows how to
interpret these figures.
Example of a Figure
Figure 16.2 presents one type of figure that is often used to illustrate the results of a factorial
study. Notice that, with this figure, scores on the criterion variable (Subject Aggressive
Acts) are plotted on the vertical axis.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 551

Figure 16.2. Example of a one type of figure that is often used to illustrate the
results from a factorial ANOVA.

The three levels of Predictor A (level of aggression displayed by model) are plotted on the
horizontal axis. The first point on this axis is labeled Low, and represents group A1 (the
children who watched the model display a low level of aggression). The middle point is
labeled Moderate, and represents group A2 (the children who watched the children display
a moderate level of aggression). The point at the right is labeled High, and represents
group A3 (the children who watched the children display a high level of aggression).
The two levels of Predictor B (consequences for the model) are identified by drawing two
different lines in the body of the figure itself. Specifically, the mean aggression scores
displayed by children who saw the model rewarded (level B1) are illustrated with small
circles connected by a solid line, while the mean aggression scores displayed by the children
who saw the model punished (level B2) are displayed by small triangles connected by a
broken line.
Interpreting the Means on the Solid Line
You will remember that your investigation involved six groups of subjects, corresponding to
the six cells of the factorial design matrix described earlier. Figure 16.2 illustrates the mean
score for each of these six groups. To read these means, begin by focusing on just the solid
line with circles. This line provides means for the subjects who saw the model rewarded.
First, find the circle that appears above the label Low on the figures horizontal axis. This

552 Step-by-Step Basic Statistics Using SAS: Student Guide

circle represents the mean aggression score for the five children who (a) saw the model
display a low level of aggression, and (b) saw the model rewarded. Look to the left of this
circle to find the mean score for this group on the Subject Aggressive Acts axis. The circle
for this group is found at about 13 on this axis. This means that the five children in this
group displayed an average of approximately 13 aggressive acts in the playroom after they
watched the videotape.
Now find the next circle on the solid line, above the label Moderate on the horizontal axis.
This circle represents the five children in the group that (a) saw the model display a
moderate level of aggression, and (b) saw the model rewarded. Looking to the vertical axis
on the left, you can see that this group displayed a mean score of about 18. This means that
the five children in this group displayed an average of approximately 18 aggressive acts in
the playroom after they watched the videotape.
Finally, find the circle on the solid line above the label High on the horizontal axis. This
circle represents the five children in the group who (a) saw the model display a high level of
aggression, and (b) saw the model rewarded. Looking to the vertical axis on the left, you can
see that this group displayed a mean score of 24, meaning that they engaged in an average of
about 24 aggressive acts in the playroom after watching the videotape.
These three circles are all connected by a single solid line, indicating that all of these
subjects were in the same condition under Predictor Bthe model-rewarded condition.
Interpreting the Means on the Broken Line
Next you will find the mean scores for the subjects in the other condition under Predictor B:
the children who saw the model punished. To do this, focus on the broken line with
triangles. First, find the triangle that appears above the label Low on the figures
horizontal axis. This triangle provides the mean aggression score for the five children who
(a) saw the model display a low level of aggression, and (b) saw the model punished. Look
to the left of this triangle to find the mean score for this group on the Subject Aggressive
Acts axis. The triangle for this group is 2, which means that the five children in this group
displayed an average of approximately 2 aggressive acts in the playroom after they watched
the videotape. Repeating this process for the two other triangles on the broken line shows
the following:

The five children who (a) saw the model display a moderate level of aggression, and (b)
saw the model punished displayed an average of approximately 7 aggressive acts in the
playroom.

The five children who (a) saw the model display a high level of aggression, and (b) saw
the model punished displayed an average of approximately 13 aggressive acts in the
playroom.

These three triangles are all connected by a single broken line, indicating that all of these
subjects were in the same condition under Predictor Bthe model-punished condition.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 553

Summary
The important points to remember when interpreting the graphs in this chapter are as
follows:

The possible scores on the criterion variable are represented on the vertical axis.

The three levels of Predictor A are represented as three different points on the horizontal
axis.

The two levels of Predictor B are represented by drawing two different lines within the
graph.

Now you are ready to learn about the different types of effects that are observed in factorial
designs, and how these effects appear when they are plotted in this type of graph.

Some Possible Results from a Factorial ANOVA


Overview
When a predictor variable (or independent variable) in a factorial design displays a
significant main effect, it means that, in the population, there is a difference between at least
two of the levels of that predictor variable with respect to mean scores on the criterion
variable. In a one-way analysis of variance, only one main effect is possible: the main effect
for the studys one independent variable. However, in a factorial design, there will be one
main effect possible for each predictor variable included in the study. Because the present
study involves two predictor variables, two types of main effects are possible:

a main effect for Predictor A

a main effect for Predictor B.

However, a factorial ANOVA can also produce an entirely different type of effect that is
not possible with a one-way ANOVAit can reveal a significant interaction between
Predictor A and Predictor B. When an interaction is significant, it means that the
relationship between one predictor variable and the criterion variable is different at different
levels of the second predictor variable (a later section will discuss interactions in more
detail).
The following sections show how main effects and interactions might appear when plotted
in a graph.

554 Step-by-Step Basic Statistics Using SAS: Student Guide

A Significant Main Effect for Predictor A Only


Figure 16.3 shows one example of how a graph may appear when there is

A significant main effect for Predictor A. Predictor A was the level of aggression
displayed by the model: low versus moderate versus high.

A nonsignificant main effect for Predictor B. Predictor B was consequences for the model:
model-rewarded versus model-punished.

A nonsignificant effect for the interaction.

In other words, Figure 16.3 shows a situation in which the only significant effect is a main
effect for Predictor A.

Figure 16.3. A significant main effect for Predictor A (level of aggression


displayed by the model); nonsignificant main effect for Predictor B;
nonsignificant interaction.

Interpreting the graph. Figure 16.3 shows that a relatively low level of aggression was
displayed by subjects in the low condition of Predictor A (the level of aggression
displayed by the model). When you look above the label Low on the horizontal axis, you
can see that both the children in the model-rewarded group (represented with a small circle)
as well as the children in the model-punished group (represented with a small triangle)
display relatively low scores on aggression (the two groups demonstrated a mean of
approximately 6 aggressive acts in the playroom). However, a somewhat higher level of

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 555

aggression was demonstrated by subjects in the moderate-model-aggression condition:


When you look above the Moderate on the horizontal axis, you can see that the children in
this condition displayed an average of about 12 aggressive acts. Finally, an even higher level
of aggression was displayed by subjects in the high-model-aggression condition: When you
look above the label High, you can see that the children in this condition displayed an
average of approximately 18 aggressive acts. In short, this trend shows that there was a
main effect for the model aggression variable. Figure 16.3 shows that, the greater the
amount of aggression displayed by the model in the videotape, the greater the number of
aggressive acts subsequently displayed by the children when they were in the playroom.
Characteristics of main effect for Predictor A when graphed. This leads to an important
point: When a figure representing the results of a factorial study displays a significant main
effect for Predictor A, it will demonstrate both of the following characteristics:

Corresponding line segments are parallel.

At least one set of corresponding line segments displays a relatively steep angle (or slope).

First, you need to understand that corresponding line segments refers to line segments that
(a) run from one point on the horizontal axis for Predictor A to the next point on the same
axis, and (b) appear immediately above and below each other in a figure. For example, the
solid line and the broken line that run from Low to Moderate in Figure 16.3 are
corresponding line segments. Similarly, the solid line and the broken line that go from
Moderate to High in Figure 16.3 are also corresponding line segments.
The first of the two preceding conditionsthat the lines should be parallelconveys that
the two predictor variables are not involved in an interaction. This is important because you
typically will not interpret a main effect for a predictor variable if that predictor variable is
involved in a significant interaction (the meaning of interaction will be discussed later in the
section A Significant Interaction). In Figure 16.3, you can see that the lines for the two
conditions under Predictor B (the solid line and the broken line) are parallel to one another.
This suggests that there probably is not an interaction between Predictor A (level of
aggression displayed by the model) and Predictor B (consequences for the model) in the
present study.
The second conditionthat at least one set of corresponding line segments should display a
relatively steep anglecan be understood by again referring to Figure 16.3. Notice that the
segment that begins at Low (the low-model-aggression condition) and extends to
Moderate (the moderate-model-aggression condition) is not horizontal; it displays upward
anglea positive slope. Obviously, this is because the aggression scores for the moderatemodel-aggression group were higher than the aggression scores for the low-modelaggression group. When you obtain a significant effect for Predictor A variable in your
study, you should expect to see this type of angle. Similarly, you can see that the line
segment that begins at Moderate and continues to High also displays an upward angle,
also consistent with a significant effect for the model aggression variable.

556 Step-by-Step Basic Statistics Using SAS: Student Guide

Remember that these guidelines are merely intended to help you understand what a main
effect looks like when it is plotted in a graph such as Figure 16.3. To determine whether this
main effect is statistically significant, it will of course be necessary to review the results of
the analysis of variance, to be discussed below.
Another Example of a Significant Main Effect for Predictor A Only
Figure 16.4 shows another example of a significant main effect for the model aggression
factor. You know that this figure illustrates a main effect for Predictor A, because both of
the following are true:

the corresponding line segments are all parallel

one set of corresponding line segments displays a relatively steep angle.

Figure 16.4. Another example of a significant main effect for Predictor A (level
of aggression displayed by the model); nonsignificant main effect for Predictor
B, nonsignificant interaction.

Notice that the solid line and the broken line that run from Low to Moderate are parallel
to each other. In addition, the solid line and the broken line that run from Moderate to
High are also parallel. This tells you that there is probably not a significant interaction
between Predictor A and Predictor B. Where an interaction is concerned, it is irrelevant that
the lines show an upward angle from Low to Moderate and then become level from
Moderate to High. The important point is that the corresponding line segments are

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 557

parallel to each other. Think of these lines as being like the rails of a railroad track that
twists and curves along the landscape: As long as the two corresponding rails are always
parallel to each other (regardless of how much they slope), the interaction is probably not
significant.
You know that Figure 16.4 illustrates a significant main effect for Predictor A because the
lines demonstrate a relatively steep angle as they run from Low to Moderate. This tells
you that the children who observed a moderate level of model aggression displayed a higher
number of aggressive acts after viewing the videotape, compared to the children who
observed a low level of model aggression. It is this difference that tells you that you
probably have a significant main effect for Predictor A.
You can see from Figure 16.4 that the lines do not demonstrate a relatively steep slope as
they run from Moderate to High. This tells you that there was probably not a significant
difference between the children who watched a moderate level of model aggression versus
those who observed a high level. But this does not change the fact that you still have a
significant effect for Predictor A. When a predictor variable contains three or more
conditions, that predictor variable will display a significant main effect if at least two of the
conditions are markedly different from each other.
A Significant Main Effect for Predictor B Only
How Predictor B is represented. You would expect to see a different type of pattern in a
graph if the main effect for the other predictor variable (Predictor B) were significant.
Earlier, you learned that Predictor A was represented in a graph by plotting three points on
the horizontal axis. In contrast, you learned that Predictor B was represented by drawing
different lines within the body of the graph: one line for each level of Predictor B. In the
present study, Predictor B was the consequences for the model variable: A solid line was
used to represent mean scores from the children who saw the model rewarded, and a broken
line was used to represent mean scores from the children who saw the model punished.
Characteristics of a main effect for Predictor B when graphed. When Predictor B is
represented in a figure by plotting separate lines for its various levels, a significant main
effect for Predictor B is revealed when the figure displays both of the following
characteristics:

Corresponding line segments are parallel

At least two of the lines are relatively separated from each other.

Interpreting the graph. For example, a main effect for Predictor B in the current study is
represented by Figure 16.5. Consistent with the two preceding points, the two lines in
Figure 16.5 (a) are parallel to one another (indicating that there is probably no interaction),
and (b) are separated from one another.

558 Step-by-Step Basic Statistics Using SAS: Student Guide

Figure 16.5. A significant main effect for Predictor B (consequences for the
model); nonsignificant main effect for Predictor A, nonsignificant interaction.

Regarding the separation between the lines: Notice that, in general, the children in the
model rewarded condition tended to demonstrate a higher number of aggressive acts after
viewing the videotape, compared to the children in the model punished condition. This is
the general trend that you would expect, given the assumptions of social learning theory
(Bandura, 1977).
Notice that neither the solid line nor the broken line show much of an angle, or slope. This
indicates that there was probably not a main effect for Predictor A (level of aggression
displayed by model).
A Significant Main Effect for Both Predictor Variables
It is possible to obtain significant effects for both Predictor A and Predictor B in the same
investigation. When there is a significant effect for both predictor variables, you should see
all of the following:

Corresponding line segments are parallel (indicating no interaction).

At least one set of corresponding line segments displays a relatively steep angle
(indicating a main effect for Predictor A)

At least two of the lines are relatively separated from each other (indicating a significant
main effect for Predictor B).

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 559

Figure 16.6 shows what the graph might look like under these circumstances.

Figure 16.6. Significant main effects for both Predictor A (level of aggression
displayed by the model) and Predictor B (consequences for the model);
nonsignificant interaction.

From Figure 16.6, you can see that the dotted line and the solid line are parallel, indicating
no interaction. Both lines display an upward angle, indicating a significant effect for
Predictor A (level of aggression displayed by the model): The children in the high
condition were more aggressive than the children in the moderate condition, who in turn
were more aggressive than the children in the low condition. Finally, the solid line is
higher than the broken line, indicating a significant effect for Predictor B (consequences for
the model): The children who saw the model rewarded tended to be more aggressive than
the children who saw the model punished.
No Main Effects
Figure 16.7 shows what a graph might look like if there were no main effects for either
Predictor A or Predictor B. Notice that the lines are parallel (indicating no interaction), none
of the line segments display a relatively steep angle (indicating no main effect for Predictor
A), and the lines are not separated (indicating no main effect for Predictor B).

560 Step-by-Step Basic Statistics Using SAS: Student Guide

Figure 16.7. Nonsignificant main effects and a nonsignificant interaction.

A Significant Interaction
Overview. An earlier section indicated that, when you perform a two-way ANOVA, there
are three types of effects that may be observed: (a) a main effect for Predictor A, (b) a main
effect for Predictor B, and (c) an interaction between Predictor A and Predictor B. This
section provides definitions for the concept of interaction, shows what an interaction
might look like when plotted on a graph, and addresses the issue of whether main effects
should typically be interpreted when an interaction is significant.
Definitions for interaction. The concept of an interaction can be defined in several ways.
For example, with respect to experimental research (in which you are actually manipulating
true independent variables), the following definition can be used:

An interaction is a condition in which the effect of one independent variable on the


dependent variable is different at different levels of the second independent variable.

On the other hand, for nonexperimental research (in which you are simply measuring
naturally-occurring variables rather than manipulating true independent variables), the
concept of interaction can be defined in this way:

An interaction is a condition in which the relationship between one predictor variable and
the criterion variable is different at different levels of the second predictor variable.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 561

Characteristics of an interaction when graphed. These definitions are abstract and


somewhat difficult to understand at first reading. However, the concept of interaction is
much easier to understand by seeing an interaction illustrated in a graph. A graph indicates
that there is probably an interaction between two predictor variables when it displays the
following characteristic:

At least one set of corresponding line segments are not parallel.

An interaction illustrated in a graph. For example, Figure 16.8 displays a significant


interaction between Predictor A and Predictor B in the present study. Notice that the solid
line and the broken line are no longer parallel: The line representing the children who saw
the model rewarded now displays a fairly steep angle, while the line for the line for the
children who saw the model punished is relatively flat. This is the key characteristic of a
figure that displays a significant interaction: lines that are not parallel.

Figure 16.8. A significant interaction between Predictor A (level of aggression


displayed by the model) and Predictor B (consequences for the model).

Notice how the relationships depicted in Figure 16.8 are consistent with the definition for
interaction provided above: The relationship between one predictor variable (level of
aggression displayed by the model) and the criterion variable (subject aggression) is
different at different levels of the second predictor variable (consequences for the model).
More specifically, the figure shows that there is a relatively strong, positive relationship
between Predictor A (model aggression) and the criterion variable (subject aggression) for
the children in the model rewarded level of Predictor B. For the children in this group, the
more aggression that they saw from the model, the more aggressively they (the children)

562 Step-by-Step Basic Statistics Using SAS: Student Guide

behaved when they were in the playroom. The children who saw a high level of model
aggression displayed an average of about 23 aggressive acts, the children who saw a
moderate level of model aggression displayed an average of about 17 aggressive acts, and
the children who saw a low level of model aggression displayed an average of about 9
aggressive acts.
In contrast, notice that Predictor A (level of aggression displayed by the model) had
essentially no effect on the children in the model-punished level of Predictor B. The
children in this condition are represented by the broken line in Figure 16.8. You can see that
this broken line is relatively flat: The children in the high, moderate, and low groups
all displayed the same number of aggressive acts when they were in the playroom (each
group displayed about 7 aggressive acts). The fact that there were no differences between
the three conditions means that Predictor A had no effect on the children in the modelpunished condition.
Interpreting the interaction in the figure. If you conducted the study described here and
actually obtained the interaction that is illustrated in Figure 16.8, what would the results
mean with respect to the effects of your two predictor variables? These results suggest that
the level of aggression displayed by a model can have an effect on the level of aggression
later displayed by the subjects, but only if the subjects see the model being rewarded for her
aggressive behavior. However, if the subjects instead see the model being punished, then the
level of aggression displayed by the model has no effect.
Two caveats regarding this interpretation: First, remember that these results, like most of
the results presented in this book, are fictitious, and were provided only to illustrate
statistical concepts. They do not necessarily represent what researchers have discovered
when conducting research on aggression. Second, remember that the interaction illustrated
in Figure 16.8 is just one example of what an interaction might look like when plotted.
When you perform a two-way ANOVA, a significant interaction might take on any of an
infinite variety of forms. These different types of interactions will have one characteristic in
common: they will all involve corresponding line segments that are not parallel. A later
section will show a different example of how an interaction from the present study might
appear.
Interpreting Main Effects When an Interaction Is Significant
The problem. When you perform a two-way ANOVA, it is possible that you will find that
(a) the interaction term is statistically significant, and (b) one or both of the main effects are
also statistically significant. When you prepare a report summarizing the results, you will
certainly discuss the nature of your significant interaction. But is it also acceptable to discuss
and interpret the main effects that were significant?
There is some disagreement between statisticians in answering this question. Some
statisticians argue that, if the interaction is significant, you should not interpret the main
effects at all, even if they are significant. Others take a less extreme approach. They say that

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 563

it is acceptable to interpret significant main effects, as long as the primary interpretation of


the results focuses on the interaction (if it is significant).
Example. The interaction illustrated in Figure 16.8 provides a good example of why you
must be very cautious in interpreting main effects when the predictor variables are involved
in an interaction. To understand why, consider this: Based on the results presented in the
figure, would it make sense to begin your discussion of the results by saying that there is a
main effect for Predictor A (level of aggression displayed by the model)? Probably notto
simply say that there is a main effect for Predictor A would be somewhat misleading. It is
clear that the level of aggression displayed by the model does seem to have an effect on
aggression among children who see the model rewarded, but the graph suggests that the
level of aggression displayed by the model probably does not have any real effect on
aggression among children who see the model punished. To simply say that there is a main
effect for model aggression might mislead readers into believing that exposure to aggressive
models is likely to cause increased subject aggression under any circumstances (which it
apparently does not).
Recommendations. If it is questionable to discuss main effects under these circumstances,
then how should you present your results when you have a significant interaction along with
significant main effects? In situations such as this, it often makes more sense to do the
following:

Note that there was a significant interaction between the two predictor variables. Your
interpretation of the results should be based primarily upon this interaction.

Prepare a figure (like Figure 16.8) that illustrates the nature of the interaction.

If appropriate, further investigate the nature of the interaction by testing for simple effects
(a later section explains this concept of simple effects).

If the main effect for a predictor variable was significant, interpret this main effect very
cautiously. Remind the reader that the predictor variable was involved in an interaction,
and explain how the effect of this predictor variable was different for different groups of
subjects.

Another Example of a Significant Interaction


An earlier section pointed out that an interaction can assume an almost infinite variety of
forms. Figure 16.9 illustrates a different type of interaction for the current study.

564 Step-by-Step Basic Statistics Using SAS: Student Guide

Figure 16.9. Another example of a significant interaction between Predictor A


(level of aggression displayed by the model) and Predictor B (consequences
for the model).

How would you know that these results constitute an interaction? Because one of the sets of
corresponding line segments in this figure contains lines that are not parallel. Specifically,
you can see that the solid line running from Moderate to High displays a fairly steep
angle, while the broken line running from Moderate to High does not. Obviously, these
line segments are not parallel, and that means that you have an interaction.
It is true that another set of line segments in the figure are parallel to each other (i.e., the
solid line and broken line running from Low to Moderate are parallel), but this is
irrelevant. As long as at least one set of corresponding line segments display markedly
different angles, the interaction is probably significant.
As was noted before, the purpose of this section was simply to show you how main effects
and interactions might appear when plotted in a graph. Obviously, you do not conclude that
a main effect or interaction is significant by simply viewing a graph; instead, this is done by
performing the appropriate statistical analysis using the SAS System. The remainder of this
chapter shows how to perform these analyses.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 565

Example of a Factorial ANOVA Revealing Two


Significant Main Effects and a Nonsignificant
Interaction
Overview
The steps that you follow in performing a factorial ANOVA will vary depending on whether
the interaction is significant: If the interaction is significant, you will follow one set of
steps, and if the interaction is nonsignificant you will follow a different set of steps. This
section illustrates an analysis that results in a nonsignificant interaction, along with two
significant main effects. It shows you how to prepare the SAS program, how to interpret the
SAS output, and how to write up the results. These procedures are illustrated by analyzing
fictitious data from the aggression study described previously. In these analyses, Predictor A
is the level of aggression displayed by the model, Predictor B is the consequences for the
model, and the criterion variable is the number of aggressive acts displayed by the children
after viewing the videotape.
Choosing SAS Variable Names and Values to Use in the SAS Program
Overview. Before you write a SAS program to perform a factorial ANOVA, you might find
it helpful to prepare a table similar to that in Figure 16.10. This table will help you choose
(a) meaningful SAS variable names for the variables in the analysis, (b) values to represent
the different levels under the predictor variables, and (c) cell designators for the cells that
constitute the factorial design matrix. If you carefully choose meaningful variable names and
values now, you will find it easier to interpret your SAS output later.

Figure 16.10. Variable names and values to be used in the SAS program for
the aggression study.

566 Step-by-Step Basic Statistics Using SAS: Student Guide

SAS variable name for Predictor A. You can see that Figure 16.10 is very similar to
Figure 16.1, except that variable names and values have now been added. For example,
Figure 16.10 again shows that Predictor A in your study is the Level Of Aggression
Displayed by Model. Below this heading (within parentheses) is MOD_AGGR, which
will serve as the SAS variable name for Predictor A in your SAS program (MOD_AGGR
stands for model aggression). Obviously, you can choose any SAS variable name that is
meaningful and that complies with the rules for SAS variable names.
Values to represent conditions under Predictor A. Below the heading for Predictor A are
the names of the three conditions for this predictor variable: Low, Moderate, and
High. Below these headings for the three conditions (within parentheses) are the values
that you will use to represent these conditions in your SAS program. You will use the value
L to represent children in the low-model-aggression condition, the value M to represent
children in the moderate-model-aggression condition, and the value H to represent
children in the high-model-aggression condition. Choosing meaningful letters such as L, M,
and H will make it easier to interpret your SAS output later.
SAS variable name for Predictor B. Figure 16.10 shows that Predictor B in your study is
Consequences for Model. Below this heading (within parentheses) is CONSEQ, which
will serve as the SAS variable name for Predictor B in your SAS program (CONSEQ
stands for consequences).
Values to represent conditions under Predictor B. To the right of this heading are the
names for the treatment conditions under Predictor B, along with the values that will
represent these conditions in your SAS program. You will use the value MR to represent
children in the Model Rewarded condition, and MP to represent children in the Model
Punished condition.
Cell designators. Each cell in Figure 16.10 contains a cell designator that indicates the
condition a child was assigned to under both predictor variables. For example, the upper left
cell has the designator Cell MR-L. The MR tells you that this group of children were in
the model-rewarded condition under Predictor B, and the L tells you that they were in
the low condition under Predictor A. Now consider the cell in the middle column on the
bottom row of the figure. This cell has the designator Cell MP-M. Here, the MP tells
you that this group of children were in the model-punished condition under Predictor B,
and the M tells you that they were in the moderate condition under Predictor A.
When you work with these cell designators, remember that the value for the row always
comes first, and the value for the column always comes second. This means that the cell at
the intersection of row 2 and column 1 should be identified with the designator MP-L, not LMP. As you will soon see, being able to quickly interpret these cells designators will make it
easier to write your SAS program and to interpret the results.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 567

Data Set to Be Analyzed


Table 16.1 presents the data set that you will analyze.
Table 16.1
Variables Analyzed in the Aggression Study (Data Set Will Produce
Significant Main Effects and a Nonsignificant Interaction)
____________________________________________________
Consequences
Model
Subject
Subject
for model
aggression
aggression
____________________________________________________
01
MR
L
11
02
MR
L
7
03
MR
L
15
04
MR
L
12
05
MR
L
8
06
MR
M
24
07
MR
M
19
08
MR
M
20
09
MR
M
23
10
MR
M
29
11
MR
H
23
12
MR
H
29
13
MR
H
25
14
MR
H
20
15
MR
H
27
16
MP
L
4
17
MP
L
0
18
MP
L
9
19
MP
L
2
20
MP
L
8
21
MP
M
17
22
MP
M
20
23
MP
M
12
24
MP
M
17
25
MP
M
21
26
MP
H
12
27
MP
H
20
28
MP
H
21
29
MP
H
20
30
MP
H
18
__________________________________________________

Understanding the columns in the table. The columns of Table 16.1 provide the variables
that you will analyze in your study. The first column in Table 16.1, Subject, assigns a
unique subject number to each child.
The second column is headed Consequences for model. This column identifies the
condition to which children were assigned under Predictor B, consequences for the model.
In this column, the value MR identifies children in the model-rewarded condition, and
MP identifies children in the model-punished condition. You can see that children with
subject numbers 1-15 were in the model-rewarded condition, and children with subject
numbers 16-30 were in the model-punished condition.

568 Step-by-Step Basic Statistics Using SAS: Student Guide

The third column is headed Model aggression. This column identifies the condition to
which children were assigned under Predictor A, level of aggression displayed by the model.
In this column, the value L identifies children who saw the model in the videotape display
a low level of aggression, M identifies children who saw the model display a moderate
level of aggression, and H identifies children who saw the model display a high level of
aggression.
Finally, the column headed Subject aggression indicates the number of aggressive acts
that each child displayed in the play room after viewing the videotape. This variable will
serve as the criterion variable in your study.
Understanding the rows of the table. The rows of Table 16.1 represent the individual
children who participated as subjects in the study. The first row represents Subject 1. The
MR under Consequences for model tells you that this child was in the model-rewarded
condition under Predictor B. The L under Model aggression tells you that the subject
was in the low condition under Predictor A. Finally, the 11 under Subject aggression
tells you that this child displayed 11 aggressive acts after viewing the videotape. The rows
for the remaining children may be interpreted in the same way.
Stepping back and getting the big picture of Table 16.1 shows that it contains every
possible combination of the levels of Predictor A and Predictor B. Notice that subjects 1-15
were in the model-rewarded condition under Predictor B, and that subjects 1-5 were in the
low-model-aggression condition, subjects 6-10 were in the moderate-model-aggression
condition, and subjects 11-15 were in the high-model-aggression condition under Predictor
A. For subjects 16-30, the pattern repeats itself, with the exception that these subjects were
in the model-punished condition under Predictor B.
Writing the DATA Step of the SAS Program
As you type the SAS program, you will enter the data similar to the way that they appear in
Table 16.1. That is, you will have one column to contain subject numbers, one column to
indicate the subjects condition under the consequences for model predictor variable, one
column to indicate the subjects condition under the model aggression predictor variable,

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 569

and one column to indicate the subjects score on the subject aggression criterion variable.
Here is the DATA step for your SAS program:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
CONSEQ
$
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 MR L 11
02 MR L 7
03 MR L 15
04 MR L 12
05 MR L 8
06 MR M 24
07 MR M 19
08 MR M 20
09 MR M 23
10 MR M 29
11 MR H 23
12 MR H 29
13 MR H 25
14 MR H 20
15 MR H 27
16 MP L 4
17 MP L 0
18 MP L 9
19 MP L 2
20 MP L 8
21 MP M 17
22 MP M 20
23 MP M 12
24 MP M 17
25 MP M 21
26 MP H 12
27 MP H 20
28 MP H 21
29 MP H 20
30 MP H 18
;
You can see that INPUT statement of the preceding program uses the following SAS
variable names:

The variable SUB_NUM represents subject numbers;

The variable CONSEQ represents the condition that each subject is in under the
consequences-for-model predictor variable (values are either MR or MP for this variable;

570 Step-by-Step Basic Statistics Using SAS: Student Guide

note that the variable name is followed by the $ symbol to indicate that it is a character
variable);

The variable MOD_AGGR represents subject condition under the model-aggression


predictor variable (values are either L, M, or H for this variable; note that the variable
name is also followed by the $ symbol to indicate that it is a character variable);

The variable SUB_AGGR contains subjects scores on the criterion variable: the number
of aggressive acts displayed by the subject in the playroom.

Data Screening and Testing Assumptions Prior to Performing the


ANOVA
Overview. Prior to performing the ANOVA, you should perform some preliminary analyses
to verify that your data are valid and that you have met the assumptions underlying analysis
of variance. This section summarizes these analyses, and refers you to the other sections of
this Guide that show you how to perform them.
Basic data screening. Before performing any statistical analyses, you should always verify
that your data are valid. This means checking for any obvious errors in typing the data or in
writing the DATA step of your SAS program. At the very least, you should analyze your
numeric variables with PROC MEANS to verify that the means are reasonable that you do
not have any invalid values. For guidance in doing this, see Chapter 4 Data Input, the
section Using PROC MEANS and PROC FREQ to Identify Obvious Problems with a Data
Set.
It is also wise to create a printout of your raw data that you can audit. For guidance in doing
this, again see Chapter 4, the section Using PROC PRINT to Create a Printout of Raw
Data.
Testing assumptions underlying the procedure. The first section of this chapter included a
list of the assumptions underlying factorial ANOVA. For many of these assumptions (such
as random sampling), there is no statistical procedure for testing the assumption. The only
way to verify that you have met the assumption is to conduct a careful review of how you
conducted the study.
On the other hand, some of these assumptions can be tested statistically using the SAS
System. In particular, an excerpt of one of the assumptions is reproduced below:

Normal distributions. Each cell should be drawn from a normally-distributed


population.

You can use PROC UNIVARIATE to test this assumption. Using the PLOT option with
PROC UNIVARIATE prints a stem-and-leaf plot that you can use to determine the
approximate shape of the sample datas distribution. Using the NORMAL option with
PROC UNIVARIATE requests a test of the null hypothesis that the sample data were drawn
from a normally-distributed population. You should perform this test separately for each of

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 571

the cells that constitute your experimental design (remember that the tests for normality
provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large). For
guidance in doing this, see Chapter 7, Measures of Central Tendency and Variability, and
especially the section Using PROC UNIVARIATE to Determine the Shape of
Distributions.
Writing the SAS Program to Perform the Two-Way ANOVA
Overview. This section shows you how to write SAS statements to perform an ANOVA
with two between-subjects factors. It shows how to prepare the PROC GLM statement, the
CLASS statement, the MODEL statement, and the MEANS statement.
The syntax. Below is the syntax for the PROC step that is needed to perform a two-way
factorial ANOVA, and to follow it with Tukey's HSD test:
PROC GLM DATA = data-set-name;
CLASS predictorB predictorA;
MODEL criterion-variable = predictorB predictorA
predictorB*predictorA;
MEANS predictorB predictorA predictorB*predictorA;
MEANS predictorB predictorA /
TUKEY CLDIFF ALPHA=alpha-level ;
TITLE1 ' your-name ';
RUN;
QUIT;
The actual code for the current analysis. Here, the appropriate SAS variable names have
been substituted into the syntax (line numbers have been added on the left):
1
2
3
4
5
6
7
8

PROC GLM DATA=D1;


CLASS CONSEQ MOD_AGGR;
MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05;
TITLE1 'JANE DOE';
RUN;
QUIT;

Some notes about the preceding code:

The PROC GLM statement in Line 1 requests the GLM procedure, and requests that the
analysis be performed on data set D1.

The CLASS statement in Line 2 lists the two classification variables as CONSEQ
(Predictor B) and MOD_AGGR (Predictor A).

572 Step-by-Step Basic Statistics Using SAS: Student Guide

Line 3 contains the MODEL statement for the analysis. The name of the criterion variable
(SUB_AGGR) appears to the left of the equal sign in this statement. To the right of the
equal sign, you should list the following:
the two predictor variables (CONSEQ and MOD_AGGR, in this case). The
names of the two predictor variables should be separated by at least one space.
an additional term that represents the interaction between these two predictor
variables. To create this interaction term, type the names for Predictor B and
Predictor A, and connect them with an asterisk (*). This should be typed as a
single term with no spaces. For the current analysis, the interaction term was
CONSEQ*MOD_AGGR.

You will notice that in all of the statements that contain the SAS variable names for
Predictor A and Predictor B, the statement lists Predictor B first, followed by Predictor A
(that is, CONSEQ is always followed by MOD_AGGR). This order may seem
counterintuitive, but there is a reason for it: When SAS lists the means for the various cells
in the study, this output will be somewhat easier to interpret if you list Predictor B prior to
Predictor A in these statements (more on this later)

Line 4 presents the first MEANS statement:


MEANS

CONSEQ

MOD_AGGR

CONSEQ*MOD_AGGR;

This statement requests means and standard deviations on the criterion variable for the
various conditions under Predictor B (CONSEQ) and Predictor A (MOD_AGGR). By
including the interaction term in this statement (CONSEQ*MOD_AGGR), you ensure
that PROC GLM will also print means and standard deviations on the criterion variable
for each of the six cells in the factorial design. You will need these means in interpreting
the results.

Line 5 presents the second MEANS statement:


MEANS

CONSEQ

MOD_AGGR / TUKEY

CLDIFF

ALPHA=0.05;

In this statement, you will list the names for the two predictor variables followed by a
slash and a number of options. The first option is requested by the key word TUKEY. The
key word TUKEY requests that Tukeys HSD test be performed as a multiple comparison
procedure, in the event that the main effects are significant (for an explanation of multiple
comparison procedures, see the section Treatment Effects, Multiple Comparison
Procedures, and a New Index of Effect Size in Chapter 15, One-Way ANOVA with One
Between-Subjects Factor). Technically, you could have omitted the predictor variable
CONSEQ from this statement because it contains only two conditions; you will remember
from the preceding chapter that it is not necessary to perform a multiple comparison
procedure on a predictor variable when it involves only two conditions.

The MEANS statement also contains the key word CLDIFF, which requests that the
results of the Tukey test be printed as confidence intervals for the differences between the
means. The option ALPHA=0.05 requests that the the significance level (alpha) be set at
.05 for the Tukey tests. If you had wanted alpha set at .01, you would have used the option

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 573

ALPHA=0.01, and if you had wanted alpha set at .10, you would have used the option
ALPHA=0.1.

Although the preceding MEANS statement requests the Tukey HSD test, remember that it
is possible to request other multiple comparison procedures instead of the Tukey test (such
as the Bonferroni t test and the Scheffe multiple comparison procedure). For guidance in
doing this, see Chapter 15, the section Keywords for Other Multiple Comparison
Procedures.

Finally, lines 6, 7, and 8 contain the TITLE, RUN, and QUIT statements for your
program.

The complete SAS program. Here is the complete SAS program that will input your data
set and perform a factorial ANOVA with two between-subjects factors:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
CONSEQ
$
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 MR L 11
02 MR L 7
03 MR L 15
04 MR L 12
05 MR L 8
06 MR M 24
07 MR M 19
08 MR M 20
09 MR M 23
10 MR M 29
11 MR H 23
12 MR H 29
13 MR H 25
14 MR H 20
15 MR H 27
16 MP L 4
17 MP L 0
18 MP L 9
19 MP L 2
20 MP L 8
21 MP M 17
22 MP M 20
23 MP M 12
24 MP M 17
25 MP M 21
26 MP H 12
27 MP H 20
28 MP H 21
29 MP H 20

574 Step-by-Step Basic Statistics Using SAS: Student Guide

30 MP H 18
;
PROC GLM DATA=D1;
CLASS CONSEQ MOD_AGGR;
MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05;
TITLE1 'JANE DOE';
RUN;
QUIT;

Log File Produced by the SAS Program


Why review the log file? After submitting your SAS program, you should always review
the log file prior to reviewing the output file. Verify that the analysis was performed
correctly, and look for any error messages, warning messages, or other notes indicating that
something went wrong. Log 16.1 contains the log file produced by the preceding program.
NOTE: SAS initialization used:
real time
20.53 seconds
1
OPTIONS LS=80 PS=60;
2
DATA D1;
3
INPUT SUB_NUM
4
CONSEQ
$
5
MOD_AGGR $
6
SUB_AGGR;
7
DATALINES;
NOTE: The data set WORK.D1 has 30 observations and 4 variables.
NOTE: DATA statement used:
real time
1.43 seconds
38
;
39
PROC GLM DATA=D1;
40
CLASS CONSEQ MOD_AGGR;
41
MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
42
MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
43
MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05;
44
TITLE1 'JANE DOE';
45
RUN;
NOTE: Means from the MEANS statement are not adjusted for other terms in the
model. For adjusted means, use the LSMEANS statement.
NOTE: Means from the MEANS statement are not adjusted for other terms in the
model. For adjusted means, use the LSMEANS statement.
46
QUIT;
NOTE: There were 30 observations read from the dataset WORK.D1.
NOTE: PROCEDURE GLM used:
real time
3.29 seconds
Log 16.1. Log file produced by the SAS program.

Note regarding the number of observations and variables. Log 16.1 provides no
evidence of any obvious errors in conducting the analysis. For example, following line 7 of

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 575

the program, you can see a note that says NOTE: The data set WORK.D1 has 30
observations and 4 variables. This is a good sign, because you intended your program to
have 30 observations (subjects) and four variables.
Notes regarding the LSMEANS statement. Following line 45 of the program, you can see
two notes that both say the same thing: NOTE: Means from the MEANS statement are not
adjusted for other terms in the model. For adjusted means, use the LSMEANS statement.
This note is not necessarily a cause for alarm. When you are performing a two-way
ANOVA, it is appropriate to use the MEANS statement (rather than the LSMEANS
statement) to compute group means on the criterion variable as long as your experimental
design is balanced. An experimental design is balanced if the same number of observations
(subjects) appear in each cell of the design. For example, Figure 16.10 illustrates the
research design used in the aggression study. It shows that there are five subjects in each cell
of the design (that is, there are five subjects in the cell of subjects who experienced the
low condition under Predictor A and the model-rewarded condition under Predictor B,
there are five subjects in the cell of subjects who experienced the moderate condition
under Predictor A and the model-punished condition under Predictor B, and so forth).
Because your experimental design is balanced, there is no need to adjust the means from the
analysis for other terms in the model. This means that it is acceptable to use the MEANS
statement in your program, and it is not necessary to use the LSMEANS statement.
In contrast, a research design is typically unbalanced if some cells in the design contain a
larger number of observations (subjects) than other cells. For example, again consider
Figure 16.10. If there were 20 subjects in Cell MR-L, but only five subjects in each of the
remaining five cells, then the experimental design would be unbalanced. Note that if you are
analyzing data from an unbalanced design, using the MEANS statement may produce
marginal means that are biased. Thus, to analyze data from an unbalanced design, it is
generally preferable to use the LSMEANS (least-squares means) statement in your program,
rather than the MEANS statement. This is because the LSMEANS statement will estimate
the marginal means over a balanced population. The marginal means estimated by the
LSMEANS statement are less likely to be biased.
In summary, if your experimental design is balanced, you can ignore the note about
LSMEANS that appears in your log file. Because the design of your aggression study is
balanced, it is appropriate to use the MEANS statement.
Analyzing unbalanced designs. For guidance in analyzing data from studies with unequal
cell sizes, see the section Using the LSMEANS Statement to Analyze Data from
Unbalanced Designs, which appears toward the end of this chapter.
Output Produced by the SAS Program
The preceding program would produce five pages of output. Most of this output will be
presented in this section. The information that appears on each page is summarized below:

576 Step-by-Step Basic Statistics Using SAS: Student Guide

Page 1 provides class level information and the number of observations in the data set.

Page 2 provides the ANOVA summary table from the GLM procedure.

Page 3 provides results from the first MEANS statement. These results consist of three
tables that present means and standard deviations for the criterion variable. These means
and standard deviations are broken down by the various levels that constitute the study:
The first table provides the means observed for each level of CONSEQ.
The second table provides the means observed for each level of MOD_AGGR.
The third table provides the means observed for each of the six cells in the studys
factorial design.

Page 4 provides the results of the Tukey multiple comparison procedure for Predictor B
(CONSEQ).

Page 5 provides the results of the Tukey multiple comparison procedure for Predictor A
(MOD_AGGR).

Steps in Interpreting the Output


Overview. The fictitious data set that was analyzed in this section was designed so that the
interaction term would be nonsignificant. When the interaction is nonsignificant,
interpreting the results of a two-factor ANOVA is very similar to interpreting the results of a
one-factor ANOVA. This section begins by showing you how to review specific sections of
output to verify that there were no obvious errors in entering data or in writing the program.
It will then show you how to determine whether the interaction is significant, how to
determine whether the main effects were significant, how to prepare an ANOVA summary
table, and how to review the results of the multiple comparison procedures.
1. Make sure that the output looks correct. The output created by the GLM procedure in the
preceding program contains information that may help identify possible errors in the writing
the program or in entering the data. This section shows how to review that information.
You will begin by reviewing the class level information which appears on Page 1 of the
PROC GLM output. This page is reproduced here as Output 16.1.
JANE DOE
The GLM Procedure
Class Level Information
Class
CONSEQ
MOD_AGGR

Levels
2
3

Number of observations

Values
MP MR
H L M
30

Output 16.1. Verifying that everything looks correct on the class level
information page; two-way ANOVA performed on aggression data, significant
main effects, nonsignificant interaction.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 577

Below the heading Class, you will find the names of the classification variables (the
predictor variables) in your analysis. In Output 16.1, you can see that the classification
variables are CONSEQ and MOD_AGGR.
Under the heading Levels the output should indicate how many levels (or conditions) for
each of your predictor variables. Output 16.1 shows that there were two levels for the
predictor variable CONSEQ, and there were three levels for the predictor variable
MOD_AGGR. This is as it should be.
Under the heading Values the output should indicate the specific numbers or letters that
you used to code the two predictor variables. Output 16.1 shows that you used the values
MP and MR to code conditions under CONSEQ, and that you used the values H, L,
and M to code conditions under MOD_AGGR. This is all correct.
Remember that it is important to use uppercase and lowercase letters consistently when you
are coding treatment conditions under the predictor variables. For example, the preceding
paragraph indicated that you used uppercase letters (H, L, and M) in coding conditions
under MOD_AGGR. If you had accidentally keyed a lowercase h instead of an uppercase
H for one or more of your subjects, the SAS system would have interpreted that as a code
for a new and different treatment condition. This, of course, would have led to errors in the
analysis. Again, the point is that it is important to use uppercase and lowercase letters
consistently when you are coding treatment conditions, because SAS treats uppercase letters
and lowercase letters as different values.
Finally, the last line in Output 16.1 indicates the number of observations in the data set. The
present experimental design consisted of six cells of subjects with five subjects in each cell,
for a total of 30 subjects in the study. Output 16.1 indicates that your data set included 30
observations, and so everything appears to be correct at this point.
Page 2 of the output provides the analysis of variance table created by PROC GLM. It is
reproduced here as Output 16.2.

578 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.822988

DF
5
24
29

Sum of
Squares
1456.166667
313.200000
1769.366667

Coeff Var
21.98263

Mean Square
291.233333
13.050000

Root MSE
3.612478

F Value
22.32

Pr > F
<.0001

SUB_AGGR Mean
16.43333

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type I SS
276.033333
1178.866667
1.266667

Mean Square
276.033333
589.433333
0.633333

F Value
21.15
45.17
0.05

Pr > F
0.0001
<.0001
0.9527

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type III SS
276.033333
1178.866667
1.266667

Mean Square
276.033333
589.433333
0.633333

F Value
21.15
45.17
0.05

Pr > F
0.0001
<.0001
0.9527

Output 16.2. Verifying that everything looks correct with the ANOVA summary
table; two-way ANOVA performed on aggression data, significant main effects,
nonsignificant interaction.

Near the top of page 2 on the left side, the name of the criterion variable being analyzed
should appear to the right of the heading Dependent Variable. In Output 16.2, the
dependent variable is listed as SUB_AGGR. You will remember that SUB_AGGR stands
for subject aggression. The remainder of Output 16.2 provides information about the
analysis of this criterion variable.
The top half of Output 16.2 consists of the ANOVA summary table for the analysis. This
ANOVA summary table is made up of columns with headings such as Source, DF,
Sum of Squares, and so on.
The first column of this table is headed Source, and below this Source heading are three
subheadings: Model, Error, and Corrected Total. Look to the right of the heading
Corrected Total, and under the column headed DF.
For the current output, you will see the number 29. This number represents the corrected
total degrees of freedom. This number should always be equal to N - 1, where N represents
the total number of subjects for whom you have a complete set of data. In this study, N was
30, and so the corrected total degrees of freedom should be equal to 30 1 = 29. Output 16.2
shows that the corrected total degrees of freedom are in fact equal to 29, so again it appears
that everything is correct so far.
Later, you will return to the ANOVA summary table that appears in Output 16.2 to
determine whether any of the effects are significant (and to review other important
information). For now, however, you will continue reviewing other pages of output to see if
there are any other obvious signs of problems with your analysis.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 579

Page 3 of the output provides means and standard deviations on the criterion variable broken
down by the various conditions under the two predictor variables. This page is reproduced
here as Output 16.3.

Level of
CONSEQ
MP
MR

JANE DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev
15
13.4000000
7.28795484
15
19.4666667
7.31794923

Level of
MOD_AGGR
H
L
M

N
10
10
10

Level of
CONSEQ
MP
MP
MP
MR
MR
MR

Level of
MOD_AGGR
H
L
M
H
L
M

-----------SUB_AGGR---------Mean
Std Dev
21.5000000
4.83620604
7.6000000
4.59951688
20.2000000
4.58984386
N
5
5
5
5
5
5

------------SUB_AGGR----------Mean
Std Dev
18.2000000
3.63318042
4.6000000
3.84707681
17.4000000
3.50713558
24.8000000
3.49284984
10.6000000
3.20936131
23.0000000
3.93700394

Output 16.3. Verifying that everything looks correct regarding the means and
standard deviations for the various conditions under Predictor A (model
aggression) and Predictor B (consequences for model).

Output 16.3 is divided into three tables.


The top table provides means and standard deviations on the criterion variable broken down
by levels of the predictor variable CONSEQ (consequences for the model).
To the right of the value MP you will find the mean and standard deviation for the modelpunished group.
Below the heading N, you can see that there were 15 subjects in this condition, as
expected.
Below the heading Mean, you can see that the mean score of this group on the subject
aggression criterion variable was 13.4, which seems reasonable.
Below the heading Std Dev, you can see that the groups standard deviation was
approximately 7.29, which again seems reasonable.
To the right of the value MR, you will find the same statistics for the model-rewarded
group. These statistics can be reviewed in the same way to verify that they seem correct.
The second table in Output 16.3 is headed Level of MOD_AGGR. This table provides
means and standard deviations for the various conditions under the MOD_AGGR predictor

580 Step-by-Step Basic Statistics Using SAS: Student Guide

variable (this was the predictor variable that manipulated the level of aggression displayed
by the model). You can use the same procedure (described above) to review the means and
standard deviations for the three conditions under this predictor variable, to verify that they
are reasonable.
Finally, the bottom table in Output 16.3 provides means and standard deviations broken
down by both predictor variables simultaneously. This is analogous to saying that it provides
means and standard deviations for each of the six cells that constitute your experimental
design. For example, consider the first row in this table. This row is identified by the value
MP under Level of CONSEQ and the value H under Level of MOD_AGGR. This
means that this row provides information about the one cell of subjects who were,
respectively, in the model-punished condition under Predictor B (consequences for the
model), and were in the high-model-aggression condition under Predictor A (level of
aggression displayed by model). Other information in this row shows that (a) there were five
subjects in the cell (under N), (b) their mean score on the criterion variable was 18.2
(under Mean), and (c) their standard deviation was approximately 3.63 (under Std Dev).
All of these figures seem reasonable.
The remaining five rows in the bottom table represent the remaining five cells in your
experimental design. You can review the information in these rows in the same way to
verify that the numbers seem correct.
In summary, these results provide no evidence of any obvious errors in writing the program
or in keying the data. You can therefore proceed to interpreting the results that are relevant
to your research questions.
2. Determine whether the interaction term is statistically significant. Two-factor
ANOVA allows you to test for three types of effects: (a) the main effect of Predictor A
(level of aggression displayed by model, in this case), (b) the main effect of Predictor B
(consequences for the model), and (c) the interaction between Predictor A and Predictor B.
Remember that, if the interaction term is significant, you should interpret the main effects
only with great caution. So one of your first steps must be to determine whether the
interaction is significant.
The null hypothesis for the interaction effect in the aggression study can be stated in this
way:
Statistical null hypothesis (Ho): In the population, there is no interaction between the
level of aggression displayed by the model and the consequences for the model in the
prediction of the criterion variable (the number of aggressive acts displayed by the
subjects).
You can determine whether the interaction is significant by looking at the analysis of
variance results, which appear on page 2 of the SAS Output. That page is reproduced here as
Output 16.4.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 581

JANE DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.822988

DF
5
24
29

Sum of
Squares
1456.166667
313.200000
1769.366667

Coeff Var
21.98263

Mean Square
291.233333
13.050000

Root MSE
3.612478

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type I SS
276.033333
1178.866667
1.266667

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type III SS
276.033333
1178.866667
1.266667

F Value
22.32

Pr > F
<.0001

SUB_AGGR Mean
16.43333

Mean Square
276.033333
589.433333
0.633333
Mean Square
276.033333
589.433333
0.633333

F Value
21.15
45.17
0.05
F Value
21.15
45.17
0.05

Pr > F
0.0001
<.0001
0.9527
Pr > F
0.0001
<.0001
0.9527

Output 16.4. Determining whether the interaction is significant; two-way


ANOVA performed on aggression data, nonsignificant interaction,
significant main effects.

Toward the top of Output 16.4, you can see that the criterion variable being analyzed is
SUB_AGGR ( ). The bottom half of the output page actually provides two sum of squares
tables: One based on Type I sum of squares, and one based on the Type III sum of squares.
Remember that it is usually best to interpret only the results that are based on the Type III
sum of squares ( ). When your cells sizes are equal (i.e., when the design is balanced), the
results from the two sections will be identical. However, when the cells sizes are not equal,
the Type III results are more appropriate.
In the lower left corner of Output 16.4 is the heading, Source ( ). Below this heading are
the names of the predictor variables in your study (CONSEQ and MOD_AGGR), along with
the name of the interaction term (CONSEQ*MOD_AGGR). If you look to the right of the
name for the interaction term ( ), you can see that the interaction has 2 degrees of freedom,
a value of approximately 1.267 for the Type III sum of squares, a mean square of 0.633, an
F value of 0.05, and a corresponding p value of .9527. Remember that you generally view a
result as being statistically significant only if the p value is less than .05. Since this p value
of .9527 is larger than .05, you conclude that the interaction between the two predictor
variables is nonsignificant. You can therefore proceed with your review of the two main
effects.

582 Step-by-Step Basic Statistics Using SAS: Student Guide

3. Determine whether either of the two main effects are statistically significant. A twoway ANOVA allows you to test two null hypotheses concerning main effects: one for
Predictor A, and one for Predictor B. The null hypothesis for Predictor A (the level of
aggression displayed by the model) can be stated as follows:
Statistical null hypothesis (Ho): A1 = A2 = A3
In the population, there is no difference between subjects in the low-model-aggression
condition, subjects in the moderate-model-aggression condition, and subjects in the
high-model-aggression condition with respect to mean scores on the criterion variable
(the number of aggressive acts displayed by the subjects).
You can see that the preceding null hypothesis is essentially identical to the null hypothesis
that was stated for the aggression study in Chapter 15, One-Way ANOVA with One
Between-Subjects Factor. The difference is in the symbolic representation of the null
hypothesis: A1 = A2 = A3. In this symbolic representation, the subscripts to the symbol
are now A1, A2, and A3. These subscripts identify Level A1, Level A2, and Level A3
under Predictor A, respectively. These levels correspond to the low-model-aggression
condition, the moderate-model-aggression condition, and the high-model-aggression
condition, respectively.
The F statistic to test this null hypothesis can again be found in the ANOVA summary table
in Output 16.4. To the right of the heading MOD_AGGR ( ), you can see that this effect
has 2 degrees of freedom, a value of 1178.87 for the Type III sum of squares, a mean square
of 589.43, an F value of 45.17, and a p value of <.0001. This p value is less than our
standard criterion of .05. With such a small p value, you can clearly reject the null
hypothesis of no main effect for model aggression. This means that at least two of the
conditions under this predictor variable must be significantly different from each other
(although you dont know which two at this point). Later, you will review the results of the
Tukey tests to see which groups (low-model-aggression, moderate-model-aggression, etc.)
are significantly different from one another.
The second test for main effects is the test for the main effect of Predictor B (the
consequences for the model). Remember that Predictor B had only two conditions (modelrewarded versus model-punished). Therefore, you will be able to state the null hypothesis
for this predictor variable by using the same format that was used for an independentsamples t test (which always has only two conditions). The null hypothesis for Predictor B
(consequences for the model) may be stated as follows:
Statistical null hypothesis (H0): B1 = B2
In the population, there is no difference between subjects in the model-rewarded
condition versus subjects in the model-punished condition with respect to mean scores
on the criterion variable (number of aggressive acts displayed by the subjects).
The appropriate F statistic is found toward the bottom of Output 16.4. To the right of the
variable name CONSEQ ( ), you can see that this effect is associated with 1 degree of
freedom, a value of 276.03 for the Type III sum of squares, a mean square of 276.03, an F

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 583

value of 21.15, and a p value of .0001. This p value is also less than .05; therefore you may
again reject the null hypothesis. You may conclude that there is also a significant main
effect for the consequences for model predictor variable.
Because there are only two levels of CONSEQ, you will not have to review the results of the
Tukey test for this variablethe Tukey test is necessary only if the predictor variable has
three or more conditions. You need only look at the means for the two conditions to
determine which condition scored significantly higher than the other. A later section will
show you how to do this.
4. Prepare your own version of the ANOVA summary table. The completed ANOVA
summary table for the analysis is reproduced here as Table 16.2.
Table 16.2
ANOVA Summary Table for Study Investigating the Relationship between
Level of Aggression Displayed by Model (A), Consequences For Model (B),
and Subject Aggression (Significant Main Effects, Nonsignificant
Interaction)
________________________________________________________________________
Source

df

SS

MS

R2

________________________________________________________________________
Model aggression (A)
2
1178.87
589.43
45.17
.0001
.67
Consequences for model (B) 1
276.03
276.03
21.15
.0001
.16
A X B Interaction
2
1.27
0.63
0.05
.9527
.00
Within groups
24
313.20
13.05
Total
29
1769.37
________________________________________________________________________
Note: N = 30

The headings in Table 16.2 are similar to the headings used in Chapter 15, One-Way
ANOVA with One Between-Subjects Factor. Specifically:

In the column headed df, you will provide the degrees of freedom associated with
different sources of variation.

In the column headed SS you will provide the Type III sum of squares.

In the column headed MS, you will provide the mean square for each source of
variation.

In the column headed F, you will provide the F value for each effect.

In the column headed p, you will provide the p value (probability value) for each effect
(the tables shown in Chapter 15 did not contain a separate column for p values).

In the column headed R2, you will provide R2 values for the different effects. You will
remember that R2 is an index of effect size that can be used with ANOVA.

To complete the preceding table, you will mostly transfer information from Output 16.4 to
the appropriate line of the ANOVA summary table. One exception is the last column in the

584 Step-by-Step Basic Statistics Using SAS: Student Guide

tablethe column that provides R2 values. You will have to compute those values manually
(see R2 values in the following list).
Here is a description of how information was transferred from Output 16.4 to Table 16.2:

Main effect for Predictor A. Predictor A in your study was the level of aggression
displayed by the model. Information concerning this effect appears to the right of the
heading MOD_AGGR in Output 16.4 ( ). You can see that all of the information in the
row for MOD_AGGR in Output 16.4 (for example, degrees of freedom, sum of squares)
has been entered on the line headed Model aggression (A) in Table 16.2. The only piece
of information that is not directly provided by Output 16.4 is the R2 value for Table 16.2
(see R2 values below).

Main effect for Predictor B. Predictor B in your study was consequences for the model.
Information concerning this effect appears to the right of the heading CONSEQ in Output
16.4 ( ). You can see that all of the information in the row for CONSEQ in Output 16.4
(e.g., degrees of freedom, sum of sum of squares) has been entered on the line headed
Consequences for model (B) in Table 16.2.

The A B interaction. Information about the interaction between Predictor A and


Predictor B can be found to the right of the heading CONSEQ*MOD_AGGR in Output
16.4 ( ). You can see that all of the information from the row for this
CONSEQ*MOD_AGGR interaction in Output 16.4 (e.g., degrees of freedom, sum of sum
of squares) has been entered on the line headed A B Interaction in Table 16.2

Within groups. In the chapter about one-way ANOVA, you learned that the Withingroups line of an ANOVA summary table contains information about the error term from
the analysis of variance. To find this information for the current analysis, look to the right
of the heading Error in Output 16.4 ( ). You can see that the information from the
Error line of Output 16.4 has been copied onto the line headed Within groups in Table
16.2.

Total. In the preceding chapter, you also learned that the total degrees of freedom and the
total sum of squares from an analysis of variance can be found to the right of the heading
Corrected Total in the output of PROC GLM, and the same is true for a factorial
ANOVA. For the current analysis, look to the right of Corrected Total in Output 16.4
( ). You can see that the information from this line has been copied onto the line headed
Total in Table 16.2.

R2 values. In earlier chapters, you learned that it is good practice to report an index of
effect size when you conduct an experiment. In general, an index of effect size is a
measure of the magnitude of a treatment effect. A variety of different types of indices are
available to researchers. In Chapter 15, you learned that an index of effect size that is often
2
2
used with ANOVA is R . The R statistic indicates the proportion of variance in the
criterion variable that is accounted for by the studys predictor variable(s). Values of R2
may range from .00 to 1.00, with larger values indicating a larger treatment effect.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 585

In the preceding chapter, you learned that the output of PROC GLM includes the
heading R-Square, and below this heading you may find the R2 value for a one-way
ANOVA. When you perform a two-way ANOVA, you will again see the heading RSquare in your output. However, you generally will not report this R2 value in your
2
analysis. The R value that appears in the output of PROC GLM indicates the total
percent of variance in the criterion variable that is accounted for by all of your treatment
effects combined. In most cases, it is better to instead report a separate R2 value for each
of your treatment effects individually. This means that you will report one R2 value for
Predictor A, one R2 value for Predictor B, and one R2 value for the interaction term. You
will have to perform a few simple hand calculations to compute these three values of R2.
To calculate R2 for a given effect, divide the type III sum of squares associated with that
effect by the corrected total sum of squares. For example, Table 16.2 shows that the sum
of squares for model aggression (Predictor A) is 1178.87, and the total sum of squares
for the analysis is 1769.37. To calculate the R2 value for the model aggression main
effect (Predictor A), you substitute these terms in the formula for R2, as is done below:
R2

Sum of Squares for Predictor A

Total Sum Of Squares

1178.87
= .67
1769.37

The R2 value of .67 indicates that Predictor A (the level of aggression displayed by the
model) accounted for 67% of the variance in the criterion variable (the number of
aggressive acts displayed by the subjects). This is a fairly large effect size (but remember
that these data are fictitious).
2
To calculate the R value for the consequences for model main effect (Predictor B), you
divide the sum of squares for Predictor B (from Table 16.2) by the same total sum of
squares:

Sum of Squares for Predictor B


=
Total Sum Of Squares

276.03
= .16
1769.37

The R2 value of .16 indicates that Predictor B accounted for 16% of the variance in the
criterion variable. This is a smaller treatment effect, compared to the treatment effect for
Predictor A.
Finally, to calculate the R2 value for the interaction between Predictor A and Predictor B,
you divide the sum of squares for A B Interaction (from Table 16.2) by the same
total sum of squares:
R

Sum of Squares for A B Interaction


= =
Total Sum Of Squares

1.27
= .00
1769.37

This computation resulted in an R2 value of .0007, which rounded to .00. Clearly, the
interaction term accounted for almost none of the variance in the criterion variable.

586 Step-by-Step Basic Statistics Using SAS: Student Guide

Once you have computed the R2 values for the two main effects and the interaction
effect, you can enter them in the column headed R2 in your ANOVA summary table.
You can see that this has already been done in Table 16.2.
5. Review the sample means and the results of the multiple comparison procedures for
Predictor A. If a particular main effect is statistically significant, you can review the group
means and the results of the Tukey tests to determine which pairs of groups are significantly
different from each other. Do this in the same way that you would if you had conducted a
one-way ANOVA.
One of the MEANS statements in your program requested means and standard deviations,
broken down by the levels of the predictor variables. The output created by that statement
appeared earlier as Output 16.3. An excerpt of the same output is reproduced here as Output
16.5.

Level of
CONSEQ
MP
MR

JANE DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev
15
13.4000000
7.28795484
15
19.4666667
7.31794923

Level of
MOD_AGGR
H
L
M

N
10
10
10

-----------SUB_AGGR---------Mean
Std Dev
21.5000000
4.83620604
7.6000000
4.59951688
20.2000000
4.58984386

Output 16.5. Means and standard deviations for conditions under Predictor A
(level of aggression displayed by the model); two-way ANOVA performed on
aggression data, significant main effects, nonsignificant interaction.

Means and standard deviations for Predictor A appear in the section headed Level of
MOD_AGGR. Below this heading, the row identified with the value H provides
information for the high-model-aggression group, the row identified with the value L
provides information for the low-model-aggression group, and the row identified with the
value M provides information for the moderate model-aggression group.
Below the heading Mean you can see the means for the three treatment conditions. These
results show that the high-model-aggression group (identified with an H) displayed a
mean of 21.5 (meaning that these children displayed an average of 21.5 aggressive acts after
viewing the videotape). The moderate-model-aggression group (identified with an M)
displayed a mean of 20.2, and the low-model-aggression group (identified with an L)
displayed a mean of 7.6.
On the surface, it appears that the low condition scored substantially lower than the other
two conditions on the criterion variable. But will the differences be statistically significant?
To find out, you must review the results of the Tukey HSD multiple comparison procedure.
Output 16.6 provides the results of this procedure for Predictor A.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 587

JANE DOE
The GLM Procedure
Tukey's Studentized Range (HSD) Test for SUB_AGGR
NOTE: This test controls the Type I experimentwise error rate.
Alpha
Error Degrees of Freedom
Error Mean Square
Critical Value of Studentized Range
Minimum Significant Difference

0.05
24
13.05
3.53170
4.0345

Comparisons significant at the 0.05 level are indicated by ***.

MOD_AGGR
Comparison
H
H
M
M
L
L

M
L
H
L
H
M

Difference
Between
Means
1.300
13.900
-1.300
12.600
-13.900
-12.600

Simultaneous 95%
Confidence Limits
-2.734
9.866
-5.334
8.566
-17.934
-16.634

5.334
17.934
2.734
16.634
-9.866
-8.566

***
***
***
***

Output 16.6. Results of Tukey HSD test for Predictor A (level of aggression
displayed by the model); two-way ANOVA performed on aggression data,
significant main effects, nonsignificant interaction.

This section will present a brief review of how to interpret the results of the Tukey tests and
how to prepare a table to display the confidence intervals for the differences between the
means. All of these concepts were explained in much more detail in Chapter 15, One-Way
ANOVA with One Between-Subjects Factor. If you need to refresh your memory on
how to interpret the output that presents the Tukey tests and confidence intervals, go to
Chapter 15 and reread Example 15.1: One-Way ANOVA Revealing a Significant
Treatment Effect, especially items 57 in the subsection Steps in Interpreting the Output.
In Output 16.6, an entry at the top of the page tells you that the criterion variable in the
analysis was SUB_AGGR ( ). Remember that SUB_AGGR represents the number of
aggressive acts displayed by the subject. Lower on the page, the heading MOD_AGGR
Comparison ( ) indicates that the predictor variable in this analysis is MOD_AGGR (the
level of aggression displayed by the model). Below this heading are a number of entries
such as H M, H L, and so on. These entries tell you which conditions are being
compared. You will remember that you coded your data so that H would represent the
high-model-aggression condition, M would represent the moderate-model-aggression
condition, and L would represent the L low-model-aggression condition. The same
values appear in this section.
The first row is identified by the entry H M ( ). This means that this row provides
information about the comparison between the high-model-aggression condition versus the
moderate-model-aggression condition. Below the heading Difference Between Means
( ), you can see that the difference between means for these two conditions is 1.300. The

588 Step-by-Step Basic Statistics Using SAS: Student Guide

order that the values H and M appear in for the entry H M tell you how this
difference was computed: SAS started with the mean for the high-model-aggression
condition, and subtracted from it the mean for the moderate-model-aggression condition.
The resulting difference was 1.3000.
The next two columns are headed Simultaneous 95% Confidence Limits ( ). This section
provides the 95% confidence interval for the difference between the means. Output 16.6
shows that the 95% confidence interval for the current difference extends from 2.734 to
5.334. This means that, although you are not sure what the actual difference between the
means is (in the population), you estimate that there is a 95% probability that it is
somewhere between 2.734 and 5.334. Notice that this confidence interval contains the
value of zero. In general, when a confidence interval contains the value of zero, it means that
the difference between the two conditions is not statistically significant.
A note about mid-way down Output 16.6 says Comparisons significant at the 0.05 level are
indicated by *** ( ). This means that three asterisks will be used to flag any comparisons
that are significant at p < .05. These asterisks will appear on the right side of the table at the
bottom of Output 16.6 ( ). You can see that no asterisks appear on the right side of the row
for the entry H M ( ). This means that, according to the Tukey HSD test, the
difference between the high-model-aggression condition versus the moderate-modelaggression condition is not statistically significant.
The row identified with H L ( ) provides information about the comparison between
the high-model-aggression condition versus the low-model-aggression condition.
Information in this row shows that the difference between means for these two conditions is
13.900, and that the 95% confidence interval ranges from 9.866 to 17.934. Three asterisks
(***) appear on the right side of this row, indicating the difference between the high
condition and the low condition is in fact significant at the .05 level.
Finally, the fourth row down is identified with M L ( ), meaning that this row
provides information about the comparison between the moderate-model-aggression
condition versus the low-model-aggression condition. Information in this row shows that the
difference between means for these two conditions is 12.600, and that the 95% confidence
interval ranges from 8.566 to 16.634. Again, three asterisks appear on the right side of the
row, indicating the difference between the moderate condition and the low condition is also
significant at the .05 level.
Notice that there are three additional rows of information at the bottom of Output 16.6. It is
not necessary to interpret them here, because these rows provide information that is already
provided by the three other rows. For example, the third row down is identified by M
H. This row provides the same information as was provided by the row labeled H M
(discussed above). The other two rows provide similarly redundant information.
Because there is so much information provided in Output 16.6, it is useful to summarize the
most important information in a table:

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 589


Table 16.3
Results of Tukey Tests for Predictor A: Comparing High-Model-Aggression
Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression
Group on the Criterion Variable (Subject Aggression)
________________________________________________________
Simultaneous 95%
Difference
confidence limits
between

a
means
Lower
Upper
Comparison
________________________________________________________
High Moderate
1.300
-2.734
5.334
High - Low
13.900 *
9.866
17.934
Moderate - Low
12.600 *
8.566
16.634
________________________________________________________
Note: N = 30
a Differences are computed by subtracting the mean for the second group
from the mean for the first group.
* Tukey test indicates that the differences between the means is
significant at p < .05.

Table 16.3 summarizes the results of the multiple comparison procedure. The asterisks in the
table show that (a) the high-model-aggression condition scored significantly higher than the
low-model-aggression condition, (b) the moderate-model-aggression condition also scored
significantly higher than the low-model-aggression condition, but that (c) the difference
between the high condition versus the moderate condition was not statistically significant.
Chapter 15 provides detailed instructions on how to take information from SAS output such
as Output 16.6, and use it to create a table such as Table 16.3. In particular, see the subsubsection titled 7. Prepare a table that presents the results of the Tukey tests and the
confidence intervals.
6. Review the sample means and the confidence interval for Predictor B. Predictor B in
your study was consequences for the model. Output 16.5 displayed means on the criterion
variable broken down by levels of this predictor variable. For convenience, that output is
presented again as Output 16.7.

590 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE
The GLM Procedure
Level of
CONSEQ
MP
MR

N
15
15

-----------SUB_AGGR---------Mean
Std Dev
13.4000000
7.28795484
19.4666667
7.31794923

Level of
MOD_AGGR
H
L
M

N
10
10
10

-----------SUB_AGGR---------Mean
Std Dev
21.5000000
4.83620604
7.6000000
4.59951688
20.2000000
4.58984386

Output 16.7. Means for conditions under Predictor B (consequences for the
model); two-way ANOVA performed on aggression data, significant main
effects, nonsignificant interaction.

Statistics related to Predictor B appear in the section headed Level of CONSEQ.


In the section headed Mean, you can see that the subjects in the model-punished condition
(identified by the value MP) displayed an average of 13.4 aggressive acts, while the
subjects in the subjects in the model-rewarded condition (identified by the value MR)
displayed an average of approximately 19.47 aggressive acts. You already know that the
difference between these two means is statistically significant because you have already
determined that the main effect for Predictor B (CONSEQ) from the ANOVA was
statistically significant (see Output 16.4 and Table 16.2). From one perspective, it is not
really necessary for you to review the results of the Tukey HSD test to determine whether
there is a significant difference between these two conditionsyou need to refer to the
Tukey test only when the predictor variable has three or more conditions.
Although you do not need to review to the results of the Tukey test to determine whether
there is a significant difference between the two conditions under Predictor B, there is a
different reason that you should review these results. In addition to printing the results of a
significance test, the Tukey procedure also prints a confidence interval for the difference
between the means (as long as you have included the CLDIFF option in the MEANS
statement). Therefore, the results of the Tukey test for Predictor B is reproduced here as
Output 16.8.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 591

JANE DOE
The GLM Procedure
Tukey's Studentized Range (HSD) Test for SUB_AGGR
NOTE: This test controls the Type I experimentwise error rate.
Alpha
Error Degrees of Freedom
Error Mean Square
Critical Value of Studentized Range
Minimum Significant Difference

0.05
24
13.05
2.91880
2.7225

Comparisons significant at the 0.05 level are indicated by ***.

CONSEQ
Comparison
MR
- MP
MP
- MR

Difference
Between
Means
6.067
-6.067

Simultaneous
95% Confidence
Limits
3.344
8.789
-8.789 -3.344

***
***

Output 16.8. Confidence interval for Predictor B (consequences for the


model); two-way ANOVA performed on aggression data, significant main
effects, nonsignificant interaction.

A note at the top of Output 16.8 tells you that the criterion variable being analyzed is
SUB_AGGR, the number of aggressive acts displayed by the subjects.
A note toward the bottom of the page tells you that the predictor variable is CONSEQ, the
consequences for the model.
The first row in the bottom section contains the entry MR MP. This entry indicates that
this row provides information about the comparison in which the mean of the modelpunished condition (MP) is subtracted from the mean of the model-rewarded condition
(MR).
In the column headed Difference Between Means, you can see that the difference between
these two means is equal to 6.067.
In the column headed Simultaneous 95% Confidence Limits you can see that the 95%
confidence interval for this difference goes from 3.344 to 8.789. There is a second row of
information at the bottom of this output (the MP MR row), but it is not necessary to
interpret it, since it provides essentially redundant information.
The confidence interval from Output 16.8 tells you that, although you do not know what the
exact difference between the means is (in the population), you estimate that there is a 95%
probability that the actual difference is between 3.344 and 8.789. (Many research journals
will expect you to report this confidence interval in your published article.) With Predictor A
(discussed in an earlier section), you presented your confidence intervals in a table,
specifically Table 16.3. This was because Predictor A included three treatment conditions,
and this meant that you had three confidence intervals to present. In relatively complicated
situations such as that, it is often best to present your confidence intervals in the form of a
table.

592 Step-by-Step Basic Statistics Using SAS: Student Guide

However, Predictor B included only two treatment conditions, meaning that there is only
one confidence interval to present. In relatively simple situations such as this it will typically
not be necessary to prepare a table to present the confidence interval. Because there is
relatively little information to present, you can simply describe the confidence interval in the
text of your paper. Following is an example of how this might be done for Predictor B, using
the information from Output 16.8.
Subtracting the mean of the model-punished condition from the mean of the modelrewarded condition resulted in an observed difference of 6.067. The 95% confidence
interval for this difference ranged from 3.344 to 8.789.
Using a Figure to Illustrate the Results
The results of a factorial ANOVA are easiest to understand when they are represented in a
table that plots the means for each of the cells in the studys factorial design. The factorial
design used in the current study involved a total of six cells (this design was presented in
Figure 16.1). The mean commitment scores for these six cells are presented on page 3 of the
SAS output that is produced by the current analysis. Output page 3 is reproduced here as
Output 16.9.

Level of
CONSEQ
MP
MR

JANE DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev
15
13.4000000
7.28795484
15
19.4666667
7.31794923

Level of
MOD_AGGR
H
L
M

N
10
10
10

Level of
CONSEQ
MP
MP
MP
MR
MR
MR

Level of
MOD_AGGR
H
L
M
H
L
M

-----------SUB_AGGR---------Mean
Std Dev
21.5000000
4.83620604
7.6000000
4.59951688
20.2000000
4.58984386
N
5
5
5
5
5
5

-----------SUB_AGGR---------Mean
Std Dev
18.2000000
3.63318042
4.6000000
3.84707681
17.4000000
3.50713558
24.8000000
3.49284984
10.6000000
3.20936131
23.0000000
3.93700394

Output 16.9. The means that are needed to plot the results of
the study in a figure.

Means for conditions under Predictor B. There are actually three tables of means on page
3 of this output. The first table provides the means for each level of CONSEQ, the SAS
variable that coded membership under the consequences-for-model predictor variable.
Below the heading CONSEQ are the values that represented the two levels of this variable.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 593

To the right of MP, you can see that the mean for the model-punished group was 13.4. To
the right of MR, you can see that the mean for the model-rewarded groups was 19.47.
Means for conditions under Predictor A. The second table provides the means for each
level of MOD_AGGR, the SAS variable that represents the model-aggression predictor.
Below the heading MOD_AGGR are the values for the three levels of this predictor variable
H, M, and L, which represent the high-model-aggression condition, moderate-modelaggression condition, and low-model-aggression condition, respectively. To the right of
these values are the mean scores displayed by subjects in these conditions.
Means for the individual cells. Finally, the third table is the most interesting as it provides
means and standard deviations for each of the six cells from the studys factorial design
matrix. This table shows means and standard deviations for groups of subjects broken down
by Predictor A and Predictor B. You can see that this third table has a column on the far left
headed Level of CONSEQ. Below this heading are values that represent the conditions
under Predictor B (MP and MR). Next to this column is a column headed Level of
MOD_AGGR. Below this heading are the values that represent the conditions under
Predictor A (H, L, and M).
The first line of this third table provides the mean score displayed by the cell that was coded
with an MP under CONSEQ and an H under MOD_AGGR. This line thus provides the
mean score for the five subjects who were in the model-punished condition under
Predictor B (consequences for the model), and in the high condition under Predictor A
(level of aggression displayed by the model). In other words, these subjects saw the version
of the videotape in which the model demonstrated a high level of aggression and was
subsequently punished. You can see that this group had a mean of 18.2 on the criterion
variable, and a standard deviation of 3.63.
The second line from this table provides the mean score displayed by the cell that was coded
with an MP under CONSEQ and an L under MOD_AGGR. This line therefore provides
the mean score for the five subjects who were in the model-punished condition under
Predictor B, and in the low condition under Predictor A. You can see that this group had a
mean on the criterion variable of 4.6 and a standard deviation of 3.85. You can follow the
same procedure to review the means and standard deviations for the other four groups who
participated in the study.
Plotting cell means in a graph. It can be difficult to understand the results of a factorial
ANOVA by simply reviewing cell means from a table, such as the third table in Output 16.9.
These results are typically easier to understand if the means are plotted in a graph.
Therefore, the cells means from Output 16.9 have been plotted in Figure 16.11.

594 Step-by-Step Basic Statistics Using SAS: Student Guide

Figure 16.11. Mean number of aggressive acts as a function of the level of


aggression displayed by the model and the consequences for the model
(significant main effects for both Predictor A and Predictor B;
nonsignificant interaction).

The vertical axis in Figure 16.11 is labeled Subject Aggressive Acts, which was the
criterion variable in the study. It is on this axis that you will plot the mean number of
aggressive acts that were displayed by the various groups of children after viewing the
videotape.
The two different lines that appear in Figure 16.11 represent the two conditions under
Predictor B (consequences for the model). The solid line represents mean scores for the
children in the model-rewarded group, and the broken line represents mean scores for the
model-punished group.
The horizontal axis of the graph labels the three conditions under Predictor A (level of
aggression displayed by the model). For example, directly above the label Low, you see
the mean scores displayed by subjects in the low-model-aggression condition. Notice that
two mean scores are plotted: A small circle plots the mean aggression scores from subjects
who were in the low-model-aggression/model-rewarded cell, and a small triangle plots
mean aggression scores from subjects who were in the low-model-aggression/modelpunished cell. The same system was used to plot mean scores for subjects in the remaining
cells.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 595

Steps in Preparing the Graph


Overview. Preparing a graph that plots the results of a factorial ANOVA (such as Figure
16.11) can be a confusing undertaking. For this reason, this section presents structured,
step-by-step guidelines that should make the task easier. The following guidelines describe
how to begin with the cells means from Output 16.9, and transform them into the figure
represented in Figure 16.11.
1. Labeling the vertical axis. Label the vertical axis with the name of the criterion variable.
In this case, the criterion variable is Subject Aggressive Acts. On this same axis, provide
some midpoints of scores which were possible. In Figure 16.11, this was done by using the
midpoints of 5, 10, 15, and so forth.
2. Labeling the horizontal axis. Label the horizontal axis with the name of Predictor A. In
this case, Predictor A was Level of Aggression Displayed by Model. On the same axis,
provide one midpoint for each level of this predictor variable, and label these midpoints.
Here, this was done by creating three midpoints labeled Low, Moderate, and High.
3. Drawing a solid line to represent the model-rewarded condition. Graphs such as the
one in Figure 16.11 are easiest to draw if you draw just one line at a time. Here, you will
begin by drawing a solid line to represent the mean scores of subjects in the model-rewarded
condition.
Draw small circles on the graph to indicate the mean subject aggression scores of just those
subjects in the model-rewarded condition under Predictor B. First, plot the mean of the
subjects who were in the model-rewarded conditions under Predictor B, and were also in the
low-model-aggression condition under Predictor A. Go to the cell means provided in
Output 16.9, and find the entry for the group that is coded with an MR under CONSEQ
and is also coded with a L under MOD_AGGR. It turns out that this is the next-to-last
entry in the table ( ), and the mean subject aggression score for this subgroup is 10.6.
Therefore, draw a small circle where Low on the horizontal axis intersects with 10.6 on
the Subject Aggressive Acts vertical axis.
Next, plot the mean for the subjects who were in the model-rewarded condition under
Predictor B, and in the moderate-model-aggression condition under Predictor A. In Output
16.9, find the entry for the subgroup that is coded with an MR under CONSEQ and an
M under MOD_AGGR. This is the last entry in the table ( ), and their mean subject
aggression score was 23.0. Therefore, draw a small circle where Moderate on the
horizontal axis intersects with 23.0 on the Subject Aggressive Acts vertical axis.
Next, plot the mean for the subjects who were in the model-rewarded condition under
Predictor B, and the high-model-aggression condition under Predictor A. In Output 16.9,
find the entry for the subgroup that is coded with an MR under CONSEQ and an H
under MOD_AGGR. This entry in the table ( ) shows that the mean subject aggression
score for this group was 24.8. Therefore, draw a small circle where High intersects with
24.8.

596 Step-by-Step Basic Statistics Using SAS: Student Guide

As a final step, draw a solid line connecting these three circles. This solid line will help
readers to see that the circles all represent the same condition under Predictor B (the modelrewarded condition).
4. Drawing a broken line to represent the model-punished condition. Now repeat this
procedure, except that this time you will draw small triangles to represent the scores of the
subjects who were in the model-punished condition under Predictor B. Remember that these
are the subgroups coded with an MP under Level of CONSEQ in Output 16.9. Output
16.9 shows that the mean for the model-punished/low-model-aggression group is 4.6 ( ).
To represent this score, draw a small triangle above the Low midpoint on the horizontal
axis, to the right of 4.6 on the vertical axis. The mean for the model-punished/moderatemodel-aggression subgroup is 17.4 ( ), and the mean for the model-punished/high-modelaggression subgroup is 18.2 ( ). Note where triangles were drawn on the figure to
represent these means. After all three triangles are in place, they are connected with a
broken line. With this done, the figure is now complete.
Using this system, you will know that whenever you see a solid line connecting circles, you
are looking at mean subject aggression scores from subjects in model-rewarded group, and
whenever you see a broken line connecting triangles, you are looking at mean subject
aggression scores from the model-punished group.
Interpreting Figure 16.11
Notice how the graphic pattern of results in Figure 16.11 are consistent with the statistical
results reported in Output 16.4. Specifically:

In Figure 16.11, the corresponding line segments are parallel to one another. That is, the
segments going from Low to Moderate are parallel, and the segments going from
Moderate to High are also parallel. This is the railroad track pattern that you would
expect when the interaction is nonsignificant (as was reported in Output 16.4).

The line segments going from Low to Moderate in Figure 16.11 display a marked
upward angle, or slope. This is consistent with the fact that Output 16.4 reported a
significant main effect for Predictor A (level of aggression displayed by the model).
However, notice that there is very little angle going from Moderate to High. This
outcome is consistent with the results of the Tukey HSD test (from Output 16.6), which
indicated that the low-model-aggression group was significantly different from the
moderate- and high-model-aggression groups, but that the moderate- and high-modelaggression groups were not significantly different from each other.

Finally, you can see that the solid line for the model-rewarded group is separated from the
broken line for the model-punished group. This shows that, generally speaking, the
children who saw the model rewarded in the videotape tended to be more aggressive
themselves, compared to children in the model-punished condition. This is consistent with
the finding from Output 16.4 that there was a main effect for Predictor B (consequences
for the model).

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 597

Preparing Analysis Reports for Factorial ANOVA: Overview


To summarize the results of a two-way ANOVA, you will use a format very similar to the
one used with the one-way ANOVA in the preceding chapter. However, in the present case
it is possible to prepare three different analysis reports from a single analysis, since it is
possible to test three different null hypotheses with a two-way ANOVA:

the hypothesis of no main effect for Predictor A (level of aggression displayed by the
model)

the hypothesis of no main effect for Predictor B (consequences for the model)

the hypothesis of no interaction.

The following sections show how to prepare analysis reports for these three types of effects,
based on the current results.
Analysis Report Concerning the Main Effect for Predictor A
(Significant Effect)
This section shows how to prepare a report concerning Predictor A, the level of aggression
displayed by the model. You will remember that the main effect for Predictor A was
significant in this analysis. The section that follows this section explains where you have to
look in the SAS output to find some of the statistics that have been inserted in the report.
A) Statement of the research question: One purpose of this
study was to determine whether there was a relationship
between (a) the level of aggression displayed by a model and
(b) the number of aggressive acts later demonstrated by
children who observed the model.
B) Statement of the research hypothesis: There will be a
positive relationship between the level of aggression
displayed by a model, and the number of aggressive acts later
demonstrated by children who observed the model. Specifically,
it is predicted that (a) children who witness a high level of
aggression will demonstrate a greater number of aggressive
acts than children who witness a moderate or low level of
aggression, and (b) children who witness a moderate level of
aggression will demonstrate a greater number of aggressive
acts than children who witness low level of aggression.
C) Nature of the variables: This analysis involved two
predictor variables and one criterion variable:
Predictor A was the level of aggression displayed by the
model. This was a limited-value variable, was assessed on an
ordinal scale, and included three levels: low, moderate, and
high. This was the predictor variable that was relevant to
the present hypothesis.

598 Step-by-Step Basic Statistics Using SAS: Student Guide

Predictor B was the consequences for the model. This was a


dichotomous variable, was assessed on an nominal scale, and
included two levels: model-rewarded and model-punished.
The criterion variable was the number of aggressive acts
displayed by the subjects after observing the model. This
was a multi-value variable, and was assessed on a ratio
scale.
Factorial ANOVA with two between-

D) Statistical test:
subjects factors

E) Statistical null hypothesis (Ho): A1 = A2 = A3; In the


population, there is no difference between subjects in the
low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to mean scores on the
criterion variable (the number of aggressive acts displayed by
the subjects).
F) Statistical alternative hypothesis (H1): Not all A are
equal; In the population, there is a difference between at
least two of the following three conditions with respect to
their mean scores on the criterion variable: subjects in the
low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition.
G) Obtained statistic:

F(2, 24) = 45.17

H) Obtained probability (p) value:

p = .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Multiple comparison procedure: Tukeys HSD test showed
that subjects in the high-model-aggression condition and
moderate-model-aggression condition scored significantly
higher on subject aggression than did subjects in the lowmodel-aggression condition. (p < .05). With alpha set at .05,
the difference between the high-model-aggression condition
versus the moderate-model-aggression condition was
nonsignificant.
K) Confidence intervals:. Confidence intervals for
differences between the means are presented in Table 16.3
2
L) Effect size:. R = .67, indicating that model aggression
accounted for 67% of the variance in subject aggression.

M) Conclusion regarding the research hypothesis: These


findings provide partial support for the studys research
hypothesis. The findings provided support for the hypothesis

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 599

that (a) children who witness a high level of aggression will


demonstrate a greater number of aggressive acts than children
who witness a low level of aggression, as well as for the
hypothesis that (b) children who witness a moderate level of
aggression will demonstrate a greater number of aggressive
acts than children who witness low level of aggression.
However, the study failed to provide support for the
hypothesis that children who witness a high level of
aggression will demonstrate a greater number of aggressive
acts than children who witness a moderate level of aggression.
N) Formal description of the results for a paper:
Results were analyzed by using a factorial ANOVA with two
between-subjects factors. This analysis revealed a significant
main effect for level of aggression displayed by the model,
F(2, 24) = 45.17, MSE = 13.05, p = .0001.
On the criterion variable (number of aggressive acts
displayed by subjects), the mean score for the high-modelaggression condition was 21.50 (SD = 4.84), the mean for the
moderate-model-aggression-condition was 20.20 (SD = 4.59), and
the mean for the low-model-aggression condition was 7.60 (SD =
4.60). Sample means for the various conditions that
constituted the study are displayed in Figure 16.11.
Tukeys HSD test showed that subjects in the high-modelaggression condition and moderate-model-aggression condition
scored significantly higher on subject aggression than did
subjects in the low-model-aggression condition (p < .05). With
alpha set at .05, the difference between the high-modelaggression condition versus the moderate-model-aggression
condition was nonsignificant. Confidence intervals for
differences between the means are presented in Table 16.3.
2
In the analysis, R for this main effect was computed as
.67. This indicated that model aggression accounted for 67% of
the variance in subject aggression.

O) Figure representing the results:

See Figure 16.11.

Notes Regarding the Preceding Analysis Report


Overview. With some sections of the preceding report, it might not be clear where you
needed to look in your SAS output to find the necessary statistics to insert at that location.
This section will attempt to clarify some of those issues. First, the main ANOVA summary
table produced by PROC GLM is reproduced here as Output 16.10. Much of the information
needed for your report appears on this page of the output.

600 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.822988

DF
5
24
29

Sum of
Squares
1456.166667
313.200000
1769.366667

Coeff Var
21.98263

Mean Square
291.233333
13.050000

Root MSE
3.612478

F Value
22.32

Pr > F
<.0001

SUB_AGGR Mean
16.43333

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type I SS
276.033333
1178.866667
1.266667

Mean Square
276.033333
589.433333
0.633333

F Value
21.15
45.17
0.05

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type III SS
276.033333
1178.866667
1.266667

Mean Square
276.033333
589.433333
0.633333

F Value
21.15
45.17
0.05

Pr > F
0.0001
<.0001
0.9527
Pr > F
0.0001
<.0001
0.9527

Output 16.10. Information needed from ANOVA summary table to complete an


analysis report; two-way ANOVA performed on aggression data, significant
main effects, nonsignificant interaction.

F Statistic, degrees of freedom, and p value. Items G and H from the preceding analysis
report are reproduced here:
G) Obtained statistic: F(2, 24) = 45.17
H) Obtained probability (p) value: p = .0001
You can see that Items G and H provide the F statistic, degrees of freedom for this statistic,
and p value associated with this statistic. It is useful to review where these terms can be
found in the SAS output. The F statistic of 45.17 that appears in Item G of the preceding
report is the F statistic associated with Predictor A, the level of aggression displayed by the
model. It can be found in Output 16.10 where the row labeled MOD_AGGR intersects
with the column headed F Value ( ).
It is customary to list the degrees of freedom for an F statistic within parentheses. In the
preceding report, these degrees of freedom were listed in item G as F(2, 24). The first term
(2) represents the degrees of freedom for the numerator in the F ratio (i.e., the degrees of
freedom for the model aggression main effect). This term appears in Output 16.10 where the
row labeled MOD_AGGR intersects with the column headed DF ( ). The second term
(24) represents the degrees of freedom for the denominator in the F ratio (i.e., the degrees of
freedom for the error term). This term appears in Output 16.10 where the row labeled
Error intersects with the column headed DF ( ).
The p value listed in item H of the preceding report is the p value associated with the model
aggression main effect. It appears in Output 16.10 where the row labeled MOD_AGGR
intersects with the column headed Pr > F ( ).

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 601

The MSE (mean square error). The mean square error is an estimate of the error
variance in your analysis. Item N from the analysis report provides a formal description of
the results for a paper. The first paragraph of this report includes a statistic abbreviated as
MSE. The relevant section of Item N is reproduced as follows:
N) Formal description of the results for a paper:
Results were analyzed using a
with two between-subjects factors.
revealed a significant main effect
aggression displayed by the model,
MSE = 13.05, p = .0001.

factorial ANOVA
This analysis
for level of
F(2, 24) = 45.17,

The last sentence of the preceding excerpt indicates that MSE = 13.05. In the output from
PROC GLM, you will find it at the location where the row labeled Error intersects with
the column headed Mean Square. In Output 16.10, you can see that the mean square error
is equal to 13.050000 ( ), which rounds to 13.05.
Analysis Report Regarding the Main Effect for Predictor B
(Significant Effect)
This section shows how to prepare a report about Predictor B, consequences for the
model. You will remember that the main effect for Predictor B was significant in this
analysis. There are some differences between the way that this report was prepared, versus
the way that the report for Predictor A was prepared. See the notes following this analysis
report for details.
A) Statement of the research question: One purpose of this
study was to determine whether there was a relationship
between (a) the consequences experienced by an aggressive
model and (b) the number of aggressive acts later demonstrated
by children who had observed this model.
B) Statement of the research hypothesis: Children who observe
a model being rewarded for engaging in aggressive behavior
will later demonstrate a greater number of aggressive acts,
compared to children who observe a model being punished for
engaging in aggressive behavior.
C) Nature of the variables: This analysis involved two
predictor variables and one criterion variable:
Predictor A was the level of aggression displayed by the
model. This was a limited-value variable, was assessed on an
ordinal scale, and included three levels: low, moderate, and
high.
Predictor B was the consequences for the model. This was a
dichotomous variable, was assessed on a nominal scale, and

602 Step-by-Step Basic Statistics Using SAS: Student Guide

included two levels: model rewarded and model punished.


This was the predictor variable that was relevant to the
present hypothesis.
The criterion variable was the number of aggressive acts
displayed by the subjects after observing the model. This
was a multi-value variable, and was assessed on a ratio
scale.
Factorial ANOVA with two between-

D) Statistical test:
subjects factors

E) Statistical null hypothesis (H0): B1 = B2; In the


population, there is no difference between subjects in the
model-rewarded condition versus subjects in the model-punished
condition with respect to mean scores on the criterion
variable (the number of aggressive acts displayed by the
subjects).
F) Statistical alternative hypothesis (H1): B1 B2; In the
population, there is a difference between subjects in the
model-rewarded condition versus subjects in the model-punished
condition with respect to their mean scores on the criterion
variable (the number of aggressive acts displayed by the
subjects).
G) Obtained statistic:

F(1, 24) = 21.15

H) Obtained probability (p) value:

p = .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Multiple comparison procedure: A multiple comparison
procedure was not necessary because this predictor variable
included just two levels.
K) Confidence interval: Subtracting the mean of the modelpunished condition from the mean of the model-rewarded
condition resulted in an observed difference of 6.067. The 95%
confidence interval for this difference ranged from 3.344 to
8.789.
L) Effect size: R2 = .16, indicating that consequences for
the model accounted for 16% of the variance in subject
aggression.
M) Conclusion regarding the research hypothesis: These
findings provide support for the studys research hypothesis
that children who observe a model being rewarded for engaging
in aggressive behavior will later demonstrate a greater number
of aggressive acts, compared to children who observe a model
being punished for engaging in aggressive behavior.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 603

N) Formal description of the results for a paper:


Results were analyzed by using a factorial ANOVA with two
between-subjects factors. This analysis revealed a significant
main effect for consequences for the model, F(1, 24) = 21.15,
MSE = 13.05, p = .0001.
On the criterion variable (number of aggressive acts
displayed by the subjects), the mean score for the modelrewarded condition was 19.47 (SD = 7.32), and the mean score
for the model-punished condition was 13.40 (SD = 7.29). Sample
means for the various conditions that constituted the study
are displayed in Figure 16.11.
Subtracting the mean of the model-punished condition from
the mean of the model-rewarded condition resulted in an
observed difference of 6.067. The 95% confidence interval for
this difference ranged from 3.344 to 8.789.
2
In the analysis, R for this main effect was computed as
.16. This indicated that consequences for the model accounted
for 16% of the variance in subject aggression.

O) Figure representing the results:

See Figure 16.11.

Notes Regarding the Preceding Analysis Report


Overview. Many sections of the analysis report for Predictor B were prepared using the
same format that was used for the analysis report for Predictor A. Therefore, to conserve
space, this section will not repeat explanations that were provided following the analysis
report for Predictor A (such as the explanation of where to look in the SAS output for the
MSE statistic). Instead, this section will discuss the ways in which the report for Predictor B
differed from the report from Predictor A.
Some sections of the analysis report for Predictor B were completed in a way that differed
from the report for Predictor A. In most cases, this was because Predictor B consisted of
only two levels (model-rewarded versus model-punished), while Predictor A consisted of
three levels (low versus moderate versus high). Because Predictor B involved two levels,
several sections of its analysis report used a format similar to that used for an independentsamples t test. To refresh your memory on how reports were prepared for that statistic, see
Chapter 13, Independent-Samples t Test, the section Example 13.1: Observed
Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression
(Significant Differences), in the subsection Summarizing the Results of the Analysis.
Information about the main effect for Predictor B. Much of the information about the
main effect for Predictor B came from Output 16.10, in the section for the Type III sum of

604 Step-by-Step Basic Statistics Using SAS: Student Guide

squares, in the row labeled CONSEQ ( ). This includes the F statistic, the p value, as well
as other information.
Item F, the statistical alternative hypothesis. For Predictor B, the statistical alternative
hypothesis was stated in symbolic terms as follows:
B1 B2
The above format is similar to the format used with an independent samples t test.
Remember that this format is appropriate only for predictor variables that contain two levels.
Item G, the obtained statistic. Item G from the preceding analysis report presented the F
statistic and degrees of freedom for Predictor B (CONSEQ). This item is reproduced again
here:
G) Obtained statistic: F(1, 24) = 21.15
The degrees of freedom for this main effect appear in parentheses above. The number 1 is
the degree of freedom for the numerator used in computing the F statistic for the CONSEQ
main effect. This 1 comes from the ANOVA summary table that was created by PROC
GLM, and was reproduced in Output 16.10 , presented earlier. Specifically, this 1 appears
in Output 16.10 at the location where the row labeled CONSEQ intersects with the
column headed DF ( ).
The second number in item G, 24, represents the degrees of freedom for the denominator
used in computing the F statistic. This 24 appears in Output 16.10 at the location where
the row labeled Error intersects with the column headed DF ( ).
Item K, the confidence interval. Notice that, in the analysis report for Predictor B, Item K
does not refer the reader to a table that presents the confidence intervals (as was the case
with Predictor A). Instead, because there is only one confidence interval to report, it is
described in the text of the report itself.
Item N, the formal description of the results for a paper. Item N of the preceding report
provides means and standard deviations for the model-rewarded and model-punished
conditions under Predictor B. These statistics may be found in Output 16.9, in the section for
the CONSEQ variable ( ). The relevant portion of that output is presented again as Output
16.11.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 605

Level of
CONSEQ
MP
MR

JANE DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev
15
13.4000000
7.28795484
15
19.4666667
7.31794923

Output 16.11. Means and standard deviations for conditions under Predictor B
(consequences for model).

Analysis Report Concerning the Interaction (Nonsignificant Effect)


An earlier section stated that an interaction is an outcome for which the relationship between
one predictor variable and the criterion variable is different at different levels of the second
predictor variable. Output 16.10 showed that the interaction term for the current analysis
was nonsignificant. Below is an example of how this result could be described in a report.
A) Statement of the research question: One purpose of this
study was to determine whether there was a significant
interaction between (a) the level of aggression displayed by
the model, and (b) the consequences experienced by the model
in the prediction of (c) the number of aggressive acts later
displayed by subjects who have observed the model.
B) Statement of the research hypothesis: The positive
relationship between the level of aggression displayed by the
model and subsequent subject aggressive acts will be stronger
for subjects who have seen the model rewarded than for
subjects who have seen the model punished.
C) Nature of the variables: This analysis involved two
predictor variables and one criterion variable:
Predictor A was the level of aggression displayed by the
model. This was a limited-value variable, was assessed on an
ordinal scale, and included three levels: low, moderate, and
high.
Predictor B was the consequences for the model. This was a
dichotomous variable, was assessed on a nominal scale, and
included two levels: model rewarded and model punished.
The criterion variable was the number of aggressive acts
displayed by the subjects after observing the model. This
was a multi-value variable, and was assessed on a ratio
scale.
D) Statistical test:
subjects factors

Factorial ANOVA with two between-

606 Step-by-Step Basic Statistics Using SAS: Student Guide

E) Statistical null hypothesis (Ho): In the population, there


is no interaction between the level of aggression displayed by
the model and the consequences for the model in the prediction
of the criterion variable (the number of aggressive acts
displayed by the subjects).
F) Statistical alternative hypothesis (H1): In the
population, there is an interaction between the level of
aggression displayed by the model and the consequences for the
model in the prediction of the criterion variable (the number
of aggressive acts displayed by the subjects).
F(2, 24) = 0.05

G) Obtained statistic:

p = .9527

H) Obtained probability (p) value:

I) Conclusion regarding the statistical null hypothesis: Fail


to reject the null hypothesis.
J) Multiple comparison procedure:
K) Confidence intervals:

Not relevant.

Not relevant.

L) Effect size: R2 = .00, indicating that the interaction


term accounted for none of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These
findings fail to provide support for the studys research
hypothesis that the positive relationship between the level of
aggression displayed by the model and subsequent subject
aggressive acts will be stronger for subjects who have seen
the model rewarded than for subjects who have seen the model
punished.
N) Formal description of the results for a paper:
Results were analyzed by using a factorial ANOVA with two
between-subjects factors. This analysis revealed a
nonsignificant F statistic for the interaction between the
level of aggression displayed by the model and the
consequences for the model, F(2, 24) = 0.05, MSE = 13.05, p =
.9527. Sample means for the various conditions that
constituted the study are displayed in Figure 16.11.
2

In the analysis, R for this interaction effect was


computed as .00. This indicated that the interaction accounted
for none of the variance in subject aggression.
O) Figure representing the results:

See Figure 16.11.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 607

Notes Regarding the Preceding Analysis Report


Item B, statement of the research hypothesis. Item B (above) not only predicts that there
will be an interaction, but also makes a specific prediction about the nature of that
interaction. The type of interaction described here was chosen arbitrarily; remember that an
interaction can be expressed in an infinite variety of forms.
Items G and H. The degrees of freedom for the interaction term, along with the F statistic and
p value for the interaction term, appeared in Output 16.10. The relevant portion of that output
is presented again here as Output 16.12. The information relevant to the interaction appears to
the right of the heading CONSEQ*MOD_AGGR ( ).
Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type III SS
276.033333
1178.866667
1.266667

Mean Square
276.033333
589.433333
0.633333

F Value
21.15
45.17
0.05

Pr > F
0.0001
<.0001
0.9527

Output 16.12. Excerpt from ANOVA summary table; two-way ANOVA


performed on aggression data, significant main effects, nonsignificant
interaction.

Item G from the preceding report is again reproduced below:


G) Obtained statistic: F(2, 24) = 0.05
The degrees of freedom for the F statistic appear within parentheses. The first degree of
freedom is 2, and this value appears in the DF column of Output 16.12 ( ). The F
statistic itself appears in the column headed F Value ( ), and the probability value for the
F statistic appears in the column headed Pr > F ( ).
The second number in Item G, 24, again represents the degrees of freedom for the
denominator of the F ratio. Again, these degrees of freedom may be found in Output 16.10
at the location where the row titled Error intersects with the column titled DF.

Example of a Factorial ANOVA Revealing Nonsignificant


Main Effects and a Nonsignificant Interaction
Overview
This section presents the results of a factorial ANOVA in which the main effects for
Predictor A and Predictor B, along with the interaction term, are all nonsignificant. These
results are presented so that you will be prepared to write analysis reports for projects in
which nonsignificant outcomes are observed.

608 Step-by-Step Basic Statistics Using SAS: Student Guide

The Complete SAS Program


The study presented here is the same aggression study described in the preceding section.
The data will be analyzed with the same SAS program that was presented earlier, in the
section Writing the SAS Program. Here, the data have been changed so that they will
produce nonsignificant results. The complete SAS program, including the new data set, is
presented below:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
CONSEQ
$
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 MR L 15
02 MR L 22
03 MR L 19
04 MR L 16
05 MR L 11
06 MR M 16
07 MR M 24
08 MR M 10
09 MR M 17
10 MR M 17
11 MR H 17
12 MR H 12
13 MR H 24
14 MR H 20
15 MR H 15
16 MP L 14
17 MP L 7
18 MP L 22
19 MP L 15
20 MP L 13
21 MP M 14
22 MP M 21
23 MP M 11
24 MP M 9
25 MP M 19
26 MP H 15
27 MP H 9
28 MP H 10
29 MP H 20
30 MP H 21
;
PROC GLM DATA=D1;
CLASS CONSEQ MOD_AGGR;
MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05;
TITLE1 'JANE DOE';
RUN;
QUIT;

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 609

Steps in Interpreting the Output


Overview. As was the case with the earlier data set, the SAS program performing this
analysis would produce five pages of output. This section will present only those sections of
output that are relevant to preparing the ANOVA summary table, the graph, and the analysis
reports.
You will notice that the steps listed in this section are not identical to the steps listed in the
preceding section. Some of the steps (such as 1. Make sure that everything looks correct)
are not included here because the key concepts have already been covered. Some other steps
are not included here because they are typically not appropriate when effects are
nonsignificant.
1. Determine whether the interaction term is statistically significant. As was the case
before, you determine whether the interaction is significant by reviewing the ANOVA
summary table produced by PROC GLM. This table appears on output page 2, and is
reproduced here as Output 16.13.
JANE DOE
The GLM Procedure
Dependent Variable: SUB_AGGR
Sum of
Source
DF
Squares
Mean Square
Model
5
45.3666667
9.0733333
Error
24
594.8000000
24.7833333
Corrected Total
29
640.1666667
R-Square
0.070867

Coeff Var
31.44181

Root MSE
4.978286

F Value
0.37

Pr > F
0.8667

SUB_AGGR Mean
15.83333

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type I SS
40.83333333
4.06666667
0.46666667

Mean Square
40.83333333
2.03333333
0.23333333

F Value
1.65
0.08
0.01

Pr > F
0.2115
0.9215
0.9906

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type III SS
40.83333333
4.06666667
0.46666667

Mean Square
40.83333333
2.03333333
0.23333333

F Value
1.65
0.08
0.01

Pr > F
0.2115
0.9215
0.9906

Output 16.13. ANOVA summary table for two-way ANOVA performed on


aggression data, nonsignificant main effects, nonsignificant interaction.

As was the case with the earlier data set, you will review the results of the analyses that
appear in the section headed Type III SS ( ), as opposed to the section headed Type I
SS.
To determine whether the interaction is significant, look to the right of the heading
CONSEQ*MOD_AGGR ( ). Here, you can see that the F statistic is only 0.01, with a p
value of .9906. This p value is larger than our standard criterion of .05, and so you know that

610 Step-by-Step Basic Statistics Using SAS: Student Guide

the interaction is nonsignificant. Since the interaction is nonsignificant, you may proceed to
interpret the significance tests for the main effects.
2. Determine whether either of the two main effects are statistically significant.
Information regarding the main effect for Predictor A appears to the right of the heading
MOD_AGGR ( ). You can see that the F value for this effect is 0.08, with a
nonsignificant p value of .9215. Information regarding Predictor B appears to the right of
CONSEQ ( ). This factor displayed a nonsignificant F statistic of 1.65 (p = .2115).
3. Prepare your own version of the ANOVA summary table. The completed ANOVA
summary table for this analysis is presented here as Table 16.4.
Table 16.4
ANOVA Summary Table for Study Investigating the Relationship Between
Level Of Aggression Displayed by Model (A), Consequences for Model (B),
and Subject Aggression (Nonsignificant Interaction, Nonsignificant Main
Effects)
_______________________________________________________________________
SS
MS
F
p
R2
Source
df
_______________________________________________________________________
Model aggression (A)
2
4.07
2.03
0.08
.9215
.01
Consequences for model (B) 1
40.83
40.83
1.65
.2115
.06
A X B Interaction
2
0.47
0.23
0.01
.9906
.00
Within groups
24
594.80
24.78
Total
29
640.17
_______________________________________________________________________
Note: N = 30

Notice how information from Output 16.13 was used to fill in the relevant sections of Table
16.4:

Information from the row labeled MOD_AGGR ( ) in Output 16.13 was transferred to
the row labeled Model aggression (A) in Table 16.4.

Information from the row labeled CONSEQ ( ) in Output 16.13 was transferred to the
row labeled Consequences for model (B) in Table 16.4.

Information from the row labeled CONSEQ*MOD_AGGR ( ) in Output 16.13 was


transferred to the row labeled A B Interaction in Table 16.4.

Information from the row labeled ERROR ( ) in Output 16.13 was transferred to the
row labeled Within groups in Table 16.4.

Information from the row labeled Corrected Total ( ) in Output 16.13 was transferred
to the row labeled Total in Table 16.4.

The R2 entries in Table 16.4 were computed by dividing the sum of squares for a particular
effect by the total sum of squares. For detailed guidelines on constructing an ANOVA
summary table (such as Table 16.4), see the subsection 4. Prepare your own version of the

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 611

ANOVA summary table from the major section Example of a Factorial ANOVA
Revealing a Significant Main Effects and a Nonsignificant Interaction, presented earlier.
4. Prepare a table that displays the confidence intervals for Predictor A. Notice that,
unlike the previous section, this section does not advise you to review the results of the
Tukey multiple comparison procedures to determine whether there are significant
differences between the various levels of Predictor A or Predictor B. This is because the
main effects for these two predictor variables were nonsignificant, and you normally should
not interpret the results of multiple comparison procedures for main effects that are not
significant. However, the Tukey test that you requested also computes confidence intervals
for the differences between the means. Some readers might be interested in seeing these
confidence intervals, even when the main effects are not statistically significant.
To conserve space, this section will not present the SAS output that shows the results of the
Tukey tests and confidence intervals. However, it will show how to summarize the
information that was presented in that output for a paper.
Predictor A was the level of aggression displayed by the model. This predictor contained
three levels, and therefore produced three confidence intervals for differences between the
means. Because this is a fairly large amount of information to present, these confidence
intervals will be summarized in Table 16.5 here:
Table 16.5
Results of Tukey Tests for Predictor A: Comparing High-Model-Aggression
Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression
Group on the Criterion Variable (Subject Aggression)
________________________________________________________
Simultaneous 95%
Difference
confidence limits
between

a
b
means
Lower
Upper
Comparison
________________________________________________________
High Moderate
0.5000
-5.0599
6.0599
High - Low
0.9000
-4.6599
6.4599
Moderate - Low
0.4000
-5.1599
5.9599
________________________________________________________
Note: N = 30
a Differences are computed by subtracting the mean for the second group
from the mean for the first group.
b Tukey test indicates that the none of the differences between the
means were significant at p < .05.

5. Summarize the confidence interval for Predictor B in the text of your paper. In
contrast, Predictor B (consequences for the model), contained only two levels and therefore
produced only one confidence interval for a difference between the means. This interval
could be summarized in the text of your paper in this way:

612 Step-by-Step Basic Statistics Using SAS: Student Guide

Subtracting the mean of the model-punished condition from the mean of the modelrewarded condition resulted in an observed difference of 2.333. The 95% confidence
interval for this difference ranged from 1.418 to 6.085.
Using a Figure to Illustrate the Results
When all main effects and interactions are nonsignificant, researchers usually do not
illustrate means for the various conditions in a graph. However, this section will present a
graph, simply to provide an additional example of how to prepare a graph from the cell
means that are provided in the SAS output.
Page 3 of the SAS output from the current program provides means and standard deviations
for the various treatment conditions manipulated in the current study. These results are
presented here as Output 16.14.

Level of
CONSEQ
MP
MR

JANE DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev
15
14.6666667
4.95215201
15
17.0000000
4.27617987

Level of
MOD_AGGR
H
L
M

N
10
10
10

Level of
CONSEQ
MP
MP
MP
MR
MR
MR

Level of
MOD_AGGR
H
L
M
H
L
M

-----------SUB_AGGR---------Mean
Std Dev
16.3000000
4.98998998
15.4000000
4.69515116
15.8000000
4.87168691
N
5
5
5
5
5
5

-----------SUB_AGGR---------Mean
Std Dev
15.0000000
5.52268051
14.2000000
5.35723809
14.8000000
5.11859356
17.6000000
4.61519230
16.6000000
4.15932687
16.8000000
4.96990946

Output 16.14. Means needed to plot the studys results in a figure; two-way
ANOVA performed on aggression data; nonsignificant main effects;
nonsignificant interaction.

As was noted earlier, this page of output actually contains three tables of means. However,
you will be interested only in the third tablethe one that presents the means broken down
by both Predictor A and Predictor B ( ).
Figure 16.12 graphs the six cell means from the third table in Output 16.14. As before, cell
means from subjects in the model-rewarded condition are displayed as circles on a solid line,
while cell means from subjects in the model-punished condition are displayed as triangles on
a solid line.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 613

Figure 16.12. Mean number of aggressive acts as a function of the level of


aggression displayed by the model and the consequences for the model
(nonsignificant main effects and nonsignificant interaction).

To conserve space, this section will not repeat the steps that you should follow when
transferring mean scores from Output 16.14 to Figure 16.12. For detailed guidelines on how
to construct a graph such as that in Figure 16.12, see the subsection Steps in Preparing the
Graph from the major section Example of a Factorial ANOVA Revealing Significant
Main Effects and a Nonsignificant Interaction, presented earlier.
Interpreting Figure 16.12
Notice how the graphic pattern of results in Figure 16.12 are consistent with the statistical
results reported in Output 16.13. Specifically:

In Figure 16.12, the corresponding line segments are parallel to one another. This is the
railroad track pattern that you would expect when the interaction is nonsignificant (as
reported in Output 16.13).

None of the line segments going from Low to Moderate or from Moderate to
High display a substantial angle. This is consistent with the fact that Output 16.13
reported a nonsignificant main effect for Predictor A (model aggression).

Finally, the solid line for the model-rewarded group is not substantially separated from the
broken line for the model-punished group. This is consistent with the finding from Output
16.13 that the main effect for Predictor B (consequences for the model) was nonsignificant

614 Step-by-Step Basic Statistics Using SAS: Student Guide

(it is true that the solid line is slightly separated from the broken line, but the separation is
not enough to attain statistical significance).
Analysis Report Concerning the Main Effect for Predictor A
(Nonsignificant Effect)
The analysis report in this section provides an example of how to write up the results for a
predictor variable that has three levels and is not statistically significant. Items A through F
of this report would be identical to items A through F of the analysis report in the section
titled Analysis Report Concerning the Main Effect for Predictor A (Significant Effect),
presented earlier. These sections stated the research question, the research hypothesis, and so
on. Therefore, those items will not be presented again in this section.
(Items A through F would normally appear here)
G) Obtained statistic:

F(2, 24) = 0.08

H) Obtained probability (p) value:

p = .9215

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.

Fail

J) Multiple comparison procedure: The multiple comparison


procedure was not appropriate because the F statistic for the
main effect was nonsignificant.
K) Confidence intervals: Confidence intervals for differences
between the means are presented in Table 16.5
L) Effect size: R2 = .01, indicating that model aggression
accounted for 1% of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These
findings fail to provide support for the studys research
hypothesis that there will be a positive relationship between
the level of aggression displayed by a model and the number of
aggressive acts later demonstrated by children who observed
the model.
N) Formal description of the results for a paper:
Results were analyzed by using a factorial ANOVA with two
between-subjects factors. This analysis revealed a
nonsignificant main effect for level of aggression displayed
by the model, F(2, 24) = 0.08, MSE = 24.78, p = .9215.
On the criterion variable (number of aggressive acts
displayed by subjects), the mean score for the high-modelaggression condition was 16.30 (SD = 4.99), the mean for the
moderate-model-aggression-condition was 15.80 (SD = 4.87), and

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 615

the mean for the low-model-aggression condition was 15.40 (SD


= 4.70). Sample means for the various conditions that
constituted the study are displayed in Figure 16.12.
Confidence intervals for differences between the means (based
on Tukeys HSD test) are presented in Table 16.5.
2
In the analysis, R for this main effect was computed as
.01. This indicated that model aggression accounted for 1% of
the variance in subject aggression.

O) Figure representing the results:

See Figure 16.12.

Notice that the preceding Formal description of the results for a paper did not discuss the
results of the Tukey HSD test (other than to refer to the confidence intervals). This is
because the results of multiple comparison procedures are typically not discussed when the
main effect is nonsignificant.
Analysis Report Concerning the Main Effect for Predictor B
(Nonsignificant Effect)
The analysis report in this section provides an example of how to write up the results for a
predictor variable that has two levels and is not statistically significant. Items A through F of
this report would be identical to items A through F of the analysis report in the section titled
Analysis Report Concerning the Main Effect for Predictor B (Significant Effect),
presented earlier. These sections stated the research question, the research hypothesis, and so
on. Therefore, those items will not be presented again in this section.
(Items A through F would normally appear here)
G) Obtained statistic:

F(1, 24) = 1.65

H) Obtained probability (p) value:

p = .2115

I) Conclusion regarding the statistical null hypothesis:


to reject the null hypothesis.
J) Multiple comparison procedure:

Fail

Not relevant.

K) Confidence intervals: Subtracting the mean of the modelpunished condition from the mean of the model-rewarded
condition resulted in an observed difference of 2.333. The 95%
confidence interval for this difference extended from 1.418
to 6.085.
L) Effect size: R2 = .06, indicating that consequences for
the model accounted for 6% of the variance in subject
aggression.
M) Conclusion regarding the research hypothesis: These
findings fail to provide support for the studys research

616 Step-by-Step Basic Statistics Using SAS: Student Guide

hypothesis that children who observe a model being rewarded


for engaging in aggressive behavior will later demonstrate a
greater number of aggressive acts, compared to children who
observe a model being punished for engaging in aggressive
behavior.
N) Formal description of the results for a paper:
Results were analyzed by using a factorial ANOVA with two
between-subjects factors. This analysis revealed a
nonsignificant main effect for consequences for the model,
F(1, 24) = 1.65, MSE = 24.78, p = .2115.
On the criterion variable (number of aggressive acts
displayed by subjects), the mean score for the model-rewarded
condition was 17.00 (SD = 4.28), and the mean for the modelpunished condition was 14.67 (SD = 4.95). Sample means for the
various conditions that constituted the study are displayed in
Figure 16.12.
Subtracting the mean of the model-punished condition from
the mean of the model-rewarded condition resulted in an
observed difference of 2.333. The 95% confidence interval for
this difference ranged from 1.418 to 6.085.
In the analysis, R2 for this main effect was computed as
.06. This indicated that consequences for the model accounted
for 6% of the variance in subject aggression.
O) Figure representing the results:

See Figure 16.12.

Analysis Report Concerning the Interaction (Nonsignificant Effect)


An earlier section of this chapter has already shown how to prepare a report for a
nonsignificant interaction term. Therefore, to save space, a similar report will not be
provided here. To find the previous example, see subsection Analysis Report Concerning
the Interaction (Nonsignificant Effect) within the major section Example of a Factorial
ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 617

Example of a Factorial ANOVA Revealing a Significant


Interaction
Overview
This section presents the results of a factorial ANOVA in which the interaction between
Predictor A and Predictor B is significant. You will remember that an interaction means that
the relationship between one predictor variable and the criterion variable is different at
different levels of the second predictor variable. These results are presented so that you will
be prepared to write analysis reports for projects in which significant interactions are
observed.
The Complete SAS Program
The study presented here is the same aggression study that was described in the preceding
sections. The data will be analyzed with the same SAS program that was presented earlier in
the section headed Writing the SAS Program. Here, the data have been changed so that
they will produce a significant interaction term. The complete SAS program, including the
new data set, is presented below:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT SUB_NUM
CONSEQ
$
MOD_AGGR $
SUB_AGGR;
DATALINES;
01 MR L 10
02 MR L 8
03 MR L 12
04 MR L 10
05 MR L 9
06 MR M 14
07 MR M 15
08 MR M 17
09 MR M 16
10 MR M 16
11 MR H 18
12 MR H 20
13 MR H 20
14 MR H 23
15 MR H 21
16 MP L 9
17 MP L 8
18 MP L 7
19 MP L 11
20 MP L 10
21 MP M 10
22 MP M 12
23 MP M 13

618 Step-by-Step Basic Statistics Using SAS: Student Guide

24 MP M 9
25 MP M 9
26 MP H 11
27 MP H 12
28 MP H 9
29 MP H 13
30 MP H 11
;
PROC GLM DATA=D1;
CLASS CONSEQ MOD_AGGR;
MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05;
TITLE1 'JANE DOE';
RUN;
QUIT;

Steps in Interpreting the Output


Overview. As was the case with the earlier data set, the SAS program performing this
analysis would produce five pages of output. This section will present only those sections of
output that are relevant to preparing the ANOVA summary table, the graph, and the analysis
reports.
You will notice that the steps listed in this section are not identical to the steps listed in the
preceding sections. Some of the steps (such as 1. Make sure that everything looks correct)
are not included here because the key concepts have already been covered. Other steps in
this section are different because, when an interaction is significant, it is necessary to follow
a special sequence of steps.
1. Determine whether the interaction term is statistically significant. As was the case
before, you determine whether the interaction is significant by reviewing the ANOVA
summary table produced by PROC GLM. This table appears on output page 2, and is
reproduced here as Output 16.15.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 619

JANE DOE
The GLM Procedure

Dependent Variable: SUB_AGGR


Source
Model
Error
Corrected Total
R-Square
0.890647

DF
5
24
29

Sum of
Squares
482.1666667
59.2000000
541.3666667

Coeff Var
12.30206

Mean Square
96.4333333
2.4666667

Root MSE
1.570563

F Value
39.09

Pr > F
<.0001

SUB_AGGR Mean
12.76667

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type I SS
187.5000000
206.4666667
88.2000000

Mean Square
187.5000000
103.2333333
44.1000000

F Value
76.01
41.85
17.88

Pr > F
<.0001
<.0001
<.0001

Source
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR

DF
1
2
2

Type III SS
187.5000000
206.4666667
88.2000000

Mean Square
187.5000000
103.2333333
44.1000000

F Value
76.01
41.85
17.88

Pr > F
<.0001
<.0001
<.0001

Output 16.15. ANOVA summary table for two-way ANOVA performed on


aggression data; significant interaction.

The information relevant to the interaction term appears to the right of the heading
CONSEQ*MOD_AGGR ( ). You can see that this interaction term has an F statistic of
17.88, and a corresponding p value of .0001. Because the p value is below the standard
criterion of .05, you conclude that the interaction is statistically significant.
Other results presented in Output 16.15 show that the main effects for CONSEQ and for
MOD_AGGR are also statistically significant. However, main effects must be interpreted
very cautiously (if at all) when an interaction is significant. Therefore, the main effects for
Predictors A and B will not be interpreted in this section. Instead, the primary focus will be
on the interpretation of the interaction.
2. Prepare your own version of the ANOVA summary table. The completed ANOVA
summary table for this analysis is presented here as Table 16.6.
Table 16.6
ANOVA Summary Table for Study Investigating the Relationship between
Level Of Aggression Displayed by Model (A), Consequences for Model (B),
and Subject Aggression (Significant Interaction)
_______________________________________________________________________
SS
MS
F
p
R2
Source
df
_______________________________________________________________________
Model aggression (A)
2
206.47
103.23
41.85
.0001
.38
Consequences for model (B) 1
187.50
187.50
76.01
.0001
.35
A X B Interaction
2
88.20
44.10
17.88
.0001
.16
Within groups
24
59.20
2.47
Total
29
541.37
_______________________________________________________________________
Note: N = 30

620 Step-by-Step Basic Statistics Using SAS: Student Guide

For the most part, the information that appears in Table 16.6 was taken from Output 16.15.
For detailed guidelines on constructing an ANOVA summary table (such as Table 16.6), see
the subsection 4. Prepare your own version of the ANOVA summary table from the major
section Example of a Factorial ANOVA Revealing Significant Main Effects and a
Nonsignificant Interaction, presented earlier.
Using a Graph to Illustrate the Results
Interactions are easiest to understand when they are plotted in a graph. You can plot the
current interaction by preparing the same type of graph that has been used throughout this
chapter.
Page 3 of the SAS output from the current program provides means and standard deviations
for the various treatment conditions manipulated in the current study. These results are
presented here as Output 16.16:

Level of
CONSEQ
MP
MR

JANE DOE
The GLM Procedure
-----------SUB_AGGR---------N
Mean
Std Dev
15
10.2666667
1.79151439
15
15.2666667
4.69751707

Level of
MOD_AGGR
H
L
M

N
10
10
10

Level of
CONSEQ
MP
MP
MP
MR
MR
MR

Level of
MOD_AGGR
H
L
M
H
L
M

-----------SUB_AGGR---------Mean
Std Dev
15.8000000
5.09465951
9.4000000
1.50554531
13.1000000
2.99814758
N
5
5
5
5
5
5

-----------SUB_AGGR---------Mean
Std Dev
11.2000000
1.48323970
9.0000000
1.58113883
10.6000000
1.81659021
20.4000000
1.81659021
9.8000000
1.48323970
15.6000000
1.14017543

Output 16.16. Means needed to plot the studys results in a figure; two-way
ANOVA performed on aggression data, significant interaction.

As noted earlier, this page of output contains three tables of means. However, you will be
interested only in the third tablethe one that presents the means broken down by both
Predictor A and Predictor B ( ).
Figure 16.13 graphs the six cell means from the third table of Output 16.16. As before, cell
means from subjects in the model-rewarded condition are displayed as circles on a solid line,
while cell means from subjects in the model-punished condition are displayed as triangles on
a broken line.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 621

Figure 16.13. Mean number of aggressive acts as a function of the level of


aggression displayed by the model and the consequences for the model
(significant interaction between Predictor A and Predictor B).

To conserve space, this section will not repeat the steps that you should follow when you
transfer mean scores from Output 16.16 to Figure 16.13. For detailed guidelines on how to
construct a graph such as Figure 16.13, see the subsection headed Steps in Preparing the
Graph from the major section headed Example of a Factorial ANOVA Revealing
Significant Main Effects and a Nonsignificant Interaction, presented earlier.
Interpreting Figure 16.13
Notice how the pattern of results in presented Figure 16.13 is consistent with the statistical
results reported in Output 16.15 (that is, the finding that the interaction was significant). In
Figure 16.13, you can see that corresponding line segments are not parallel to each other.
Because the two lines are not parallel, you know that your have an interaction between
Predictor A and Predictor B. Specifically:

The broken line (representing subjects in the model-punished condition) is relatively flat,
indicating that Predictor A (the level of aggression displayed by the model) had little if
any effect on subjects in this condition.

In contrast, the solid line (representing subjects in the model-rewarded condition) displays
a marked upward angle, indicating that Predictor A had a stronger effect on subjects in this
condition.

622 Step-by-Step Basic Statistics Using SAS: Student Guide

Notice how these results are also consistent with the definition for an interaction that was
presented earlier: The relationship between one predictor variable (model aggression) and
the criterion variable (subject aggression) is different at different levels of the second
predictor variable (consequences for the model).
Testing for Simple Effects
When there is a simple effect for Predictor A at a particular level of Predictor B, it means
that there is a significant relationship between Predictor A and the criterion variable at that
level of Predictor B. This concept of a simple effect is perhaps easiest to understand by
again referring again to Figure 16.13.
First, consider the solid line (the line representing the model-rewarded group) in Figure
16.13. This line displays a relatively steep angle, suggesting that there may be a significant
relationship between model aggression and subject aggression for the subjects in the modelrewarded group. In other words, there may be a significant simple effect for model
aggression at the model-rewarded level of the consequences for model predictor variable.
Now consider the broken line in the same figurethe line which represents the modelpunished group. This line displays an angle that is less steep than the angle displayed by the
solid line. It is impossible to be sure by simply viewing the graph in this way, but this may
mean that there is not a simple effect for model aggression at the model-punished level of
the consequences for model predictor variable.
It is possible to perform tests to determine whether a simple effect is statistically significant.
For example, if you performed these tests, you might find that the simple effect for model
aggression at the model-rewarded level of Predictor B was statistically significant, but that
the simple effect for model aggression at the model-punished level of Predictor B was
nonsignificant. Whether a researcher chooses to perform tests for simple effects will depend,
in part, on the nature of the research questions being addressed in the study.
Testing for simple effects is a somewhat advanced concept, and so it is not addressed in this
text. For detailed guidelines on how to use SAS to test for simple effects, see Hatcher and
Stepanski (1994), pages 263279.
Analysis Report Concerning the Interaction (Significant Effect)
Below is an example of how this result could be described in an analysis report.
A) Statement of the research question: One purpose of this
study was to determine whether there was a significant
interaction between (a) the level of aggression displayed by
the model and (b) the consequences experienced by the model in
the prediction of (c) the number of aggressive acts later
displayed by subjects who have observed the model.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 623

B) Statement of the research hypothesis: The positive


relationship between the level of aggression displayed by the
model and subsequent subject aggressive acts will be stronger
for subjects who have seen the model rewarded than for
subjects who have seen the model punished.
C) Nature of the variables: This analysis involved two
predictor variables and one criterion variable:
Predictor A was the level of aggression displayed by the
model. This was a limited-value variable, was assessed on an
ordinal scale, and included three levels: low, moderate, and
high.
Predictor B was the consequences for the model. This was a
dichotomous variable, was assessed on a nominal scale, and
included two levels: model-rewarded and model-punished.
The criterion variable was the number of aggressive acts
displayed by the subjects after observing the model. This
was a multi-value variable, and was assessed on a ratio
scale.
Factorial ANOVA with two between-

D) Statistical test:
subjects factors

E) Statistical null hypothesis (Ho): In the population, there


is no interaction between the level of aggression displayed by
the model and the consequences for the model in the prediction
of the criterion variable (the number of aggressive acts
displayed by the subjects).
F) Statistical alternative hypothesis (H1): In the
population, there is an interaction between the level of
aggression displayed by the model and the consequences for the
model in the prediction of the criterion variable (the number
of aggressive acts displayed by the subjects).
G) Obtained statistic:

F(2, 24) = 17.88

H) Obtained probability (p) value:

p = .0001

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Multiple comparison procedure:
K) Confidence intervals:

Not relevant.

Not relevant.

2
L) Effect size: R = .16, indicating that the interaction
term accounted for 16% of the variance in subject aggression.

M) Conclusion regarding the research hypothesis: These


findings provide support for the studys research hypothesis

624 Step-by-Step Basic Statistics Using SAS: Student Guide

that the positive relationship between the level of aggression


displayed by the model and subsequent subject aggressive acts
will be stronger for subjects who have seen the model rewarded
than for subjects who have seen the model punished.
N) Formal description of the results for a paper:
Results were analyzed by using a factorial ANOVA with two
between-subjects factors. This analysis revealed a significant
F statistic for the interaction between the level of
aggression displayed by the model and the consequences for the
model, F(2, 24) = 17.88, MSE = 2.47, p = .0001.
Sample means for the various conditions that constituted
the study are displayed in Figure 16.13. The nature of the
interaction displayed in Figure 16.13 shows that there is a
positive relationship between model aggression and subject
aggression for subjects in the model-rewarded group: for
subjects in this condition, greater levels of model aggression
are associated with greater levels of subject aggression. On
the other hand, there is only a very weak relationship between
model aggression and subject aggression for subjects in the
model-punished group.
In the analysis, R2 for this interaction effect was
computed as .16. This indicated that the interaction accounted
for 16% of the variance in subject aggression.
O) Figure representing the results:

See Figure 16.13.

The preceding analysis report was prepared following the same steps that were followed
with other analysis report in this chapter. The F statistic, degrees of freedom, and p value
came from Output 16.15, from the row CONSEQ*MOD_AGGR( ).

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 625

Using the LSMEANS Statement to Analyze Data from


Unbalanced Designs
Overview
This section shows you how to use the LSMEANS statement rather than the MEANS
statement in a factorial ANOVA. LSMEANS stands for least-squares means. It is often
appropriate to use the LSMEANS statement when you are analyzing data from an
unbalanced design. This section explains what an unbalanced design is, and shows you how
to use LSMEANS statements to request least-squares means, multiple comparison
procedures, and confidence intervals.
Reprise: What Is an Unbalanced Design?
An experimental design is balanced if the same number of observations (subjects) appear in
each cell of the design. For example, Figure 16.1 (presented toward the beginning of this
chapter) illustrates the research design that was used in the aggression study. It shows that
there are five subjects in each cell of the design (that is, there are five subjects in the cell of
subjects who experienced the low condition under Predictor A and the model-rewarded
condition under Predictor B, there are five subjects in the cell of subjects who experienced
the moderate condition under Predictor A and the model-punished condition under
Predictor B, and so on). When a research design is balanced, it is generally appropriate to
use the MEANS statement with PROC GLM to request group means, multiple comparison
procedures, and confidence intervals.
In contrast, a research design is typically unbalanced if some cells in the design contain a
larger number of observations (subjects) than other cells. For example, again consider
Figure 16.1, presented earlier. The research design illustrated there contains six cells. If
there were 20 subjects in one of the cells, but only five subjects in each of the remaining five
cells, the research design would then be unbalanced.
When you analyze data from an unbalanced design, it is generally best not to use the
MEANS statement. This is because with unequal cell sizes, the MEANS statement may
produce marginal means that are biased. When analyzing data from an unbalanced design, it
is generally preferable to use the LSMEANS statement in your program, rather than the
MEANS statement because the LSMEANS statement will estimate the marginal means over
a balanced population. LSMEANS estimates what the marginal means would be if you did
have equal cell sizes. The marginal means estimated by the LSMEANS statement are less
likely to be biased.

626 Step-by-Step Basic Statistics Using SAS: Student Guide

Writing the LSMEANS Statements


The syntax. Below is the syntax for the PROC step of a SAS program that uses the
LSMEANS statement rather than the MEANS statement:
PROC GLM DATA = data-set-name;
CLASS predictorB predictorA;
MODEL criterion-variable = predictorB predictorA
predictorB*predictorA;
LSMEANS predictorB predictorA predictorB*predictorA;
LSMEANS predictorB predictorA /
PDIFF ADJUST=TUKEY CL ALPHA=alpha-level;
TITLE1 ' your-name ';
RUN;
QUIT;
The preceding syntax is very similar to the syntax that used the MEANS statement,
presented earlier in this chapter. The first LSMEANS statement takes this form:
LSMEANS

predictorB

predictorA

predictorB*predictorA;

You can see that this LSMEANS statement is identical to the earlier MEANS statement,
except that MEANS has been replaced with LSMEANS.
The second LSMEANS statement is more complex:
LSMEANS

predictorB predictorA /
PDIFF ADJUST=TUKEY CL ALPHA=alpha-level;

You can see that this second LSMEANS statement contains a slash, followed by a number
of key words for options. Here is what the key words request:

PDIFF requests that SAS print p values for significance tests related to the multiple
comparison procedure. These p values will tell you whether there are significant
differences between the least-squares means for the different levels under the two
predictor variables.

ADJUST=TUKEY requests a multiple comparison adjustment for the p values and


confidence limits for the differences between the least-squares means. Including
ADJUST=TUKEY requests an adjustment based on the Tukey HSD test. The adjustment
can also be based on other multiple-comparison procedures; see the chapter on the GLM
procedure in the SAS/STAT Users Guide for details.

CL requests confidence limits for individual least-squares means. If you also include the
PDIFF option (as is done here), it will also print confidence limits for differences between
means. These are the type of confidence limits that have been illustrated throughout this
chapter.

ALPHA=alpha-level specifies the significance level to be used for the multiple


comparison procedure and the confidence level to be used with the confidence limits.

Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 627

Specifying ALPHA=0.05 requests that the the significance level (alpha) be set at .05 for
the Tukey tests. If you had wanted alpha set at .01, you would have used the option
ALPHA=0.01, and if you had wanted alpha set at .10, you would have used the option
ALPHA=0.1.
The actual SAS statements. Following are the statements that you would include in a SAS
program to request a factorial ANOVA using the LSMEANS statement rather than the
MEANS statement. The following statements would be appropriate to analyze data from the
aggression study described in this chapter. Notice that alpha is set at .05 for the Tukey tests.
PROC GLM DATA=D1;
CLASS CONSEQ MOD_AGGR;
MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
LSMEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR;
LSMEANS CONSEQ MOD_AGGR /
PDIFF ADJUST=TUKEY CL ALPHA=0.05;
TITLE1 'JANE DOE';
RUN;
QUIT;

Output Produced by LSMEANS


The output produced by the LSMEANS statements is very similar to the output produced by
the MEANS statements, except that the means have been appropriately adjusted. There are a
few additional differences. For example, the MEANS statement prints the means and
standard deviations, while the LSMEANS statement prints only the adjusted means. For the
most part, however, if you have read the sections of this chapter that show how to interpret
the results of the MEANS statements, you should have little difficulty in interpreting the
results produced by LSMEANS.

Learning More about Using SAS for Factorial ANOVA


This chapter has provided an elementary introduction to the use of factorial ANOVA for
analyzing data from research. Its scope was limited to the simplest types of factorial designs:
those with only two predictor variables. Its scope was also limited to studies in which both
factors are between-subjects factors.
For a discussion of how to use SAS to perform factorial ANOVA with one between-subjects
factor and one within-subjects factor (i.e., one repeated-measures factor), see Hatcher and
Stepanski (1994). As was mentioned earlier, Hatcher and Stepanski (1994) also show how to
perform tests for simple effects.

628 Step-by-Step Basic Statistics Using SAS: Student Guide

For a guide on using SAS for an even wider variety of factorial designs, see Cody and Smith
(1997). This book shows how to analyze data from studies with up to three predictor
variables. It also shows how to handle any combination of between-subjects factors and
within-subjects factors.

Conclusion
This chapter, along with Chapters 10-15 of this text, has dealt with statistical tests in which
the criterion variable was always a numeric variable assessed on an interval or ratio scale.
Chapter 10 and Chapter 11 illustrated tests of association: procedures that allowed you to
determine whether a multi-value numeric criterion variable was associated with a predictor
variable that was also a multi-value numeric variable. Chapter 13 through Chapter 16
illustrated tests of group differences: tests that enabled you to determine whether a multivalue numeric criterion variable was associated with a predictor variable that was either a
dichotomous or limited-value variable.
But what if you are conducting a study in which the criterion variable is not a multi-value
numeric variable assessed on an interval or ratio scale? More specifically, what if you have
conducted a study in which both the criterion variable and the predictor variable are
dichotomous or limited-value variables? What if you need to determine whether there is a
significant relationship between these two variables? In studies such as these, it may be
appropriate to analyze your data using a nonparametric procedure called the chi-square test
of independence. This statistical procedure is the topic of the next (and final) chapter.

Chi-Square Test
of Independence
Introduction..........................................................................................631
Overview................................................................................................................ 631
Situations That Are Appropriate for the Chi-Square Test of
Independence ..................................................................................631
Overview................................................................................................................ 631
Nature of the Predictor and Criterion Variables ..................................................... 631
The Type-of-Variable Figure .................................................................................. 632
Example of a Study That Provides Appropriate Data for This Procedure .............. 632
Summary of Assumptions Underlying the Chi-Square Test of Independence ....... 633
Using Two-Way Classification Tables .................................................634
Overview................................................................................................................ 634
General Structure of a Two-Way Classification Table............................................ 634
Two-Way Classification Table for the Juvenile Offender Study ............................. 635
Results Produced in a Chi-Square Test of Independence...................637
Overview................................................................................................................ 637
Test of the Null Hypothesis .................................................................................... 637
Effect Size.............................................................................................................. 638
A Study Investigating Computer Preferences .....................................640
Overview................................................................................................................ 640
Research Method................................................................................................... 640

630 Step-by-Step Basic Statistics Using SAS: Student Guide

Computing Chi-Square from Raw Data versus Tabular Data ..............642


Overview................................................................................................................ 642
Raw Data ............................................................................................................... 642
Tabular Data .......................................................................................................... 642
The Approach Used Here ...................................................................................... 643
Example of a Chi-Square Test That Reveals
a Significant Relationship ...............................................................643
Overview................................................................................................................ 643
Choosing SAS Variable Names and Values to Use in the Analysis ....................... 643
Data Set to Be Analyzed........................................................................................ 645
Writing the SAS Program....................................................................................... 646
Output Produced by the SAS Program .................................................................. 649
Steps in Interpreting the Output ............................................................................. 649
Using a Graph to Illustrate the Results .................................................................. 655
Analysis Report for the Computer Preferences Study (Significant Results)........... 658
Notes Regarding the Preceding Analysis Report ................................................... 659
Example of a Chi-Square Test That Reveals
a Nonsignificant Relationship .........................................................661
Overview................................................................................................................ 661
The Complete SAS Program ................................................................................. 662
Steps in Interpreting the Output ............................................................................. 662
Using a Figure to Illustrate the Results .................................................................. 664
Analysis Report for the Computer Preferences Study (Nonsignificant Results) ..... 665
Notes Regarding the Preceding Analysis Report ................................................... 667
Computing Chi-Square from Raw Data ................................................668
Overview................................................................................................................ 668
Inputting Raw Data ................................................................................................ 668
The PROC Step ..................................................................................................... 670
The Complete SAS Program ................................................................................. 671
Interpreting the SAS Output................................................................................... 671
Conclusion............................................................................................671

Chapter 17: Chi-Square Test of Independence 631

Introduction
Overview
This chapter shows how to enter data and prepare SAS programs that will perform a chisquare test of independence (also called the Pearson chi-square test, the chi-square test of
association, chi-square test of homogeneity, or the two-way chi-square test). This test is
useful when you want to determine whether there is a significant relationship between two
dichotomous or limited-value variables that are assessed on any scale of measurement. The
2
symbol for the chi-square statistic is .
The chapter shows how to prepare SAS programs that will input either tabular data (data that
have already been summarized in a table) or raw data. It shows how to interpret the two-way
classification table produced by PROC FREQ, how to determine whether there is a
significant relationship between the two variables, and how to prepare a report that
summarizes the results of the analysis, including a figure and an index of effect size.

Situations That Are Appropriate for the Chi-Square Test


of Independence
Overview
The chi-square test of independence is a test of association. It allows you to determine
whether a single predictor variable is related to a single criterion variable. It is called a test
of independence because, in this context, the word independent means unrelated. By the
end of the analysis, you will conclude that the two variables are either independent
(unrelated) or dependent (related).
With the chi-square test, both the predictor variable as well as the criterion variable are
typically dichotomous or limited-value variables. This is test is particularly useful because it
can be used with variables assessed on any scale of measurement: nominal, ordinal,
interval, or ratio. It is one of the few statistical procedures that allow you to determine
whether there is a significant relationship between two nominal-level variables.
Nature of the Predictor and Criterion Variables
Predictor variable. The predictor variable is typically a dichotomous or limited-value
variable. It may be assessed on any scale of measurement (nominal, ordinal, interval, or
ratio).

632 Step-by-Step Basic Statistics Using SAS: Student Guide

Criterion variable. The criterion variable is also typically a dichotomous or limited-value


variable. It may also be assessed on any scale of measurement (nominal, ordinal, interval, or
ratio).
The Type-of-Variable Figure
The figure below illustrates the types of variables that are typically being analyzed when
researchers perform a chi-square test of independence.
Criterion

Predictor
=

The Lmt symbol that appears to the left of the equal sign in the above figure indicates that
the criterion variable in a chi-square test of independence is typically a limited-value
variable (a variable that assumes two to six values in your sample). The Lmt symbol that
appears to the right of the equal sign indicates that the predictor variable in this procedure is
also usually a limited-value variable.
It should be noted that, in theory, either of the variables can actually assume any number of
values. In practice, however, the number of values is usually relatively small, typically from
two to six.
Example of a Study That Provides Appropriate Data for This
Procedure
Overview. Suppose that you are a criminologist studying violent crime among male juvenile
delinquents. You are interested in determining whether some types of juvenile offenders are
more likely to use weapons (such as a gun) in the commission of crimes. Knowing this
would enable officials in law enforcement to make better predictions about what types of
delinquents are most likely to be dangerous.
The study. To conduct research of this nature, you need to choose a typology for classifying
juvenile offenders. One such typology has been offered by Dicataldo and Grisso (1995).
They have shown that young offenders can be classified into the following three categories:

Immature juvenile offenders. Individuals in this category tend to be young, have


problems in school, and be child-like and dependent.

Socialized juvenile offenders. Individuals in this category tend to display better


performance in school, display a sense of guilt, and are motivated to comply with the
court.

Mature delinquent juvenile offenders. Individuals in this category tend to be


independent and adult-like, display little guilt, lack respect for the court, and have an
extensive history of delinquency.

Chapter 17: Chi-Square Test of Independence 633

You wish to determine whether there is any relationship between (a) what type of delinquent
a young person is (according to the above categories), and (b) whether that delinquent used a
weapon in his or her most recent crime. You therefore review court records of 300 cases in
which a juvenile committed a crime. For each case, you do the following:

You direct a panel of experts to classify the young person into one of the three preceding
categories.

You note whether or not a weapon was used in the crime.

You then perform a chi-square test of independence to determine whether there is a


significant relationship between (a) the type of offender category of the juvenile and (b)
whether or not the juvenile used a weapon. If there is a significant relationship, you would
then review the results in more detail to determine which type of offender is most likely to
use a weapon.
Why these data would be appropriate for this procedure. The preceding study involved a
single predictor variable and a single criterion variable. The predictor variable was type of
offender. You know that this was a limited-value variable, because it assumed only three
values: an immature type, a socialized type, and a mature delinquent type. This predictor
variable was assessed on a nominal scale because it indicates group membership but does
not convey any quantitative information. However, remember that the predictor variable
used in a chi-square test of independence can be assessed on any scale of measurement.
The criterion variable in this study was whether or not the juvenile used a weapon in the
most recent crime. You know that this is a dichotomous variable, because it has only two
values: a weapon was used, versus a weapon was not used. This variable was assessed on an
nominal scale, since it doesnt really convey any meaningful quantitative information.
Note: Although the study described here is fictitious, it is based on the actual study reported
by Dicataldo and Grisso (1995).
Summary of Assumptions Underlying the Chi-Square Test of
Independence

Level of measurement. Both the predictor and criterion variables can be assessed on any
scale of measurement (nominal, ordinal, interval, or ratio).

Random sampling. Subjects contributing data should represent a random sample drawn
from the population of interest.

Independent cell entries. Each subject should appear in only one cell of the two-way
classification table (the concept of a cell and a classification table will be discussed in
the next section). Among other things, this means that the chi-square test of independence
should generally not be used with within-subject designs in which the same subject is
exposed to more than one experimental condition. In addition, the fact that a particular

634 Step-by-Step Basic Statistics Using SAS: Student Guide

subject appears in one cell should not affect the probability of another subject appearing in
any other cell.

Observed frequencies should not be zero. The chi-square test might not be valid if the
observed frequency in any of the cells is zero. When this might be a problem, consider
combining categories.

Minimum expected cell frequencies. When analyzing a 2 2 classification table, no cell


should display an expected frequency of less than 5. When this minimum is violated,
consider computing Fishers exact test instead of chi-square. With larger tables (e.g., 3 4
tables), no more than 20% of the cells should have expected frequencies less than 5. When
this minimum is violated, consider combining categories or, again, using Fishers exact
test. Note that, although these minimums have long been advised by statistics textbooks,
Monte Carlo studies suggest that they might be overly conservative, and that the
probability of making Type I errors is not greatly increased even when these minimums
are violated. For a concise review of these issues, see Spatz (2001, pages 293294).

Using Two-Way Classification Tables


Overview
The chi-square test is typically performed to investigate the relationship between two
dichotomous or limited-value variables. The nature of the relationship between two
variables of this kind is easiest to understand if you first prepare a two-way classification
table. This is a table in which the rows represent the categories (values) of one variable,
while the columns represent the categories (values) of the second variable.
General Structure of a Two-Way Classification Table
For example, assume that you wish to prepare a table that plots one variable that contains
two categories against a second variable that contains three categories. The general
structure of such a table appears in Figure 17.1:

Chapter 17: Chi-Square Test of Independence 635

Figure 17.1. General structure for a two-way classification table.

In Figure 17.1, the rows run horizontally and the columns run vertically. The point at which
a row and column intersect is called a cell, and each cell is given a unique subscript. The
first digit in this subscript indicates the row to which the cell belongs, and the second digit
indicates the column to which the cell belongs. So the basic format for cell subscripts is
cellrc, where r = row and c = column. This means that cell21 is at the intersection of row 2
and column 1, cell13 is at the intersection of row 1 and column 3, and so on.
One of the first steps in performing a chi-square test of independence is to determine exactly
how many subjects fall into each of the cells in the classification table (that is, how many
subjects appear in each subgroup). The pattern shown by these subgroups will help you
understand whether the two classification variables are related to one another.
Two-Way Classification Table for the Juvenile Offender Study
To make this example more explicit, Figure 17.2 present a two-way classification table for
the juvenile offender study described above. You can see that, in this figure, the column
variable is Type of Offender. The first column represents the 100 juveniles classified as
Immature, the second column represents the 100 juveniles classified as Socialized, and
the third column represents the 100 juveniles classified as Mature Delinquent. This Type
of Offender variable will serve as the predictor variable in your study.

636 Step-by-Step Basic Statistics Using SAS: Student Guide

Figure 17.2. Number of cases in which weapons were used versus not used,
as a function of the type of juvenile offender.

The row variable in Figure 17.2 is Use of Weapon, which can assume two values:
Weapon Used versus Weapon Not Used. This variable will serve as the criterion
variable in your study.
The numbers that appear in each of the cells in the figure (such as n = 10) are
frequenciesthe number of subjects that appear in each cell. It is these frequencies that you
will analyze when you perform the chi-square test of independence.
Figure 17.2 is easiest to understand if you interpret it just one column at time. For example,
first consider the column headed Immature. This column consists of two cells. The top
cell (in the row labeled Weapon Used) includes the entry n = 10. This means that there
were only 10 juveniles who fell in this subcategory. The bottom cell (in the row labeled
Weapon Not Used) includes the entry n = 90, which means that 90 juveniles fell into
this subcategory. In summary, of the 100 juveniles in the immature category, 10 of them
used weapons in their crimes, while 90 did not.
Now consider the second column in Figure 17.2: the column that represents the socialized
offenders. There, you see exactly the same pattern that was seen with the immature
juveniles. Of the 100 individuals in the socialized category, 10 used weapons and 90 did not.
Finally, review the third column, the column representing the mature delinquent
offenders. You see a different pattern with this group. The top cell for this column shows
that 60 of the 100 juveniles in this group used a weapon, and the bottom cell shows that 40
did not use a weapon.
Taken together, these figures indicate that there appears to be a relationship between (a) type
of offender and (b) whether or not a weapon was used in the crime. It appears that juveniles
in the immature and socialized categories are unlikely to use a weapon, and juveniles in the
mature delinquent category are much more likely to use a weapon.

Chapter 17: Chi-Square Test of Independence 637

Notice that the preceding paragraph said that there appears to be a relationship between the
two variables. To determine whether there really is a significant relationship, you would
perform a chi-square test of independence. The remainder of this chapter shows you how.

Results Produced in a Chi-Square Test of Independence


Overview
When you perform a chi-square test of independence, you test a null hypothesis which states
that there is no relationship between the predictor variable and the criterion variable in the
population. If the chi-square statistic is significant, you reject this null hypothesis. In the
analysis, you also review an index of effect size to determine the strength of the relationship
between the two variables. Your choice of an index of effect size will depend on the number
of rows and columns in your two-way classification table. This section discusses these
issues in greater detail.
Test of the Null Hypothesis
Overview. This section describes the statistical null hypothesis that is tested in studies of the
type described in this chapter. It discusses the circumstances under which it is appropriate to
use the chi-square statistic to test this null hypothesis, as well as the circumstances under
which it may be more appropriate to use Fishers exact test.
Statistical null and alternative hypotheses. For a chi-square test of independence, the null
hypothesis states that there is no relationship between the predictor variable and the criterion
variable in the population. As a concrete example, the statistical null hypothesis for the
juvenile offender study could be stated in this way:
Statistical null hypothesis (H0): In the population, there is no relationship between the
type of offender and the use of a weapon in a crime.
The statistical alternative hypothesis states that there is a relationship in the population. For
the juvenile offender study, it could be stated in this way:
Statistical alternative hypothesis (H1): In the population, there is a relationship between
the type of offender and the use of a weapon in a crime.
The chi-square statistic and p value. When you analyze your sample data with PROC
FREQ (and request the appropriate options), SAS computes a chi-square statistic
(symbolized as 2) to test the statistical null hypothesis. If there is absolutely no relationship
between the two variables in the sample, the obtained value of chi-square will be equal to
zero. The stronger the relationship between the two variables in the sample, the larger the
obtained chi-square statistic will be.

638 Step-by-Step Basic Statistics Using SAS: Student Guide

SAS also computes a p value (probability value) associated with the chi-square statistic. If
this p value is less than some standard criterion (alpha level), you will reject the null
hypothesis of no relationship between the two variables. This book recommends that you
use an alpha level of .05. This means that, if your obtained p value is less than .05, you will
reject the statistical null hypothesis. In this case, you will state that you have statistically
significant results, and will conclude that the two variables are probably related in the
population.
Fishers exact test for 2 2 tables. A 2 2 table is a two-way classification table in which
the predictor variable consists of just two categories and the criterion variable also consists
of just two categories. When you are analyzing data from a 2 2 table, it is often appropriate
to review the chi-square statistic and its p value, as described in the previous section. This is
particularly the case if your sample size is relatively large.
There are some instances, however, in which it may be more appropriate to consult a
different statistic when you are analyzing data from a 2 2 table. The other statistic is called
Fishers exact test. Fishers exact test may be more desirable than the chi-square statistic if
you are analyzing data from a 2 2 table and your sample is relatively small. One way of
determining whether your sample is relatively small is by reviewing the expected
frequencies in each cell of your two-way classification table. The expected frequencies are
the frequencies that would be expected in a cell if there were no relationship between the
predictor variable and the criterion variable. Some statistics textbooks recommend against
using the chi-square test for a 2 2 table if the expected frequency in any cell is less than 5.
In those instances, it might be preferable to consult Fishers exact test.
A later section of this chapter shows how to use PROC FREQ to perform a chi-square test of
independence. There, you will see how to use the keyword EXPECTED in the TABLES
statement to request expected frequencies, and how to use the keyword FISHER in the
TABLES statement to request Fishers exact test.
For a conceptual introduction to Fishers exact test, see Hays (1988), pages 781783. For a
detailed guide on how to compute and interpret Fishers exact test with SAS, see the
SAS/STAT Users Guide, Chapter 28 The FREQ Procedure, the section titled Example
28.4. Analyzing a 2 2 Contingency Table, pages 13421345.
Be warned that requesting Fishers exact test with PROC FREQ may require a great deal of
computer time and/or memory. This is especially likely to be the case if your sample size is
large or if you analyze a table that is larger than a 2 2 table.
Effect Size
Overview. With most of the inferential statistical procedures covered in this book, you have
learned to report an index of effect size. The way that effect size has been defined has
varied, depending upon the statistic. With the chi-square test of independence, it is useful to
use a measure of association as an index of effect size. These measures of association are

Chapter 17: Chi-Square Test of Independence 639

essentially correlation coefficients, similar in some ways to the Pearson r correlation


coefficient that you learned about in Chapter 10, Bivariate Correlation. This book
recommends that you use as a measure of effect size, either the phi coefficient or Cramers
V, depending upon the size of your two-way classification table. Both of these statistics can
be computed by PROC FREQ.
For 2 2 tables. As was mentioned in an earlier section, a 2 2 classification table is one
that involves a predictor variable consisting of two categories, and a criterion variable that
also consists of two categories. When analyzing data from a table such as this, the phi
coefficient (symbolized as ) is a desirable measure of association, or effect size. The phi
coefficient is a correlation coefficient, and in general can be interpreted in the same way as
Pearson correlation coefficient, r. For example, phi can range from approximately 1.00
through zero to approximately +1.00. Values of phi that are closer to zero indicate a weaker
relationship between the predictor and criterion variables; values that are closer to 1.00 or
+1.00 indicate a stronger relationship (for guidelines on interpreting Pearsons r, see Chapter
10 Bivariate Correlation, the section titled Interpreting the Sign and Size of a Correlation
Coefficient).
There are two important caveats related to the interpretation of phi, however:

Although the range of Pearsons r extends from exactly 1.00 to exactly +1.00, the range
of phi ranges from approximately 1.00 through approximately +1.00. This means that,
under some circumstances, phi may be somewhat less than 1.00 or +1.00 (in absolute
value) even when there is a perfect relationship between the two variables.

Although phi may be positive or negative, the sign of the coefficient is meaningful only if
both the predictor and criterion variables are ordered in some meaningful way. This means
that, for the sign to be meaningful, both variables must be assessed on either an ordinal,
interval, or ratio scale (see Hays, 1988, pages 785-786).

For tables larger than 2 2. A table is larger than 2 2 if you have more than two
categories for the predictor variable, or more than two categories for the criterion variable,
or both. For example, the juvenile offender classification table presented in Figure 17.2
(earlier) is larger than 2 2 because it has two categories under Use of Weapon, but three
categories under Type of Offender. It is a 2 3 table.
For these larger tables, a desirable measure of association is Cramers V (symbolized as V or
V , rarely symbolized as c or '). Cramers V is also a type of correlation coefficient. Its
values may range from zero (indicating no relationship between the two variables) to +1.00
(indicating a perfect relationship).
Summary. For 2 2 classification tables, use the phi coefficient as an index of effect size.
For tables that are larger than 2 2, use Cramers V.

640 Step-by-Step Basic Statistics Using SAS: Student Guide

A Study Investigating Computer Preferences


Overview
Most of the remainder of this chapter illustrates how to use SAS to perform a chi-square test
of independence. To do this, it describes a fictitious investigation in which you are studying
a sample of college students and are trying to determine whether there is a significant
relationship between the students school of enrollment and the type of computer that they
prefer. This section describes the study in greater detail.
Research Method
Research question. Assume that you are a university administrator preparing to purchase a
large number of new personal computers for three of the schools that constitute your
university: the School of Arts and Sciences, the School of Education, and the School of
Business. For a given school, you may purchase either IBM-compatible computers
(computers that use the Windows environment) or Macintosh computers. You now need to
know which type of computer tends to be preferred by students within each school.
In general terms, your research question is Is there a relationship between the following
two variables: (a) school of enrollment, and (b) computer preference? You suspect that
there probably is such a relationship. Before conducting your study, you have a hunch that
students in the School of Arts and Sciences and the School of Education will prefer
Macintosh computers, while students in the School of Business will prefer IBM-compatible
computers. The chi-square test of independence will help you determine whether this is true.
If the test shows that there is a relationship between school of enrollment and computer
preference, you will then review the two-way classification table to see which type of
computer is preferred by most students in the School of Arts and Sciences, which type is
preferred by most students in the School of Education, and so on.
Measuring the predictor and criterion variables. To investigate your research question,
you draw a random, representative sample of 370 students from the 8,000 students that
constitute the three schools. Each student is given a short questionnaire that asks just two
questions:
1. In which school are you enrolled? (circle one):
a. School of Arts and Sciences
b. School of Business
c. School of Education
2. Which type of computer do you prefer that we purchase for your school? (circle one):
a. IBM-compatible
b. Macintosh

Chapter 17: Chi-Square Test of Independence 641

The two questions above constitute the two nominal-level variables for your study:
Question 1 allows you to create a school of enrollment variable that can assume one of
three values (Arts & Sciences versus Business versus Education), while Question 2 allows
you to create a computer preference variable that may assume one of two values (IBMcompatible versus Macintosh). Notice that these are both limited-value variables, which is
appropriate for the chi-square test of independence.
Two-way classification table. After you have gathered completed questionnaires from the
370 students, you prepare a two-way classification table that plots computer preference
against school of enrollment. This table (with fictitious data) is presented as here as Figure
17.3. Notice that computer preference is the row variable; row 1 represents students who
preferred IBM compatibles, and row 2 represents students who preferred Macintosh. In the
same way, you can see that school of enrollment is the column variable: Column 1
represents students from the School of Arts and Sciences, column 2 represents students from
the School of Business, and column 3 represents students from the School of Education.
This is a 2 3 classification table.

Figure 17.3. Preference for IBM-compatible computers versus Macintosh


computers as a function of school of enrollment (significant results).

Figure 17.3 reveals the number of students who appear in each cell of the classification
table. For example, the first row of the table shows that, among those students who
preferred IBM compatibles, 30 were Arts and Sciences students, 100 were Business
students, and 20 were Education majors.
Remember that the purpose of the study was to determine whether there is any relationship
between the two variables: to determine whether school of enrollment is related to computer
preference. This is just another way of saying, If you know what school a student is
enrolled in, does that help you predict what type of computer that student is likely to
prefer? In the present case, the answer to this question is easiest to find if you review the
table one column at a time.
For example, if you look at the Arts and Sciences column of the table, you can see that only a
minority of these students (n = 30) preferred IBM-compatible computers, while a larger

642 Step-by-Step Basic Statistics Using SAS: Student Guide

number of students (n = 60) preferred Macintosh computers. The column for the Business
students shows the opposite trend, however: most business students (n = 100) preferred IBM
compatibles, while fewer (n = 40) preferred Macintosh computers. Finally, the pattern for the
Education students was similar to that of the Arts and Sciences students: a minority (n = 20)
preferred IBM compatibles, while a majority (n = 120) preferred Macintosh.
In short, there appears to be a relationship between school of enrollment and computer
preference, with Business students preferring IBM compatibles, and Arts and Sciences and
Education students preferring Macintoshes. But this is just a trend that you observed in the
sample. Is this trend strong enough to allow you to reject the null hypothesis that there is no
relationship in the population of students? To determine this, you must conduct the chisquare test of independence.

Computing Chi-Square from Raw Data versus


Tabular Data
Overview
The way that you will write a SAS program to perform the chi-square test will differ
somewhat depending on whether you are working with raw data or with tabular data. This
section explains the difference between these two formats.
Raw Data
Raw data is data that have not been summarized or tabulated in any way. For example,
suppose that you have administered your questionnaire to 370 students, and you have not yet
tabulated their responses: you merely have 370 completed questionnaires. So you enter the
questionnaire responses, one subject at a time. In your SAS data set, line 1 contains
responses from Subject #1, line 2 contains responses from Subject # 2, and so on. In this
situation, you are working with raw data.
Tabular Data
On the other hand, tabular data are data that have already been summarized in a table. For
example, suppose that it was actually another researcher who administered this
questionnaire and then summarized subject responses in a two-way classification table
similar to Figure 17.3. This two-way classification table tells you how many people were in
cell11 (that is, how many people were in Arts and Sciences and also preferred IBM
compatibles), how many were in cell12 (that is, how many people were in the School of
Business and also preferred IBM compatibles), and so on. In this case you are dealing with
tabular data.

Chapter 17: Chi-Square Test of Independence 643

The Approach Used Here


In computing the chi-square statistic, there is no real advantage to using one form of data
rather than another, although you will generally have a lot less data to enter if your data are
already in tabular form. The following sections shows how to input the data and request the
chi-square statistic for tabular data; a section appearing later in the chapter shows you how
to perform the same analysis using raw data.

Example of a Chi-Square Test That Reveals a Significant


Relationship
Overview
You will use PROC FREQ to perform the chi-square test of independence. You first learned
about PROC FREQ in Chapter 5, Creating Frequency Tables. There, you used the FREQ
procedure to create a one-way table that listed values of a single variable. Here, you will use
PROC FREQ to create a two-way classification table in which your predictor variable is
crosstabulated with the criterion variable. This type of table is sometimes called a
crosstabulation table.
This section shows you how to prepare the data set, write the SAS program, interpret the
output, summarize the results in a bar chart, and prepare an analysis report. The current
section also shows you how to analyze tabular data: data that have already been organized
into a table.
Choosing SAS Variable Names and Values to Use in the Analysis
Before you write a SAS program to perform the chi-square test, it is helpful to first prepare a
figure similar to Figure 17.4. The purpose of this figure is to help you choose meaningful
SAS variable names for the predictor and criterion variables, as well as meaningful values to
represent the different categories under the predictor and criterion variables. If you carefully
choose meaningful variable names and values at this point, you will find it easier to interpret
your SAS output later.

644 Step-by-Step Basic Statistics Using SAS: Student Guide

Figure 17.4. Variable names and values to be used in the SAS program
for the computer preference study.

SAS variable name and values for the predictor variable. You can see that Figure 17.4 is
very similar to Figure 17.3 presented earlier, except that shorter, more concise labels are
used in Figure 17.4. These shorter labels will serve as the SAS variable names and values to
be used in the SAS program. For example, the label School of Enrollment from Figure
17.3 has been replaced with the SAS variable name SCHOOL in Figure 17.4. Obviously,
you may choose any SAS variable name you like, as long as it is meaningful and complies
with the rules for SAS variable names.
Each column in Figure 17.4 is headed with the value that will be used to code that category
in the SAS program. The figure shows that

the value ARTS will be used to represent the School of Arts and Sciences

the value BUS will be used to represent the School of Business

the value ED will be used to represent the School of Education.

SAS variable name and values for the criterion variable. You can see that the label
Computer Preference from Figure 17.3 has been replaced with the SAS variable name
PREF in Figure 17.4. This will be the SAS variable name to represent the criterion
variable in the analysis.
Each row in Figure 17.4 is labeled with the following values that will be used to code each
category in the SAS program:

The value IBM will be used to represent students who prefer IBM-compatibles.

The value MAC will be used to represent students who prefer Macintosh computers.

Chapter 17: Chi-Square Test of Independence 645

Data Set to Be Analyzed


Table 17.1 presents the data set that you will analyze.
Table 17.1
Tabular Data Set for the Computer Preferences Study (Data Will Produce a
Significant Chi-Square Statistic)
_____________________________
Preference
School
Number
_____________________________
IBM
ARTS
30
IBM
BUS
100
IBM
ED
20
MAC
ARTS
60
MAC
BUS
40
MAC
ED
120
_____________________________

Understanding the columns of Table 17.1. The first two columns of Table 17.1 represent
the two variables that you will analyze in your SAS program. The first column is headed
Preference, and this column indicates whether a particular group preferred IBM
compatibles versus Macintosh computers. The second column is headed School, and this
column indicates the school in which a particular group is enrolled (Arts and Science versus
Business versus Education). The third column of Table 17.1 is headed Number, and this
column simply indicates the number of students who were in a particular subgroup, or cell.
Understanding the rows of Table 17.1. Each row in Table 17.1 represents one of the six
cells from Figure 17.4. For example, the first row is coded IBM under Preference, and
ARTS under School. This row therefore represents the subgroup of students who (a)
preferred IBM compatibles, and (b) were in the School of Arts and Sciences. For this row,
the value 30 appears under Number, meaning that there were 30 students in this
subgroup. This subgroup also appears in Figure 17.4, in the cell where the row labeled
IBM intersects with the column headed ARTS. The n = 30 in that cell also indicates
that there were 30 students in that subgroup.
The second row of Table 17.1 is coded IBM under Preference, and BUS under
School. This row therefore represents the subgroup of students who (a) preferred IBM
compatibles, and (b) were in the School of Business. For this row, the value 100 appears
under Number, meaning that there were 100 students in this subgroup. You can see that
this row corresponds to the cell in Figure 17.4 where IBM intersects with BUS.
In the same way, you can see that the six rows in Table 17.1 correspond to the six cells of
Figure 17.4. After you have tabulated your data in a two-way classification table such as
Figure 17.4, it is fairly simple to convert this information into a data set (such as Table 17.1)
that can be analyzed with PROC FREQ.

646 Step-by-Step Basic Statistics Using SAS: Student Guide

Writing the SAS Program


The SAS DATA step. When the data for a chi-square test of independence are in tabular
form (as they are in Table 17.1), it is necessary to write a special type of INPUT statement
to read the data. Here is the syntax:
DATA
data-set-name;
INPUT
row-variable-name
$
column-variable-name
$
number-variable-name ;
DATALINES;
row-value
column-value
number-in-cell
row-value
column-value
number-in-cell
[Additional data lines would go here]
row-value
column-value
number-in-cell
row-value
column-value
number-in-cell
;
The INPUT statement in this program tells SAS that the data set includes three variables,
and the names of these three variables are symbolized as row-variable-name, columnvariable-name, and number-variable-name.
The first variable is a character variable that codes the rows of the classification table (in the
present study, the row variable was computer preference). The second variable is a
character variable that codes the columns of the table (here, the column variable was
school of enrollment). Finally, the third variable (symbolized as number-variablename) is a quantitative variable that codes how many subjects appear in a cell. (Specific
names will be given to these variables in the program to be presented shortly).
Each line of data in the DATALINES section corresponds to one of the cells in the two-way
classification table that is presented in Figure 17.4. This classification table included six
cells, so there will be six data lines in the DATALINES section for the current program.
Below is the actual DATA step for inputting the tabular data presented in Figure 17.4 and
Table 17.1 (line numbers have been added on the left):
1
2
3
4
5
6
7
8
9
10
11
12
13

OPTIONS LS=80 PS=60;


DATA D1;
INPUT
PREF
$
SCHOOL
$
NUMBER
;
DATALINES;
IBM
ARTS
30
IBM
BUS
100
IBM
ED
20
MAC
ARTS
60
MAC
BUS
40
MAC
ED
120
;

Chapter 17: Chi-Square Test of Independence 647

The DATA statement on line 2 of the preceding program tells SAS to create a new data set
and name it D1. The INPUT statement on lines 3-5 indicates that the data set contains
three variables. The first variable is a character variable named PREF (coding the row
variable), the second is a character variable named SCHOOL (coding the column variable),
and the third variable is a numeric variable called NUMBER (indicating how many students
appear in a cell).
The DATALINES portion of the preceding program includes six lines of data, one for each
cell. The first cell represents those students who (a) preferred IBM-compatibles, and (b)
were in the School of Arts and Sciences. The value for NUMBER on this line shows that
there were 30 subjects in this cell. The remaining data lines may be interpreted in the same
fashion. You can see that these lines of data were taken directly from Table 17.1.
The PROC Step. Below is the syntax for the PROC step that will create a two-way
classification table when data have been input in tabular form. The options used with these
statements (to be described below) allow you to request a chi-square test of independence,
along with additional information.
PROC FREQ
TABLES
WEIGHT
RUN;

DATA=data-set-name;
row-variable-name*column-variable-name
options ;
number-variable-name;

Substituting the appropriate SAS variable names into this syntax results in the following
(line numbers have been added on the left):
1
2
3
4
5

PROC FREQ
TABLES
WEIGHT
TITLE1
RUN;

DATA=D1;
PREF*SCHOOL
NUMBER;
'JANE DOE';

ALL;

The PROC FREQ statement on line 1 requests the FREQ procedure to be performed on data
set D1. The TABLES statement in line 2 sets PREF as row variable and SCHOOL as the
column variable in the two-way classification table that is created by PROC FREQ. This
request is followed by a slash, the keyword ALL for the all statistics option (to be
described below), and a semicolon.
The WEIGHT statement on line 3 provides the name of the variable that codes the number
of subjects in each cell. In this case, the variable named NUMBER is specified. This part
of the program then ends with the usual TITLE1 and RUN statements.
Some options available with PROC FREQ. When PROC FREQ is used to create and
analyze a two-way classification table, SAS enables you to request a variety of options in
the TABLES statement. To request these options, you begin the TABLES statement with the

648 Step-by-Step Basic Statistics Using SAS: Student Guide

word TABLES, followed by the names of the row variable and column variable
(connected by an asterisk), followed by a slash (/), and then by the keywords for the
options that you want. For a complete list of options, see the SAS/STAT Users Guide,
Chapter 28, The FREQ Procedure.
Here are some of the options that may be especially useful for research in the social sciences
and eduction:
ALL
Requests several significance tests (including the chi-square test of independence) and
several measures of bivariate association. Although many statistics are printed, only a
few will be appropriate for a particular analysis. The choice of the correct statistic will
depend upon the level of measurement that is used with the variables, the size of the
two-way classification table, and other considerations.
CHISQ
Requests the chi-square test of independence, and prints a number of measures of
bivariate association based on chi-square. It is not necessary to list this option if you
have already listed the ALL option.
FISHER
Prints Fishers exact test. This test is printed automatically for 2 2 tables (if the
ALL or CHISQ options are specified), but must be specifically requested for larger
tables. Be warned that, if your table has a large number of rows or columns, or if your
sample size is large, Fishers exact test may require a large amount of computer time
or memory.
EXPECTED
Prints the expected cell frequencies; that is, the cell frequencies that are expected if
the two variables are independent (unrelated). You should request this option if you
suspect that the expected frequency in any of your cells might be below the minimum
described in the previous section titled Summary of Assumptions Underlying the
Chi-Square Test of Independence. This is also a useful option for better
understanding the nature of the relationship between the two variables that you are
studying.
MEASURES
Requests several measures of bivariate association, along with their asymptotic
standard errors (ASE). These include the Pearson and Spearman correlation
coefficients, gamma, Kendalls tau-b, Stuarts tau-c, symmetric lambda, asymmetric
lambda, uncertainty coefficients, and other measures. Again, only a few of these
indices will be appropriate for a particular study. All of these measures are printed if
the ALL option is requested.

Chapter 17: Chi-Square Test of Independence 649

The complete SAS program. Below is a complete SAS program that will (a) read tabular
data (b) create a two-way classification table, and (c) print the statistics requested by the
ALL option (including the chi-square test of independence):
OPTIONS LS=80 PS=60;
DATA D1;
INPUT
PREF
$
SCHOOL
$
NUMBER
;
DATALINES;
IBM
ARTS
30
IBM
BUS
100
IBM
ED
20
MAC
ARTS
60
MAC
BUS
40
MAC
ED
120
;
PROC FREQ
DATA=D1;
TABLES
PREF*SCHOOL
WEIGHT
NUMBER;
TITLE1 'JANE DOE';
RUN;

ALL;

Output Produced by the SAS Program


The preceding program produces two pages of output. Page 1 includes a two-way
classification table in which PREF is the row variable and SCHOOL is the column variable
(this table will be similar to Figure 17. 4). Page 1 also includes the chi-square test of
independence, the phi coefficient, Cramers V coefficient, and a few other statistics. Page 2
presents additional statistics requested by the ALL option in the TABLES statement. The
following sections show you how to interpret those parts of the output that are most relevant
to your research question.
Steps in Interpreting the Output
1. Verify that the SAS variable names and values are correct. You should begin by
verifying that there were no obvious errors in typing your data or in writing the SAS
program. The two-way classification table created by the FREQ procedure contains
information that can help to identify possible errors. The two-way classification table
produced by PROC FREQ appears here as Output 17.1.

650 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE
The FREQ Procedure
Table of PREF by SCHOOL
PREF n
SCHOOL o
Frequency|
Percent |
Row Pct |
Col Pct |ARTS r |BUS s |ED t
| Total
---------+--------+--------+--------+
IBM p
|
30 |
100 |
20 |
150
|
8.11 | 27.03 |
5.41 | 40.54
| 20.00 | 66.67 | 13.33 |
| 33.33 | 71.43 | 14.29 |
---------+--------+--------+--------+
MAC q
|
60 |
40 |
120 |
220
| 16.22 | 10.81 | 32.43 | 59.46
| 27.27 | 18.18 | 54.55 |
| 66.67 | 28.57 | 85.71 |
---------+--------+--------+--------+
Total
90
140
140
370
24.32
37.84
37.84
100.00

Output 17.1. Two-way classification table, requested by PROC FREQ, for the
computer preferences study (significant results).

In the classification table reproduced in Output 17.1, the name of the row variable (PREF)
appears in the upper left corner (n). The first row (labeled IBM) represents the subjects
who preferred IBM-compatibles (p), and the second row (labeled MAC) represents subjects
who preferred Macintoshes (q).
The name of the column variable (SCHOOL) appears above the three columns (o), and
each column in turn is headed with its label: Column 1 is headed ARTS (r) and represents
the Arts and Sciences students, column 2 is headed BUS (s) and represents the Business
students, and column three is headed ED (t) and represents the Education students.
Your first task should be to review these SAS variable names and values. Verify that the
correct values (categories) are listed under the predictor and criterion variables. In the
present case, there is no evidence of any problems.
2. Verify that the cell frequencies are correct. Next, you will check each cell to verify that
it contains the correct frequency. To illustrate this, the two-way classification table from the
current analysis is reproduced again below as Output 17.2.

Chapter 17: Chi-Square Test of Independence 651

JANE DOE
The FREQ Procedure
Table of PREF by SCHOOL
PREF
SCHOOL
Frequency|
Percent |
Row Pct |
Col Pct |ARTS
|BUS
|ED
| Total
---------+--------+--------+--------+
IBM
|
30n|
100o|
20p|
150
|
8.11 | 27.03 |
5.41 | 40.54
| 20.00 | 66.67 | 13.33 |
| 33.33 | 71.43 | 14.29 |
---------+--------+--------+--------+
MAC
|
60q|
40r|
120s|
220
| 16.22 | 10.81 | 32.43 | 59.46
| 27.27 | 18.18 | 54.55 |
| 66.67 | 28.57 | 85.71 |
---------+--------+--------+--------+
Total
90
140
140
370 t
24.32
37.84
37.84
100.00

Output 17.2. Verifying that cell frequencies are correct for the computer
preferences study (significant results).

As mentioned earlier, a cell is a location in a two-way table at which a row for the criterion
variable intersects with a column for the predictor variable. For example, in Output 17.2,
look at the cell at the intersection of the row labeled IBM and the column labeled
ARTS. Four numbers appear in this cell:
30
8.11
20.00
33.33
A later section will describe what these four numbers represent, but for now we will focus
on just the top number: 30. The top number in each cell should be the frequency for that cell.
In the current study, the top number for each cell should be the number of people who
appear in each subgroup. You should always verify that the top number in each cell is
correct.
For example, in Output 17.2, the cell where the row labeled IBM intersects with the
column labeled ARTS indicates a frequency of 30 (n). This means that there were 30
subjects who (a) preferred IBM-compatible computers, and (b) were in the School of Arts
and Sciences. If you review Figure 17.4 (presented earlier), you can see that this figure of 30
is correct.
In the same way, Output 17.2 shows that there were 100 students in the IBM-BUS cell (o),
20 students in the IBM-ED cell (p), 60 students in the MAC-ARTS cell (q), 40 students in
the MAC-BUS cell (r), and 120 students in the MAC-ED cell (s). Each of these numbers
match the cell frequencies presented in Figure 17.4.

652 Step-by-Step Basic Statistics Using SAS: Student Guide

The frequency figure in the lower right corner of this page of output provides the total
sample size for the analysis. Here, you can see that the total sample size is 370 (t). So far,
there do not appear to be any obvious errors in preparing your SAS program.
3. Review the supplementary information in each cell. The preceding section indicated
that four numbers are presented in each cell of a two-way classification table created by
PROC FREQ. For example, here again are the numbers that appear in the cell at the
intersection of IBM and ARTS:
30
8.11
20.00
33.33

n
o
p
q

Here is a description of what each figure represents:


n The top number in each cell is the Frequency; that is, the raw number of subjects in the
cell. The top number in the IBM-ARTS cell is 30, as was discussed above.
o The second number in each cell is the Percent; that is, the percent of subjects in that cell
relative to the total number of subjects (the number of subjects in the cell divided by the
total number of subjects). For example, there were 30 subjects in the IBM-ARTS cell, and a
total of 370 subjects in the study. Therefore, the cell percent is 30 / 370 = 8.11%.
p The third number is the Row Pct; that is, the percent of subjects in that cell, relative to the
number of subjects in that row. For example, there are 30 subjects in the IBM-ARTS cell,
and 150 subjects in the IBM row. Therefore, the row percent for this cell is 30 / 150 = 20%.
q The bottom number in each cell is the Col Pct; that is, the percent of subjects in that cell,
relative to the number of subjects in that column. For example, there are 30 subjects in the
IBM-ARTS cell, and 90 subjects in the ARTS column. Therefore, the column percent for
this cell is 30 /90 = 33.33%.
In the upper left corner of Output 17.2, you can see that PROC FREQ has provided a key to
help your remember what the four values in each cell represent. That key from the
classification table in Output 17.2 is reproduced here:
Frequency
Percent
Row Pct
Col Pct
4. Review the column percent figures. In the present study, it is particularly revealing to
review the classification table one column at a time, and to pay particular attention to the last
entry in each cell: the column percent. First, consider the ARTS column in Output 17.2.
The column percent entries show that only 33.33% of the Arts and Sciences students
preferred IBM-compatible computers, while 66.67% preferred Macintosh computers. Next,
consider the BUS column, which shows the reverse trend: 71.43% of the Business students

Chapter 17: Chi-Square Test of Independence 653

preferred IBM compatibles while only 28.57% preferred Macintoshes. Finally, the trend of
the Education students in the ED column is similar to that for the Arts and Sciences
students: only 14.29% preferred IBM compatibles, while 85.71% preferred Macintoshes.
These percentages reinforce the hypothesis that there may be a relationship between school
of enrollment and computer preference. On the surface, it seems that students in the School
of Arts and Sciences and in the School of Education tend to prefer Macintoshes over IBMcompatibles, while students in the School of Business tend to prefer IBM compatibles over
Macintoshes. But the real question is whether this relationship is statistically significant. To
answer this, you must consult the chi-square test of independence.
5. Review the chi-square test of independence. The statistical null hypothesis for the
current study may be stated as follows:
Statistical null hypothesis (H0): In the study population, there is no relationship
between school of enrollment and computer preference.
The chi-square test of independence tests this null hypothesis. The results of the chi-square
test appear in a statistics table at the bottom of Page 1 of the PROC FREQ output. This table
is reproduced here as Output 17.3.
Statistics for Table of PREF by SCHOOL n
o
p
q
r
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square s
2t
97.3853u <.0001v
Likelihood Ratio Chi-Square
2
102.6849
<.0001
Mantel-Haenszel Chi-Square
1
16.9812
<.0001
Phi Coefficient
0.5130
Contingency Coefficient
0.4565
Cramer's V
0.5130
Output 17.3. Chi-square test of independence, requested with PROC FREQ, for
the computer preferences study (significant results).

A heading at the top of the table in Output 17.3 identifies the two variables that are being
investigated (n). In the present case, you can see that the two variables are PREF and
SCHOOL.
The left side of the table in Output 17.3 is headed Statistic (o). Below this heading are the
names of the different statistics appearing in the table. Information that is related to the chisquare test of independence appears as the first row of this table, to the right of the heading
Chi-Square (s).
The third column in Output 17.3 is headed Value (q). Below this heading are the actual
values for the various statistics that are computed by PROC FREQ. The obtained value of
the chi-square statistic appears where the column Value intersects with the row ChiSquare. Output 17.3 shows that the obtained value of chi-square for the current analysis is
97.3853 (u), which rounds to 97.385. As stated above, this statistic tests the null hypothesis

654 Step-by-Step Basic Statistics Using SAS: Student Guide

that, in the population, the two variables are independent, or unrelated. When the null
hypothesis is true, we expect the value of chi-square to be relatively small. The stronger the
relationship between the two variables in the sample, the larger the chi-square value will be.
To determine whether the current obtained value of 97.385 is statistically significant, you
must review the p value (probability value) that is associated with this statistic.
The last column in output 17.3 is headed Prob (r). This column reports p values for some
of the statistics in the table. At the location where the row headed Chi-Square intersects
with the column headed Prob, you can see that the p value for your chi-square statistic is
<.0001 (v). This p value is less than our standard alpha criterion of .05, which means that
your chi-square test of independence is statistically significant. You can therefore reject the
statistical null hypothesis, and tentatively conclude that there is a relationship between
school of enrollment and computer preference in the population. You have already identified
the nature of this relationship in the earlier section titled 4. Review the column percent
figures. There, you determined that students in the School of Arts and Sciences and the
School of Education tend to prefer Macintosh computers over IBM compatibles, while
students in the School of Business tend to prefer IBM-compatible computers over
Macintoshes.
The second column in Output 17.3 is headed DF (p). This column provides the degrees of
freedom for some of the statistics provided in the table. At the location where the column
headed DF intersects with the row headed Chi-Square, you can see that there are 2
degrees of freedom for your chi square statistic (t).
The degrees of freedom for the chi-square test of independence are calculated as
df = (r-1)(c-1)
where
r = number of categories for the row variable
c = number of categories for the column variable.
For the current analysis, the row variable (PREF) had two categories, and the column
variable (SCHOOL) had three categories, so the degrees of freedom are calculated as
df = (2-1)(3-1)
= (1)(2)
= 2
6. Review the index of effect size. An earlier section indicated that, when you perform a
chi-square test of independence, you should generally use either the phi coefficient or
Cramers V as your index of effect size (your measure of association). It indicated that you
should use the phi-coefficient for 2 2 tables, and Cramers V for larger tables. The
classification table for the current analysis is a 2 3 table, because it had two categories
under the row variable (IBM compatibles versus Macintosh) and three categories under the

Chapter 17: Chi-Square Test of Independence 655

column variable (Arts and Sciences versus Business versus Education). Therefore, you will
report Cramers V as your index of effect size.
For convenience, the statistics table from your analysis is reproduced here as Output 17.4.
Statistics for Table of PREF by SCHOOL
n
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
2
97.3853
<.0001
Likelihood Ratio Chi-Square
2
102.6849
<.0001
Mantel-Haenszel Chi-Square
1
16.9812
<.0001
Phi Coefficient o
0.5130
Contingency Coefficient
0.4565
Cramer's V p
0.5130 q
Output 17.4. Index of effect size, requested with PROC FREQ, for the
computer preferences study (significant results).

Cramers V coefficient appears in the location where the row labeled Cramers V (p)
intersects with the column headed Value (n). In Output 17.4, you can see that the
obtained value of Cramers V for the current analysis is 0.5130 (q), which rounds to .51.
You will report this as your index of effect size for the current analysis. Remember that
values of Cramers V may range from zero to +1.00, with larger values indicating a stronger
relationship.
If your classification table had been a 2 2 table instead of a 2 3 table, you would have
reported the phi coefficient instead of Cramers V. In Output 17.4, this statistic appears to
the right of the label Phi Coefficient (o).
Using a Graph to Illustrate the Results
Overview. The results of a chi-square test of independence are easiest to understand when
they are represented in a graph that plots the frequencies for each of the cells in the studys
two-way classification table. The cell frequencies for the computer preferences study were
presented in the two-way classification table of Output 17.2, presented earlier. For your
convenience, that classification table is reproduced here as Output 17.5.

656 Step-by-Step Basic Statistics Using SAS: Student Guide

JANE DOE
The FREQ Procedure
Table of PREF by SCHOOL
PREF
SCHOOL
Frequency|
Percent |
Row Pct |
Col Pct |ARTS
|BUS
|ED
| Total
---------+--------+--------+--------+
IBM
|
30n|
100o|
20p|
150
|
8.11 | 27.03 |
5.41 | 40.54
| 20.00 | 66.67 | 13.33 |
| 33.33 | 71.43 | 14.29 |
---------+--------+--------+--------+
MAC
|
60q|
40r|
120s|
220
| 16.22 | 10.81 | 32.43 | 59.46
| 27.27 | 18.18 | 54.55 |
| 66.67 | 28.57 | 85.71 |
---------+--------+--------+--------+
Total
90
140
140
370
24.32
37.84
37.84
100.00

Output 17.5. Cell frequencies needed to prepare bar chart; computer


preferences study, significant results.

Remember that the top number in each cell represents the number of people in that subgroup
(in Output 17.5, these frequencies are identified with the numbers n,o, and so on). Figure
17.5 presents a bar chart that displays the cell frequencies of Output 17.5.

Figure 17.5. Bar chart displaying preference for IBM compatibles versus
Macintosh computers as a function of school of enrollment (significant
results).

Chapter 17: Chi-Square Test of Independence 657

Understanding the bar chart. You can see that the vertical axis of Figure 17.5 is labeled
Frequency. This means that you will be plotting the frequency (number of people) in each
cell.
The horizontal axis of Figure 17.5 is labeled School of Enrollment. The three sets of bars
in the figure are labeled Arts (for the School of Arts and Sciences), Business (for the
School of Business), and Education (for the School of Education).
You can see that there are actually two bars for each school. For each school, the solid bar
represents the number of people who preferred an IBM-compatible computer, and the white
bar represents the number of people who preferred a Macintosh.
Using cell frequencies from the SAS output to create the bar chart. The cell frequencies
from Output 17.5 were used to create the bars in Figure 17.5. First, consider the row labeled
IBM in Output 17.5. The frequency entries from this row were used to create the solid
bars in Figure 17.5.

The frequency of 30 from the ARTS column in Output 17.5 (n) was used to create the
solid bar for the Arts group in Figure 17.5. Notice that this bar indicates a value of 30
on the vertical axis of Figure 17.5.

The frequency of 100 from the BUS column in Output 17.5 (o) was used to create the
solid bar for the Business group in Figure 17.5.

The frequency of 20 from the ED column in Output 17.5 (p) was used to create the
solid bar for the Education group in Figure 17.5.

Next, consider the row labeled MAC in Output 17.5. The frequency entries from this row
were used to create the white bars in Figure 17.5.

The frequency of 60 from the ARTS column in Output 17.5 (q) was used to create the
white bar for the Arts group in Figure 17.5.

The frequency of 40 from the BUS column in Output 17.5 (r) was used to create the
white bar for the Business group in Figure 17.5.

The frequency of 120 from the ED column in Output 17.5 (s) was used to create the
white bar for the Education group in Figure 17.5.

Interpreting the bar chart. If you review the pattern of frequencies in Figure 17.5, you will
see the same relationships described earlier: In both the School of Arts and Sciences and the
School of Education, the majority of students preferred Macintosh computers over IBMcompatible computers (the white bars are taller than the solid bars for those two schools).
However, in the School of Business, the majority of students preferred IBM-compatible
computers over Macintosh computers (the solid bar is taller than the white bar for that
school).

658 Step-by-Step Basic Statistics Using SAS: Student Guide

Analysis Report for the Computer Preferences Study (Significant


Results)
Overview. The following analysis report summarizes the results of the preceding analysis.
This report can be used as a model of how to prepare reports when you have performed a
chi-square test of independence and have obtained significant results. A section following
the report explains the meaning of some of the information that appears in the report.
A) Statement of the research question: The purpose of the
present study was to determine whether there is a relationship
between school of enrollment and computer preference among
college students. Specifically, this study was designed to
determine whether there is a difference between subjects in
the School of Arts and Sciences, the School of Business, and
the School of Education with respect to their preferences for
IBM-compatible computers versus Macintosh computers.
B) Statement of the research hypothesis: There will be a
relationship between school of enrollment and computer
preference such that (a) students in the School of Arts and
Sciences and students in the School of Education will be
likely to prefer Macintosh computers, while (b) students in
the School of Business will be likely to prefer IBM-compatible
computers.
C) Nature of the variables:
variables:

This analysis involved two

The predictor variable was school of enrollment. This was


a limited-value variable, was measured on a nominal scale,
and could assume three values: the School of Arts and
Sciences, the School of Business, and the School of
Education.
The criterion variable was computer preference. This was a
dichotomous variable, was measured on a nominal scale, and
could assume two values: IBM compatible, and Macintosh.
Chi-square test of independence.

D) Statistical test:

E) Statistical null hypothesis (H0): In the study population,


there is no relationship between school of enrollment and
computer preference.
F) Statistical alternative hypothesis (H1): In the study
population, there is a relationship between school of
enrollment and computer preference.
G) Obtained statistic:

2(2, N = 370) = 97.385.

H) Obtained probability (p) value:

p = .0001.

Chapter 17: Chi-Square Test of Independence 659

I) Conclusion regarding the statistical null hypothesis:


Reject the null hypothesis.
J) Effect size: Cramers V was used as the index of effect
size. For this analysis, Cramers V = .51. Values of Cramers
V may range from zero to +1.00, with values closer to zero
indicating a weaker relationship between the predictor
variable and the criterion variable.
K) Conclusion regarding the research hypothesis: These
findings provide support for the studys research hypothesis.
L) Formal description of the results for a paper:
Results were analyzed using a chi-square test of
independence. This analysis revealed a significant
relationship between school of enrollment and computer
preference, 2(2, N = 370) = 97.385, p = .0001. Figure 17.5
illustrates the number of students who preferred IBMcompatible computers versus Macintosh computers, broken down
by school of enrollment. The crosstabulation table showed
that, for subjects in the School of Arts and Sciences, a
smaller percentage preferred IBM-compatible computers versus
Macintosh computers (33% versus 67%, respectively); for the
School of Business, a larger percentage preferred IBMcompatible versus Macintosh computers (71% versus 29%,
respectively); and for the School of Education, a smaller
percentage preferred IBM compatible computers versus Macintosh
computers (14% versus 86%, respectively).
Cramers V was used to assess the strength of the
relationship between the two variables. This statistic may
range from zero to +1.00, with values closer to zero
indicating a weaker relationship. For this analysis, Cramers
V was computed as V = .51.
M) Figure representing the results: See Figure 17.5.
Notes Regarding the Preceding Analysis Report
Reporting the chi-square statistic. In the preceding report, items G and L presented the
obtained value of the chi-square statistic. Many journals in the behavioral sciences report
obtained chi-square statistics according to this format:
2(df, N = N) = value
where
df is equal to the degrees of freedom for the chi-square analysis.
N is equal to the total sample size.
value is the obtained chi-square statistic.

660 Step-by-Step Basic Statistics Using SAS: Student Guide

The statistics table at the bottom of Page 1 of the SAS output (Output 17.3), showed that the
chi-square test had 2 degrees of freedom and that the obtained value of chi-square was
97.385. The two-way classification table created by PROC FREQ showed that the total
sample size was 370. This table appeared in Output 17.2, and the total sample size was
provided in the lower right corner of the table.
Therefore, the results for the current analysis were reported as
2 (2, N = 370) = 97.385.
Reporting the effect size. Item J in the preceding report provided the index of effect size.
For this analysis, the two-way classification table was larger than 2 2, so you used
Cramers V as your index. Item J from the previous report is reproduced below:
J) Effect size. Cramers V was used as the index of effect
size. For this analysis, Cramers V = .51. Values of
Cramers V may range from zero to +1.00, with values closer
to zero indicating a weaker relationship between the
predictor variable and the criterion variable.
If your table had been a 2 2 table instead of a 2 3 table, you would have instead reported
the phi coefficient as your index of effect size (remember that the phi coefficient had also
appeared in Output 17.4, presented earlier). If this had been the case, you would have
prepared Item J in this way:
J) Effect size. The phi coefficient () was used as the
index of effect size. For this analysis, = .51. Values of
may range from approximately 1.00 through zero through
approximately +1.00, with values closer to zero indicating a
weaker relationship between the predictor variable and the
criterion variable.
Item L (the formal description of results for a paper) also reported the index of effect size in
its final paragraph. That paragraph is reproduced again below:
Cramers V was used to assess the strength of the
relationship between the two variables. This statistic may
range from zero to +1.00, with values closer to zero
indicating a weaker relationship. For this analysis,
Cramers V was computed as V = .51.
If it had been appropriate to report the phi coefficient instead of Cramers V, this paragraph
would have instead been written in this way:
The phi coefficient () was used to assess the strength
of the relationship between the two variables. This
statistic may range from approximately 1.00 through zero to
approximately +1.00, with values closer to zero indicating a

Chapter 17: Chi-Square Test of Independence 661

weaker relationship. For this analysis, the phi coefficient


was computed as = .51.
Reporting subgroup percentages. Item L from the preceding analysis report described the
column percentages from the two-way cross-tabulation table produced by PROC FREQ
(Output 17.5). To review how to interpret these percentages, see the section 4. Review the
column percent figures, presented earlier.
In some cases, it might be better to report the row percentages from the cross-tabulation
table instead of the column percentages. The decision will depend on the nature of your
study, what hypotheses you are testing, and what point you are trying to make for the reader.

Example of a Chi-Square Test That Reveals a


Nonsignificant Relationship
Overview
This section presents the results of a chi-square test in which the relationship between the
two variables is nonsignificant. These results are presented so that you will be prepared to
write analysis reports for projects in which nonsignificant outcomes are observed.
The study presented here is the same computer preference study described in the preceding
section. Here, the data have been changed so that they will produce nonsignificant results.
Figure 17.6 presents the new data set that will be analyzed in this section.

Figure 17.6. Preference for IBM compatibles versus Macintosh computers as a


function of school of enrollment (nonsignificant results).

662 Step-by-Step Basic Statistics Using SAS: Student Guide

The Complete SAS Program


The data from Figure 17.6 will be analyzed with the same SAS program that was presented
earlier. The complete SAS program, including the new data set (from Figure 17.6), is
presented here:
OPTIONS LS=80 PS=60;
DATA D1;
INPUT
PREF
$
SCHOOL
$
NUMBER
;
DATALINES;
IBM
ARTS
40
IBM
BUS
75
IBM
ED
68
MAC
ARTS
50
MAC
BUS
65
MAC
ED
72
;
PROC FREQ
DATA=D1;
TABLES
PREF*SCHOOL
WEIGHT
NUMBER;
TITLE1 'JANE DOE';
RUN;

ALL;

Steps in Interpreting the Output


Overview. As was the case with the earlier data set, the SAS program that performs this
analysis produces two pages of output. Page 1 again presents the two-way classification
table and significance tests, while page 2 presents the a number of additional statistics. As
was discussed earlier, your first step should be to review the output for signs of any obvious
errors. To save space, this section will skip those steps, and will move directly to the results
that are relevant to the studys research hypothesis.
1. Review the column percent figures. As was stated in the preceding section, in the
present study, it is particularly revealing to review the classification table one column at a
time, and to pay particular attention to the last entry in each cell: the column percent.
Output 17.6 presents the two-way classification table from the current analysis.

Chapter 17: Chi-Square Test of Independence 663

JANE DOE
The FREQ Procedure
Table of PREF by SCHOOL
PREF
SCHOOL
Frequency|
Percent |
Row Pct |
Col Pct |ARTS n |BUS o |ED p
| Total
---------+--------+--------+--------+
IBM
|
40 |
75 |
68 |
183
| 10.81 | 20.27 | 18.38 | 49.46
| 21.86 | 40.98 | 37.16 |
| 44.44 | 53.57 | 48.57 |
---------+--------+--------+--------+
MAC
|
50 |
65 |
72 |
187
| 13.51 | 17.57 | 19.46 | 50.54
| 26.74 | 34.76 | 38.50 |
| 55.56 | 46.43 | 51.43 |
---------+--------+--------+--------+
Total
90
140
140
370
24.32
37.84
37.84
100.00

Output 17.6. Two-way classification table, requested by PROC FREQ, for the
computer preferences study (nonsignificant results).

First, consider the ARTS column in Output 17.6 (n). The column percent entries show that
44.44% of the Arts and Sciences students preferred IBM-compatibles, while 55.56%
preferred Macintoshes. These percentages are fairly close to each other, indicating that
approximately equal numbers of students wanted IBM-compatibles versus Macintoshes.
Next, consider the BUS column (o), which shows a similar trend: 53.57% of the Business
students preferred IBM-compatibles while 46.43% percent preferred Macintoshes. The
same trend appears again for the Education students in the ED column (p), which shows
that 48.57% preferred IBM-compatibles, while 51.43% preferred Macintoshes.
In summary, there does not appear to be a strong trend in these data suggesting that the
students preference for a specific computer depends on their school of enrollment.
Regardless of school, in each sample about half of the students preferred IBM-compatibles,
and half preferred Macintosh computers. This is the type of pattern that you would expect if
the relationship between school of enrollment and computer preference was nonsignificant.
Your next step, therefore, will be to determine whether this relationship is in fact
nonsignificant.
2. Review the chi-square test of independence. As was stated earlier, the statistical null
hypothesis for the current study can be stated as follows:
Statistical null hypothesis (H0): In the population, there is no relationship between
school of enrollment and computer preference.

664 Step-by-Step Basic Statistics Using SAS: Student Guide

The results of the chi-square test for this null hypothesis appear in Output 17.7.
Statistics for Table of PREF by SCHOOL
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
2
1.8967 n 0.3874 o
Likelihood Ratio Chi-Square
2
1.8994
0.3869
Mantel-Haenszel Chi-Square
1
0.1911
0.6620
Phi Coefficient
0.0716
Contingency Coefficient
0.0714
Cramer's V p
0.0716 q
Output 17.7. Chi-square test of independence, requested with PROC FREQ, for
the computer preferences study (nonsignificant results).

Output 17.7 shows that the obtained value of chi-square is 1.8967 (n), which rounds to
1.897. The analysis had 2 degrees of freedom. This value of chi-square was quite small,
given the degrees of freedom. The probability value, or p value, for this chi-square statistic
is .3874 (o). This p value is well above the standard criterion of .05, and so you fail to
reject the null hypothesis. In other words, you will retain the null hypothesis, and tentatively
conclude that school of enrollment is unrelated to computer preference in the population.
You will conclude that your results are nonsignificant.
3. Review the index of effect size. Because your classification table is larger than 2 3, you
will use Cramers V as your index of effect size. This statistic appears in Output 17.7, in the
row headed Cramers V (p). You can see that, for the current analysis, Cramers V =
.0716 (q), which rounds to .07. With Cramers V, values closer to zero indicate a weaker
relationship. The value of .07 obtained here is quite close to zero, indicating a relatively
weak relationship between school of enrollment and computer preference.
Using a Figure to Illustrate the Results
The frequencies for the number of people appearing in each cell for this study were
presented in the two-way classification table of Output 17.6. Remember that the top number
in each cell represents the number of people in that subgroup. Figure 17.7 presents a bar
chart that displays the cell frequencies of Output 17.6.

Chapter 17: Chi-Square Test of Independence 665

Figure 17.7. Bar chart displaying preference for IBM compatibles versus
Macintosh computers as a function of school of enrollment (nonsignificant
results).

Again, notice how the results presented in Figure 17.7 are consistent with the hypothesis that
there is little relationship between school of enrollment and computer preference: Knowing
which school a student is in does not enable you to accurately predict the type of computer
that the student is likely to prefer.
Analysis Report for the Computer Preferences Study
(Nonsignificant Results)
The following report summarizes the results of the preceding analysis. This will serve as a
model of how to prepare a report when you perform a chi-square test of independence and
obtain nonsignificant results.
A) Statement of the research question: The purpose of the
present study was to determine whether there is a relationship
between school of enrollment and computer preference among
college students. Specifically, this study was designed to
determine whether there is a difference between subjects in
the School of Arts and Sciences, the School of Business, and
the School of Education with respect to their preferences for
IBM-compatible computer versus Macintosh computers.
B) Statement of the research hypothesis: There will be a
relationship between school of enrollment and computer
preference such that (a) students in the School of Arts and
Sciences and students in the School of Education will be
likely to prefer Macintosh computers, while (b) students in

666 Step-by-Step Basic Statistics Using SAS: Student Guide

the School of Business will be likely to prefer IBM-compatible


computers.
C) Nature of the variables:
variables:

This analysis involved two

The predictor variable was school of enrollment. This was


a limited-value variable, was measured on a nominal scale,
and could assume three values: the School of Arts and
Sciences, the School of Business, and the School of
Education.
The criterion variable was computer preference. This was a
dichotomous variable, was measured on a nominal scale, and
could assume two values: IBM compatible, and Macintosh.
Chi-square test of independence.

D) Statistical test:

E) Statistical null hypothesis (H0): In the study population,


there is no relationship between school of enrollment and
computer preference.
F) Statistical alternative hypothesis (H1): In the study
population, there is a relationship between school of
enrollment and computer preference.
G) Obtained statistic:

2(2, N = 370) = 1.897.

H) Obtained probability (p) value: p = .3874.


I) Conclusion regarding the statistical null hypothesis:
to reject the null hypothesis.

Fail

J) Effect size: Cramers V was used as the index of effect


size. For this analysis, Cramers V = .07. Values of Cramers
V may range from zero to +1.00, with values closer to zero
indicating a weaker relationship between the predictor
variable and the criterion variable.
K) Conclusion regarding the research hypothesis: These
findings fail to provide support for the studys research
hypothesis.
L) Formal description of the results for a paper:
Results were analyzed using a chi-square test of
independence. This analysis revealed a nonsignificant
relationship between school of enrollment and computer
2
preference, (2, N = 370) = 1.897, p = .3874. Figure 17.7
illustrates the number of students who preferred IBMcompatible computers versus Macintosh computers, broken down
by school of enrollment.

Chapter 17: Chi-Square Test of Independence 667

Cramers V was used to assess the strength of the


relationship between the two variables. This statistic may
range from zero to +1.00, with values closer to zero
indicating a weaker relationship. For this analysis, Cramers
V was computed as V = .07.
M) Figure representing the results: See Figure 17.7.

Notes Regarding the Preceding Analysis Report


Reporting the subgroup percentages. Item L in the preceding report provides a formal
description of the results for a paper. Notice that it provides the chi-square value, p value,
and index of effect size, but it does not discuss the column percents (which had been
discussed in the report on the significant relationship, presented earlier). This is because,
when the overall relationship that is being tested is nonsignificant, more detailed results,
such as the column percents, are typically not reported.
Reporting the effect size. In the preceding report, you used Cramers V as your index of
effect size because the two-way classification table was larger than 2 2. If your table had
been a 2 2 table instead of a 2 3 table, you would have instead reported the phi
coefficient as your index of effect size (the phi coefficient had also appeared in Output
17.7). If this had been the case, you would have prepared Item J in this way:
J) Effect size. The phi coefficient () was used as the
index of effect size. For this analysis, = .07. Values of
may range from approximately 1.00 through zero to
approximately +1.00, with values closer to zero indicating a
weaker relationship between the predictor variable and the
criterion variable.
You also reported Cramers V as the index of effect size in the last paragraph of Item L, the
formal description of results for a paper. If you had instead reported the phi coefficient, this
paragraph would have been written this way:
The phi coefficient () was used to assess the strength
of the relationship between the two variables. This
statistic may range from approximately 1.00 through zero
through approximately +1.00, with values closer to zero
indicating a weaker relationship. For this analysis, the phi
coefficient was computed as = .07.

668 Step-by-Step Basic Statistics Using SAS: Student Guide

Computing Chi-Square from Raw Data


Overview
So far, this chapter has shown you only how to analyze data that are already in tabular form.
In contrast, raw data are data that have not been summarized in tabular form. Regarding the
computer preferences study, above: Suppose that you administered your questionnaire to
370 students, and then keyed responses to the questionnaire with one line of data for each
student. In this case, you would be working with raw data.
When working with raw data, the DATA step as well as the PROC step will follow a
somewhat different format, compared to the format discussed so far. This section shows you
the proper format for performing a chi-square test on raw data.
Inputting Raw Data
Data set to be analyzed. If the data to be analyzed are in raw form, they may be entered
following the procedures discussed in Chapter 4, Data Input, presented earlier in this
book. For example, assume that the two-item computer preferences questionnaire
(presented at the beginning of this chapter) has been administered to the 370 subjects. Table
17.2 presents fictitious data from a small subset of these 370 subjects.
Table 17.2
Raw Data Set for the Computer Preferences Study
_____________________________
Subject
Preference
School
_____________________________
001
MAC
ARTS
002
IBM
BUS
003
MAC
BUS
[Data lines for the remaining
subjects would go here]
368
MAC
ED
369
IBM
BUS
370
MAC
ED
_____________________________

The conventions used with Table 17.2 are the same conventions that have been used
throughout this text. Each column in the table represents a different variable. The first
column (headed Subject) simply assigns a unique subject number to each respondent. The
second column (headed Preference) indicates whether that subject preferred an IBMcompatible or a Macintosh. The third column (headed School) indicates the school in
which the student was enrolled.

Chapter 17: Chi-Square Test of Independence 669

Each row in the table provides data from different student. For example:

The first row presents data from Subject 001, who preferred a Macintosh and was enrolled
in the School of Arts and Sciences.

The second row presents data from Subject 002, who preferred an IBM-compatible and
was enrolled in the School of Business.

And so forth. Remember that Table 17.2 only presents data from the first three subjects and
the last three subjects in the sample. If the table were complete, it would have data for 370
subjects (i.e., it would have 370 lines of data).
The DATA step. Below is the syntax for the DATA step when the data are in raw-score
form:
OPTIONS LS=80 PS=60;
DATA
data-set-name;
INPUT
subject-number-variable-name
row-variable-name
$
column-variable-name
$
DATALINES;
[data lines go here]
;

Syntax for the previous INPUT statement indicates that the name for the row-variable
should come before the name for the column-variable, but in fact, this order is arbitrary (in
the INPUT statement, at least).
The actual DATA step that would input the data from Table 17.2 is as follows (line numbers
have been added on the left):
1
2
3
4
5
6
7
8
9

OPTIONS LS=80 PS=60;


DATA D1;
INPUT
SUB_NUM
PREF
$
SCHOOL
$ ;
DATALINES;
001 MAC ARTS
002 IBM BUS
003 MAC BUS
[Remaining data lines would appear here]

375
376
377
378

368
369
370
;

MAC
IBM
MAC

ED
BUS
ED

670 Step-by-Step Basic Statistics Using SAS: Student Guide

The INPUT statement on lines 3-5 of this program tells SAS that the data set includes three
variables. The first of these three variables is SUB_NUM (which represents subject
number), the second variable is PREF (for computer preference), and the third variable is
SCHOOL (for school of enrollment).
The data lines themselves appear on lines 7-377. You can see that these data lines are the
same as in Table 17.2.
The preceding program specified SCHOOL and PREF as character variables with values
such as ARTS and MAC, but it also would have been possible to code them as numeric
variables. For example, SCHOOL could have been coded so that 1 = Arts and Sciences, 2 =
Business, and 3 = Education. The analysis could have then proceeded in the usual way,
although you would then need to (a) make a record to remember exactly which group is
represented by each numerical value, or (b) use the VALUES statement of PROC FORMAT
to attach meaningful value labels (such as ARTS and BUS) to the variable categories
when they are printed.
The PROC Step
Below is the syntax for the PROC step that will perform the chi-square analysis using raw
data:
PROC FREQ
TABLES

DATA=data-set-name;
row-variable-name*column-variable-name
options ;
TITLE1 'your name';
RUN;

Substituting the appropriate SAS variable names into this syntax results in the following
(line numbers have been added on the left):
379
380
381
382

PROC FREQ
TABLES
TITLE1
RUN;

DATA=D1;
PREF*SCHOOL
'JANE DOE';

ALL;

What is the difference between these statements versus the PROC-step statements presented
earlier? The earlier statements (from the section in which the analysis was performed on
tabular data) included a WEIGHT statement after the TABLES statement. In the example
immediately above, the WEIGHT statement has been dropped. It is not required when you
are analyzing raw data.

Chapter 17: Chi-Square Test of Independence 671

The Complete SAS Program


Below is the complete SAS program (containing only a subset of data) that will perform a
chi-square test of independence using raw data from the computer preferences study (line
numbers have been added on the left):
1
2
3
4
5
6
7
8
9
...
375
376
377
378
379
380
381
382

OPTIONS LS=80 PS=60;


DATA D1;
INPUT
SUB_NUM
PREF
$
SCHOOL
$ ;
DATALINES;
001 MAC ARTS
002 IBM BUS
003 MAC BUS
[Remaining data lines appear here]
368 MAC ED
369 IBM BUS
370 MAC ED
;
PROC FREQ
DATA=D1;
TABLES
PREF*SCHOOL
TITLE1 'JANE DOE';
RUN;

ALL;

Interpreting the SAS Output


From this point forward, the analysis proceeds in exactly the same way as when the data set
was based on tabular data. The same options can be requested, and the results are interpreted
in exactly the same way.

Conclusion
This concludes the final chapter of Step-by-Step Basic Statistics Using SAS: Student Guide.
This book was designed to introduce you to just a few of the elementary statistical
procedures that are used in research for the social sciences and education. Given its limited
scope, it has covered only a very small percentage of the statistical procedures that are made
possible with SAS. However, there are a wide variety of additional books available that
show how to perform other types of statistical procedures using SAS software.

672 Step-by-Step Basic Statistics Using SAS: Student Guide

A Step-by-Step Approach to Using the SAS System for Univariate and Multivariate Statistics
(Hatcher and Stepanski, 1994) covers many of the same procedures and statistics discussed
in the present text, but at a somewhat more advanced level. It also covers more advanced
statistical procedures, including one-way ANOVA with one repeated-measures factor,
factorial ANOVA with repeated-measures factors and between-subjects factors, multiple
regression, and principal component analysis. Cody and Smiths (1997) Applied Statistics
and the SAS Programming Language covers similar subject matter, and provides a more indepth treatment of factorial ANOVA with within-subject factors. It also covers a number of
more advanced statistics.
The SAS Institutes Books By Users program has published a large number of texts that can
be useful to researchers using SAS. These texts range in level of sophistication from the
elementary to the advanced. You may obtained a copy of the SAS Publishing Catalog from
SAS Publications (1-800-727-3228). You may also view the entire SAS Publishing Catalog
on the Web at support.sas.com/pubs.

References
Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Bandura, A. (1965). Influence of models reinforcement contingencies on the acquisition of imitative
responses. Journal of Personality and Social Psychology, 1, 589595.
Bandura, A. (1977). Social learning theory. Englewood Cliff, NJ: Prentice Hall.
Buunk, B.P., Angleitner, A., Oubaid, V., & Buss, D.M. (1996). Sex differences in jealousy in
evolutionary and cultural perspective: Tests from the Netherlands, Germany, and the United States.
Psychological Science, 7, 359363
Carpenter, A.L., & Shipp, C.E. (1995). Quick results with SAS/GRAPH software. Cary, NC: SAS
Institute Inc.
Cody, R. P., & Smith, J.K. (1997). Applied statistics and the SAS programming language, fourth
edition. Upper Saddle River, NJ: Prentice Hall.
Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic.
Daly, M., Wilson, M., & Weghorst, S.J. (1982). Male sexual jealousy. Ethology and Sociobiology,
3, 1127.
Delwiche, L. D., & Slaughter, S. J. (1998). The little SAS book: A primer, second edition. Cary, NC:
SAS Institute, Inc.
Dicataldo, F., & Grisso, T. (1995). A typology of juvenile offenders based on the judgments of
juvenile court professionals. Criminal Justice and Behavior, 22, 246262.
Gilmore, J. (1996). Painless Windows 3.1: A beginners handbook for SAS users. Cary, NC: SAS
Institute Inc.
Gilmore, J. (1997). Painless Windows: A handbook for SAS users. Cary, NC: SAS Institute Inc.
Gilmore, J. (1999). Painless Windows: A handbook for SAS users, second edition. Cary, NC: SAS
Institute Inc.
Hatcher, L. (1994). A step-by-step approach to using the SAS System for factor analysis and
structural equation modeling. Cary NC: SAS Institute Inc.
Hatcher, L., & Stepanski, E.J. (1994). A step-by-step approach to using the SAS system for
univariate and multivariate statistics. Cary NC: SAS Institute Inc.
Hatcher, L. (2001). Using the SAS windowing environment: A quick tutorial. Cary, NC: SAS
Institute Inc.
Hays, W. L. (1988). Statistics, fourth edition. New York: Holt, Rinehart, & Winston.
Howell, D. C. (1997). Statistical methods for psychology, fourth edition. Belmont, CA: Duxbury
Press.

674 Step-by-Step Basic Statistics Using SAS: Student Guide


Locke, E. A., & Latham, G.P. (1990). A theory of goal setting and task performance. Englewood
Cliffs, NJ: Prentice-Hall.
Ludwig, T.D., & Geller, E.S. (1997). Assigned versus participative goal setting and response
generalization: Managing injury control among professional pizza deliverers. Journal of Applied
Psychology, 82, 253261.
Rusbult, C. E. (1980). Commitment and satisfaction in romantic associations: A test of the
investment model. Journal of Experimental Social Psychology, 16, 172186.
SAS Institute Inc. (1998). Getting started with the SAS system, version 7. Cary, NC: SAS Institute
Inc.
SAS Institute Inc. (1999a). SAS language reference: Concepts, version 8. Cary, NC: SAS Institute
Inc.
SAS Institute Inc. (1999b). SAS language reference: Dictionary, version 8. Cary, NC: SAS Institute
Inc.
SAS Institute Inc. (1999c). SAS procedures guide, version 8, volumes 1 and 2. Cary, NC: SAS
Institute Inc.
SAS Institute Inc. (1999d). SAS/STAT users guide, version 8, volumes 1, 2, and 3. Cary, NC: SAS
Institute Inc.
SAS Institute Inc. (2000). SAS/GRAPH software: Reference, version 8, volumes 1 and 2. Cary, NC:
SAS Institute Inc.
Schlotzhauer, S. D., & Littell, R. C. (1997). SAS system for elementary statistical analysis, second
edition. Cary, NC: SAS Institute Inc.
Spatz, C. (2001). Basic statistics: Tales of distributions, seventh edition. Belmont, CA:
Wadsworth/Thomson Learning.
Stevens, J. (1986). Applied multivariate statistics for the social sciences. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Thorndike, R. M., & Dinnel, D. L. (2001). Basic statistics for the behavioral sciences. Upper Saddle
River, NJ: Merrill Prentice Hall.
Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

This combined index includes entries for Basic Statistics Using SAS: Student Guide and Exercises.

factorial ANOVA 597607, 614616,

Page numbers preceded by "E" indicate pages in Basic Statistics


622624Using SAS: Exercises.
All other page numbers refer to this book.

Index
A
absolute magnitude/value
correlation coefficients 295296
z scores 276278
accuracy of data, verifying 181
achievement motivation study (example)
218222
creating variable conditionally 239241
data manipulation and subsetting statements,
combined 256260
eliminating missing data 252256
recoding and creating variables 235239
ADJUST= option, LSMEANS statement (GLM)
626
aggression in children study (example)
See child aggression study (example)
All Files option (Open dialog) 80
ALL option, TABLES statement (FREQ) 648
Allow cursor movement past end of line
option
58
alpha level 336
ALPHA= option
CORR procedure 334
LSMEANS statement (GLM) 626627
MEANS statement (GLM) 509, 572573
TTEST procedure 396397, 401, 432, 470,
476477
alternative explanations 300302
alternative (directional) hypotheses 2021
memory performance with Gingko biloba
423426
one- and two-tailed tests 406407
paired-samples t tests 462
single-sample t tests 389390
analysis reports
bivariate regression, negative coefficient
376378
bivariate regression, nonsignificant
coefficient
380383
bivariate regression, positive coefficient
367371
chi-square test of independence 658661,
665667

factorial ANOVA 597607, 614616,


622624
independent-samples t tests 443445,
448450
nonsignificant correlation coefficients
328329
one-way ANOVA 526529, 535537
paired-samples t tests 479482, 485487
Pearson correlation coefficient 318320
significance tests 318320
single-sample t tests, nonsignificant results
410411
single-sample t tests, significant results
405406
statistical null hypothesis 318320
AND statements 244, 252
ANOVA, factorial
See factorial ANOVA with two betweensubjects factors
ANOVA, one-way
See one-way ANOVA with betweensubjects
factor
Appearance Options tab (Enhanced Editor) 58
approximately normal distribution 191193,
E47
stem-and-leaf plots 191192
arithmetic operators 231
arrow keys, navigating SAS programs 81
assignment statements 233
association, tests of 383
assumptions, statistical 35
average
See central tendency statistics
See mean

B
balanced factorial ANOVA designs 575, 625
bar charts
See also frequency bar charts
subgroup-mean bar charts 161, 174177
basic statistical measures table 184, 192

676 Index
best-fitting lines for variables
See bivariate linear regression
between-subjects designs 416417
See also factorial ANOVA with two
between-subjects factors
See also one-way ANOVA with betweensubjects factor
bimodal distributions 198200, E47
See also CORR procedure
causal relationships, investigating 299303
correlating weight loss with predictor
variables 303307
mean, median, mode 199
nonsignificant correlations, summarizing
results 324329
scattergrams, creating 307313, 325327
significant results 335338
Spearman rank-order correlation coefficient
332333
stem-and-leaf plots 198
suppressing correlation coefficients 329331
binary (dichotomous) variables 2728
type-of-variable figures for 33
bivariate correlation
See correlation coefficients
bivariate linear regression 338, 341383, E100,
E115
appropriate situations and assumptions
344346
criterion variables 347348
degrees of freedom 354, 371
drawing regression line through scattergrams
358364, 372374
negative regression coefficient 371379
nonsignificant regression coefficient
379383
p value 371
positive regression coefficient 350371
predictor variables 347348
residuals of prediction 365367
weight loss, correlating with predictor
variables 346350
BON option, MEANS statement (GLM) 509

C
campaign finance studies
See political donations study #1 (example)
See political donations study #2 (example)
categorical variables 22
nominal scales 25
causal relationships, correlations to investigate
299303

cause-and-effect relationships 30, 300302


central tendency statistics
See also mean
median (median score) 186, 192, 194196,
199
mode (modal score) 185, 192, 194196,
199
UNIVARIATE procedure for 183184
variance and standard deviation 204213
character value conversions 245248
character variables
$ for 128
frequency bar charts for 163164
in IF-THEN statements 245248
in INPUT statement 128130
CHART procedure
frequency bar charts 163173, E33, E38
HBAR statement 163, 168173, 175
subgroup-mean bar charts 161, 174177
VBAR statement 163, 168173 , 175
charts
See graphs
chi-square statistic 637
chi-square test of independence 631671,
E209, E218
analysis reports 658661, 665667
appropriate situations for 631634
computing from raw data 642643,
668671
criterion variables 632, 644
DATA step 646647, 669670
DATALINES statement 646647
FREQ procedure for 647649, 670, E209,
E218
graphing test results 655657
index of effect size 638639, 654655, 660
INPUT statement 646647, 669670
nonsignificant relationship 661667
output files 649655
p value 638, 654
plotting 655657
predictor variables 631, 644
significant relationship 643661
statistical alternative hypothesis 637
statistical null hypothesis 637
two-way classification tables 634637,
641642
type-of-variable figures 632
child aggression study (example) 494496
criterion variables 428430
factorial ANOVA, significant effect for
predictor A only 554557
factorial ANOVA, significant effect for
predictor B only 557558

Index 677
factorial ANOVA, significant interaction
561564
factorial ANOVA graphs, interpreting
550553, 592594
factorial ANOVA graphs, preparing
595596, 612, 620
factorial design matrix 548550
nonsignificant main effects and interaction
607617
nonsignificant results 446450
nonsignificant treatment effect 529537
predictor variables 428430
significant interactions 617625
significant main effects, nonsignificant
interaction 565607
significant results 428445
significant treatment effects 505529
CHISQ option, TABLES statement (FREQ) 648
CL option, LSMEANS statement (GLM) 626
classification variables 22
nominal scales 25
CLDIFF option, MEANS statement (GLM) 509,
572
Clear All command 73
Clear text on submit option 58, 73, 78, 106
clear window contents 73
clicking, defined 49
closing windows 54
coefficient alpha 334
coefficient of determination 296, 357358, 375
column input 115116
columns, data set 23, 113115
comparison number 395
comparison operators 241242
subsetting data 252
computer preferences among college students
(example) 640642
chi-square test, nonsignificant relationship
661667
chi-square test, significant relationship
643661
conditional data modification 239248
conditional statements 244
subsetting data 252
confidence intervals
independent-samples t tests 426, 440441,
447
paired-samples t tests 462, 463, 476, 484
single-sample t tests 391, 401, 409
control group, defined 32
Copy command 8892
copying lines into SAS programs 8892
CORR procedure 313320, E91, E100
ALPHA= option 334

computing all possible correlations


320323
computing Pearson correlation coefficient
313320, E91, E100
COV option 334
KENDALL option 334
NOMISS option 335
NOPROB option 335
options for 333335
RANK option 335
SPEARMAN option 333, 335
suppressing correlation coefficients
329331
VAR statement 314, 321, 329331
WITH statement 329331
correcting errors in SAS programs 97100,
107108
correlated-samples t tests
See t tests, paired-samples
correlation
negative 294295, 317318
perfect 296
positive 293294, 317318
zero 2627
correlation coefficients (bivariate correlation)
290338
See also CORR procedure
See also Pearson correlation coefficient
absolute magnitude/value 295296
calculating E91, E100
causal relationships, investigating 299303
coefficient of determination 296, 357358,
375
correlating weight loss with predictor
variables 303307
matrices of 316317
nonsignificant, summarizing results
324329
sample size and 318
scattergrams, creating 307313, 325327
sign and size of, interpreting 293296,
317318
significant results 335338
Spearman rank-order correlation coefficient
332333
statistical alternative hypothesis 297, 298
statistical null hypothesis 297299
statistical significance, interpreting
297299
suppressing 329331
correlational research 2930, 3637, 300303,
341342
COV option, CORR procedure 334

678 Index
criminal activity among juvenile offenders
(example) 632639
criterion variables 30, 341344
bivariate linear regression 347348
chi-square test of independence 632, 644
child aggression study 428430
factorial ANOVA 542
independent-samples t tests 417419
one-way ANOVA 491
paired-samples t tests 453
Pearson correlation coefficient 291
scattergrams, creating 307313, 325327
womens responses to sexual infidelity 466,
484
crosstabulation tables 643
Cut command 92

D
DAT files 62
data 200201
See also central tendency statistics
See also data modification
See also inputting data
See also subsetting data
analyzing with MEANS and FREQ
procedures 131138
creating and analyzing E9, E14
creating and modifying E63
defined 21
distribution shape 190200
missing, eliminating observations with
252256
missing, representing in input 124, 127
printing raw data 139142, 222225
stem-and-leaf plots 187198, E56
variability measures 200203, E56
verifying accuracy of 181
data files 62
data inputting
See inputting data
data modification 217260, E63
See also subsetting data
combining several statements 256260
conditional operations 239248
creating z-score variables 272, 280
duplicating variables with new names
228230
placing statements in DATA step 225228
reasons for 217
DATA= option, PROC statements 252

data screening for factorial ANOVA 570


data sets 23, 113115
DATA statement 118
DATA step 41, E9, E14
See also examples by name
chi-square test of independence 646647,
669670
data manipulation and subsetting statements
225228
one-way ANOVA 507, 530
data subsetting
See subsetting data
DATALINES statement 120
chi-square test of independence 646647
debugging SAS programs 97100, 107108
degrees of freedom 209
bivariate linear regression 354, 371
deleting lines from SAS programs 8487
dependent variables 3132, 342344
descriptive statistics 204
determination, coefficient of 296, 357358,
375
Deviation IQs 262
df (degrees of freedom) 209
bivariate linear regression 354, 371
dichotomous (binary) variables 2728
type-of-variable figures for 33
differences between the means
independent-samples t tests 436443, 447
paired-samples t tests 461463, 475476,
484
directional (alternative) hypotheses 2021
independent-samples t tests 423426
memory performance with Gingko biloba
423426
one- and two-tailed tests 406407
paired-samples t tests 462
single-sample t tests 389390
DISCRETE option, VBAR/HBAR statements
(CHART) 172173
distribution shape 190200, E47
approximately normal distribution 191193
bimodal distributions 198200
skewed distributions 196197
UNIVARIATE procedure 190200, E47
dollar sign ($), for character variables 128
double-clicking, defined 49
Drag and drop text editing option 58
DUNCAN option, MEANS statement (GLM)
509
DUNNETT option, MEANS statement (GLM)
509

Index 679

E
editing SAS programs 8193
Editor window 52
maximizing 5556
menu bar 59
title bar 67
Window menu 5960
effect size 391393, 402404
chi-square test of independence 638639,
654655, 660
independent-samples t tests 426427,
441443, 447448
one-way ANOVA 499500
paired-samples t tests 463, 477479, 484
single-sample t tests 409
ELSE statements 243
emotional responses to infidelity
See womens responses to sexual infidelity
(example)
Enhanced Editor Options dialog 5658, 7778,
106
equality of variances, testing 437438, 447
error messages 43, 70
debugging SAS programs 97100, 107108
estimated population standard deviation 209,
212213, E56
calculating to create z-score variables 269,
E77, E84
estimated population variance 209, 212213,
E56
executing SAS programs 6769, E3, E6
with errors 96101
exiting SAS 74
expected frequencies 638
EXPECTED option, TABLES statement (FREQ)
648
experimental conditions 32, 497
experimental groups, defined 32
experimental research 3132, 300, 342343
experiments
three conditions 3536
two conditions 3435
Explorer window 53
closing 54
maximizing 5556
extremes table 184

F
F test for equality of variances 437438, 447
factorial ANOVA with two between-subjects
factors 542628, E187, E198

aggression study 546550


analysis reports 597607, 614616,
622624
appropriate situations for 542545
assumptions for 545
balanced designs 575, 625
criterion variables 542
data screening and assumption tests 570
GLM procedure 571573
graphs, interpreting 550553, 592594
graphs, preparing 595596, 612, 620
LSMEANS statement 627
no main effects 559
nonsignificant main effects and interaction
607617
output files 575592
predictor variables 542
significant effect for both predictors
558559
significant effect for predictor A only
554557
significant effect for predictor B only
557558
significant interactions 560564, 617625
significant main effects, nonsignificant
interaction 565607
simple effects, testing for 622
summary tables 583586
type-of-variable figures 543
unbalanced designs 575, 625627
file name extensions for SAS programs 6162
file types 62, 6364
Files of type (Open dialog) 80
financial donations studies
See political donations study #1 (example)
See political donations study #2 (example)
FISHER option, TABLES statement (FREQ)
648
Fishers exact test 638
fitting straight lines to variables
See bivariate linear regression
floppy disks, saving programs on 6466
formatted input 116
free-formatted input 115
FREQ procedure 131134, 147, E9, E14, E21,
E27
analyzing data with 131138
chi-square test of independence 647649,
670, E209, E218
creating frequency tables 153156
interpreting results of 137138
questions answerable by frequency tables
157158
TABLES statement 647649

680 Index
FREQ procedure (continued)
WEIGHT statement 647, 670
frequency bar charts 160, 162173, E33, E38
See also frequency tables
character variables 163164
controlling number of bars 168169
numeric variables 165167
setting bar midpoints 170171
values as separate bars 172173
frequency tables
See also frequency bar charts
creating 147158, E21
numeric variables 165171
questions answerable by 157158
reviewing 137138
stem-and-leaf plots 187198, E56

G
GABRIEL option, MEANS statement (GLM)
509
gathering data 21
GE (greater than or equal to) operator 240242
General Options tab (Enhanced Editor) 58,
7778, 106
Gingko biloba study
See memory performance with Gingko biloba
(example)
GLM procedure
factorial ANOVA with two between-subjects
factors 571573
LSMEANS statement 575, 625627
MEANS statement 509511, 572573
MODEL statement 572
one-way ANOVA 508509, 530, E167,
E177
goal setting theory E38
graphs
See also frequency tables
See also scattergrams
See also type-of-variable figures
chi-square test of independence 655657
factorial ANOVA graphs, interpreting
550553, 592594
factorial ANOVA graphs, preparing
595596, 612, 620
frequency bar charts 160, 162173, E33, E38
one-way ANOVA results 525, 534535
resolution 160
stem-and-leaf plots 187190
subgroup-mean bar charts 161, 174177

group differences, tests of 383


GT (greater than) operator 242

H
H0 option, TTEST procedure 396397,
469470
HBAR statement, CHART procedure 163
DISCRETE option 172173
LEVELS= option 168169
MIDPOINTS= option 170171
SUMVAR option 175
TYPE= option 175
high-resolution graphics 160
hyphens (-) in variable names 130
hypotheses 1621
See also statistical alternative hypotheses
See also statistical null hypotheses
nondirectional vs. directional 1921
representing with figures 34

I
I-beam pointer 49
IF-THEN control statements 239248, E70
See also subsetting data
character variables 245248
comparison operators 241242
conditional statements 244
ELSE statements 243
Indentation setting 58, 78, 106
independence tests
See chi-square test of independence
independent samples 415417
independent-samples t tests
See t tests, independent-samples
independent variables 3132, 342344
levels of 32, 497
subject vs. true independent variables
544545
index of effect size 391393, 402404
chi-square test of independence 638639,
654655, 660
independent-samples t tests 426427,
441443, 447448
one-way ANOVA 499500
paired-samples t tests 463, 477479, 484
single-sample t tests 409
index of variance accounted for 499500
industrial psychology study (example)
291292
alternative explanations 301
positive and negative correlations 294295

Index 681
inferential statistics 205
infidelity, responses to
See womens responses to sexual infidelity
(example)
INFILE statement 120
INPUT statement 115116, 119
character variables in 128130
data for chi-square tests of independence
646647, 669670
rules for list input 126130
variable names in 119
inputting data 113144
analyzing data with MEANS and FREQ
procedures 131138
column input 115116
eliminating observations with missing data
252256
example of (financial donations study #1)
122143
example of (three quantitative variables)
117121
formatted input 116
free-formatted input 115
list input 115, 126130
printing raw data 139142
representing missing data 124, 127
inserting lines into SAS programs 8184
instruments for gathering data 21
interaction between predictors in factorial
ANOVA 560564
child aggression study 617625
interaction, defined 560
simple effects, testing for 622
intercept constant of regression line 354, 375
interquartile range 202203
interval scales 26

LEVELS= option, VBAR/HBAR statements


(CHART) 168169
limited-value variables 28
type-of-variable figures for 33
line numbers, displaying 5658, 7778, 106
linear regression, bivariate
See bivariate linear regression
linear relationships between variables 307313
LINESIZE= option, OPTIONS statement 109,
118, 161
list input 115, 126130
log files 4244, 52, 62
analyzing 134135
creating new variables from existing
variables 238239
debugging with 97100, 107108
factorial ANOVA 574575
reviewing contents of 6972
Log window 52
debugging with 97100, 107108
reviewing and printing contents 6972
Look in (Open dialog) 7980, 96, 106107
low-resolution graphics 160
LS= option, OPTIONS statement 109, 118,
161
LSMEANS statement, GLM procedure 575,
625627
ADJUST= option 626
ALPHA= option 626627
CL option 626
factorial ANOVA 627
PDIFF option 626
LST files 62
LT (less than) operator 240242

M
J
juvenile offenders, criminal activity of (example)
632639

K
KENDALL option, CORR procedure 334
Kendalls tau-b coefficient 334

L
LE (less than or equal to) operator 242
levels of independent variables 32, 497

main effects in factorial ANOVA


no main effects 559
nonsignificant main effects and interaction
607617
significant effect for both predictors
558559
significant effect for predictor A only
554557
significant effect for predictor B only
557558
significant main effects, nonsignificant
interaction 565607
manipulated variables 31
matched-subjects designs 416
matrices of correlation coefficients 316317
maximizing Editor window 5556

682 Index
mean 186
See also differences between the means
approximately normal distribution 192
bimodal distributions 199
calculating to create z-score variables
269270, 279, E77, E84
comparing variables with different means
263265
computing with UNIVARIATE procedure
186187
factorial ANOVA graphs, interpreting
551553, 592594
factorial ANOVA graphs, preparing
595596, 612, 620
plotting for subgroups 161, 174177
reviewing for new z-score variables
275276, 283
skewed distributions 194196
MEANS procedure 131134, E9, E14, E56,
E70
analyzing data with 131138
creating z-score variables 269270, 273,
279, 281, E77, E84
interpreting results of 135136
paired-samples t tests 468469
VARDEF= option 210213, 269, 279
variance and standard deviation 210213
MEANS statement, GLM procedure 509511,
572573
ALPHA= option 509, 572573
BON option 509
CLDIFF option 509, 572
DUNCAN option 509
DUNNETT option 509
GABRIEL option 509
REGWQ option 509
SCHEFFE option 509
SIDAK option 509
SNK option 509
T option 509
TUKEY option 509, 572
measurement scales 2427
measures of central tendency
See central tendency statistics
measures of variability 200203, E56
MEASURES option, TABLES statement
(FREQ) 648
median (median score) 186, 192
bimodal distributions 199
computing with UNIVARIATE procedure
186
skewed distributions 194196
memory performance with Gingko biloba
(example)

hypothesis tests 419426


paired-samples and single-sample t tests
457460
menu bar (Editor window) 59
mid-term test scores, comparing (example)
266268
converting multiple raw-score variables into
z-score variables 278285
converting single raw-score variables into
z-score variables 268278
MIDPOINTS= option, VBAR/HBAR
statements (CHART) 170171
missing data
eliminating 252256
representing in input 124, 127
mode (modal score) 185, 192
bimodal distributions 199
computing with UNIVARIATE procedure
185
skewed distributions 194196
model-rewarded and -punished conditions
See child aggression study (example)
MODEL statement, GLM procedure 572
MODEL statement, REG procedure 351352
P option 352, 365367
STB option 351
modifying data
See data modification
moments table 184
multi-value variables 2829
type-of-variable figures for 33
multiple comparison procedures, one-way
ANOVA 498
all significant, significant treatment effect
500502, 505529
nonsignificant treatment effect 504,
529537
some significant, significant treatment effect
502503

N
names, of SAS programs 6162, 67
names, variables
See variable names
naturally occurring variables 2930
navigating among system windows 60
NE (not equal to) operator 242
negative bivariate regression coefficient
371379
negative correlation 294295, 317318
negative relationships between variables
312313

Index 683
negatively skewed distributions 196197, E47
nominal scales 25
NOMISS option, CORR procedure 335
nondirectional hypotheses 1921
independent-samples t tests 420423
memory performance with Gingko biloba
420423
one- and two-tailed tests 406407
paired-samples t tests 462
nonexperimental (correlational) research
2930, 3637, 300303, 341342
nonlinear relationships between variables
307313
nonmanipulative research 2930, 3637,
300303, 341342
nonsignificant bivariate regression coefficients
379383
nonsignificant correlations
p value 325
scattergrams 325327
summarizing results 324329
nonsignificant treatment effect (one-way
ANOVA) 504, 529537
nonstandardized regression coefficient 355356,
375
NOPROB option, CORR procedure 335
normal distributions 191193, E47
NORMAL option, UNIVARIATE procedure
183
normality, tests for 184, 191193
null hypothesis
See statistical null hypotheses
null statement 121
numeric value conversions 245248

O
observational research 2930, 3637, 300303,
341342
observational units 23
observations
defined 113
eliminating missing data 252256
observational units 23
reviewing for validity 135
obtaining data 21
one-sided (one-tailed) hypotheses
See directional (alternative) hypotheses
one-tailed t tests 406407
one-way ANOVA with between-subjects factor
491537
analysis reports 526529, 535537
appropriate situations for 491494

assumptions for 493


between-subjects designs E167, E177
child aggression study 494496
criterion variables 491
DATA step 507, 530
GLM procedure 508509, 530, E167, E177
graphing results 525, 534535
index of effect size 499500
multiple comparison procedures 498
nonsignificant treatment effect 504,
529537
plotting 525, 534535
predictor variables 491
significant treatment effect, all multiple
comparison tests significant 500502,
505529
significant treatment effect, some multiple
comparison tests significant 502503
statistical alternative hypothesis 497
statistical null hypothesis 497
summary tables 516
treatment effects 497
type-of-variable figures 491
Open dialog 7880, 96, 106107
opening SAS programs 7880, 96, 106107
operators
arithmetic operators 231
comparison operators 241242, 252
precedence of 232
OPTIONS statement 109, 117118, 161, 306,
510
LINESIZE= option 109, 118, 161
PAGESIZE= option 109, 118, 161, 306
OR statements 244
subsetting data 252
ordinal scales 25
Spearman rank-order correlation coefficient
332333
organization behavior study
See prosocial organization behavior study
(example)
organizational fairness study
See perceived organizational fairness study
(example)
out-of-bounds values 136
output files 4445
analyzing 137138
chi-square test of independence 649655
factorial ANOVA 575592
LSMEANS statement, factorial ANOVA
627
one-way ANOVA with between-subjects
factor 511525, 531534
reviewing contents of 7273

684 Index
Output window 53
controlling size of printed page 109
opening 60
reviewing and printing contents 7273
overtype mode 49

P
P option, MODEL statement (REG) 352,
365367
p value 193
bivariate linear regression 371
chi-square test of independence 638, 654
nonsignificant correlations 325
Pearson correlation coefficient 297298
single-sample t tests 406407
page size 109
PAGESIZE= option, OPTIONS statement 109,
118, 161, 306
paired samples 416
paired-samples t tests
See t tests, paired-samples
PAIRED statement, TTEST procedure 470
parentheses in formulas 232
Paste command 8892
PDIFF option, LSMEANS statement 626
Pearson chi-square test
See chi-square test of independence
Pearson correlation coefficient 290293, 639
analysis reports 318320
assumptions for 293
causal relationships, investigating 299303
computing with CORR procedure 313320,
E91, E100
correlating weight loss with predictor
variables 303307
criterion variables 291
p value 297298
predictor variables 291
scattergrams, creating 307313, 325327
significant results 335338
suppressing correlation coefficients 329331
type-of-variable figures 291
perceived organizational fairness study
(example) 291292
alternative explanations 301
positive and negative correlations 294295
perfect correlation 296
period (.) for missing data 124, 127
PLOT option, UNIVARIATE procedure 183,
187
PLOT procedure 307313, E100, E115
See also scattergrams

nonsignificant correlations 325327


plotting
See also frequency tables
See also scattergrams
See also stem-and-leaf plots
See also type-of-variable figures
chi-square test of independence 655657
factorial ANOVA graphs, interpreting
550553, 592594
factorial ANOVA graphs, preparing
595596, 612, 620
frequency bar charts 160, 162173, E33,
E38
one-way ANOVA results 525, 534535
resolution 160
subgroup-mean bar charts 161, 174177
political donations study #1 (example)
122143
analyzing data with MEANS and FREQ
procedures 131138
complete program 142143
data input 125130
DATA step 125
printing raw data 139142
political donations study #2 (example)
148150
creating frequency tables 153156
data input 151152
DATA step 151152
distribution shape 190200
frequency bar charts 163173
stem-and-leaf plot 183187
subgroup-mean bar charts 174177
population standard deviation 206207, 212
estimated 209, 212213, E56
z-score variables 269, E77, E84
population variance 205, 207, 212
estimated 209, 212213, E56
populations 204
populations, samples representing
See t tests, single-sample
positive bivariate regression coefficient
350371
positive correlation 293294, 317318
positive relationships between variables
312313
positively skewed distributions 194196, E47
precedence of operators 232
predicting relationships between variables
See hypotheses
predictor variables 30, 341344
See also weight loss, correlating with
predictor variables (example)
bivariate linear regression 347348

Index 685
chi-square test of independence 631, 644
child aggression study 428430
factorial ANOVA 542
independent-samples t tests 417419
one-way ANOVA 491
paired-samples t tests 453
Pearson correlation coefficient 291
scattergrams, creating 307313, 325327
womens responses to sexual infidelity 466
Print dialog 71
PRINT procedure 139142, 222225, E9, E14,
E70, E77, E84
printing
Log window contents 7172
Output window contents 7273
page size 109
raw data 139142, 222225
PROC statements 41, 121
DATA= option 252
programs
See SAS programs
prosocial organization behavior study (example)
291292
alternative explanations 301
positive and negative correlations 294295
PS= option, OPTIONS statement 109, 118, 161,
306
psychology study
See industrial psychology study (example)

Q
qualitative variables 22, 25
quantitative variables 22
See also z scores
advantages of 263265, 276278, 284285
types of 263
quartiles table 184, 201
quasi-interval scales 26
quitting SAS 74
quotation mark, single () 245

R
R2 statistic 499500
randomized-subjects designs 415
range 200201
interquartile range 202203
semi-interquartile range 203
RANK option, CORR procedure 335
ratio scales 27

raw data files 62


See also data
See also inputting data
computing chi-square test of independence
642643, 668671
raw-score variables 262
converting multiple into z-score variables
278285
converting singly into z-score variables
268278
STANDARD procedure with 285286
reading comprehension
See spatial recall in reading comprehension
(example)
Recall Last Submit command 74
recoding reversed variables 233234
example 235239
REG procedure
MODEL statement 351352, 365367
negative regression coefficient 371379
nonsignificant regression coefficient
379383
positive regression coefficient 350371
regression analysis exercises E100, E115
residuals of prediction 365367
syntax for 351352
regression, bivariate linear
See bivariate linear regression
regression coefficient 354, 375
nonstandardized significance tests 355356,
375
standardized 351
standardized significance tests 356358,
375
regression line
drawing through scattergrams 358364,
372374
intercept constant of 354, 375
slope of 354, 355356, 375
Y-intercept of 354, 375
REGWQ option, MEANS statement (GLM)
509
related-samples t tests
See t tests, paired-samples
relationships between variables, predicted
See hypotheses
renaming variables 228230
repeated-measures designs 416
paired-samples and single-sample t tests
457460
research
basic approaches 2932
experimental 3132, 300, 342343

686 Index
research (continued)
nonexperimental 2930, 3637, 300303,
341342
research question, defined 16
research hypothesis
See hypotheses
residuals of prediction 365367
resolution of graphics 160
restarting SAS 75
Results window 53
closing 54
reversed variables, recoding 233234
example 235239
rows, data set 23, 113115
Ryan-Einot-Gabriel-Welsch multiple range test
509

S
samples 204
See also t tests, independent-samples
See also t tests, paired-samples
See also t tests, single-sample
independent vs. paired samples 415417
relative positions of scores in 263
sampling error 337
size, with correlation coefficient
computations 318
standard deviation 208209, 210212
standard deviation, creating z-score variables
269270, 279, E77, E84
variance 208209, 210212, E56
SAS 5
exiting 74
for statistical analysis 6
restarting 75
starting 5052, 75, 105106
SAS Files option (Open dialog) 80
SAS log files
See log files
SAS output files
See output files
SAS programs 3742
copying lines into 8892
debugging 97100, 107108
deleting lines from 8487
editing 8193
file name extensions for 6162
inserting lines 8184
moving lines within 9293
names of 6162, 67
navigating 81
opening 7880, 96, 106107

saving 6367
scrolling 6263
searching for 7880, 96, 106107
submitting for execution 6769, E3, E6
submitting for execution, with errors
96101
text editors for writing 4142
typing 6163
SAS windowing environment
See windowing environment
Save As dialog 6567
saving SAS programs 6367
scales of measurement 2427
scattergrams 307313, E91, E100, E115
bimodal distributions 307313, 325327
correlation coefficients 307313, 325327
criterion variables 307313, 325327
drawing regression line through 358364,
372374
nonsignificant correlations 325327
Pearson correlation coefficient 307313,
325327
PLOT procedure for E100, E115
predictor variables 307313, 325327
residuals of prediction 365367
SCHEFFE option, MEANS statement (GLM)
509
scores
See also quantitative variables
See also z scores
median score 186, 192, 194196, 199
mid-term test scores (example) 266285
modal score 185, 192, 194196, 199
raw-score variables 262, 268286
relative position in samples 263
T scores 262
screening data for factorial ANOVA 570
scrolling SAS programs 6263
semi-interquartile range 203
semicolon (;) in SAS code 40
SET statement 252
sexuality infidelity, responses to
See womens responses to sexual infidelity
(example)
Shapiro-Wilk statistic 193
Show line numbers option 58, 78, 106
SIDAK option, MEANS statement (GLM) 509
sign
correlation coefficients 293295, 317318
z scores 276278
significance tests
analysis reports 318320
nonstandardized regression coefficient
355356, 375

Index 687
slope of regression line 355356, 375
standardized regression coefficient 356358,
375
significant treatment effect (one-way ANOVA)
all multiple comparison tests significant
500502, 505529
some multiple comparison tests significant
502503
simple effects, testing for 622
single quotation mark () 245
single-sample t tests
See t tests, single-sample
skewed distributions 194197, E47
negatively skewed 196197, E47
positively skewed 194196, E47
slope of regression line 354, 375
significance tests 355356, 375
SNK option, MEANS statement (GLM) 509
spatial recall in reading comprehension
(example)
nonsignificant t statistic 407411
significant t statistic 393406
SPEARMAN option, CORR procedure 333,
335
Spearman rank-order correlation coefficient
332333
standard deviation 204210, E56
comparing variables with different 263265
estimated population standard deviation 209,
212213, E56
estimated population standard deviation,
creating z-score variables 269, E77, E84
MEANS procedure for 210213
population standard deviation 206207, 212
reviewing for new z-score variables
275276, 283
sample standard deviation 208209,
210212
sample standard deviation, creating z-score
variables 269270, 279, E77, E84
single-sample t tests 402
standard error of the difference between the
means 436, 437, 447
STANDARD procedure 285286
standardized regression coefficient 351
significance tests 356358, 375
t statistic 356358, 375
standardized variables 262
See also z scores
STANDARD procedure for 285286
starting SAS 5052, 75, 105106
statistical alternative hypotheses 19
chi-square test of independence 637
correlation coefficients 297, 298

independent-samples t tests 419426,


439440
nondirectional vs. directional 2021
one-way ANOVA 497
paired-samples t tests 461462
single-sample t tests 389390
womens responses to sexual infidelity
(example) 466
statistical analysis, SAS for 6
statistical assumptions 35
statistical measures table 184, 192
statistical null hypotheses 1819
analysis reports 318320
chi-square test of independence 637
correlation coefficients 297299
independent-samples t tests 419426,
438439
one-way ANOVA 497
paired-samples t tests 461462, 476
significant results 335338
single-sample t tests 389
womens responses to sexual infidelity
(example) 466
statistics
See also central tendency statistics
See also t statistic
chi-square statistic 637
descriptive 204
inferential 205
R2 statistic 499500
Shapiro-Wilk statistic 193
STB option, MODEL statement (REG) 351
stem-and-leaf plots 187190, E56
approximately normal distribution 191192
bimodal distributions 198
negatively skewed distributions 196
positively skewed distributions 194
UNIVARIATE procedure 187190
string variables 116
Student-Newman-Keuls multiple range test
509
subgroup-mean bar charts 161, 174177
subject vs. true independent variables 544545
submitting SAS programs for execution 6769,
E3, E6
with errors 96101
subsetting data 217, 248256, E70
combining several statements 256260
comparison operators and statements 252
multiple subsets 251252
placing statements in DATA step 225228
syntax for 248
summary tables, ANOVA 516, 583586

688 Index
SUMVAR option, VBAR/HBAR statements
(CHART) 175
suppressing correlation coefficients 329331
syntax, guidelines for 131

T
T option, MEANS statement (GLM) 509
T scores 262
t statistic E123, E129
See also t tests, independent-samples
See also t tests, paired-samples
See also t tests, single-sample
nonstandardized regression coefficient
355356, 375
standardized regression coefficient 356358,
375
t tests, independent-samples 415450, E135,
E142
analysis reports 443445, 448450
appropriate situations and assumptions
417419
child aggression study, nonsignificant results
446450
child aggression study, significant results
428445
confidence intervals 426, 440441, 447
criterion variables 417419
directional hypotheses 423426
effect size 426427, 441443, 447448
nondirectional hypotheses 420423
paired-samples t tests vs. 415417, 453
predictor variables 417419
standard error 437
statistical alternative hypothesis 419426,
439440
statistical null hypothesis 419426, 438439
t statistic 438440, 447, E135, E142
TTEST procedure E135, E142
type-of-variable figures 417419
t tests, one-tailed 406407
t tests, paired-samples 453488, E151, E159
analysis reports 479482, 485487
appropriate situations and assumptions
453456
confidence intervals and effect size 463
criterion variables 453
effect size 463, 477479, 484
hypothesis tests 461462
independent-samples t tests vs. 415417, 453
MEANS procedure 468469
predictor variables 453
repeated-measures designs 457460

single-sample t tests vs. 457460


statistical alternative hypothesis 461462
statistical null hypothesis 461462, 476
t statistic 475, E151, E159
TTEST procedure 468469, E151, E159
type-of-variable figures 453
womens responses to sexual infidelity,
nonsignificant results 483487
womens responses to sexual infidelity,
significant results 463482
t tests, single-sample 387410, E123, E129
analysis reports 405406, 410411
appropriate situations and assumptions
387388
confidence intervals and effect size
391393, 401, 409
hypothesis tests 389390
nonsignificant results 407411
one- and two-tailed tests 406407
paired-samples t tests vs. 457460
repeated-measures designs 457460
spatial recall in reading comprehension
393406
standard deviation 402
statistical alternative hypothesis 389390
statistical null hypothesis 389
TTEST procedure E123, E129
t tests, two-tailed 406407
tables
See also frequency tables
basic statistical measures table 184, 192
crosstabulation tables 643
extremes table 184
moments table 184
quartiles table 184, 201
summary tables, ANOVA 516, 583586
two-way classification tables 634637,
641643
TABLES statement, FREQ procedure 647649
ALL option 648
CHISQ option 648
EXPECTED option 648
FISHER option 648
MEASURES option 648
test scores (mid-term), comparing (example)
266268
converting multiple raw-score variables into
z-score variables 278285
converting single raw-score variables into
z-score variables 268278
testing assumptions for ANOVA 570
tests for normality 184, 191193
tests of association 383
tests of group differences 383

Index 689
text editors for writing SAS programs 4142
title bar (Editor window) 67
treatment conditions 32, 497
treatment effects 497498
true independent variables vs. subject variables
544545
true zero point 2627
TTEST procedure 388393
ALPHA= option 396397, 401, 432, 470,
476477
child aggression study, nonsignificant results
446450
child aggression study, significant results
428445
H0 option 396397, 469470
independent-samples t tests E135, E142
one- and two-tailed tests 406407
paired-samples t tests 468469, E151, E159
PAIRED statement 470
single-sample t tests E123, E129
spatial recall in reading comprehension
396404
TUKEY option, MEANS statement (GLM) 509,
572
Tukeys HSD test 509, 572, 587589
two-sided (two-tailed) hypotheses
See nondirectional hypotheses
two-tailed t tests 406407
two-way ANOVA
See factorial ANOVA with two betweensubjects factors
two-way chi-square test
See chi-square test of independence
two-way classification tables 634637
computer preferences among college students
641642
raw data vs. (chi-square test) 642643
type-of-variable figures 3237, 291, 344
chi-square test of independence 632
dichotomous (binary) variables 33
factorial ANOVA 543
independent-samples t tests 417419
limited-value variables 33
multi-value variables 33
one-way ANOVA 491
paired-samples t tests 453
Pearson correlation coefficient 291
Type I error 336337
TYPE= option, VBAR/HBAR statements
(CHART) 175
typing SAS programs 6163

U
unbalanced factorial ANOVA designs 575,
625627
Undo command 81
UNIVARIATE procedure 183184
distribution shape 190200, E47
mean, computing 186187
median (median score), computing 186
mode (modal score), computing 185
NORMAL option 183
PLOT option 183, 187
stem-and-leaf plots 187190
testing ANOVA assumptions 570
variability measures 200203, E56

V
valid observations, reviewing for 135
values
absolute magnitude/value 276278,
295296
defined 22
out-of-bounds values 136
variables classified by number of displayed
values 2729
VAR statement, CORR procedure 314
omitting 321
suppressing correlation coefficients
329331
VARDEF= option, MEANS procedure
210213, 269, 279
variability measures 200203, E56
variable names
creating variables from existing variables
230232, 235241
duplicating variables with new names
228230
in INPUT statement 119
variables
See also criterion variables
See also predictor variables
See also quantitative variables
See also raw-score variables
See also type-of-variable figures
See also z scores
analyzing with MEANS and FREQ
procedures 131134
categorical variables 22, 25
character variables 128130, 163164,
245248
classification variables 22

690 Index
variables (continued)
creating from existing variables 230232,
235241, E9, E14, E63
defined 22
dependent variables 3132, 342344
dichotomous (binary) variables 2728, 33
duplicating with new names 228230
hyphens in names 130
independent variables 3132, 342344, 497,
544545
inputting, rules for 126
limited-value variables 28, 33
linear and nonlinear relationships 307313
manipulated variables 31
multi-value variables 2829, 33
naming in INPUT statement 119
naturally occurring 2930
negative relationships between 312313
number of displayed values 2729
positive relationships between 312313
printing raw data for 139142, 222225
qualitative 22, 25
renaming 228230
recoding reversed variables 233239
scales of measurement 2427
standardized 262, 285286
string variables 116
subject vs. true independent variables
544545
variables, correlation between
See bivariate correlation
variables, fitting line to
See bivariate linear regression
variables, predicted relationships between
See hypotheses
variables, relationship between
See chi-square test of independence
variance 204210
coefficient of determination 296, 357358,
375
estimated population variance 209, 212213,
E56
F test for equality of variances 437438,
447
index of variance accounted for 499500
MEANS procedure for 210213
population variance 205, 207, 212
R2 statistic 499500
sample variance 208209, 210212, E56
VBAR statement, CHART procedure 163
DISCRETE option 172173
LEVELS= option 168169
MIDPOINTS= option 170171
SUMVAR option 175

TYPE= option 175


verifying data accuracy 181

W
weapon use in criminal activity by juveniles
(example) 632639
weight loss, correlating with predictor variables
(example)
bivariate linear regression for 346350
computing all possible correlations
320323
CORR procedure 313320
negative regression coefficient 350371
nonsignificant correlations 324329
nonsignificant regression coefficient
379383
Pearson correlation coefficient for 303307
positive regression coefficient 350371
Spearman rank-order correlation coefficient
332333
suppressing correlation coefficients
329331
WEIGHT statement, FREQ procedure 647,
670
Window menu (Editor window) 5960
windowing environment 42, 47110, E3, E6
editing programs 8193
exiting 74
managing Log and Output window contents
6973
saving programs 6367
starting and restarting 5052, 75, 105106
submitting programs for execution 6769
submitting programs for execution, with
errors 96101
typing programs 6163
windows 5260
windows 5260
See also Editor window
See also Log window
See also Output window
bringing to foreground 60
clearing contents 73
closing 54
Explorer window 53, 54, 5556
navigating among 60
Results window 53, 54
WITH statement, CORR procedure 329331
within-subjects designs 416417
womens responses to sexual infidelity
(example)
criterion variables 466, 484

Index 691
nonsignificant results 483487
predictor variables 466
significant results 463482
statistical alternative hypothesis 466
statistical null hypothesis 466

Y
Y-intercept of regression line 354, 375

Z
z scores 262286
absolute magnitude/value 276278
advantages of 263265
converting raw-score variables into E77, E84
converting raw-score variables into, more
than one 278285
converting raw-score variables into, singly
268278
creating with MEANS procedure 269270,
273, 279, 281, E77, E84
creating z-score variables 272, 280
estimated population standard deviation and
269, E77, E84
mean and z-score variables 269270,
275276, 279, 283, E77, E84
sign 276278
STANDARD procedure with 285286
zero correlation 296
zero point 2627

Special Characters
$ for character variables 128
- (hyphens) in variable names 130
() (parentheses) in formulas 232
. (period) for missing data 124, 127
; (semicolon) in SAS code 40
(single quotation mark) 245

692 Index

Call your local SAS office to order these books


from

Books by Users Press

Advanced Log-Linear Models Using SAS

Health Care Data and the SAS System

by Daniel Zelterman .................................Order No. A57496

by Marge Scerbo, Craig Dickstein,


and Alan Wilson .......................................Order No. A57638

Annotate: Simply the Basics


by Art Carpenter .......................................Order No. A57320

The How-To Book for SAS/GRAPH Software


by Thomas Miron .....................................Order No. A55203

Applied Multivariate Statistics with SAS Software,


Second Edition
by Ravindra Khattree
and Dayanand N. Naik..............................Order No. A56903

Applied Statistics and the SAS Programming Language,


Fourth Edition

In the Know ... SAS Tips and Techniques From


Around the Globe
by Phil Mason ..........................................Order No. A55513

by Ronald P. Cody
and Jeffrey K. Smith.................................Order No. A55984

Integrating Results through Meta-Analytic Review Using


SAS Software
by Morgan C. Wang
and Brad J. Bushman .............................Order No. A55810

An Array of Challenges Test Your SAS Skills

Learning SAS in the Computer Lab, Second Edition

by Robert Virgile.......................................Order No. A55625

by Rebecca J. Elliott ................................Order No. A57739

Beyond the Obvious with SAS Screen Control Language

The Little SAS Book: A Primer


by Lora D. Delwiche
and Susan J. Slaughter ...........................Order No. A55200

by Don Stanley .........................................Order No. A55073

Carpenters Complete Guide to the SAS Macro Language


by Art Carpenter .......................................Order No. A56100

The Cartoon Guide to Statistics


by Larry Gonick
and Woollcott Smith.................................Order No. A55153

Categorical Data Analysis Using the SAS System,


Second Edition

The Little SAS Book: A Primer, Second Edition


by Lora D. Delwiche
and Susan J. Slaughter ...........................Order No. A56649
(updated to include Version 7 features)
Logistic Regression Using the SAS System:
Theory and Application
by Paul D. Allison ....................................Order No. A55770

by Maura E. Stokes, Charles S. Davis,


and Gary G. Koch .....................................Order No. A57998

Longitudinal Data and SAS: A Programmers Guide


by Ron Cody ............................................Order No. A58176

Codys Data Cleaning Techniques Using SAS Software


by Ron Cody........................................Order No. A57198

Maps Made Easy Using SAS


by Mike Zdeb ............................................Order No. A57495

Common Statistical Methods for Clinical Research with


SAS Examples, Second Edition

Models for Discrete Data


by Daniel Zelterman ................................Order No. A57521

by Glenn A. Walker...................................Order No. A58086

Concepts and Case Studies in Data Management


by William S. Calvert
and J. Meimei Ma......................................Order No. A55220

Debugging SAS Programs: A Handbook of Tools


and Techniques
by Michele M. Burlew ...............................Order No. A57743

Efficiency: Improving the Performance of Your SAS


Applications
by Robert Virgile.......................................Order No. A55960

A Handbook of Statistical Analyses Using SAS ,


Second Edition

by B.S. Everitt
and G. Der .................................................Order No. A58679

Multiple Comparisons and Multiple Tests Using SAS


Text and Workbook Set
(books in this set also sold separately)
by Peter H. Westfall, Randall D. Tobias,
Dror Rom, Russell D. Wolfinger,
and Yosef Hochberg ................................Order No. A58274
Multiple-Plot Displays: Simplified with Macros
by Perry Watts .........................................Order No. A58314

Multivariate Data Reduction and Discrimination with


SAS Software
by Ravindra Khattree
and Dayanand N. Naik..............................Order No. A56902

The Next Step: Integrating the Software Life Cycle with


SAS Programming
by Paul Gill ...............................................Order No. A55697

support.sas.com/pubs

by Lauren E. Haworth ..............................Order No. A58087

SAS for Monte Carlo Studies: A Guide for Quantitative


Researchers

Painless Windows: A Handbook for SAS Users

by Xitao Fan, kos Felsovlyi,


Stephen A. Sivo,

and Sean C. Keenan.................................Order No. A57323

by Jodie Gilmore ......................................Order No. A55769


(for Windows NT and Windows 95)

SAS Macro Programming Made Easy

Output Delivery System: The Basics

Painless Windows: A Handbook for SAS Users,


Second Edition

by Michele M. Burlew ...............................Order No. A56516

SAS Programming by Example

by Jodie Gilmore ......................................Order No. A56647


(updated to include Version 7 features)

by Ron Cody
and Ray Pass ............................................Order No. A55126

PROC TABULATE by Example

SAS Programming for Researchers and Social Scientists,


Second Edition

by Lauren E. Haworth ..............................Order No. A56514

Professional SAS Programmers Pocket Reference,


Fourth Edition
by Rick Aster ............................................Order No. A58128

Professional SAS Programmers Pocket Reference,


Second Edition
by Rick Aster ............................................Order No. A56646

by Paul E. Spector....................................Order No. A58784

SAS Software Roadmaps: Your Guide to Discovering


the SAS System
by Laurie Burch
and SherriJoyce King ..............................Order No. A56195

SAS Software Solutions: Basic Data Processing


by Thomas Miron......................................Order No. A56196

Professional SAS Programming Shortcuts


by Rick Aster ............................................Order No. A59353

Programming Techniques for Object-Based Statistical Analysis


with SAS Software

SAS Survival Analysis Techniques for Medical Research,


Second Edition
by Alan B. Cantor .....................................Order No. A58416

by Tanya Kolosova
and Samuel Berestizhevsky ....................Order No. A55869

SAS System for Elementary Statistical Analysis,


Second Edition

Quick Results with SAS/GRAPH Software

by Sandra D. Schlotzhauer
and Ramon C. Littell.................................Order No. A55172

by Arthur L. Carpenter
and Charles E. Shipp ...............................Order No. A55127

SAS System for Forecasting Time Series, 1986 Edition

Quick Results with the Output Delivery System

by John C. Brocklebank
and David A. Dickey ...................................Order No. A5612

by Sunil K. Gupta .......................................Order No. A58458

Quick Start to Data Analysis with SAS


by Frank C. Dilorio
and Kenneth A. Hardy..............................Order No. A55550

Reading External Data Files Using SAS : Examples Handbook

by Michele M. Burlew ...............................Order No. A58369

Regression and ANOVA: An Integrated Approach Using


SAS Software
by Keith E. Muller
and Bethel A. Fetterman ..........................Order No. A57559

Reporting from the Field: SAS Software Experts Present


Real-World Report-Writing

SAS System for Mixed Models


by Ramon C. Littell, George A. Milliken, Walter W. Stroup,
and Russell D. Wolfinger .........................Order No. A55235

SAS System for Regression, Third Edition


by Rudolf J. Freund
and Ramon C. Littell.................................Order No. A57313

SAS System for Statistical Graphics, First Edition


by Michael Friendly ..................................Order No. A56143

The SAS Workbook and Solutions Set


(books in this set also sold separately)
by Ron Cody .............................................Order No. A55594

Applications .............................................Order No. A55135

Selecting Statistical Techniques for Social Science Data:


A Guide for SAS Users

SAS Applications Programming: A Gentle Introduction

by Frank M. Andrews, Laura Klem, Patrick M. OMalley,


Willard L. Rodgers, Kathleen B. Welch,
and Terrence N. Davidson .......................Order No. A55854

by Frank C. Dilorio ...................................Order No. A56193

SAS for Forecasting Time Series, Second Edition


by John C. Brocklebank,
and David A. Dickey .................................Order No. A57275

SAS for Linear Models, Fourth Edition


by Ramon C. Littell, Walter W. Stroup,
and Rudolf J. Freund ...............................Order No. A56655

support.sas.com/pubs

Solutions for Your GUI Applications Development Using


SAS/AF FRAME Technology
by Don Stanley .........................................Order No. A55811

Statistical Quality Control Using the SAS System


by Dennis W. King....................................Order No. A55232

A Step-by-Step Approach to Using the SAS System


for Factor Analysis and Structural Equation Modeling
by Larry Hatcher.......................................Order No. A55129

A Step-by-Step Approach to Using the SAS System


for Univariate and Multivariate Statistics
by Larry Hatcher
and Edward Stepanski .............................Order No. A55072

Step-by-Step Basic Statistics Using SAS : Student Guide


and Exercises
(books in this set also sold separately)
by Larry Hatcher.......................................Order No. A57541

Strategic Data Warehousing Principles Using


SAS Software
by Peter R. Welbrock ...............................Order No. A56278

JMP Books
Basic Business Statistics: A Casebook
by Dean P. Foster, Robert A. Stine,
and Richard P. Waterman........................Order No. A56813

Business Analysis Using Regression: A Casebook


by Dean P. Foster, Robert A. Stine,
and Richard P. Waterman........................Order No. A56818

JMP Start Statistics, Second Edition


by John Sall, Ann Lehman,
and Lee Creighton....................................Order No. A58166

Regression Using JMP


by Rudolf J. Freund, Ramon C. Littell,
and Lee Creighton....................................Order No. A58789

Survival Analysis Using the SAS System:


A Practical Guide
by Paul D. Allison .....................................Order No. A55233

Table-Driven Strategies for Rapid SAS Applications


Development
by Tanya Kolosova
and Samuel Berestizhevsky ....................Order No. A55198

Tuning SAS Applications in the MVS Environment


by Michael A. Raithel ...............................Order No. A55231

Univariate and Multivariate General Linear Models:


Theory and Applications Using SAS Software
by Neil H. Timm
and Tammy A. Mieczkowski ....................Order No. A55809

Using SAS in Financial Research


by Ekkehart Boehmer, John Paul Broussard,
and Juha-Pekka Kallunki .........................Order No. A57601

Using the SAS Windowing Environment: A Quick Tutorial


by Larry Hatcher.......................................Order No. A57201

Visualizing Categorical Data


by Michael Friendly ..................................Order No. A56571

Working with the SAS System


by Erik W. Tilanus ....................................Order No. A55190

Your Guide to Survey Research Using the SAS System


by Archer Gravely ....................................Order No. A55688

support.sas.com/pubs

Welcome * Bienvenue * Willkommen * Yohkoso * Bienvenido

SAS Publishing Is Easy to Reach


Visit our Web site located at support.sas.com/pubs
You will find product and service details, including

companion Web sites

sample chapters

tables of contents

author biographies

book reviews

Learn about

regional users group conferences


trade show sites and dates
authoring opportunities

e-books

Explore all the services that SAS Publishing has to offer!


Your Listserv Subscription Automatically Brings the News to You
Do you want to be among the first to learn about the latest books and services available from SAS Publishing?
Subscribe to our listserv newdocnews-l and, once each month, you will automatically receive a description of the
newest books and which environments or operating systems and SAS release(s) each book addresses.
To subscribe,

1.

Send an e-mail message to listserv@vm.sas.com.

2.

Leave the Subject line blank.

3.

Use the following text for your message:


subscribe NEWDOCNEWS-L your-first-name your-last-name
For example: subscribe NEWDOCNEWS-L John Doe

Youre Invited to Publish with SAS Institutes Books by Users Press


If you enjoy writing about SAS software and how to use it, the Books by Users program at SAS Institute
offers a variety of publishing options. We are actively recruiting authors to publish books and sample code.
If you find the idea of writing a book by yourself a little intimidating, consider writing with a co-author. Keep in
mind that you will receive complete editorial and publishing support, access to our users, technical advice and
assistance, and competitive royalties. Please ask us for an author packet at sasbbu@sas.com or call
919-531-7447. See the Books by Users Web page at support.sas.com/bbu for complete information.

Book Discount Offered at SAS Public Training Courses!


When you attend one of our SAS Public Training Courses at any of our regional Training Centers in the United
States, you will receive a 20% discount on book orders that you place during the course.Take advantage of this
offer at the next course you attend!

SAS Institute Inc.


SAS Campus Drive
Cary, NC 27513-2414
Fax 919-677-4444

E-mail: sasbook@sas.com
Web page: support.sas.com/pubs
To order books, call SAS Publishing Sales at 800-727-3228 *
For product information, consulting, customer service, or
training, call 800-727-0025
For other SAS business, call 919-677-8000*

* Note: Customers outside the United States should contact their local SAS office.

Вам также может понравиться