Вы находитесь на странице: 1из 6

Answering Your Questions with Statistics

Diana Suhr, University of Northern Colorado

Abstract
Organizing a project, planning your strategies, and executing The Merriam-Webster Dictionary states statistics is 1: a
statistical procedures to examine/analyze data are essential for branch of mathematics dealing with the collection, analysis,
answering your questions. A guide to determine appropriate interpretation, and presentation of masses of numerical data.
statistical tests given a list of questions will be discussed in this
presentation. Statistics range from the simple (e.g., frequencies What is a sample?
with PROC FREQ, means with PROC MEANS) to the complex A sample is a subgroup selected from a population. The method
(e.g., confirmatory factor analysis with PROC CALIS). used in selecting the sample is important. If the sample is
representative of the population, inferences can be made to the
Introduction population.
A plan or research proposal has been written explicitly stating
the purpose, objectives, and goals of your project. Data has What is a variable?
been collected in an organized manner as part of the written A variable refers to a property or characteristic whereby
proposal or plan. Research questions, written as part or your members of groups differ from one another. Before a variable
plan, will be answered by analyzing the data collected. can be treated statistically, it must be observed, classified,
measured, and quantified.
Data analysis is dependent on the type of data collected and on Dependent variable: a variable whose value is determined by
utilizing an appropriate statistical analysis to answer research that of one or more other variables in a function. Independent
questions and to satisfy the purpose, objectives, and goals of the variable: a variable whose value is specified first and determines
study. As we proceed, pertinent statistical analyses will be the value of one or more other values in an expression or
discussed to answer questions. Statistics discussed range from function.
basic (e.g., frequency, average) to complex (e.g., canonical
correlation, structural equation modeling).
What are univariate statistics?
Univariate (UV) statistics consider one dependent variable at a
What are statistics? time (Truxillo, 2003). Separate statistical tests on each of the
Several definitions have been offered for statistic and statistics. dependent variables increase the probability of type I error
A statistic is a mathematical expression describing a (incorrectly rejecting the null hypothesis). Examples of UV
measurement for a sample (Gall, Borg, & Gall, 1996). statistical procedures are PROC CORR, PROC TTEST, PROC
The Merriam-Webster Dictionary states a statistic is a ANOVA, PROC GLM (without the MANOVA statement),
quantity (as the mean of a sample) that is computed from a PROC REG (with one dependent variable).
sample (2003).
Statistics is a scientific methodology dealing with the What are multivariate statistics?
collections, classification, description, and interpretations of Multivariate (MV) statistics consider more than one dependent
data collected in surveys or experiments. The purpose of variable at a time (Truxillo, 2003). MV analysis fits a model to
statistics is to describe and draw inferences about a population predict a vector of responses simultaneously. MV analysis allow
(Ferguson, 1981). the model to better approximate relationships in the population
Glass and Hopkins believe that there is a place for and controls for type I error. Examples of MV statistical
systematic, objective, empirical research, and communicating procedures are PROC CANCORR, PROC CANDISC, PROC
knowledge for which statistics is a tool (1984). Statistics has an DISCRIM, PROC STEPDISC, PROC GLM with REPEATED,
orderly “mother” and a gambling “father”. From the mother PROC GLM with MANOVA, PROC REG (with more than one
came counting, measuring, describing, tabulating, ordering, and dependent variable), PROC PRINCOMP, PROC FACTOR,
taking censuses (descriptive statistics). The father relied on PROC CALIS.
mathematics to increase his skill at playing the odds in games of
chance (inferential statistics based on theories of probability).
Statistics, sometimes viewed as the study of variation,
What are scales of measurement?
A measurement is a quantified or categorized observation and
provides a methodology for the exploration of variation in
involves assigning numbers to things according to rules.
events and for making inferences about the circumstances
Measurement transforms attributes into numbers. Two types of
underlying variation (Ferguson, 1981). Emphasis on the study of
variables are used in measurement, continuous and discrete.
variation originated with Darwin in “The Origin of Species” and
Continuous variables can be any value within a certain range,
was a central concept in the theory of natural selection. Darwin
(e.g., ages, heights, values between 1 and 5). Discrete variables
did not make a direct contribution to statistical methodology. He
have certain values within a range (e.g., integers between 1 and
did, however, create a theoretical context which made the study
5 are 1, 2, 3, 4, 5). Four types of measurement scales are
of variation meaningful and required the development of
recognized with implications for appropriate statistical analysis.
statistical methods. Galton, Darwin’s student, understood the
concept of variation, was responsible for the initial application
of the normal curve, and contributed to the development of
methods of correlation.
Nominal scales use numbers to represent categories. The PROC PRINT. For example, if gender is coded as 1 and 2 and
numbers distinguish groups and do not reflect differences in the frequency distribution shows values of 1, 2, 3, 5, use the
magnitude. Examples of nominal levels of measurement are following statements to print data errors.
gender (male, female), ethnicity, college major, eye color.
data error;
Ordinal scales use numbers to indicate rank order of set rawsub;
observations. Examples of ordinal scales are percentile norms, if gender in (3,5);
proc print;
social class, order of performance (e.g., 1st, 2nd, 3rd place).
Compare the PROC PRINT results with the original data sheets.
Measures on interval scales represent equal distances between
Maybe during data entry, everything got moved over a column
the amounts of the attribute measured. Examples are
or two. Maybe on a “bubble” sheet there were 5 bubbles per
temperature measured in Fo or Co, or calendar time.
item, the question only had 2 choices (e.g., male or female) and
the respondent bubbled choice 3, 4, or 5.
Measures on ratio scales represent equal units from absolute
zero. Observations can be compared as ratios or percentages.
For some variables (e.g., id number, age, gpa), PROC FREQ
Examples are distance, age, weight, height.
will produce more pages of output than you’d like to examine.
For variables like age or gpa, PROC MEANS provides
Different scales of measurement are appropriate for statistical
minimum and maximum values to determine out-of-range
procedures. Categorical data (nominal scales) are described with
values.
frequencies or percentages. An average gender value does not
make sense but 30% male and 70% female describes a sample.
Are there duplicate identification numbers for a one-time
Describing the range and mean value of achievement tests for
measurement of subjects? If the study is a repeated-measures or
third graders is more appropriate than a frequency distribution.
longitudinal design, are there missing measurements for a
Whenever conducting a statistical analysis, be sure you have the
subject at any time point? What SAS procedures will help you
appropriate scale of measurement.
answer these questions?
What is data cleaning?
Data cleaning is checking values to make sure they are not out- To check for duplicate subject identification numbers on a one-
of-range and verifying they are accurate. Data may be entered time measurement, the following code is helpful.
by hand, scanned from scanner forms, entered online, or
proc freq data = rawsub;
transcribed from interviews. To represent the sample and make
tables idnum / noprint out=newid;
inferences to the population, data must be accurate and
data idfl;
complete.
set newid;
if count gt 1;
Data examination is essential to accurate results. Follow these
proc print;
steps when checking and cleaning data. Verify the input title ‘duplicate id numbers’;
statement, check frequencies and means, examine univariate
statistics, and edit data if necessary.
The above code can be modified to count the number of
measurements for each subject in a longitudinal or repeated
Making sure the input statement is correct is a good start for
measures study. Delete the “if count gt 1;” statement to
checking data. Was the data file created with data entry, a
determine the number of measurements for each subject.
scanned form, or web input? Data entry can easily produce data
errors. A scanned form or “bubble” sheet could have missing
If there were 4 measurements for each subject in the project,
data if a respondent leaves an item blank or bubbles two choices
replace the “if count gt 1;” with “if count ne 4;”
when only one can be accepted. First, examine a data form and
to print identification numbers of subjects with less than or
match the responses on the form with the appropriate location in
greater than 4 measurements.
the data file. Then determine column locations to create or
verify the input statement. proc freq data = rawsub;
tables idnum / noprint out=newid;
Next, use PROC FREQ to run frequency distributions for data idfl;
variables in the data set. Answer the following questions. Are set newid;
values out-of-range? Is there missing data? Did the subjects if count eq 4 then delete;
respond to all the questions/items? How will missing data be proc print;
handled? Is there a way to contact participants to complete title ‘missing measurements’;
missing demographic data? Should missing data (e.g., on a
rating scale) be replaced with mean substitution? After data has been checked and “cleaned”, you’re ready to
proceed to the statistical analysis!
Run PROC UNIVARIATE FREQ PLOT NORMAL for selected
variables. Answer these questions: Is the distribution normal? Is
the distribution skewed? Are there outliers? Should outliers be
What is frequency?
Frequency is counting the number of occurrences of a value or
removed from the data analysis? How will a nonnormal
response. Frequency is the number of individuals in a single
distribution affect the statistical analysis?
class when objects are classified according to variations in a set
of one or more specified attributes (Merriam-Webster, 2003).
To print errors, the following example illustrates a routine for
selecting inappropriate values. Use an IF…THEN statement and
PROC FREQ allows you to describe your data in a concise way level
by producing frequency counts and crosstabulation tables. Frequency|
Row Pct | 1| 2| 3| 4| Total
Frequency tables show the distribution of values and the number
---------+------+------+------+------+
of responses for each value. Crosstabulation tables combine Boys | 22 | 30 | 78 | 3 | 133
frequency distributions for two or more variables. For example, |16.54 |22.56 | 58.65 | 2.26 | 51.35
a table with gender by residency shows the number of male ---------+------+------+------+------+
residents, female residents, male nonresidents and female Girls | 8 | 21 | 87 | 10 | 126
nonresidents. | 6.35 | 16.67| 69.05 | 7.94 | 48.65
---------+------+------+------+------+
Total 30 51 165 13 259
Is there an association between row and column 11.58 19.69 63.71 5.02 100.00
variables of a two-way table?
Statistical procedures available in PROC FREQ allow you to
test associations (relationships) between variables. Chi-square Statistic DF Value Prob
-------------------------------------------
involves differences between observed and expected Chi-Square 3 12.2014 0.0067
frequencies. The test can be thought of as a test of differences
between two proportions. In PROC FREQ, measures of Effective Sample Size = 259
association are close to zero when there is no association and Frequency Missing = 2
close to maximum (or minimum) value when there is perfect
association (e.g., Likelihood ratio chi-square, Mantel-Haenszel
chi-square, Cramer’s V). Computation and printing of statistics What is the mean?
is specified in the TABLES statement options. The mean is an arithmetic average calculated by adding the
values of a variable and dividing by the number of values. Some
What is a frequency distribution? statistical procedures determine if there is a significant
A frequency distribution is an arrangement of statistical data that difference between group means.
illustrates the frequency of the occurrence of the values of a
variable (Merriam-Webster, 2003). Frequencies are arranged What is the variance?
from smallest to largest value and can be ungrouped or grouped. When describing a set of scores, how much the scores vary is
often reported. Variance describes how much the scores spread
Is the distribution normal? out from low score to high score. The variance is an average of
About 1870, a Belgian mathematician, Quetelet, and an English the squared deviations from the mean obtained by adding the
scientist, Galton, made a discovery about individual differences. squared deviations and dividing by the total number of cases.
They found the same pattern of results over and over again when The square root of the variance is called the standard deviation.
arranging measurements into a frequency distribution from a
large samples. A symmetrical, bell-shaped curve (a normal
curve) resulted when plotting human characteristics.
Are means significantly different?
Between two groups?
Bar charts help visualize a frequency distribution. The following
PROC TTEST determines a t-value and probability between two
code could help illustrate “item 1” distribution on a 1 to 5 scale.
proc gchart data = rawsub; group means.
proc ttest;
vbar item1/midpoinst = 1 2 3 4 5;
class gender;
var math read sci;
What is the range?
The range of a variable is the highest valid value minus the Between more than two groups?
lowest valid value of the variable. PROC GLM can be used to determine significant different
between the means of two or more groups. Post hoc tests
Are data out-of-range? perform pairwise group comparisons.
Out-of-range is a value less than the lowest value or greater than
the highest value. For example, if gender is coded as 1 for male Covariance?
and 2 for female and a value of 6 is found for gender, the value 6 PROC GLM and LSMEANS can be for analysis of
is out-of-range. A value of 8 on a scale of 1 to 5 is out-of-range. covariance.
PROC GLM;
Are percentages significantly different? CLASS GENDER;
Chi-square is the appropriate statistic to determine if percentages MODEL POST = GENDER PRE;
in a two-way table are significantly different. The question could MEANS GENDER;
be “Is there a significant difference between boys and girls in LSMEAN GENDER;
proficiency level”? PROC FREQ with chi-square will answer
the question. POST is the dependent variable, PRE is the covariate, and
GENDER is the independent variable. POST values will be
proc freq data = all; adjusted with the covariate, PRE.
tables gender * level/chisq;
Multivariate analysis? Is there a relationship between variables?
PROC GLM with MANOVA specifies multivariate analysis of Between two variables?
variance to determine if a set of means is significantly different Correlation measures the strength of the linear relationship
between two or more groups. between two variables. If one variable can be expressed exactly
proc glm; as a linear relationship of another variable, then the correlation
class gender; is 1 (directly related) or –1 (inversely related). A correlation of 0
model math read sci = gender; indicates no relationship. PROC CORR computes correlation
manova _all_; coefficients.
Is there a relationship between income and education?
Multivariate Covariance? PROC CORR;
PROC GLM; VAR INCOME EDUC;
CLASS GENDER;
MODEL MATH READ SCI = GENDER VOCAB; Is there a relationship between high school gpa, cumulative gpa
MANOVA _ALL_; (college), ACT score? Correlations are calculated for each pair
Dependent variables (math, read, sci) will be adjusted with the of variables, hsgpa and cumgpa, hsgpa and ACT, cumgpa and
covariate (vocab); ACT.
PROC CORR;
Over time (repeated measures)? VAR HSGPA CUMGPA ACT;
Is there a significant change over time between boys and girls?
PROC GLM; Between sets (groups) of variables?
CLASS GENDER; PROC CANCORR calculates multivariate correlation
MODEL T1 T2 T3 T4 = GENDER; coefficients. Determines if there is a relationship between sets of
REPEATED TIME 4; variables.
Is there a relationship between illness, fitness, exercise and
hardiness, stress?
Structural equation modeling (PROC CALIS) is another method PROC CANCORR;
to determine change over time. Measured variables are RR7, VAR ILLNESS FITNESS EXERCISE;
RR9, RR11, RR13. Latent variables (unobserved) are initial WITH HARDINESS STRESS;
value (F1) and rate of change (F2). Average initial value (ml)
and average rate of change (ms) are estimated.
Are there direct and indirect relationships
between variables?
Structural equation modeling (SEM) allows specification of
direct and indirect relationships between variables (measured or
latent). Path analysis specifies relationships between measured
variables and can help determine direct, indirect, and mediating
effects. Confirmatory factor analysis (CFA) tests factor structure
of a measurement instrument. A form of CFA tests the degree to
which measured variables are influenced by a latent construct.

Does a group of variables predict the value of


another variable?
Linear relationship?
proc reg;
model income = age educ gender;

Nonlinear relationship?
PROC CALIS DATA = COH7F UCOV AUG ALL; PROC NLIN DATA=EXPL METHOD=DUD OUTEST=ENL ;
LINEQS BY ID;
RR7 = F1 + + E7, PARAMETERS A=.455 B=.200 C=-0.852;
RR9 = F1 + F2 + E9, MODEL MEDTIME = A + B * TRIAL ** C;
RR11 = F1 + PV11F2 F2 + E11,
RR13 = F1 + PV13F2 F2 + E13, Logistic?
F1 = ML INTERCEPT + D1, Fits linear logistic regression models for binary or ordinal
F2 = MS INTERCEPT + D2; response data.
STD PROC LOGISTIC DATA = RAWSUB;
E7 = VARE7, MODEL Y = X1-X3;
E9 = VARE9,
E11 = VARE11, Multivariate regression?
E13 = VARE13, Do hardiness and stress predict illness, fitness, and exercise?
D1-D2 = VARD1-VARD2; PROC REG;
COV MODEL ILLNESS FITNESS EXERCISE =
D1 D2 = CD1D2; HARDINESS STRESS;
VAR RR7 RR9 RR11 RR13;
Does a group of variables discriminate or Conclusion
distinguish group membership? You’ve got a plan explicitly stating the purpose, objectives, and
How well do 12 risk factors classify participants as having goals of your project. Data has been collected in an organized
or not having cancer? manner. Research questions, written as part or your plan, were
proc discrim; answered by analyzing the data collected. Appropriate
class group; statistical analyses were used to answer research questions and
var risk1-risk12; to satisfy the purpose, objectives, and goals of the study.
Appendix A provides a quick reference guide to pertinent
statistical analyses ranging from the basic (e.g., frequency,
Do items measure a common construct? average) to complex (e.g., canonical correlation, structural
Explore Factor Structure? equation modeling). You have the ability to answer your
Are there underlying latent constructs?
proc factor data=rawsub reorder scree questions with statistics.
method=principal rotate=varimax;
var q1-q36; References
Darwin, C. (1888). On the origin of species, 1859. Washington
Confirm Factor Structure? Square, NY: New York University Press.
How well do the items measure an underlying latent construct? Ferguson, G. A. (1981). Statistical analysis in psychology and
proc calis data=rawsub cov stderr education, 5th Ed. New York: McGraw-Hill Book Company.
all kurtosis modifications residual; Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational
lineqs
q10 = ps10f1 F1 + e10, research: An introduction. White Plains, NY: Longman
q15 = ps15f1 F1 + e15, Publishers USA.
q17 = ps17f1 F1 + e17, Glass, G. V. & Hopkins, K. D. (1984). Statistical methods in
q32 = ps32f1 F1 + e32; education and psychology, 2nd Ed. Boston: Allyn and Bacon.
std Kranzler, G. & Moursund, J. (1999). Statistics for the terrified,
e10 = vare10, 2nd Ed. Upper Saddle River, NJ: Prentice Hall.
e15 = vare15, Merriam-Webster Dictionary Online 2003 @ http://www.m-
e17 = vare15,
e32 = vare32, w.com/home.htm
F1 = 1; SAS® Applications Guide, 1980 Edition, Cary, N.C.: SAS
var q10 q15 q17 q32; Institute.
SAS® Language, Version 6. Cary, N.C.: SAS Institute, 1990.
Could observations be grouped/clustered SAS® Language and Procedures, Version 6, First Edition. Cary,
N.C.: SAS Institute, 1989
according to common characteristics?
SAS® OnlineDoc, Version 8, SAS/STAT® User’s Guide,
Can respondents be grouped (clustered) according to common
Chapter 63. Cary, N.C.: SAS Institute, 1999.
characteristics? Use PROC CLUSTER or PROC VARCLUS.
SAS® Procedures, Version 6, Third Edition. Cary, N.C.: SAS
The following routine performs a cluster analysis, draws a
Institute, 1990.
diagram (PROC TREE), merges clusters with data set to print
Truxillo, C. (2003). Multivariate Statistical Methods: Practical
and run descriptive statistics.
Research Applications Course Notes. Cary, N.C.: SAS
PROC CLUSTER METHOD=AVE STD RSQ
Institute.
OUTTREE=NEW1 SIMPLE; **OUTPUT NEW1;
ID ID; **USE ID STATEMENT TO KEEP VARIABLES;
VAR AGE GINV COMPL YRSCOMP DAYS HRS TSIR; About the author
PROC TREE DATA=NEW1 N=4 OUT=TREE1; Diana Suhr is a Statistical Analyst in the Office of Institutional
**USE NEW1, OUTPUT TREE1; Research at the University of Northern Colorado. She earned a
ID ID; **USE ID STATEMENT AGAIN; Ph.D. in Educational Psychology at UNC in 1999. The first
PROC SORT DATA = ALL; programming language she learned was Fortran in 1970. She has
BY ID; **SORT ORIGINAL DATASET; been a SAS programmer since 1984.
PROC SORT DATA = TREE1;
BY ID; **SORT DATASET WITH CLUSTER#S; Contact
DATA ALLNEW; Diana Suhr, Statistical Analyst
MERGE ALL TREE1; **MERGE SETS BY ID; Institutional Research
BY ID; University of Northern Colorado
PROC SORT; Greeley, CO 80639
BY CLUSTER; 970-351-2193, diana.suhr@unco.edu
PROC PRINT DATA = ALLNEW; **PRINT;
VAR ID CLUSTER AGE GINV COMPL
YRSCOMP DAYS HRS TSIR; SAS and all other SAS Institute product or service names are
PROC MEANS DATA = ALLNEW; **FREQUENCIES; registered trademarks or trademarks of SAS Institute Inc. in the
BY CLUSTER; USA and other countries. ® indicates USA registration.
VAR AGE GINV COMPL YRSCOMP DAYS HRS TSIR;

Can ranked data be analyzed?


Ranked data can be analyzed with nonparametric statistics.
Appendix A: Research Questions, Statistics, and SAS Procedures
Question Statistic SAS Procedure

What is the frequency distribution? frequencies PROC FREQ

What is the range? range = maximum - minimum PROC UNIVARIATE


PROC FREQ
PROC MEANS

Are data “out-of-range”? frequency values PROC FREQ

Is the distribution normal? normality, skewness, kurtosis PROC UNIVARIATE

Is there a significant difference in frequencies or


percentages between groups? chi-square PROC FREQ

Is the sample in the same distribution as the population?


(on selected variables) chi-square PROC FREQ

Is there a relationship between two variables? correlation PROC CORR

Is there a relationship between two groups of variables? canonical correlation PROC CANCORR

Are there direct and indirect relationships between variables? structural equation modeling PROC CALIS

Is there a difference between the means?


(two groups) t-test PROC TTEST

Is there a difference between the means? analysis of variance PROC ANOVA


(more than two groups) (anova) PROC GLM

Is there a differences between the means of a group of multivariate analysis of variance PROC GLM
variables? (manova)

Is there change over time? repeated measures anova PROC GLM


latent growth curve modeling PROC CALIS

Does a group of variables predict the value of another variable? regression (linear) PROC REG
PROC GLM

Does a group of variables predict a nonlinear relationship with nonlinear regression PROC NLIN
another variable?

Does a group of variables discriminate or distinguish discriminant analysis PROC DISCRIM


group membership?

Do items measure a common construct? exploratory factor analysis PROC FACTOR


(groups items)

Can the factor structure of a measurement instrument be


confirmed? confirmatory factor analysis PROC CALIS

Can observations be grouped or clustered according to Cluster analysis PROC CLUSTER


similarities/common characteristics? (groups observations) PROC TREE

Can ranked data be analyzed? Nonparametric tests - Wilcoxon, PROC NPAR1WAY


Mann-Whitney, Kruskal-Wallis

Вам также может понравиться