Вы находитесь на странице: 1из 7

Abstract
Projects include a plan, outcomes, and measuring results. In order to determine the outcomes of a project, questions are asked and data is examined. A guide to determine appropriate statistical tests given a list of questions will be discussed. Descriptive statistics range from normality/percentiles/quartiles (PROC UNIVARIATE), to frequencies (PROC FREQ) and means (PROC MEANS). This presentations will discuss preparing your data for analysis as well as using appropriate statistical tests to answer your research questions without going into details of coding or interpreting the analysis.

Introduction
A plan or research proposal has been written explicitly stating the purpose, objectives, and goals of your project. Data has been collected in an organized manner as part of the written proposal or plan. Research questions, written as part of your plan, will be answered by analyzing the data collected. Data analysis is dependent on the type of data collected and on utilizing an appropriate statistical analysis to answer research questions and to satisfy the purpose, objectives, and goals of the study. As we proceed, pertinent statistical analyses will be discussed to answer your research questions. Statistics discussed range from basic (e.g., frequency, average) to complex (e.g., canonical correlation, structural equation modeling).

## What are statistics?

Several definitions have been offered for statistic and statistics. A statistic is a mathematical expression describing a measurement for a sample (Gall, Borg, & Gall, 1996). The Merriam-Webster Dictionary states a statistic is a quantity (as the mean of a sample) that is computed from a sample (2003). Statistics is a scientific methodology dealing with the collections, classification, description, and interpretations of data collected in surveys or experiments. The purpose of statistics is to describe and draw inferences about a population (Ferguson, 1981). Glass and Hopkins believe that there is a place for systematic, objective, empirical research, and communicating knowledge for which statistics is a tool (1984). Statistics has an orderly mother and a gambling father. From the mother came counting, measuring, describing, tabulating, ordering, and taking censuses (descriptive statistics). The father relied on mathematics to increase his skill at playing the odds in games of chance (inferential statistics based on theories of probability). Statistics, sometimes viewed as the study of variation, provides a methodology for the exploration of variation in events and for making inferences about the circumstances underlying variation (Ferguson, 1981). Emphasis on the study of variation originated with Darwin in The Origin of Species and was a central concept in the theory of natural selection. Darwin did not make a direct contribution to statistical methodology. He did, however, create a theoretical context which made the study of variation meaningful and required the development of statistical methods. Galton, Darwins student, understood the concept of variation, was responsible for the initial application of the normal curve, and contributed to the development of methods of correlation. The Merriam-Webster Dictionary states statistics is 1: a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.

What is a population?
A population is a set or group of existing items, e.g. people, objects, or events (Bowerman, 2010). Members of the set are all existing items of the population. For example, we could define employees at a company as a population, or first-time full-time freshmen at a university as a population, or WUSS 2010 attendees as a population.

What is a census?
A census is examining or measuring all items in the population. Sometimes all items of a population cannot be examined due to cost, inability to obtain measurements, or size of the population.

What is a sample?
In order to describe the population, a subgroup or sample is selected from the population. Samples are selected randomly with an appropriate method so that inferences can be made about the population.

What is a variable?
A variable refers to a property or characteristic of the population. Members of the group (population) differ from one another and can be described in terms of variable differences. Before a variable can be treated statistically, it must be observed, classified, measured, and quantified. A dependent variable is a variable whose value is determined by that of one or more other variables. An independent variable is a variable whose value is specified first and influences the value of one or more other variables. For example, income is a dependent variable that could be influenced by the level of education, an independent variable.

## What are univariate statistics?

Univariate (UV) statistics consider one dependent variable at a time (Truxillo, 2003). Separate statistical tests on each of the dependent variables increase the probability of type I error (incorrectly rejecting the null hypothesis). Examples of UV statistical procedures are PROC CORR, PROC TTEST, PROC ANOVA, PROC GLM (without the MANOVA statement), and PROC REG (with one dependent variable).

## What are multivariate statistics?

Multivariate (MV) statistics consider more than one dependent variable at a time (Truxillo, 2003). MV analysis fits a model to predict a vector of responses simultaneously. MV analysis allows the model to better approximate relationships in the population and controls for type I error. Examples of MV statistical procedures are PROC CANCORR, PROC CANDISC, PROC DISCRIM, PROC STEPDISC, PROC GLM with REPEATED, PROC GLM with MANOVA, PROC REG (with more than one dependent variable), PROC PRINCOMP, PROC FACTOR, PROC CALIS.

## What are scales of measurement?

A measurement is a quantitative or qualitative variable and involves assigning numbers to characteristics according to rules. Measurement transforms attributes into numbers. Two types of variables are used in measurement, continuous and discrete. Continuous variables can be any value within a certain range, (e.g., ages, heights, weights). Discrete variables have certain values within a range (e.g., integers between 1 and 5 are 1, 2, 3, 4, 5). Four types of measurement scales are recognized with implications for appropriate statistical analysis. Nominal scales use numbers to represent categories. The numbers distinguish groups and do not reflect differences in magnitude. Examples of nominal levels of measurement are gender (male, female), ethnicity, college major, and eye color. Ordinal scales use numbers to indicate rank order of observations. Examples of ordinal scales are percentile norms, st nd rd social class, and order of performance (e.g., 1 , 2 , 3 place). Measures on interval scales represent equal distances between the amounts of the attribute measured. Examples o o are temperature measured in F or C , or calendar time. Measures on ratio scales represent equal units from absolute zero. Observations can be compared as ratios or percentages. Examples are distance, age, weight, height. Values can be compared as ratios, e.g. 60 miles is twice as far as 30 miles; Joe is 25 years old and half as old as his father who is 50 years old. Variables on an interval scale o cannot be compared, is 10 F half as cold as 20 F?

Different scales of measurement are appropriate for statistical procedures. Categorical data (nominal scales) are described with frequencies or percentages. An average gender value does not make sense but 30% male and 70% female describes a sample. Describing the range and mean value of achievement tests for third graders is more appropriate than a frequency distribution. When conducting a statistical analysis, be sure you have the appropriate scale of measurement.

## What is data cleaning?

Data cleaning is verifying the accuracy of your data, checking values to make sure they are not out-of-range or incorrect. Data may be entered by hand, scanned from scanner forms, entered online, or transcribed from interviews. To represent the population and make inferences form the sample, data must be accurate and complete. Data examination is essential to verify accurate results. Follow these steps when checking and cleaning data. Verify the input statement, check frequencies and means, examine univariate statistics, and edit data if necessary. Making sure the input statement is correct is a good start for checking data. Was the data file created with data entry, a scanned form, or web input? Data entry can easily produce data errors. A scanned form or bubble sheet could have missing data if a respondent leaves an item blank or bubbles two choices when only one can be accepted. First, examine a data form and match the responses on the form with the appropriate location in the data file. Then determine column locations to create or verify the input statement. Check the accuracy of online data. Were all questions answered? There could be missing responses which could affect the inferences made to the population. Next, use PROC FREQ to run frequency distributions for variables in the data set. Answer the following questions. Are values out-of-range? Is there missing data? Did the subjects respond to all the questions/items? How will missing data be handled? Is there a way to contact participants to complete missing demographic data? Should missing data (e.g., on a rating scale) be replaced with mean substitution? Run PROC UNIVARIATE FREQ PLOT NORMAL for selected variables. Answer these questions: Is the distribution normal? Is the distribution skewed? Are there outliers? Should outliers be removed from the data analysis? How will a nonnormal distribution affect the statistical analysis? To print errors, use a routine to select inappropriate values. Use an IFTHEN statement and PROC PRINT. For example, if gender is coded as 1 and 2 and the frequency distribution shows values of 1, 2, 3, 5, use the following statements to print data errors. data error; set rawsub; if gender in (3,5); proc print; Compare the PROC PRINT results with the original data sheets. Maybe during data entry, everything got moved over a column or two. Maybe on a bubble sheet there were 5 bubbles per item, the question only had 2 choices (e.g., male or female) and the respondent bubbled choice 3, 4, or 5. For some variables (e.g., id number, age, gpa), PROC FREQ will produce more pages of output than youd like to examine. For variables like age or gpa, PROC MEANS provides minimum and maximum values to determine out-ofrange values. Are there duplicate identification numbers for a one-time measurement of subjects? If the study is a repeatedmeasures or longitudinal design, are there missing measurements for a subject at any time point? What SAS procedures will help you answer these questions?

To check for duplicate subject identification numbers on a one-time measurement, the following code is helpful. proc freq data = rawsub; tables idnum / noprint out=newid; data idfl; set newid; if count gt 1; proc print; title duplicate id numbers; The above code can be modified to count the number of measurements for each subject in a longitudinal or repeated measures study. Delete the if count gt 1; statement to determine the number of measurements for each subject. If there were 4 measurements for each subject in the project, replace the if count gt 1; with if count ne 4; or with if count eq 4 then delete; to print identification numbers of subjects with less than or greater than 4 measurements. proc freq data = rawsub; tables idnum / noprint out=newid; data idfl; set newid; if count eq 4 then delete; proc print; title missing measurements; After data has been checked and cleaned, youre ready to proceed to the statistical analysis!

What is frequency?
Frequency is counting the number of occurrences of a value or response. Frequency is the number of individuals in a single class when objects are classified according to variations in a set of one or more specified attributes (MerriamWebster, 2003). PROC FREQ allows you to describe your data in a concise way by producing frequency counts and crosstabulation tables. Frequency tables show the distribution of values and the number of responses for each value. Crosstabulation tables combine frequency distributions for two or more variables. For example, a table with gender by residency shows the number of male residents, female residents, male nonresidents and female nonresidents.

## Is there an association between row and column variables of a two-way table?

Statistical procedures available in PROC FREQ allow you to test associations (relationships) between variables. Chisquare involves differences between observed and expected frequencies. This test can be thought of as a test of differences between two proportions. In PROC FREQ, measures of association are close to zero when there is no association and close to maximum (or minimum) value when there is perfect association (e.g., Likelihood ratio chisquare, Mantel-Haenszel chi-square, Cramers V). Computation and printing statistics is specified in the TABLES statement options.

## What is a frequency distribution?

A frequency distribution is an arrangement of statistical data that illustrates the frequency of the occurrence of the values of a variable (Merriam-Webster, 2003). Frequencies are arranged from smallest to largest value and can be ungrouped or grouped.

Is the distribution normal? About 1870, a Belgian mathematician, Quetelet, and an English scientist, Galton, made a discovery about individual differences. They found the same pattern of results over and over again when arranging measurements from large samples into frequency distributions. A symmetrical, bell-shaped curve (a normal curve) resulted when plotting human characteristics. Bar charts help visualize a frequency distribution. Use the following code to illustrate a frequency distribution. proc gchart data = rawsub; vbar item1/midpoinst = 1 2 3 4 5;

## What is the range?

The range of a variable is the highest valid value minus the lowest valid value of the variable. Are data out-of-range? Out-of-range is a value less than the lowest value or greater than the highest value. For example, if gender is coded as 1 for male and 2 for female and a value of 6 is found for gender, the value 6 is out-of-range. A value of 8 on a scale of 1 to 5 is out-of-range.

## Are percentages significantly different?

Chi-square is the appropriate statistic to determine if percentages in a two-way table are significantly different. The question could be Is there a significant difference between the proficiency level of boys and girls? PROC FREQ with chi-square will answer the question. The probability value, p = 0.0067, shows significance at the 0.01 level, p < 0.01. However, the statistical test does not indicate which cells show the differences. Intuitively you could compare row percentages. proc freq data = all; tables gender * level/chisq;

level Frequency| Row Pct | 1| 2| 3| 4| Total ---------+------+------+------+------+ Boys | 22 | 30 | 78 | 3 | 133 |16.54 |22.56 | 58.65 | 2.26 | 51.35 ---------+------+------+------+------+ Girls | 8 | 21 | 87 | 10 | 126 | 6.35 | 16.67| 69.05 | 7.94 | 48.65 ---------+------+------+------+------+ Total 30 51 165 13 259 11.58 19.69 63.71 5.02 100.00

Statistic DF Value Prob ------------------------------------------Chi-Square 3 12.2014 0.0067 Effective Sample Size = 259 Frequency Missing = 2

## What is the mean?

The mean is an arithmetic average calculated by adding the values of a variable and dividing by the number of values. Some statistical procedures determine significant difference between group means.

## What is the variance?

When describing a set of scores, the amount of dispersion, spread, is known as the variance. The variance is reported to describe the characteristic. The variance is an average of the squared deviations from the mean obtained by adding the squared deviations and dividing by the total number of cases. The square root of the variance is called the standard deviation.

Conclusion
Youve got a plan explicitly stating the purpose, objectives, and goals of your project. Data has been collected in an organized manner. Research questions, a part of your plan, were investigated. Appropriate statistical analyses were used to answer research questions and to satisfy the purpose, objectives, and goals of the study. Appendix A provides a quick reference guide to pertinent statistical analyses ranging from the basic (e.g., frequency, average) to complex (e.g., canonical correlation, structural equation modeling). You have the power with SAS and the ability with statistics to answer your research questions.

References
Bowerman, B., OConnell, R., Orris, J., Murphree, E. (2010). Essentials of Business Statistics, Third Edition. Boston: McGraw-Hill Irwin. Darwin, C. (1888). On the origin of species, 1859. Washington Square, NY: New York University Press. th Ferguson, G. A. (1981). Statistical analysis in psychology and education, 5 Ed. New York: McGraw-Hill Book Company. Gall, M. D., Borg, W. R., & Gall, J. P. (1996). Educational research: An introduction. White Plains, NY: Longman Publishers USA. Glass, G. V. & Hopkins, K. D. (1984). Statistical methods in education and psychology, 2nd Ed. Boston: Allyn and Bacon. Kranzler, G. & Moursund, J. (1999). Statistics for the terrified, 2nd Ed. Upper Saddle River, NJ: Prentice Hall. Merriam-Webster Dictionary Online 2003 @ http://www.m-w.com/home.htm SAS Applications Guide, 1980 Edition, Cary, N.C.: SAS Institute. SAS Language, Version 6. Cary, N.C.: SAS Institute, 1990. SAS Language and Procedures, Version 6, First Edition. Cary, N.C.: SAS Institute, 1989 SAS OnlineDoc, Version 8, SAS/STAT Users Guide, Chapter 63. Cary, N.C.: SAS Institute, 1999. SAS Procedures, Version 6, Third Edition. Cary, N.C.: SAS Institute, 1990. Truxillo, C. (2003). Multivariate Statistical Methods: Practical Research Applications Course Notes. Cary, N.C.: SAS Institute.

Diana Suhr is a Statistical Analyst in Information Management and Technology, Institutional Reporting and Analysis Services at the University of Northern Colorado. She earned a Ph.D. in Educational Psychology at UNC in 1999. The first programming language she learned was Fortran in 1970. She has been a SAS programmer since 1984.

Contact
Diana Suhr, Statistical Analyst Institutional Research University of Northern Colorado Greeley, CO 80639 970-351-2193, diana.suhr@unco.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

## Appendix A: Research Questions, Statistics, and SAS Procedures Question

What is the frequency distribution? What is the range?

Statistic
frequencies range = maximum - minimum

SAS Procedure
PROC FREQ PROC UNIVARIATE PROC FREQ PROC MEANS

Are data out-of-range? Is the distribution normal? Is there a significant difference in frequencies or percentages between groups? Is the sample in the same distribution as the population? (on selected variables) Is there a relationship between two variables? Is there a relationship between two groups of variables? Are there direct and indirect relationships between variables? Is there a difference between the means? (two groups) Is there a difference between the means? (more than two groups) Is there a difference between the means of a group of variables? Is there change over time? Does a group of variables predict the value of another variable? Does a group of variables predict a nonlinear relationship with another variable? Does a group of variables discriminate or distinguish group membership? Do items measure a common construct? Can the factor structure of a measurement instrument be confirmed? Can observations be grouped or clustered according to similarities/common characteristics? Can ranked data be analyzed?

frequency values normality, skewness, kurtosis chi-square chi-square correlation canonical correlation structural equation modeling t-test analysis of variance (anova) multivariate analysis of variance (manova) repeated measures anova latent growth curve modeling regression (linear) nonlinear regression discriminant analysis exploratory factor analysis (groups items) confirmatory factor analysis Cluster analysis (groups observations) Nonparametric tests Wilcoxon, Mann-Whitney, Kruskal-Wallis
7

PROC FREQ PROC UNIVARIATE PROC FREQ PROC FREQ PROC CORR PROC CANCORR PROC CALIS PROC TTEST PROC ANOVA PROC GLM PROC GLM PROC GLM PROC CALIS PROC REG PROC GLM PROC NLIN PROC DISCRIM PROC FACTOR PROC CALIS PROC CLUSTER PROC TREE PROC NPAR1WAY

@Suhr, 2003