Вы находитесь на странице: 1из 34

Summary Statistics

Objectives
To determine "typical" values of variables. To look into measures of spread of data. To look into measures of how variables are correlated . To determine how to reduce variables through factor analysis. To look at how to assess the reliability of data.

Content
Descriptive Statistics Correlation Analysis Factor Analysis Reliability Analysis

Measuring Center: The Mean


The most common measure of center is the arithmetic average, or mean.
To find the mean x (pronounced x-bar) of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, , xn, their mean is:

sum of observations x1 + x 2 + ... + x n x= = n n


or in more compact notation

x x= n

Measuring Center: The Median


The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.
To find the median of a distribution:

1.Arrange all observations from smallest to largest.


2.If the number of observations n is odd, the median M is the center observation in the ordered list. 3.If the number of observations n is even, the median M is the average of the two center observations in the ordered list.
6

Measures of Location
Mean is an appropriate measure of the center of the data if the data has a symmetric distribution with light tails. Median if the distribution has heavy tails or is asymmetric. Median is resistant.

Measuring Spread: Quartiles


A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread.

How to Calculate the Quartiles and the Interquartile Range

To calculate the quartiles: 1) Arrange the observations in increasing order and locate the median M. 2) The first quartile Q1 is the median of the observations located to the left of the median in the ordered list. 3) The third quartile Q3 is the median of the observations located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q3 Q1

Five-Number Summary
The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest.
Minimum Q1 M Q3 Maximum

Measuring Spread: Standard Deviation

The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation.
The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.

( x1 - x ) 2 + ( x 2 - x ) 2 + ... + ( x n - x ) 2 1 variance = s = = ( x i - x )2 n -1 n -1
2 x

1 2 standard deviation = sx = ( x x ) i n -1

Descriptive Statistics Commands


First read in the data file: > dogB <- read.table('Dog Biscuit Data.csv', header = T, sep=',') > summary(dogB) # gives the min, max, 1st & 3rd quarters, mean and median of the variables in data > mean(dogB$prob) > sd(dogB$prob) > install.packages("psych") > library(psych) > describe(dogB) # requires psych package describeBy(dogB, dogB$edu) # descriptive statistics by edu

Frequency Table
> prob <- dogB$prob # saves some typing effort > prob.freq = table(prob) > prob.freq prob 1 2 3 4 5 6 7 8 9 10 8 16 8 8 8 12 8 8 12 12

R Multiple Response
> subdata <- dogB[,c('kbblemix','mixbars','bonnys', 'doggies')] # a contains 4 columns representing the 4 brands > b = sum(subdata == "r") # b is the total number of brand recalled responses >b [1] 216 > d = colSums(subdata == "r") # number of brand recalled responses for each brand >d kbblemix mixbars bonnys doggies 52 56 52 56 > f = as.numeric(c(d, b)) # f stores the frequencies >f [1] 52 56 52 56 216

R Multiple Response
> data.frame( brands = c(names(d), "Total"), freq=f, percent = (f/b)*100) # produce the output brands freq percent 1 kbblemix 52 24.07407 2 mixbars 56 25.92593 3 bonnys 52 24.07407 4 doggies 56 25.92593 5 Total 216 100.00000

Split File Procedure


> temp = which(dogB[,'gender'] == "m") # temp contains the id# of male respondents > mfile = dogB[temp,] # mfile is a file of male respondents > temp = which(dogB[,'gender'] == "f") > ffile = dogB[temp,] > summary(mfile) > summary(ffile)

Sample Covariance

Cov( x, y)

xi x yi y N 1

Sample Correlation Coefficient

Covxy sx s y xi x yi y N 1sx s y

Things to know about the Correlation


A single numerical summary statistic which measures the strength of a linear relationship between x and y. r varies between -1 and +1 0 = no relationship r is an effect size .1 = small effect .3 = medium effect .5 = large effect Coefficient of determination, r2

What is the Correlation?

Things to know about the Correlation


Consider the data set {(1,1)(2,2)(3,3)(4,4) 5,5)(0,7)}. The statistical correlation of the set of all six members has value 0. Although five members of the data set are perfectly correlated, the sixth member (0,7) is not at all aligned with the others. > x <- c(1,2,3,4,5,0) > y <- c(1,2,3,4,5,7) > cor(x,y) [1] 0

What is the Correlation?

What is the Correlation?

R Correlation
To calculate the covariance and correlation amongst environ2, environ3 and environ5, first extract the three variables from data by: > green <- data[,c('environ2', 'environ3', 'environ5')] > cov(green) # calculate the covariance matrix environ2 environ3 environ5 environ2 0.25212121 0.09090909 0.09575758 environ3 0.09090909 0.25252525 0.08585859 environ5 0.09575758 0.08585859 0.25242424 > cor(green) # calculate the correlation matrix environ2 environ3 environ5 environ2 1.0000000 0.3602883 0.3795796 environ3 0.3602883 1.0000000 0.3400680 environ5 0.3795796 0.3400680 1.0000000

Factor Analysis

Factor Analysis
> library(rela) # requires the rela package > g = as.matrix(green) #change data file to a matrix > pa <- paf(g) # use paf function in rela to do factor analysis > summary(pa) # obtain the output

Factor Analysis

Factor Analysis
> barplot(pa$Eigenvalues[,1]) # draw the first column of eigenvalues

Factor Analysis
> pav <varimax(pa$Factor.Loadings) # Varimax rotation > pav # get the output on the right > scores <- g %*% as.matrix(pav$loadings) # get factor scores

R Reliability Procedure
> library(multilevel) > green1 <- green[, c('environ1', 'environ4')] > green2 <- green[, c('environ2', 'environ3', 'environ5')] > cronbach(green1)$Alpha [1] 0.85513 > cronbach(green2)$Alpha [1] 0.62788

References

The Basic Practice of Statistics, 6ed. by Moore, D., W. Notz & M. Fligner, Chapters 2, 4 (pp.106-114).

33

Вам также может понравиться