Summary Statistics

Summary Statistics
Objectives
To determine "typical" values of variables. To look into measures of spread of data. To look into measures of how variables are correlated . To determine how to reduce variables through factor analysis. To look at how to assess the reliability of data.
Content
Descriptive Statistics Correlation Analysis Factor Analysis Reliability Analysis
Measuring Center: The Mean

The most common measure of center is the arithmetic average, or mean.
To find the mean x (pronounced x-bar) of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, , xn, their mean is:
sum of observations x1 + x 2 + ... + x n x= = n n

or in more compact notation
x x= n
Measuring Center: The Median

The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.
To find the median of a distribution:
1.Arrange all observations from smallest to largest.

2.If the number of observations n is odd, the median M is the center observation in the ordered list. 3.If the number of observations n is even, the median M is the average of the two center observations in the ordered list.
6
Measures of Location
Mean is an appropriate measure of the center of the data if the data has a symmetric distribution with light tails. Median if the distribution has heavy tails or is asymmetric. Median is resistant.
Measuring Spread: Quartiles

A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread.
How to Calculate the Quartiles and the Interquartile Range
To calculate the quartiles: 1) Arrange the observations in increasing order and locate the median M. 2) The first quartile Q1 is the median of the observations located to the left of the median in the ordered list. 3) The third quartile Q3 is the median of the observations located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q3 Q1
Five-Number Summary
The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest.
Minimum Q1 M Q3 Maximum
Measuring Spread: Standard Deviation
The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation.
The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.
( x1 - x ) 2 + ( x 2 - x ) 2 + ... + ( x n - x ) 2 1 variance = s = = ( x i - x )2 n -1 n -1
2 x
1 2 standard deviation = sx = ( x x ) i n -1
Descriptive Statistics Commands

First read in the data file: > dogB <- read.table('Dog Biscuit Data.csv', header = T, sep=',') > summary(dogB) # gives the min, max, 1st & 3rd quarters, mean and median of the variables in data > mean(dogB$prob) > sd(dogB$prob) > install.packages("psych") > library(psych) > describe(dogB) # requires psych package describeBy(dogB, dogB$edu) # descriptive statistics by edu
Frequency Table
> prob <- dogB$prob # saves some typing effort > prob.freq = table(prob) > prob.freq prob 1 2 3 4 5 6 7 8 9 10 8 16 8 8 8 12 8 8 12 12
R Multiple Response
> subdata <- dogB[,c('kbblemix','mixbars','bonnys', 'doggies')] # a contains 4 columns representing the 4 brands > b = sum(subdata == "r") # b is the total number of brand recalled responses >b [1] 216 > d = colSums(subdata == "r") # number of brand recalled responses for each brand >d kbblemix mixbars bonnys doggies 52 56 52 56 > f = as.numeric(c(d, b)) # f stores the frequencies >f [1] 52 56 52 56 216
R Multiple Response
> data.frame( brands = c(names(d), "Total"), freq=f, percent = (f/b)*100) # produce the output brands freq percent 1 kbblemix 52 24.07407 2 mixbars 56 25.92593 3 bonnys 52 24.07407 4 doggies 56 25.92593 5 Total 216 100.00000
Split File Procedure

> temp = which(dogB[,'gender'] == "m") # temp contains the id# of male respondents > mfile = dogB[temp,] # mfile is a file of male respondents > temp = which(dogB[,'gender'] == "f") > ffile = dogB[temp,] > summary(mfile) > summary(ffile)
Sample Covariance
Cov( x, y)
xi x yi y N 1
Sample Correlation Coefficient
Covxy sx s y xi x yi y N 1sx s y
Things to know about the Correlation

A single numerical summary statistic which measures the strength of a linear relationship between x and y. r varies between -1 and +1 0 = no relationship r is an effect size .1 = small effect .3 = medium effect .5 = large effect Coefficient of determination, r2
What is the Correlation?
Things to know about the Correlation

Consider the data set {(1,1)(2,2)(3,3)(4,4) 5,5)(0,7)}. The statistical correlation of the set of all six members has value 0. Although five members of the data set are perfectly correlated, the sixth member (0,7) is not at all aligned with the others. > x <- c(1,2,3,4,5,0) > y <- c(1,2,3,4,5,7) > cor(x,y) [1] 0
R Correlation
To calculate the covariance and correlation amongst environ2, environ3 and environ5, first extract the three variables from data by: > green <- data[,c('environ2', 'environ3', 'environ5')] > cov(green) # calculate the covariance matrix environ2 environ3 environ5 environ2 0.25212121 0.09090909 0.09575758 environ3 0.09090909 0.25252525 0.08585859 environ5 0.09575758 0.08585859 0.25242424 > cor(green) # calculate the correlation matrix environ2 environ3 environ5 environ2 1.0000000 0.3602883 0.3795796 environ3 0.3602883 1.0000000 0.3400680 environ5 0.3795796 0.3400680 1.0000000
Factor Analysis
Factor Analysis
> library(rela) # requires the rela package > g = as.matrix(green) #change data file to a matrix > pa <- paf(g) # use paf function in rela to do factor analysis > summary(pa) # obtain the output
Factor Analysis
Factor Analysis
> barplot(pa$Eigenvalues[,1]) # draw the first column of eigenvalues
Factor Analysis
> pav <varimax(pa$Factor.Loadings) # Varimax rotation > pav # get the output on the right > scores <- g %*% as.matrix(pav$loadings) # get factor scores
R Reliability Procedure
> library(multilevel) > green1 <- green[, c('environ1', 'environ4')] > green2 <- green[, c('environ2', 'environ3', 'environ5')] > cronbach(green1)$Alpha [1] 0.85513 > cronbach(green2)$Alpha [1] 0.62788
References
The Basic Practice of Statistics, 6ed. by Moore, D., W. Notz & M. Fligner, Chapters 2, 4 (pp.106-114).
33

Summary Statistics

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Summary Statistics

Загружено:

Авторское право:

Доступные форматы

Summary Statistics

Measuring Center: The Mean

sum of observations x1 + x 2 + ... + x n x= = n n

Measuring Center: The Median

1.Arrange all observations from smallest to largest.

Measuring Spread: Quartiles

How to Calculate the Quartiles and the Interquartile Range

Measuring Spread: Standard Deviation

Descriptive Statistics Commands

Split File Procedure

Sample Correlation Coefficient

Things to know about the Correlation

What is the Correlation?

Things to know about the Correlation

What is the Correlation?

What is the Correlation?

Вам также может понравиться