Академический Документы
Профессиональный Документы
Культура Документы
Multivariate analysis
Dependency techniques
Interdependency techniques
Session I
Descriptive Analysis of Univariate
& Multivariate Data
Parametric Testing
Z test, t test, ANOVA
SEM
Descriptive & Inferential Statistics
Descriptive Statistics Inferential Statistics
Organize Generalize from samples to pops
Presentation of data
2. Graphical Representations
total
Frequency ? ?
(%)
?/tot x 100 ?/tot x 100
scale of measurement? nominal
-----% ------%
Frequency Distributions
Categorize on the basis of more that one variable at same time
CROSS-TABULATION
# of Ss that fall
in a particular category
total
Democrats 24 1 25
Republican 19 6 25
Total 43 7 50
Graphical Representations
Bar graph (ratio data - quantitative)
Graphs & Tables
Graphical Representations
Histogram of the categorical variables
Graphs & Tables
Graphical Representations
Polygon - Line Graph
Graphs & Tables
Best Graph Ever
Charles Minards 1869 graph of Napoleons 1812 march on Moscow shows the
dwindling size of the army. The broad line on top represents the armys size on
the march from Poland to Moscow. The thin dark line below represents the
armys size on the retreat. The width of the lines represents the army size,
which started over 400,000 strong and dwindled to 10,000. The bottom lines
are temperature and time scales, and the overall plot shows distance travelled.
Best Graph Ever Modern Version
jagged
smooth
95%
2.5% 2.5%
13.5%
13.5%
Qualitative data:
Mode always appropriate
Mean never appropriate
Summary Statistics
describe data in just 2 numbers
Measures of variability
Measures of central tendency typical average variation
typical average score
1. Range: distance from lowest to highest (use 2 data points)
2. Variance: (use all data points)
3. Standard Deviation
4. Standard Error of the Mean
Inferential Statistics
Used to draw conclusions about a population by examining the sample
Random selection
Equal chance for anyone to be selected makes sample more representative
It help test hypotheses, answer research questions and derive meaning from the results
Inferential statistics test the likelihood that the alternative (research) hypothesis (H1) is true
and the null hypothesis (H0) is not
Steps in Inferential Statistics
State Hypothesis
Ho: no difference between 2 means; any difference found is due to sampling error
Results stated in terms of probability that Ho is false
Level of Significance
Probability that sample means are different enough to reject Ho (.05 or .01)
In parametric test, the test statistic is based on In nonparametric test, the test statistic is
distribution. arbitrary and not based on any distribution.
In the parametric test, it is assumed that the In nonparametric test, it is assumed that the
measurement of variables of interest is done variable of interest are measured on nominal
on interval or ratio level. or ordinal scale.
In general, the measure of central tendency in In the case of the nonparametric test the
the parametric test is mean measure of central tendency is median.
Parametric Vs. Non Parametric Tests
In parametric test, there is complete Conversely, in the nonparametric test, there is
information about the population no information about the population.
The applicability of parametric test is for Nonparametric test applies to both variables
variables only and attributes
For measuring the degree of association For measuring the degree of association
between two quantitative variables, Pearsons between two quantitative variables,
coefficient of correlation is used in the spearmans rank correlation is used in the
parametric test nonparametric test.
Error in Hypothesis Testing
Type I error: The null hypothesis is
Truth rejected when it is true.
Null Alternative
Decision Type II error: The null hypothesis is not
hypothesis hypothesis
Do not reject TYPE II rejected when it is false.
OK
null ERROR
TYPE I There is always a chance of making one of
Reject null OK
ERROR
these errors. Well want to minimize the
chance of doing so!
Parametric Vs. Non Parametric Tests
Parametric Vs. Non Parametric Tests
Identifying the Appropriate Statistical Test of Difference
One variable One-way chi-square
Two variables
t-test
(1 IV with 2 levels; 1 DV)
Two variables
ANOVA
(1 IV with 2+ levels; 1 DV)
z
X
(66.41 63.8)
0.98
2.66
Z Test Example
1. Percentile: How many 15 year old girls are shorter than Reshma?
50% + 33.65% = 83.65%
Z Test Example
2. What percentage of 15 year old girls are taller than Reshma?
50% - 33.65% OR 100% - 83.65% = 16.35%
Z Test Example
3. What percentage of 15 year old girls are as far from the mean as Reshma (tall
or short)?
16.35 % + 16.35% = 32.7%
Z Test Example 2
Manu is 15 years old and 61.2 in. tall
For 15 year old boys, = 67, = 3.19
z
X
(61.2 67)
1.82
3.19
50% + 13%
z= ?_
t - Test
The t-test is used to test hypotheses about means when the population variance is
unknown (the usual case).
Comes in 3 varieties:
Single sample, independent samples, and dependent samples.
t - Test
Single sample t we have only 1 group; want to test against a hypothetical mean.
Dependent t we have two means. Either same people in both groups, or people
are related, e.g., husband-wife, left hand-right hand, hospital patient and visitor.
The t Distribution
We use t when the population variance is unknown (usual case) and sample size is small (N<100,
usual case). If you use a stat package for testing hypotheses about means, you will use t.
The t distribution is a short, fat relative of the normal. The shape of t depends on its df. As N
becomes infinitely large, t becomes normal.
Single-sample z test
For large samples (N>100) can use z to test hypotheses about means.
( X ) ( X X ) 2
zM sX N 1
est. M est. M
N
N
Suppose
H 0 : 10; H1 : 10; s X 5; N 200
Then
sX 5 5
est. M .35
N 200 14.14
If
(11 10)
X 11 z 2.83; 2.83 1.96 p .05
.35
Single-sample t-test
With a small sample size, we compute the same numbers as we did for z, but we
compare them to the t distribution instead of the z distribution.
H 0 : 10; H1 : 10; s X 5; N 25
sX 5 (11 10)
est. M 1 X 11 t 1
N 25 1
Interval = X t M
11 2.064(1) [8.936, 13.064]
Interval is about 9 to 13 and contains 10, so n.s.
The basic ANOVA situation
Two variables: 1 Categorical, 1 Quantitative
Main Question: Do the (means of) the quantitative variables depend on which
group (given by categorical variable) the individual is in?
10
are significant depends on
days
9
7
The standard deviations of each group
6
for each data value look at the difference between its group mean and the overall mean
x i x
2
for each data value we look at the difference between that value and the mean of its group
x ij xi
2
How ANOVA works
The ANOVA F-statistic is a ratio of the Between Group Variation divided by the Within
Group Variation:
Between MSG
F
Within MSE
A large F is evidence against H0, since it indicates that there is more difference between
groups than within groups.
An even smaller example
Suppose we have three groups
Group 1: 5.3, 6.0, 6.7
Group 2: 5.5, 6.2, 6.4, 5.7
Group 3: 7.5, 7.2, 7.9
Total 6.884 9
Like one-way ANOVA, one-way ANCOVA is used to determine whether there are any significant
differences between two or more independent (unrelated) groups on a dependent variable.
However, instead of differences in the group means, ANCOVA looks for differences in adjusted means
(i.e., adjusted for the covariate).
One-way ANCOVA has the additional benefit of "statistically controlling" for a third variable
(sometimes known as a "confounding variable"), which is believed to affect results.
In fact, you can get almost identical results in SPSS by conducting this analysis using either the
"Analyze > Regression > Linear" dialog menus or the "Analze > General Linear Model (GLM) >
Univariate" dialog menus.
A key (but not only) difference in these methods is that you get slightly different output tables.
Also, regression requires that user dummy code factors, while GLM handles dummy coding through
the "contrasts" option. The linear regression command in SPSS also allows for variable entry in
hierarchical blocks (i.e. stages).
One way MANOVA (Multivariate analysis of
Variance )
One-way MANOVA is used to determine if there are any differences between independent
groups on more than one continuous dependent variable.
It differs from one-way ANOVA, which only measures one dependent variable
In basic terms, A MANOVA is an ANOVA with two or more continuous response variables.
If there are two independent variables rather than one, a two-way MANOVA is used
One-way MANOVA is an omnibus test statistic and cannot tell you which specific groups
were significantly different from each other
Post-hoc test have to be carried out to test which of these groups are significant
One way MANOVA (Multivariate analysis
of Variance )
SEM (Structural Equation Modeling )
SEM is a very general statistical modeling technique widely used in behavioral science
which can be viewed as a combination of factor analysis, regression and path analysis
Multivariate analysis
Dependency techniques
Multiple regression
Discriminant analysis
Conjoint analysis
Inter-dependency techniques
Factor Analysis
Cluster Analysis
Correlation
Pearson correlation is a measure of strength and direction of association that exists
between two variables measured on at least an interval scale.
It attempts to draw a line of best fit through data of two variables, and Pearson correlation
r, indicates how far away all data points are from this line of best fit
Used to predict the value of a variable based on the value of two or more other variables.
Variable to predict is called dependent variable (the outcome, target or criterion variable).
Variables to predict the value of dependent variable are called the independent variables
Multiple regression also allows to determine overall fit (variance explained) of the model
and the relative contribution of each of the predictors to the total variance explained.
Interpreting Result
Form of the equation to predict VO2max from Age, Weight, Heart_rate, Gender, is:
VO2max = 87.83 (0.165 x age) (0.385 x weight) (0.118 x heart_rate) + (13.208 x gender)
Interpreting Result
Discriminant Analysis
Linear discriminant function analysis (i.e., discriminant analysis) performs a multivariate
test of differences between groups.
In addition, discriminant analysis is used to determine the minimum number of dimensions
needed to describe these differences.
It builds a predictive model for group membership
The model is composed of a discriminant function based on linear combinations of
predictor variables.
Those predictor variables provide the best discrimination between groups
Conjoint Analysis
Used to study factors that influence customers, A numerical part-worth utility value is
purchasing decisions computed for each level of each attribute.
Products possess attributes such as price, color, Large part-worth utilities are assigned to most
ingredients, guarantee, environmental impact, preferred levels, and small part-worth utilities
predicted reliability and so on. are assigned to the least preferred levels.
Subjects provide data on preferences for The attributes with the largest part-worth
hypothetical products defined by attribute utility range are considered the most
combinations. important in predicting preference.
Conjoint analysis decomposes the judgment Conjoint analysis is a statistical model with an
data into components, based on qualitative error term and a loss function.
attributes of the products.
Factor Analysis
A class of procedures used for data Problem formulation
Statistical Terms
Hierarchical clustering
Cluster membership
Dendogram
A Dendogram
Session II
Non parametric test for
Hypothesis Testing
Run test for randomness
Sign test
Chi-Square
Mann-Whitney U test
Kruskal-Wallis test
Non Parametric Tests
They are also called distribution free tests
The popular tests are
One sample run test, a method for determining randomness
Sign test for paired data where positive and negative signs are substituted for
quantitative values
Rank sum test, also called Mann Whitney U Test, which can be used to determine if two
independent samples have been drawn from same population.
Kruskal Wallis test, another rank sum test which generalizes the Analysis of Variance
Advantages and Disadvantages
They do not require us to make They ignore a certain amount of
assumption that the population is information
normally distributed They are often not as efficient or sharp
Generally, they are easier to do and to as parametric tests
understand
Note that other than the requirement that the random variables be continuous, no other
conditions about the distributions must be met in order for the Run Test to be an appropriate
test.
we have n1 observations of the random variable X, and n2 observations of the random variable Y.
Suppose we combine the two sets of independent observations into one larger collection of n1 +
n2 observations, and then arrange the observations in increasing order of magnitude.
Example
A charter bus line has 48-passenger buses and 38-passenger buses. With X and Y denoting
the number of miles traveled per day for the 48-passenger and 38-passenger buses,
respectively, the bus company is interested in testing the equality of the two distributions
The company observed the following data on a random sample of n1 = 10 buses holding 48
passengers and n2 = 11 buses holding 38 passengers:
Sign test
Tests one population median, (eta) P(X)
30% .273
.219 .219
Corresponds to t-test for one mean 20%
.109 .109
Assumes population is continuous 10%
.031 .031
.004 .004
0%
Small sample test statistic: Number of
0 1 2 3 4 5 6 7 8 X
sample values above (or below) median Sign Test Uses p-Value to Make Decision
Can use normal approximation if n 30 P-value is the probability of getting an observation at
least as extreme as we got.
One sample sign test
One-Tailed Test Let pH0 = qH0 = 0.5
H0: = 0 N=30
Ha: > 0 [or Ha: < 0 ] Pbar = 0.633 and Pbar = .367
Test statistic: Then Pbar = sqrt (p.q/n)
S = No. of sample measurements > than 0 = sqrt (.5*.5 / 30) = 0.091
[or S = no. of measurements less than 0]
Z = (0.633 0.5) / 0.091 = 1.462 < 1.96
Observed sig. level: p-value = 2P(x S) This lies in the acceptance region and so
where x has a binomial distribution with at = .05, we cannt reject H0
parameters n and p = .5
18 42 6 7 6 7
U 1 6 6 26.5 U 2 6 6 515
.
45 63 2 2
57 57 U 2 36 21 515
U 1 36 21 26.5 .
12 90
30 68 U 1 30.5 U 2 55
.
U 1 30.5
Critical Value = 5
U 2 55
. So, This is a nonsignificant outcome
5.5 30.5
U 55
.
Kruskal-Wallis test
The Kruskal-Wallis H test (also called the "one-way ANOVA on ranks")
It is a rank-based nonparametric test, used to determine if there are statistically significant
differences between two or more groups of an independent variable on a continuous or
ordinal dependent variable.
Data need not be normally distributed
It is considered the nonparametric alternative to the one-way ANOVA, and an extension of
the Mann-Whitney U test to allow comparison of more than two independent groups.
Kruskal-Wallis H test is an omnibus test statistic and cannot tell you which specific groups of
your independent variable are statistically significantly different
Kruskal-Wallis test
Rank the total measurements in all k H0: the k distributions are identical versus
samples from 1 to n. Ha: at least one distribution is different
Tied observations are assigned average of Test statistic: Kruskal-Wallis H
ranks they would have gotten if not tied.
When H0 is true, the test statistic H has an
Calculate approximate chi-square distribution with df
Ti = rank sum for the ith sample i = 1, 2,,k = k-1.
And the test statistic Use a right-tailed rejection region or p-value
12 Ti 2 based on the Chi-square distribution.
H 3(n 1)
n(n 1) ni
Example
Four groups of students were randomly assigned to H0: the distributions of scores are the same
be taught with four different techniques, and their Ha: the distributions differ in location
Frequency Frequency
( O E )2
2
E
( 40 50 )2 ( 60 50 )2
H 40 50 50 50
( 10 )2 (10 )2 100 100
T 60 50
50 50 50 50
22
4
sum 100 100
degrees of freedom = (R 1)(C 1)
R = number of rows
C = number of columns
Title
Chi Square is a test of significance based on
bivariate tables
Rows Columns Total
Columns are scores of the independent
variable.
Row 1 cell a cell b Row
There will be as many columns as there are scores Marginal 1
Totals 12 13 N = 25 ( fo fe )2
2 (calcilated) f
e
To find fe = row marginal column marginal
N GUN
Multiply column and row marginals for each Low High Total
SALES
cell and divide by N.
fo = 8 fo = 5
High 13
(13*12)/25 = 156/25 = 6.24 fe = 6.24 fe = 6.76
(13*13)/25 = 169/25 = 6.76 fo = 4 fo = 8
Low 12
(12*12)/25 = 144/25 = 5.76 fe = 5.76 fe = 6.24
If sample not Normal then use Wilcoxon Wilcoxon Rank Sum independent t-test
Signed Rank Test as an alternative
Wilcoxon signed-rank test
NP test relating to the median as measure As the number of ranks (n) becomes
of central tendency larger, the distribution of W becomes
Ranks of absolute differences between approximately Normal
the data and hypothesised median
Generally, if n>20
calculated
Mean W=n(n+1)/4
Ranks for negative and positive
Variance W=n(n+1)(2n+1)/24
differences are then summed separately
Z=(W-mean W)/SD(W)
(W- & W+ resp.)
Thank You!