90%(30)90% нашли этот документ полезным (30 голосов)

34K просмотров124 страницыJan 29, 2009

© Attribution Non-Commercial (BY-NC)

PPT или читайте онлайн в Scribd

Attribution Non-Commercial (BY-NC)

90%(30)90% нашли этот документ полезным (30 голосов)

34K просмотров124 страницыAttribution Non-Commercial (BY-NC)

Вы находитесь на странице: 1из 124

Using SPSS

John Zhang

ARL, IUP

Topics

A Guide to Multivariate Techniques

Preparation for Statistical Analysis

Review: ANOVA

Review: ANCOVA

MANOVA

MANCOVA

Repeated Measure Analysis

Factor Analysis

Discriminant Analysis

Cluster Analysis

Guide-1

Correlation: 1 IV – 1 DV; relationship

Regression: 1+ IV – 1 DV; relation/prediction

T test: 1 IV (Cat.) – 1 DV; group diff.

One-way ANOVA: 1 IV (2+ cat.) – 1 DV;

group diff.

One-way ANCOVA: 1 IV (2+ cat.) – 1 DV –

1+ covariates; group diff.

One-way MANOVA: 1 IV (2+ cat.) – 2+ DVs;

group diff.

Guide-2

One-way MANCOVA: 1 IV (2+cat.) – 2+ DVs –

1+ covariate; group diff.

Factorial MANOVA: 2+ IVs (2+cat.) – 2+ DVs;

group diff.

Factorial MANCOVA: 2+ IVs (2+cat.) – 2+ DVs –

1+ covariate; group diff.

Discriminant Analysis: 2+ IVs – 1 DV (cat.);

group prediction

Factor Analysis: explore the underlying structure

Preparation for Stat. Analysis-1

Screen data

– SPSS Utility procedures

– Frequency procedure

Missing data analysis (missing data

should be random)

– Check if patterns exist

– Drop data case-wise

– Drop data variable-wise

– Impute missing data

Preparation for Stat. Analysis-2

Outliers (generally, statistical procedures

are sensitive to outliers.

– Univariate case: boxplot

– Multivariate case: Mahalanobis distance (a

chi-square statistics), a point is an outlier

when its p-value is < .001.

– Treatment:

Drop the case

Report two analysis (one with outlier, one without)

Preparation for Stat. Analysis-3

Normality

– Testing univariate normal:

Q-Q plot

Skewness and Kurtosis: they should be 0 when

normal; not normal when p-value < .01 or .001

Komogorov-Smirnov statistic: significant means

not normal.

– Testing multivariate normal:

Scatterplots should be elliptical

Each variable must be normal

Preparation for Stat. Analysis-4

Linearity

– Linear combination of variables make sense

– Two variables (or comb. of variables) are

linear

– Check for linearity

Residual plot in regression

Scatterplots

Preparation for Stat. Analysis-5

Homoscedasticity: the covariance

matrixes are equal across groups

– Box’s M test: test the equality of the

covariance matrixes across groups

Sensitive to normality

– Levene’s test: test equality of variances

across groups.

Not sensitive to normality

Preparation for Stat. Analysis-

Example-1

Steps in preparation for stat. analysis:

– Check for variable codling, recode if necessary

– Examining missing data

– Check for univariate outlier, normality, homogeneity of

variances (Explore)

– Test for homogeneity of variances (ANOVA)

– Check for multivariate outliers (Regression>Save>

Mahalanobis)

– Check for linearity (scatterplots; residual plots in

regression)

Preparation for Stat. Analysis-

Example-2

Use dataset dssft.sav

Objective: we are interested in

investigating group differences (satjob2) in

income (income91), age (age_2) and

education (educ)

Check for coding: need to recode

rincome91 into rincome_2 (22, 98, 99 be

system missing)

– Transform>Recode>Into Different Variable

Preparation for Stat. Analysis-

Example-3

Check for missing value

– Use Frequency for categorical variable

– Use Descriptive Stat. for measurement variable

– For categorical variables:

If missing value is < 5%, use List-wise option

If >=5%, define the missing value as a new category

– For measurement variables:

If missing value is < 5%, use List-wise option

If between 5% and 15%, use Transform>Replace Missing

Value. Replacing less than 15% of data has little effect on

the outcome

If greater than 15%, consider to drop the variable or subject

Preparation for Stat. Analysis-

Example-4

– Check missing value for satjob2

Analysis>Descriptive Statistics>Frequency

– Check for missing value for rincome_2

Analysis>Descriptive Statistics>Descriptive

– Replaying the missing values in rincome_2

Transform>Replacing Missing Value

Preparation for Stat. Analysis-

Example-5

Check for univariate outliers, normality,

Homogeneity of variances

– Analysis>Descriptive Statistics>Explore

Put rincome_2, age_2, and educ into the

Dependent List box; satjob2 into Factor List box

– There are outliers in rincome_2, lets change

those outliers to the acceptable min or max

value

Transform>Recode>Into Different Variable

– Put income_2 into Original Variable box, type income_3

as the new name

– Replace all values <= 3 by 4, all other values remain the

same

Preparation for Stat. Analysis-

Example-6

Explore rincome_3 again: not normal

– Transform rincome_3 into rincome_4 by ln or

sqrt

Explore rincome_4

Check for multivariate outliers

– Analysis>Regression>linear

Put id (dummy variable) into Depend box, put

rincome_4, age_2, and educ into Independent box

Click at Save, then Mahalanobis box

Compare Mahalanobis dist. with chi-sqrt critical

value at p=.001 and df=number of independent

variables

Preparation for Stat. Analysis-

Example-7

Check for multivariate normal:

– Must univariate normal

– Construct a scatterplot matrix, each

scatterplot should be elliptical shape

Check for Homoscedasticity

– Univariate (ANOVA, Levene’s test)

– Multivariate (MANOVA, Box’s M test, use .01

level of significance level)

Review: ANOVA -1

One-way ANOVA test the equality of group

means

– Assumptions: independent observations; normality;

homogeneity of variance

Two-way ANOVA tests three hypotheses

simultaneously:

– Test the interaction of the levels of the two

independent variables

Interaction occurs when the effects of one factor depends on

the different levels of the second factor

– Test the two independent variable separately

Review: ANCOVA -1

Idea: the difference on a DV often does not just

depend on one or two IVs, it may depend on

other measurement variables. ANCOVA takes

into account of such dependency.

– i.e. it removes the effect of one or more covariates

Assumptions: in addition to the regular ANOVA

assumptions, we need:

– Linear relationship between DV and covariates

– The slope for the regression line is the same for each

group

– The covariates are reliable and is measure without

error

Review: ANCOVA -2

– Homogeneity of slopes = homogeneity of

regression = there is interaction between IVs

and the covariate

If the interaction between covariate and IVs are

significant, ANCOVA should not be conducted

Example: determine if hours worked per

week (hrs2) is different by gender (sex)

and for those satisfy or dissatisfied with

their job (satjob2), after adjusted to their

income (or equalized to their income)

Review: ANCOVA -3

– Analysis>GLM>Univariate

Move hrs2 into DV box; move sex and satjob2 into

Fixed Factor box; move rincome_2 into Covariate

box

Click at Model>Custom

– Highlight all variables and move it to the Model box

– Make sure the Interaction option is selected

Click at Option

– Move sex and satjob2 into Display Means box

– Click Descriptive Stat.; Estimates of effect size; and

Homogeneity tests

This tests the homogeneity of regression slopes

Review: ANCOVA -4

– If there is no interaction found by the previous

step, then repeat the previous step except

click at Model>Factorial instead of

Model>Custom

Review: ANOVA -2

– Interaction is significant means the two IVs in

combination result in a significant effect on the DV,

thus, it does not make sense to interpret the main

effects.

– Assumptions: the same as One-way ANOVA

– Example: the impact of gender (sex) and age

(agecat4) on income (rincome_2)

Explore (omitted)

Analysis>GLM>univariate

– Click model>click Full factorial>Cont.

– Click Options>Click Descriptive Stat; Estimates of effect size;

Homogeneity test

– Click Post Hoc>click LSD; Bonferroni; Scheffe; Cont.

– Click Plots>put one IV into Horizontal and the other into

Separate line

MANOVA-1

Characteristics

– Similar to ANOVA

– Multiple DVs

– The DVs are correlated and linear combination makes

sense

– It tests whether mean differences among k groups on

a combination of DVs are likely to have occurred by

chance

– The idea of MANOVA is find a linear combination that

separates the groups ‘optimally’, and perform ANOVA

on the linear combination

MANOVA-2

Advantages

– The chance of discovering what actually

changed as a result of the the different

treatment increases

– May reveal differences not shown in separate

ANOVAs

– Without inflation of type one error

– The use of multiple ANOVAs ignores some

very important info (the fact that the DVs are

correlated)

MANOVA-3

Disadvantages

– More complicated

– ANOVA is often more powerful

Assumptions:

– Independent random samples

– Multivariate normal distribution in each group

– Homogeneity of covariance matrix

– Linear relationship among DVs

MANOVA-4

Steps in carry out MANOVA

– Check for assumptions

– If MANOVA is not significant, stop

– If MANOVA is significant, carry out univariate

ANOVA

– If univariate ANOVA is significant, do Post

Hoc

If homoscedasticity, use Wilks Lambda, if

not, use Pillai’s Trace. In general, all 4

statistics should be similar.

MANOVA-5

Example:An experiment looking at the memory

effects of different instructions: 3 groups of

human subjects learned nonsense syllables as

they were presented and were administered two

memory tests: recall and recognition. The first

group of subjects was instructed to like or dislike

the syllables as they were presented (to

generate affect). A second group was instructed

that they will be tested (induce anxiety?). The 3rd

group was told to count the syllable as the were

presented (interference). The objective is to

access group differences in memory

MANOVA-6

How to do it?

– File>Open Data

Open the file As9.por in Instruct>Zhang Multivariate Short

Course folder

– Analyze>GLM>Multivariate

Move recall and recog into Dependent Variable box; move

group into Fixed Factors box

Click at Options; move group into Display means box (this

will display the marginal means predicted by the model,

these means may be different than the observed means if

there are covariates or the model is not factorial); Compare

main effect box is for testing the every pair of the estimated

marginal means for the selected factors.

Click at Estimates of effect size and Homogeneity of variance

MANOVA-7

Push buttons:

– Plots: create a profile plot for each DV displaying

group means

– Post Hoc: Post Hoc tests for marginal means

– Save: save predicted values, etc.

– Contrast: perform planned comparisons

– Model: specify the model

– Options:

Display Means for: display the estimated means predicted by

the model

– Compare main effects: test for significant difference between

every pair of estimated marginal means for each of the main

effects

MANOVA-8

– Observed power: produce a statistical power

analysis for your study

– Parameter estimate: check this when you

need a predictive model

– Spread vs. level plot: visual display of

homogeneity of variance

MANOVA-9

Example 2: Check for the impact of job

satisfaction (satjob) and gender (sex) on

income (rincome_2) and education (educ)

(in gssft.sav)

– Screen data: transform educ to educ2 to

eliminate cases with ‘6 or less’

– Check for assumptions: explore

– MANOVA

MANCOVA-1

Objective: Test for mean differences

among groups for a linear combination of

DVs after adjusted for the covariate.

Example: to test if there is differences in

productivity (measured by income and

hours worked) for individuals in different

age groups after adjusted for the

education level

MANCOVA-2

Assumptions: similar to ANCOVA

SPSS how to:

– Analysis>GLM>Multivariate

Move rincome_2 and educ2 to DV box; move sex

and satjob into IV box; move age to Covariate box

Check for homogeneity of regression

– Click at Model>Custom; Highlight all variables and move

them to Model box

If the covariate-IVs interaction is not significant,

repeat the process but select the Full under model

Repeated Measure Analysis-1

Objective: test for significant differences in

means when the same observation appears in

multiple levels of a factor

Examples of repeated measure studies:

– Marketing – compare customer’s ratings on 4 different

brands

– Medicine – compare test results before, immediately

after, and six months after a procedure

– Education – compare performance test scores before

and after an intervention program

Repeated Measure Analysis-2

The logic of repeated measure: SPSS

performs repeated measure ANOVA by

computing contrasts (differences) across

the repeated measures factor’s levels for

each subject, then testing if the means of

the contrasts are significantly different

from 0; any between subject tests are

based on the means of the subjects.

Repeated Measure Analysis-3

Assumptions:

– Independent observations

– Normality

– Homogeneity of variances

– Sphericity: if two or more contrasts are to be pooled

(the test of main effect is based on this pooling), then

the contrasts should be equally weighted and

uncorrelated (equal variances and uncorrelated

contrasts); this assumption is equivalent to the

covariance matrix is diagonal and the diagonal

elements are the same)

Repeated Measure Analysis-4

Example 1: A study in which 5 subjects were

tested in each of 4 drug conditions

Open data file:

– File>Open…Data; select Repmeas1.por

SPSS repeated measure procedure:

– Analyze>GLM>Repeated Measure

Within-Subject Factor Name (the name of the repeated

measure factor): a repeated measure factor is expressed as

a set of variables

– Replace factor1 with Drug

Number of levels: the number of repeated measurements

– Type 4

Repeated Measure Analysis-5

– The Measure pushbutton for two functions

For multiple dependent measures (e.g. we

recorded 4 measures of physiological stress under

each of the drug conditions)

To label the factor levels

– Click Measure; type memory in Measure name box; click

add

Click Define: here we link the repeated measure

factor level to variable names; define between

subject factors and covariates

– Move drug1 – drug 4 to the Within-Subject box

You can move a selected variable by the up and

down button

Repeated Measure Analysis-6

Model button: by default a complete model

Contrast button: specify particular contrasts

Plot button: create profile plots that graph factor

level estimated marginal means for up to 3 factors

at a time

Post Hoc: provide Post Hoc tests for between

subject factors

Save button: allow you to save predicted values,

residuals, etc.

Options: similar to MANOVA

– Click Descriptive; click at Transformation Matrix (it

provides the contrasts)

Repeated Measure Analysis-7

Interpret the results

1. Look at the descriptive statistics

2. Look at the test for Sphericity

If Sphericity is significant, use the Multivariate results (test

on the contrasts). It tests whether all of the contrast

variables are zero in the population

If Sphericity is not significant, use the Sphericity Assumed

result

3. Look at the tests for within subject contrasts: it test

the linear trend; the quadratic trend…

– It may not be make sense in some applications, as in this

example (but it makes sense in terms of time and dosage)

Repeated Measure Analysis-8

Transformation matrix provide info on what are

linear contrast, etc.

– The fist table is for the average across the repeated

measure factor (here they are all .5, it means each

variable is weighted equally, normalization requires that

the square of the sums equals to 1)

– The second table defines the corresponding repeated

measure factor

Linear – increase by a constant, etc.

Linear and quadratic is orthogonal, etc.

– Having concluded there are memory

differences due to drug condition, , we want to

know which condition differ to which others

Repeated Measure Analysis-9

Repeat the analysis, except under Option button,

move ‘drug’ into Display Means, click at Compare

Main effects and select Bonferroni adjustment

– Transformation Coefficients (M Matrix): it shows how the

variables are created for comparison. Here, we compare

the drug conditions, so the M matrix is an identity matrix

Suppose we want to test each adjacent pair of

means: drug1 vs. drug2; drug2 vs. drug3; drug3

vs. drug 4:

– Repeated measure>Define>Contrast>Select Repeated

Repeated Measure Analysis-10

Example 2: A marketing experiment was devised

to evaluate whether viewing a commercial

produces improved ratings for a specific brand.

Ratings on 3 brands were obtained from objects

before and after viewing the commercial. Since

the hope was that the commercial would

improve ratings of only one brand (A),

researchers expected a significant brand by pre-

post commercial interaction. There are two

between-subjects factors: sex and brand used

by the subject

Repeated Measure Analysis-11

SPSS how to:

– Analyze>GLM>Repeated Measures

Replace factor1 with prepost in the Within-Subject

Factor box; type 2 in the Number of level box; click

add

Type brand in the Within-Subject Factor box; type

3 in the Number of level box; click add

Click measure; type measure in Measure Name

box; click add

Note: SPSS expects 2 between-subject factors

Repeated Measure Analysis-12

Click Define button; move the appropriate variable

into place; move sex and user into Between-

Subject Factor box

Click Options button; move sex, user, prepost and

brand into the Display means box

Click Homogeneity tests and descriptive boxes

Click Plot; move user into Horizontal Axis box and

brand into Separate Lines box

Click continue; OK

Factor Analysis-1

The main goal of factor analysis is data

reduction. A typical use of factor analysis is in

survey research, where a researcher wishes to

represent a number of questions with a smaller

number of factors

Two questions in factor analysis:

– How many factors are there and what they represent

(interpretation)

Two technical aids:

– Eigenvalues

– Percentage of variance accounted for

Factor Analysis-2

Two types of factor analysis:

– Exploratory: introduce here

– Confirmatory: SPSS AMOS

Theoretical basis:

– Correlations among variables are explained by

underlying factors

– An example of mathematical 1 factor model for two

variables:

V1=L1*F1+E1

V2=L2*F1+E2

Factor Analysis-3

Each variable is compose of a common factor (F1)

multiply by a loading coefficient (L1, L2 – the

lambdas or factor loadings) plus a random

component

V1 and V2 correlate because the common factor

and should relate to the factor loadings, thus, the

factor loadings can be estimated by the

correlations

A set of correlations can derive different factor

loadings (i.e. the solutions are not unique)

One should pick the simplest solution

Factor Analysis-4

A factor solution needs to be confirm:

– By a different factor method

– By a different sample

More on terminology

– Factor loading: interpreted as the Pearson

correlation between the variable and the

factor

– Communality: the proportion of variability for a

given variable that is explained by the factor

– Extraction: the process by which the factors

are determined from a large set of variables

Factor Analysis-5

Principle component: one of the extraction

methods

– A principle component is a linear combination of

observed variables that is independent (orthogonal) of

other components

– The first component accounts for the largest amount

of variance in the input data; the second component

accounts for the largest amount or the remaining

variance…

– Components are orthogonal means they are

uncorrelated

Factor Analysis-6

Possible application of principle

components:

– E.g. in a survey research, it is common to

have many questions to address one issue

(e.g. customer service). It is likely that these

questions are highly correlated. It is

problematic to use these variables in some

statistical procedures (e.g. regression). One

can use factor scores, computed from factor

loadings on each orthogonal component

Factor Analysis-7

Principle component vs. other extract methods:

– Principle component focus on accounting for the

maximum among of variance (the diagonal of a

correlation matrix)

– Other extract methods (e.g. principle axis factoring)

focus more on accounting for the correlations

between variables (off diagonal correlations)

– Principle component can be defined as a unique

combination of variables but the other factor methods

can not

– Principle component are use for data reduction but

more difficult to interpret

Factor Analysis-8

Number of factors:

– Eigenvalues are often used to determine how

many factors to take

Take as many factors there are eigenvalues

greater than 1

– Eigenvalue represents the amount of standardized

variance in the variable accounted for by a factor

– The amount of standardized variance in a variable is 1

– The sum of eigenvalues is the percentage of variance

accounted for

Factor Analysis-9

Rotation

– Objective: to facilitate interpretation

– Orthogonal rotation: done when data reduction is the

objective and factors need to be orthogonal

Varimax: attempts to simplify interpretation by maximize the

variances of the variable loadings on each factor

Quartimax: simplify solution by finding a rotation that

produces high and low loadings across factors for each

variable

– Oblique rotation: use when there are reason to allow

factors to be correlated

Oblimin and Promax (promax runs fast)

Factor Analysis-10

Factor scores: if you are satisfy with a

factor solution

– You can request that a new set of variables

be created that represents the scores of each

observation on the factor (difficult of interpret)

– You can use the lambda coefficient to judge

which variables are highly related to the

factor; the compute the sum of the mean of

this variables for further analysis (easy to

interpret)

Factor Analysis-11

Sample size: the sample size should be about

10 to 15 times of the number of variables (as

other multivariate procedures)

Number of methods: there are 8 factoring

methods, including principle component

– Principle axis: account for correlations between the

variables

– Unweighted least-squares: minimize the residual

between the observed and the reproduced correlation

matrix

Factor Analysis-12

– Generalize least-squares: similar to Unweighted least-

squares but give more weight the the variables with

stronger correlation

– Maximum Likelihood: generate the solution that is the

most likely to produce the correlation matrix

– Alpha Factoring: Consider variables as a sample; not

using factor loadings

– Image factoring: decompose the variables into a

common part and a unique part, then work with the

common part

Factor Analysis-13

Recommendations:

– Principle components and principle axis are

the most common used methods

– When there are multicollinearity, use principle

components

– Rotations are often done. Try to use Varimax

Factor Analysis-14

Example 1: whether a small number of athletic

skills account for performance in the ten

separate decathlon events

– File>Open>Data…; select Olymp88.por

– Looking at correlation:

Analyze>Correlation>Bivariate

– Principle component with orthogonal rotation

Analyze>Data Reduction>Factor

– Select all variables except score

– Click Extract button>click Scree Plot

– Check off Unrotated factor solution

– Click continue

Factor Analysis-15

Click Rotation button>click Varimax; Loading plots;

click continue

Click options button>click sorted by size; click

Suppress absolute values box; change .1 to ,3;

click continue

Click Descriptive>Univariate descriptive; KMO and

Bartlett’s test of sphericity (KMO measures how

well the sample data are suited for factor analysis:

.9 is great and less than .5 is not acceptable;

Bartlett’s test tests the sphericity of the correlation

matrix); click continue

Click OK

Factor Analysis-16

Try to validate the first factor solution

using a different method

– Analyze>Data Reduction>Factor Analysis

Click Extraction>Select Principle axis factoring;

click continue

Click Rotation>Select Direct Oblimin (leave delta

value at 0, most oblique value possible); type 50 in

the Max Iteration box; click continue

Click Score button>click save as variables (this

involve solving system of equation for the factors,

regression is one of the methods to solve the

equations); click continue

Click OK

Factor Analysis-17

Note: the Patten matrix gives the

standardized linear weights and the

Structure matrix gives the correlation

between variable and factors (in principle

component analysis, the component

matrix gives both factor loadings and the

correlations)

Discriminant Analysis-1

Discriminant analysis characterize the

relationship between a set of IVs with a

categorical DV with relatively few

categories

– It creates a linear combination of the IVs that

best characterizes the differences among the

groups

– Predictive discriminant analysis focus on

creating a rule to predict group membership

– Descriptive DA studies the relationship

between the DV and the IVs.

Discriminant Analysis-2

Possible applications:

– Whether a bank should offer a loan to a new

customer?

– Which customer is likely to buy?

– Identify patients who may be at high risk for

problems after surgery

Discriminant Analysis-3

How does it work?

– Assume the population of interest is composed of

distinct populations

– Assume the IVs follows multivariate normal

distribution

– DS seek a linear combination of the IVs that best

separate the populations

– If we have k groups, we need k-1 discriminate

functions

– A discriminant score is computed for each function

– This score is used to classify cases into one of the

categories

Discriminant Analysis-4

– There are three methods to classify group

memberships:

Maximum likelihood method: assign case to group

k is the probability of membership is greater in

group k than any other group

Fisher (linear) classification functions: assign a

membership to group k if its score on the function

for group k is greater than any other function

scores

Distance function: assign membership to group k if

its distance to the centroid of the group is minimum

Note: SPSS uses Maximum likelihood method

Discriminant Analysis-5

Basic steps in DA:

– Identify the variables

– Screen data: look for outliers, variables may

not be good predictors, etc

– Run DA

– Check for the correct prediction rate

– Check for the importance of individual

predictors

– Validate the model

Discriminant Analysis-6

Assumptions:

– IVs are either dichotomous or measurement

– Normality

– Homogeneity of variances

Discriminant Analysis-7

Example 1: VCR buyers filled out a survey; we

want to determine which set of demographic

information and attitude best predict which

customer may buy another VCR

– File>Open Data…>CSM.por

– Explore the data

– Analyze>Classify>Discriminant

Move age, complain, educ, fail, pinnovat, preliabl, puse, qual,

use, and value into Independent box

Move buyyes into Grouping box

Click Define Range; type 1 for Min and 2 for Max

Click continue

Discriminant Analysis-8

Click Statistics>click Box’s M and Fisher’s;

continue

Click Classify button>click Summary table;

Separate groups; Continue

Click Save button>click on Discriminant Scores;

continue

Click OK

– How original variables related to the

discriminant score?

Graphs>Scatter>Click Define

– Move pinnovat into X and dis1_1 into Y; move buyyes

into Set Markers by box

Discriminant Analysis-9

Since Box’s M test was significant, one

can ask SPSS to run DA using ‘separate

covariances’ option (under Classify) and

compare the results

From the 1st analysis, we see that ‘age’

was not important, one can redo the

analysis without ‘age’ and compare the

results

Discriminant Analysis-10

Validate the model: leave-one-out classification

– Repeat the analysis, click on Classify>click leave-one-

out classification; Click continue

Example 2: predict smoking and drinking habits

– Analyze>Classify>Discriminant

Move smkdrnk into Grouping Variable box; move age,

attend, black, class, educ, sex and white into IV list

Click Statistics>Select Fisher’s and Box M; Continue

Click Classify>Summary table, Combine-groups; Territorial

map; Continue

Click OK

Cluster Analysis-1

Cluster analysis is an exploratory data

analysis technique design to reveal groups

How?

– By distance: close together observations

should be in the same group, and

observations in the groups should be far apart

Applications:

– Plants and animals into ecological groups

– Companies for product usage

Cluster Analysis-2

Two types of method

– Hierarchical: requires observations to remain

together once they have joint in a cluster

Complete linkage

Between groups average linkage

Ward’s method

– Nonhierarchical: no such requirement

Research must pick a number of clusters to run (K-

means algorithm)

Cluster Analysis-3

Recommendations:

– For relative small samples, use hierarchical

(less than a few hundred)

– For large samples, use K-means

Example 1: evaluating 20 types of beer

– File>Open>Data; select beer.por

– Analyze>Descriptive Stat>Descriptive

Move cost, calories, sodium, and alcohol into

variable list

Click at Save standardized values; OK

Cluster Analysis-4

Analyze>Classify>Hierarchical Cluster

– Move cost, calories, sodium, and alcohol into Variable

list box

– Move Beer into label cases by box

– Click Plots>click Dendrogram; click none in Icicle

area; continue

– Click Method>select Z-score from the standardize

drop-down list; Continue

– Click Save>Click range of solutions; range 2-5

clusters; continue

– OK

Cluster Analysis-5

Additional analysis

– Look at the last 4 column of the data (clu5_1 to

clu2_1) they contain memberships for each solution

between 5 and 2 clusters

– Analyze>Descriptive>Frequencies

Move clu2_1 to clu5_1 to Variable box

OK

– Obtain mean profile for clusters

Graph>Line>summary of separate variables

– Click Define>move zcost, zcalorie, zsodium, and zalcohol to

Lines Rep. Box

– Click clu4_1 and move it to Category box

Path Analysis-1

Path analysis is a technique based on

regression to establish causal relationship

– Start with a diagram with causal flow

– Direct causal effects model (regression)

The direct causal effect of an IV on a DV is the coefficient

(the number of unit change in DV for 1 unit change in X)

– Building on the DCEM

Two forms of causal model:

– Diagram

– Equation (structure equation)

Path Analysis-2

An example of a causal model

– Structural equation:

Z4=p41Z1+p42Z2+p43Z3+e4

– P: path coefficient

– e: disturbance

– Z4, endogenous variable

– Z1: exogenous variable

– Path diagram

Indirect effect is the multiplication of the path

coefficients

Path Analysis-3

Steps in path analysis:

– Create a path diagram

– Use regression to estimate structural equation

coefficients

– Assess to model:

Compare the observed and reproduced

correlations (reproduced correlations will be

computed by hand)

Path Analysis-4

Research questions:

– Is our model-which describe the causal

effects among the variables ‘region of the

world’, ‘status as a developing nation’,

‘number of doctors’, and ‘male life

expectancy’-consistent with our observed

correlation among these variables?

– If our model is consistent, what are the

estimated direct, indirect, and total causal

effects among the variables?

Path Analysis-5

Legal path:

– No path may pass through the same variable

more than once

– No path may go backward on an arrow after

going forward on another arrow

– No path may include more than one double

headed curve arrow

Path Analysis-6

Component labels:

– D: direct effect (just one straight arrow)

– I: indirect effect (more than one straight

arrows)

– S: spurious effect (there is a backward arrow)

– U: effect is uncertain (start with a two arrows

curve)

Path Analysis-7

If the model is in question (some of the

reproduced correlations differ from the

observed correlations by more than .05)

– Test all missing paths (running additional

regressions and check for significance of the

coefficients)

– Reduce the existing paths if their coefficients

are not significant

Logistic regression - Motivations

When the dependent variable is

dichotomous, regular regression is not

appropriate

– We want to predict probability

– OLS regression predictions could be any

numbers, not just numbers between 0 and 1

– When dealing with proportions, variance is

depended on mean, equal variance

assumption in OLS is violated

Motivations-Continue

Fit a S curve to the data

1.0

Prob of Ownning Home

0.5

0.0

0 5 10

Income

What is Logistic Regression?

Regressions of the form

ln(Odds)=B0+B1X1+…+BkXk

ln(Odds) is called a logic

Odds=Porb/(1-Prob)

e B0 + B1 X 1 +...+ Bk X k

Pr ob =

1 + e B0 + B1 X 1 +...+ Bk X k

Application of Logistic

Regression

When to use it?

– When the dependent valuable is

dichotomous

Objectives:

– Run a logistic regression

– Apply a stepwise logistic regression

– Use ROC (response operating

characteristic) curve to access the model

Assumptions of logistic

regression

The indep. variables be interval or

dichotomous

All relevant predictors be included, no

irrelevant predictors be included and the

form of the relationship is linear

The expected value of the error term is

zero

There is no autocorrelation

Assumptions of logistic

regression – Cont.

There is no correlation between the error

and the independent variables

There is an absence of perfect

multicollinearity between the independent

variables

Need to have a large sample (rule of

thumb: n should be > 30 times of the

number of parameters)

Note on assumptions

No need for normality of errors

No need for equal variance

Example

Objective: to predict low birth weight babies

Variables:

– Low: 1: <=2500 grams, 0: >2500 grams

– LWT: weight at last menstrual cycle

– Age

– Smoke

– PTL: # of premature deliveries

– HT: History of Hypertension

– UI: uterine irritability

– FTV: # of physician visits during first trimester

– Race: 1=white, 2=black, 3=other

Example

File > Open > Data > Select SPSS

Portable type > select Birthwt (in

Regression)

Analyze > Regression > Binary Logistic

– Move ‘low’ to the Dependent list box

– Move ‘age’, ‘ftv’, ‘ht’, ‘ptl’, ‘race’, ‘smoke’, and

‘ui’ into the Covariate list box

Example (cont.)

Click the Categorical button

– Place ‘race’ in the Categorical Covariates box

Click Continue, click Save

– Click the Probability and Group Membership

check boxes

Click Continue and then the Option button

Example (cont.)

Click on the Classification plots and

Hosmer-Lemeshow goodness of fit

checkboxes

Click Continue, then OK

Logistic outputs

Initial summary output: info on dependent

and categorical variables

Block 0: based on the model just include a

constant – provides baseline info

Block 1: Method Enter – include the model

info

– Chi-square tests if all the coeffs are 0 (similar

to ‘F’ in regression)

Logistic outputs (cont.)

The Modle chi-square value is the

difference of the initial and final –2LL

(small value of -2LL indicates a good fit,

-2LL=0 indicates a perfect fit)

The Step and Block display the the result

of last Step and Block (they are the same

here because we are not using stepwise

regression)

Logistic outputs (cont.)

The goodness of fit statistics –2LL is

203.554

Cox & Snell R square – similar to R-

square in OLS

Nagelkerke R squre (prefered b/c it can

be 1)

Hosmer and Lemeshow test: test “there

is no difference between expected and

observe counts”. I.e. we prefer a non-

significant result

Logistic outputs (cont.)

Classification table: can our model to

predict accurately?

– Overall accuracy is 73%

– We do much better on higher birth weight

– Does a poor job on lower birth weight

– A significant model doesn’t mean having high

predictability

Interpretation of the coefficients

E.g. HT (hypertension)

– B=1.736 – hypertension in the mother

increase the log odds by 1.736

– Exp(B)=5.831 - hypertension in the mother

increase the odds of having a low birth baby

by a factor of 5.831

– What is the prob. change?

If the original odds is 1:100 (p=.0099), it changes

to 5.831:100 (p=.0551); if the original odds is 1:1

(p=.5), it changes to 5:1 (p=.83)

Interpretation of the coefficients

(cont.)

Categorical variable Race:

– First an overall effect

– Race(1) – white: the effect of being white is

significant, acting to decrease the odds ratio

compared to those of ‘other’ by a factor of .4

– The effect of being black is not significant

compared with ‘other’

Making prediction

Suppose a mother;

– Age 20

– Weigh 130 pounds

– Smoke

– No hypertension or premature labor

– Has uterine irritability

– White

– Two visits to her doctor

Making prediction (cont.)

P(event) = 1/(1+exp(-(a+b1X1+…+bkXk)

P=.397

Predicted to be not have low birth rate

because the prob. is less that .5

Checking classification

Need to study the characteristics of

mispredicted cases

– Transform>Compute> Pred_err=1 if…

– Analyze>Compare Means (LWT vs Pred_err)

The mean LWT for mispredicted is much lower

than the correctly predicted

Residual Analysis

Analyze>Regression>Logistic>Click Save >Click

Cook’s, Leverage, Unstandardized, Logit, and

Standardized

Examining data

– Cook’s and Leverage should be small (if a case has

no influence on the regression result, the values

would be 0)

– Res_1 is the residual of probability (e.g. 1st case have

predicted prob. .29804 and and actual ‘low’ value is 0,

and the res_1=0-.29804=-.29804)

– Zre_1 is the standardized residual of the probs

– lre_1 is the residual in terms of logit

ROC curve (Receiver Operating

Characteristic)

Sensitivity: true positive

Specificity: true negative

Changing cut off points (.5) changes both the

sensitivity and specificity

ROC can help us to select an ‘optimal’ cut off

point

Graph>ROC Curve>move pre_1 to ‘Test

Variable’, low to ‘State Variable’, type ‘1’ in the

‘Value of State Variable’, click ‘with diagonal

reference line’ and ‘Coordinate points of the

ROC Curve’

ROC curve interpretation

Vertical axis: sensitivity (true positive rate)

Horizontal axis: false negative rate

Diagonal: reference

Give the trade off between sensitivity and

false negative rates

Pay attention to the area where the curve

rise rapidly

The 1st column of ‘coordinate of the curve’

gives the cut off prob.

Residual Analysis – Cont.

Examine the distribution of zre_1

– Graph>Interactive>Histogram>drag zre_1 to

X axis, click Histogram, click Normal Curve

Note: this plot need not to should normality

Finding influential cases

– Graph>Scatterplot>Define>Move id to X axis,

coo_1 to Y axis

Multicollinearity

– Use OLS regression to check (?)

Multinomial Logistic Regression

The dependent variable is categorical with

two or more categories

It is an extension of the logistic regression

The assumptions are the assumptions for

logistic regression plus ‘the dependent

variable has multinomial distribution

Example

Objective: predict risk credit risk (3

categories) base on financial and

demographic variables

Variables:

– Age

– Income

– Gender

– Marital (single, married, divsepwid)

– Numkids: # of dependent children

Example Cont.

– Numcards: #of credit cards

– Howpaid: how often paid (weekly, monthly)

– Mortgage: have a mortgage (y, n)

– Storecar: # of store credit cards

– Loans: # of other loads

– Risk: 1=bad risk, 2=bad risk-profit, 3=good

risk

How does it work?

Let f(j) be the probability of being in

outcome category j

– f(1)=P(bad risk-lost)

– f(2)=P(bad risk-profit)

– f(3)=P(good risk)

– g(1)=f(1)/f(3)

– g(2)=f(2)/f(3)

– g(3)=f(3)/f(3)=1

How does it work? – Cont.

Fit the modele:

– ln(g(1))= A1+B11X1+…+B1kXk

– ln(g(2))= A2+B21X1+…+B2kXk

– ln(g(3))= ln(1)=0=A3+B31X1+…+B3kXk

g ( j)

f ( j) =

∑ g ( j)

How does it work? – Cont.

g (1) e A1 + B11 X 1 +...+ B1k X k

f (1) = = A1 + B11 X 1 +...+ B1k X k

g (1) + g (2) + 1 e + e A2 + B21 X 1 +...+ B2 k X k + 1

f ( 2) = = A1 + B11 X 1 +...+ B1k X k

g (1) + g (2) + 1 e + e A2 + B21 X 1 +...+ B2 k X k + 1

g (3) 1

f (3) = = A1 + B11 X 1 +...+ B1k X k

g (1) + g (2) + 1 e + e A2 + B21 X 1 +...+ B2 k X k + 1

Example – Cont.

File > Open > Data > Select Risk > Open

Move risk into dependent list box

Move marital and mortgage into the

Factor(s) list box

Move income and numberkids into the

Covariate(s) list box

Click model button

– Click cancel button

Example (Cont.)

Click Statistics button

– Check the Classification table check box

– Click Continue

Click Save

– The Multinomial Logistic regression in SPSS

version 10 will only save model info in an XML

(Extensible Markup Language) format

– Click cancel

Click OK

Multinomial output

Model Fit and Pseudo R-square,

Likelihood ratio test are similar to logistic

regression

Parameter estimates table is different

– There are two sets of parameters

One for the probability ratio of

(bad risk-lost)/(good risk)

Another set for the prob. Ratio of

(bad risk-profit)/(good risk)

Interpretation of coefficients

Income in the ‘bad lost’ section

– It is significant

– Exp(B)=.962: the expected probability ratio is

decreased a little (by a factor of .962) for one

unit increase of income

How to predict?

F(1) – the chance in ‘bad loss’ group

F(2) – the chance in ‘bad profit’ group

F(3) – the chance in ‘good risk’ group

F(j)=g(j)/sum(g(i))

g(j)=exp(modelj)

How to predict? (cont.)

Suppose an individual

– Single, has a mortgage

– No children

– Income of 35,000 pounds

g(1)=.218

g(2)=.767

g(3)=1

How to predict?

F(1)=.218/(.218+.767+1)=.110

F(2)=.386

F(3)=.504

The individual is classified as good risk

Multinomial Logistic Reg. With

Interaction

Analyze>Regression>Multinomial

Logistic>Click at Model, select

custom>specify your model (all main

effects and the interaction between Marital

and Mortgage)

Interpret the results as usual

Interaction effects in logistic

Regression

It is similar to OLS regression:

– Add interaction terms to the model as

crossproducts

– In SPSS, highlight two variables (holding

down the ctrl key) and move them into the

variable box will create the interaction term

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.