7 views

Uploaded by fa2heem

its about Regression

- 18001.pdf
- Correlation and Regression11
- Biostatistics Notes: Correlation and simple linear Regression
- Regression
- Television and Divorce - Evidence from Brazilian Novelas - Alberto Chong.pdf
- Chapter 16-17 - Correlation Regression latest.ppt
- Research methods and Statistics_Ismail Khater
- jafari2012.pdf
- Task Report on Financial Modelling Module
- 1
- Correlation & Regression
- Practice Test - Chap 7-9
- E-Commerce Adoption and Growth of SMEs in Uganda
- CORRELATION AND REGRESSION EXAM.docx
- 18.docx
- Analysis of Factors Affecting the Success of Safety Management Programs of Food Manufacturing Companies in Nairobi County, Kenya
- Forecasting i
- ANOVA.doc
- 5. Curve Fitting
- The Analysis of Pretender Building Price Forecasting Performance a Case Study

You are on page 1of 51

Regression

Correlation and Regression

The test you choose depends on level of measurement:

Independent Dependent Test

Dichotomous Interval-Ratio Independent Samples t-test

Dichotomous

Nominal Interval-Ratio ANOVA

Dichotomous Dichotomous

Nominal Nominal Cross Tabs

Dichotomous Dichotomous

Interval-Ratio Interval-Ratio Bivariate Regression/Correlation

Dichotomous

Correlation and Regression

Bivariate regression is a technique that fits a

straight line as close as possible between all the

coordinates of two continuous variables plotted

on a two-dimensional graph--to summarize the

relationship between the variables

Correlation is a statistic that assesses the

strength and direction of association of two

continuous variables . . . It is created through a

technique called regression

Bivariate Regression

For example:

A criminologist may be interested in the

relationship between Income and Number of

Children in a family or self-esteem and

criminal behavior.

Independent Variables

Family Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior

Bivariate Regression

For example:

Research Hypotheses:

As family income increases, the number of children in

families declines (negative relationship).

As self-esteem increases, reports of criminal behavior

increase (positive relationship).

Independent Variables

Family Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior

Bivariate Regression

For example:

Null Hypotheses:

There is no relationship between family income and the

number of children in families. The relationship statistic b = 0.

There is no relationship between self-esteem and criminal

behavior. The relationship statistic b = 0.

Independent Variables

Family Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior

Bivariate Regression

Lets look at the relationship between self-

esteem and criminal behavior.

Regression starts with plots of coordinates of

variables in a hypothesis (although you will

hardly ever plot your data in reality).

The data:

Each respondent has filled out a self-esteem

assessment and reported number of crimes

committed.

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

What do you think

the relationship is?

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

Is it positive?

Negative?

No change?

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

Regression is a procedure that

fits a line to the data. The slope

of that line acts as a model for

the relationship between the

plotted variables.

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

The slope of a line is the change in the corresponding

Y value for each unit increase in X (rise over run).

Slope = 0, No relationship!

Slope = 0.2, Positive Relationship!

1

Slope = -0.2, Negative Relationship!

1

0.5

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

The mathematical equation for a line:

Y = mx + b

Where: Y = the lines position on the

vertical axis at any point

X = the lines position on the

horizontal axis at any point

m = the slope of the line

b = the intercept with the Y axis,

where X equals zero

Bivariate Regression

The statistics equation for a line:

Y = a + bx

Where: Y = the lines position on the

vertical axis at any point (value of

dependent variable)

X = the lines position on the

horizontal axis at any point (value of

the independent variable)

b = the slope of the line (called the coefficient)

a = the intercept with the Y axis,

where X equals zero

^

^

Bivariate Regression

The next question:

How do we draw the line???

Our goal for the line:

Fit the line as close as possible to all the

data points for all values of X.

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

How do we minimize the

distance between a line and all

the data points?

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

How do we minimize the distance between a line and

all the data points?

You already know of a statistic that minimizes the

distance between itself and all data values for a

variable--the mean!

The mean minimizes the sum of squared

deviations--it is where deviations sum to zero and

where the squared deviations are at their lowest

value. (Y - Y-bar)

2

Bivariate Regression

The mean minimizes the sum of squared

deviations--it is where deviations sum to zero and

where the squared deviations are at their lowest

value.

Take this principle and fit the line to the place

where squared deviations (on Y) from the line are

at their lowest value (across all Xs).

(Y - Y)

2

Y = line

^ ^

Bivariate Regression

There are several lines that you could draw where the

deviations would sum to zero...

Minimizing the sum of squared errors gives you the

unique, best fitting line for all the data points. It is the

line that is closest to all points.

Y or Y-hat = Y value for line at any X

Y = case value on variable Y

Y - Y = residual

(Y Y) = 0; therefore, we use (Y - Y)

2

and minimize that!

^

^

^ ^

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

1

0

Illustration of Y Y

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

5

-4

Y = 10, Y = 5

Y = 0, Y = 4

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

1

0

Illustration of (Y Y)

2

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

5

-4

(Yi Y)

2

= deviation

2

Y = 10, Y = 5 . . . 25

Y = 0, Y = 4 . . . 16

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

1

0

Illustration of (Y Y)

2

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

The goal: Find the line that minimizes

sum of deviations squared.

?

The best line will have the lowest value of sum of deviations squared

(adding squared deviations for each case in the sample.

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

The fitted line for our

example has the equation:

Y = - X

, where e = distance from

line to data points or error

If you were to draw any other line, it

would not minimize . . .

(Y - Y)

2

0

1

2

3

4

5

6

7

8

9

1

0

Y = a + bX

e

Bivariate Regression

We use (Y - Y)

2

and minimize that!

There is a simple, elegant formula for

discovering the line that minimizes the sum of

squared errors

((X - X)(Y - Y))

b = (X - X)

2

a = Y - bX Y = a + bX

This is the method of least squares, it gives our

least squares estimate and indicates why we call

this technique ordinary least squares or OLS

regression

^

^

Bivariate Regression

Y

X

0 1

1

2

3

4

5

6

7

8

9

1

0

Considering that a regression line minimizes (Y - Y)

2

,

where would the regression line cross for an interval-ratio

variable regressed on a dichotomous independent variable?

^

For example:

0=Men: Mean = 6

1=Women: Mean = 4

Bivariate Regression

Y

X

0 1

1

2

3

4

5

6

7

8

9

1

0

The difference of means will be the slope.

This is the same number that is tested for

significance in an independent samples t-test.

^

0=Men: Mean = 6

1=Women: Mean = 4

Slope = -2 ; Y = 6 2X

Correlation

This lecture has

covered how to model

the relationship

between two

variables with

regression.

Another concept is

strength of

association.

Correlation provides

that.

Correlation

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

So our equation is:

Y = 6 - .2X

The slope tells us direction of

association How strong is

that?

0

1

2

3

4

5

6

7

8

9

1

0

^

Correlation

1

2

3

4

5

6

7

8

9

1

0

Example of Low Negative Correlation

When there is a lot of difference on the dependent variable across

subjects at particular values of X, there is NOT as much association

(weaker).

Y

X

Correlation

1

2

3

4

5

6

7

8

9

1

0

Example of High Negative Correlation

When there is little difference on the dependent variable across

subjects at particular values of X, there is MORE association

(Stronger).

Y

X

Correlation

To find the strength of the relationship

between two variables, we need

correlation.

The correlation is the standardized

slope it refers to the standard deviation

change in Y when you go up a standard

deviation in X.

Correlation

The correlation is the standardized slope it refers to the standard

deviation change in Y when you go up a standard deviation in X.

(X - X)

2

Recall that s.d. of x, Sx = n - 1

(Y - Y)

2

and the s.d. of y, Sy = n - 1

Sx

Pearson correlation, r = Sy b

Correlation

The Pearson Correlation, r:

tells the direction and strength of the

relationship between continuous variables

ranges from -1 to +1

is + when the relationship is positive and -

when the relationship is negative

the higher the absolute value of r, the stronger

the association

a standard deviation change in x corresponds

with r standard deviation change in Y

Correlation

The Pearson Correlation, r:

The pearson correlation is a statistic that is an

inferential statistic too.

r - (null = 0)

t

n-2

= (1-r

2

) (n-2)

When it is significant, there is a relationship in

the population that is not equal to zero!

Error Analysis

Y = a + bX This equation gives the conditional

mean of Y at any given value of X.

So In reality, our line gives us the expected mean of Y given each

value of X

The lines equation tells you how the mean on your dependent

variable changes as your independent variable goes up.

^

Y

^

X

Y

Error Analysis

As you know, every mean has a distribution around it--so

there is a standard deviation. This is true for conditional

means as well. So, you also have a conditional standard

deviation.

Conditional Standard Deviation or Root Mean Square Error

equals approximate average deviation from the line.

SSE ( Y - Y)

2

= n - 2 = n - 2

Y

^

X

Y

^

^

Error Analysis

The Assumption of Homoskedasticity:

The variation around the line is the same no matter the X.

The conditional standard deviation is for any given value of X.

If there is a relationship between X and Y, the conditional standard

deviation is going to be less than the standard deviation of Y--if this

is so, you have improved prediction of the mean value of Y by

looking at each level of X.

If there were no relationship, the conditional standard deviation

would be the same as the original, and the regression line would be

flat at the mean of Y.

Y

X

Y Conditional

standard

deviation

Original

standard

deviation

Error Analysis

So guess what?

We have a way to determine how much our

understanding of Y is improved when taking X

into accountit is based on the fact that

conditional standard deviations should be

smaller than Ys original standard deviation.

Error Analysis

Proportional Reduction in Error

Lets call the variation around the mean in Y Error 1.

Lets call the variation around the line when X is considered

Error 2.

But rather than going all the way to standard deviation to

determine error, lets just stop at the basic measure, Sum of

Squared Deviations.

Error 1 (E1) = (Y Y)

2 also called Sum of Squares

Error 2 (E2) = (Y Y)

2 also called Sum of Squared Errors

Y

X

Y Error 2 Error 1

R-Squared

Proportional Reduction in Error

To determine how much taking X into consideration reduces the

variation in Y (at each level of X) we can use a simple formula:

E1 E2 Which tells us the proportion or

E1 percentage of original error that

is Explained by X.

Error 1 (E1) = (Y Y)

2

Error 2 (E2) = (Y Y)

2

Y

X

Y

Error 2

Error 1

R-squared

r

2

= E1 - E2

E1

= TSS - SSE

TSS

= (Y Y)

2 -

(Y Y)

2

(Y Y)

2

r

2

is called the coefficient of

determination

It is also the square of the

Pearson correlation

Y

X

Y

Error 2

Error 1

R-Squared

R

2

:

Is the improvement obtained by using X (and drawing a line

through the conditional means) in getting as near as possible to

everybodys value for Y over just using the mean for Y alone.

Falls between 0 and 1

Of 1 means an exact fit (and there is no variation of scores

around the regression line)

Of 0 means no relationship (and as much scatter as in the

original Y variable and a flat regression line through the mean of

Y)

Would be the same for X regressed on Y as for Y regressed on

X

Can be interpreted as the percentage of variability in Y that is

explained by X.

Some people get hung up on maximizing R

2

, but this is too bad

because any effect is still a findinga small R

2

only indicates that

you havent told the whole (or much of the) story with your variable.

Error Analysis, SPSS

Some SPSS output (Anti- Gay Marriage regressed on Age):

r

2

(Y Y)2 - (Y Y)2

(Y Y)2

196.886 2853.286 = .069

Line to the Mean

Data points to the line

Data points to

the mean

Original SS

for Anti- Gay

Marriage

Error Analysis

Some SPSS output (Anti- Gay Marriage regressed on Age):

r

2

(Y Y)2 - (Y Y)2

(Y Y)2

196.886 2853.286 = .069

Line to the Mean

Data points to the line

Data points to

the mean

0 18 45 89

Age

Strong Oppose 5

Oppose 4

Neutral 3

Support 2

Strong Support 1

Anti- Gay

Marriage

M = 2.98

Colored lines are examples of:

Distance from each persons data point

to the line or modelnew, still

unexplained error.

Distance from line or model to Mean

for each personreduction in error.

Distance from each persons data point

to the Meanoriginal variables error.

ANOVA Table

X

Y

Mean

Q: Why do I see an ANOVA Table?

A: We bust up variance to get R

2

.

Each case has a value for distance

from the line (Y-bar

cond. Mean

) to Y-bar

big

,

and a value for distance from its Y

value and the line (Y-bar

cond. Mean

).

Squared distance from the line to the mean

(Regression SS) is equivalent to BSS, df =

1. In ANOVA, all in a group share Y-bar

group

The squared distance from the line to the

data values on Y (Residual SS) is

equivalent to WSS, df = n-2. In ANOVA, all in

a group share Y-bar

group

The ratio, Regression to Residual SS,

forms an F distribution in repeated

sampling. If F is significant, X explains

some variation in Y.

BSS

WSS

TSS

Line Intersects

Group Means

Dichotomous Variables

Y

X

0 1

1

2

3

4

5

6

7

8

9

1

0

Using a dichotomous independent variable,

the ANOVA table in bivariate regression will

have the same numbers and ANOVA results

as a one-way ANOVA table would (and

compare this with an independent samples t-

test).

^

0=Men: Mean = 6

1=Women: Mean = 4

Slope = -2 ; Y = 6 2X

Mean = 5

BSS

WSS

TSS

Regression, Inferential Statistics

Descriptive:

The equation for your line

is a descriptive statistic.

It tells you the real, best-

fitted line that minimizes

squared errors.

Inferential:

But what about the

population? What can we

say about the relationship

between your variables in

the population???

The inferential statistics

are estimates based on

the best-fitted line.

Recall that statistics are divided between descriptive

and inferential statistics.

Regression, Inferential Statistics

The significance of F, you already understand.

The ratio of Regression (line to the mean of Y) to Residual (line to

data point) Sums of Squares forms an F ratio in repeated sampling.

Null: r

2

= 0 in the population. If F exceeds critical F, then your

variables have a relationship in the population (X explains some of

the variation in Y).

Most extreme

5% of Fs

F = Regression SS / Residual SS

Regression, Inferential Statistics

What about the Slope or

Coefficient?

From sample to sample, different

slopes would be obtained.

The slope has a sampling

distribution that is normally

distributed.

So we can do a significance test.

-3 -2 -1 0 1 2 3 z

Regression, Inferential Statistics

Conducting a Test of Significance for the slope of the Regression Line

By slapping the sampling distribution for the slope over a guess of the

populations slope, H

o

, one determines whether a sample could

have been drawn from a population where the slope is equal H

o

.

1. Two-tailed significance test for -level = .05

2. Critical t = +/- 1.96

3. To find if there is a significant slope in the population,

H

o

:

= 0

H

a

:

0 ( Y Y )

2

4. Collect Data n - 2

5. Calculate t (z): t = b

o

s.e. =

s.e. ( X X )

2

6. Make decision about the null hypothesis

7. Find P-value

Correlation and Regression

Back to the SPSS output:

The standard error and t

appears on SPSS output

and the p-value too!

Correlation and Regression

Back to the SPSS output:

Y = 1.88 + .023X

So the GSS example, the

slope is significant. There is

evidence of a positive

relationship in the

population between Age and

Anti- Gay Marriage

sentiment. 6.9% of the

variation in Marriage

attitude is explained by age.

The older Americans get, the

more likely they are to

oppose gay marriage.

A one year increase in age elevates anti attitudes by .023 scale units. There is a weak

positive correlation. A s.d, increase in age produces a .023 s.d. increase in anti scale units.

- 18001.pdfUploaded byAnonymous Xwoj0INcKI
- Correlation and Regression11Uploaded byakash pradhan
- Biostatistics Notes: Correlation and simple linear RegressionUploaded bylauren smith
- RegressionUploaded byDhanu Mega
- Television and Divorce - Evidence from Brazilian Novelas - Alberto Chong.pdfUploaded byDaniel Pereira Volpato
- Chapter 16-17 - Correlation Regression latest.pptUploaded byLihthien Tang
- Research methods and Statistics_Ismail KhaterUploaded byIsmail Khater
- jafari2012.pdfUploaded byWulansari Putri Utami
- Task Report on Financial Modelling ModuleUploaded byLa Ode Sabaruddin
- 1Uploaded bySambit Mishra
- Correlation & RegressionUploaded byAbhinav Aggarwal
- Practice Test - Chap 7-9Uploaded byJoanna
- E-Commerce Adoption and Growth of SMEs in UgandaUploaded byAloysious2014
- CORRELATION AND REGRESSION EXAM.docxUploaded byVince Frederick Estrada Dulay
- 18.docxUploaded byYash Maurya
- Analysis of Factors Affecting the Success of Safety Management Programs of Food Manufacturing Companies in Nairobi County, KenyaUploaded byAJHSSR Journal
- Forecasting iUploaded byIna Pawar
- ANOVA.docUploaded byRaquel Gomez
- 5. Curve FittingUploaded byPrasetyo Hadi
- The Analysis of Pretender Building Price Forecasting Performance a Case StudyUploaded byindikuma
- Statistical Analysis of Environmental Quality DataUploaded byDr Malcolm Sutherland
- Monetary Policy Indicators Analysis Based on Regression FunctionUploaded byamig74
- s2e - stat2var - tex - doc - rev 2020Uploaded byapi-203629011
- CBE486/586 Syllabus Fall 2016Uploaded bySB216
- Descriptive StatisticsUploaded byDwi Mulyadi
- Eco No Metrics TermsUploaded byAlok Agarwal
- US Federal Reserve: 200531papUploaded byThe Fed
- ASB 3303Uploaded byFaisal Derbas
- Relationship Between Selected Corporate AttributesUploaded byMonirul Alam Hossain
- LAMPIRAN-01Uploaded bykoniah

- Storm Clouds LoUploaded byfa2heem
- Photomaker MedUploaded byfa2heem
- Power to ForgiveUploaded byfa2heem
- ATMcard Under Your Skin LoUploaded byfa2heem
- The ChoiceUploaded byfa2heem
- I_will_not_fear_loUploaded byfa2heem
- Gods_waysUploaded byfa2heem
- Pharisee and PublicanUploaded byaudioactivated.org
- One Solitary LifeUploaded byfa2heem
- no_profit_hi.ppsUploaded byfa2heem
- Tumbling BarrelUploaded byfa2heem
- Gwens CanyonUploaded byfa2heem
- Trust God AnyhowUploaded byfa2heem
- Pride and Self RighteousnessUploaded byaudioactivated.org
- Funny Things Kids SayUploaded byfa2heem
- Words of Wisdom 3Uploaded byfa2heem
- Words of Wisdom 2Uploaded byfa2heem
- Christians.ppsUploaded byfa2heem
- ATMcard Under Your Skin LoUploaded byfa2heem
- Chase the LionUploaded byfa2heem
- Gods FormulaUploaded byfa2heem
- Words of Wisdom 1 LoUploaded byfa2heem
- Who Rules LoUploaded byfa2heem
- no_profit_hi.ppsUploaded byfa2heem
- Worldchanger LoUploaded byfa2heem
- Defy the Impossible LoUploaded byfa2heem
- 7_wonders_of_the_world.ppsUploaded byfa2heem
- Drawing PowerUploaded byfa2heem
- Words of Wisdom 4Uploaded byMarkKevinAngtud
- Feeling Negative LoUploaded byfa2heem

- Test Bank StatisticsUploaded byGagandeep Singh
- Simple Linear RegressionUploaded byHicham Tou Nsi
- Panel Data AnalysisUploaded byJorge David Morales
- Chapter 17Uploaded byJuve Cuenca
- Assignment 2 Answer KeyUploaded byahsahito12
- Multi Col LinearityUploaded byIshanDogra
- Testing HypothesisUploaded bySyahrul Nizam Anuar
- Assessment 3.pdfUploaded byLuciana Echarren Leegstra
- Regression AnalysisUploaded bySumama Ikhlas
- Forecasting passengerUploaded byAnanda Mandal
- Bayesian Clinical TrialsUploaded byicarosalerno
- ANOVA PresentationUploaded byNeeraj Gupta
- Qam Ii_ps 3 Ans (1)Uploaded bySayantan Nandy
- Chapter 101Uploaded byMohamed Med
- Analysis of ARIMA and GARCH model.docxUploaded bySakshi
- Realized Volatility a ReviewUploaded bylerhlerh
- Review of Available Statistical TestsUploaded byrinkuaggarwal
- An Introduction to Analysis of VarianceUploaded by1ab4c
- abcUploaded byTom John
- MannUploaded byNeagoe
- Generalized Linear ModelsUploaded bynascem
- Factor Analysis and RegressionUploaded byRajat Sharma
- Predicting Bank Loan Recovery Rates With Neural NetworksUploaded byDavid Townsend
- eiPackRnewsUploaded byvujo
- 6.1 Boston Housing SolUploaded byGanesh Aditya
- Output EviewsUploaded byPrasetyo Adi Priatno
- Usman Dilshad QTIA Presentation ScribdUploaded byusmandilshad
- Is my research significant? Why you shouldn't rely on p valuesUploaded bySandi Mcintyre
- QTUploaded byDella Robert Sirau
- SSRN-id2249314Uploaded byTBP_Think_Tank