Вы находитесь на странице: 1из 32

Regression when some of the

regressors are qualitative

Salary =f( education, experience, sex, race )


+E
Main.charges of Vehicle=f( Age, quality) +E

Quantitative variables in
regression
(Dummy
Variables)
In regression analysis response variable is not only influenced by

the quantitative variables but also by many variables of interest


that are not quantitative but are qualitative such as gender, race,
colour
For example; holding all other factors constant, female college
professors are found to earn less than their male counterparts
Non whites are found to earn less than whites
This pattern may result from sex or racial discrimination
Qualitative variables such as sex and race does influence the
dependent variable and clearly should be included among the
explanatory variables.
Qualitative variables usually indicate the presence or absence of
a quality such as male or female black or white, literate or
illiterate, urban or rural, etc the problem is how to incorporate
such variables in regression along with other quantitative
variables
2

Qualitative variables in regression


(Dummy Variables)

One method of quantifying such attributes is by constructing


artificial variables which take on values of 1 or 0, 0 indicating
the absence of an attribute and 1 indicating the presence of
that attribute
Variables which assume such 0 and 1 values are called dummy
variables other names for such variables are indicator
variables, binary variables qualitative variables and
categorical variables.
We can include dummy variables in a regression and conduct
hypothesis tests, just as we can any other quantitative
variables. But, interpretation of the coefficient on the dummy
variable is somewhat different than what weve seen before

Consider a data on annual salary of male


and female college teachers and years
of teaching experience.

Define two dummy variables D2 &


D3,one for each category and the model
is

Yi 1 2 D 2i 3D3i X i U i

Salary (y)
Thousand of rupees

Sex

Years of teaching
experience

60
50
40
40
60
50
60

Male
Male
Female
Male
Male
Female
Female

6
4
5
3
6
7
5

Yi= annual salary of the ith


college professor
Xi= years of the teaching experience
of ith professor
D2i= 1 if male professor
= 0 otherwise
D3i= 1 if female professor
= 0 otherwise

1
1
1
1
1
1
1

1
1
0
1
1
0
0

0
0
1
0
0
1
1

Solution is not possible due to perfect


multicollinearity (Dummy variable trap) data
matrix in not full rank
Drop one variable (Solution of MC) say female

Define one dummy variables D2


for male category and the model
is

Yi 1 2 D 2i X i U i

Yi= annual salary of the ith


college professor
Xi= years of the teaching
experience of ith professor
D2i= 1 if male professor
= 0 otherwise

Salary (y)
Thousand of
rupees

Sex

Years of
teaching
experience

60
50
40
40
60
50
60

Male
Male
Female
Male
Male
Female
Female

6
4
5
3
6
7
5

1
1
0
1
1
0
0

6
7

RULE

The number of the indicator variables or dummy is


always one less than the number of the categories.

Important notes

The assignment of 0 and 1 value are arbitrary


Category that is assigned the value of 0 is often referred to as
control group, base category, or omitted category in the sense
that comparisons are made with that category. In this example
female group
Intercept term 1 in the model is introduced for the control
group.

Interpretation of results
Mean salary for Female professor
Mean salary for male professor

Yi X i , D2i 0 1 X i

Yi X i , D2i 1 1 2 X i

The estimated equation by using OLS is Y= 24.4 + 6.64 D 2+


4.51 X
(16.28)
(6.9)
(2.74)
1) The estimated mean salary of female college teacher is=24.4
+ 4.51X
2) The estimated mean salary of male
+ 4.51X

college teacher is=31.04

3) The estimated mean salary of female college teacher with 6


years of experience is
= 24.4
+4.51(6) =51.46
7

Test of hypothesis
Compare the annual mean salary of
male and female college teachers
Male

Ho: 2=0
2 2 6.64 0
t

0.96 ns
6.9
var 2

i.e sex has no effect on mean


salary of college teachers

Female

1 2
1

Example:A manufacturer of small office copiers and word processing


machinery pays its salespeople a small base salary plus a
commission equal to a fixed percentage of the persons sales.
One of the sales people charges that this salary structure
discriminates against women. The director of personal sees
that base salary depends on length of service, but she does
not know how to use the data to learn whether it also depends
on gender and whether there is discrimination against women.
Develop a regression model to test the gender effect on the
base salary.

Problem:-How can we incorporate the salespersons gender into


the regression model.
9

Introduce dummy variable for


GENDER

Where

Yi= Base salary of the ith salesperson


Xi= Months of experience of ith salesperson
D2i= 1 if salesman
= 0 otherwise (Sales Woman)

NOTE: The assignment of 0 and 1 value is arbitrary


Saleswoman is called base category or omitted category
or comparison category
Dummy variables are always one less than the
number of the categories

10

Regressions for dummy


variables
Regression for saleswomenD2=0

Regression for
salesmenD2=1

11

Salesmen

1 2

Saleswome
n

Saleswome
n
Salesmen

1 2

1
2

Saleswome
n

Salesmen

1
12

SCATTER PLOT
Salary vs experience
14
13
12
11
10

SALARY

GENDER

8
7

female

male
0

10

20

30

40

EXPERIEN
13

Analysis of the model


Ho: 2=0 i.e Gender has no influence on salary

As the regression coefficient for gender is significant


(p-value < .05), so we conclude that the firm does
discriminate against its saleswomen. As coefficient for
dummy variable is positive so salesmens salary
(present category) is more than salesmens salary 14

Estimation in regression

Estimated mean salary of Salesman with 5 months of


experience

Estimated mean salary of Saleswoman with 5 months of experience

Yi 5.460 0.789(0) 0.227(5) 6.595 or


$6595

15

Regression With Two Qualitative Variables

Consider the regression of the advertising expenditure


on the sales, type of firm (incorporated, not
incorporated) public relation department and quality of
sales management (high, low)

Y 1 2 D 2 3 D 3 U
D2=1 incorporated
=0 Otherwise
D3=1 if quality of sales management high
=0 Otherwise

Not incorporated firm with low quality


of sales management is COMPARISON
category
16

Data analysis in MINITAB

17

D2=1 incorporated
=0 Otherwise(i.e not incorporated)
(D3=1 if quality of sales management high
=0 Otherwise( Low sales management)
Estimated Regression equation is
Y=44000-16000D2+5500D3

INTERPRETATION:
Average expenditure for base category
(not incoporated with low management)
is 44000
The expenditure is decrease by 16000
if the firm is incorporated
Expenditure is increase by 5500
if quality of sale management is high.

Nonsignificant value t-ratio for D2 indicates that there is no difference between the
Advertising expenditure of incorporated and not incorporated firms
Nonsignificant value t-ratio for D3 indicates that there is no difference between the
Advertising expenditure for the firms having high or low quality sales management
18

Regression With more than two categories


Suppose that based on the cross sectional data we want to
regress the annual expenditure on healthcare by an
individual on the income and education of the individual.
Since the variable education is qualitative in nature we
consider three mutually exclusive levels of education (i)
Less than high school (ii) High School (iii) College

Y 1 2 D 2 3 D 3 U
Yi= Annual expenditure on health care
Xi= Annual Income
D2=1 if high School education
= 0 otherwise
D3= 1 if college education
= 0 otherwise

19

Expenditure (Y)

Income (X)

Education

D3

D2

D=D3+D2

38

90

College

40

95

College

39

82

College

32

75

College

36

87

College

40

93

College

20

45

High School

18

46

High School

17

48

High School

28

52

High School

18

Less than High School

25

Less than High School

27

Less than High School

32

Less than High School

20

Less than High School

0
20

Interpretation of results

Mean healthcare expenditure by an individual having less than high school ed

E (Y / D2 0, D3 0) 1 X
Mean healthcare expenditure by an individual having high school education

E (Y / D2 1, D3 0) ( 1 2 ) X
Mean healthcare expenditure by an individual having college education

E (Y / D2 0, D3 1) ( 1 3 ) X
The intercept 1 will reflect the intercept for base category.
The differential intercepts 2 , 3 tell by how much the
intercept of
the other two categories differ from the intercept of the base
category
21

TEST OF HYPOTHESIS
1) Comparison of mean expenditure on health care by an
individual having high school education and less than high
school education ( Base group)

H0 : 2 0

2) Comparison of mean expenditure on health care by an


individual having college education and less than high school
education ( Base group)

H0 : 3 0

Expendie=

-4.41 +

7.92D2

+10.5 D3

+0.361X

SE

(3.195)

(3.328)

(7.813)

(0.122)

t-ratio

2.38*

1.34

(2.95)

P-Value

0.037

0.206

0.013
22

TEST OF HYPOTHESIS

3) Comparison of mean expenditure on health care by an individual having high school


education and college

H0 : 2 3

Restricted Model

Unrestricted Model
Y 1 2D

2i

3D

3i

X U

Unrestricted ANOVA
SOV

H0 : 2 3

Y
Where

*
2

2i

D 2 D 3

Restricted ANOVA
SS

SOV

Regression 3

3052.9

Regression 2

3051.2

Error

11

71.1

Error

12

72.8

Total

14

3124.0

Total

14

3124.0

DF

DF

SS

ESS R ESSUR / Edf R EdfUR 72.8 71.1 / 12 11 1.70 0.2615ns


ESSUR / EdfUR

71.1 / 11

6.5

23

DUMMY VARIABLE FOR COMPARING


TWO REGRESSION LINES

When we use a regression model involving time series data, it


may happen that there is a structural change in the
relationship between the regressand and the regressors

By structural change we mean that the value of the parameters


of the model do not remain the same through the entire time

period

Dummy variable can be used to find structural


changes in the regression parameters.

Concurrent
Coincident

Parallel

Dissimilar

Equal or
24

Equality of two regression


lines by dummy variable
approach:

EXAMPLE:An economist is studying the relation between amount of


saving and level of income for middle income families
from urban and rural areas, based on independent
samples from the two populations. Each of two relations
can be modeled by linear regression. He wish to compare
whether, at given income level, urban and rural families
tend to save the same amount i.e. whether the two
regression lines are the same. If they are not he wishes to
explore whether at least the amount of saving out of an
individual Rs of income are the same for the groups i.e.
whether the slopes of the two regression lines are the
same.
25

1 2D 2i 1Xi 2(D 2iXi) Ui (i)

Yi = saving of a family
Xi = income of family
D2i
=1
for urban family ; = 0
Mean saving of a rural family

otherwise

E (Yi / D 0) 1 1X i..........(ii)
2

Mean Saving For an urban family E (Yi / D2 1) ( 1 2) ( 1 2) X i (iii)


Therefore estimating (i) is equivalent to estimating the
two individual saving functions (ii) and (iii).
2
In (i) 2 is the differential intercept and
is the
differential slope coefficient ,indicating by how much
the slope coefficient of the urbans saving function
differ from slope coefficient of the rural saving
function.
26

The equality of two saving function can be


tested by considering the hypothesis

27

Unrestricted Model

y 1 2D 2i 1Xi 2(D2iXi) Ui
i

SOV DF

SS

ESS R ESSUR / Edf R EdfUR


ESSUR / EdfUR

1 1Xi Ui

SOV DF

SS

Reg 1 15806561
Error 18 64193439

Reg 3 16746734
Error 16 63253266
Total 19 80000000
F

Restricted Model

Total 19 80000000

64193439 63253266 / 18 16
63253266 /16

470086.5
0.11ns
3953329.125

Regression Lines for Urban & Rural are

28

EXAMPLE:-The following data gives data on


disposable personal income and personal savings,
in billions of dollars, for the United States for the
period 1970-1995. Split the data into two time
periods 1970-1981 and 1982-1995.
Test structural changes in the regression relation.
1982 the USA
suffered its worse
peacetime
recession

29

Introduce dummy variable for


structural change
Yi 1 2 Di 1 X i 2 ( DX i ) U i
Yi= saving Xi= Income
Di= 1 for observations from 1982-1995
= 0 otherwise (i.e observation from 1970-1981)

30

Scatter plot

31

Un - Restricted model
Y 1 2 Di 1 X i 1 ( Di X i ) U i
Y 1.016 152.479 Di 0.08 X i 0.065( DX i )

SOV DF

SS

Reg 3 88079.834
Error 22 11790.253
Total 25 99870.087
F

ESS R ESSUR / Edf R EdfUR


ESSUR / EdfUR

Restricted Model

1 1Xi Ui
Y 62.423 0.038 X i

SOV DF

SS

Reg 1 76621.788
Error 24 23248.298
Total 25 99870.087
23248.298 11790.253
(24 22)
11790 .253
22

5729.023
10.69*
535.92

Regression are different for two time

32

Вам также может понравиться