Вы находитесь на странице: 1из 68

LINEAR REGRESSION ANALYSIS

Do Huu Luat
luat.do@luatdo.com
Outline
• Examples
• The research process
• Liner regression model
• Qualitative regressors
Relationship between variables:
Examples
• [Individual] Does a master degree improve my
wage?
• [Producer] How much my output increase if I
hire more labor?
• [Seller] How much would the demand for my
product increase if I reduce the price by $X?
• [Policy maker] Does subsidies (cash transfers) to
the poor reduce their working efforts?
• [Farmer and Policy maker] Does lower fertilizer
price result in higher profit?
Relationship between variables:
Examples
• [Wage and gender] Is there gender
discrimination in wage?
• [Labor supply] Is female worker working harder
than male?
• [Labor supply] Is laborer working more with
higher wage?
Econometrics techniques
Multiple regressions
Advanced methods to obtain unbiased estimates
Multiple regression

• Multiple regression function:


𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑔𝑒𝑛𝑑𝑒𝑟 + 𝛽2 𝑠𝑘𝑖𝑙𝑙
• 𝑤𝑎𝑔𝑒 is the dependent variable
• 𝑔𝑒𝑛𝑑𝑒𝑟 and 𝑠𝑘𝑖𝑙𝑙 are independent variables
• in this equation, 𝛽1 indicates the gender
difference of wage, holding skill level equal.
• By adding 𝑠𝑘𝑖𝑙𝑙, we control for the skill effect.
• We can control other effects by adding more
independent variables.
Unbiased estimates

• What other independent variables should we


include? Many.
• Do we miss one or more important variables?
Test.
• Are we including an irrelevant variable? Test.
• Is there any problem that make our estimates of
𝛽 biased? Test.
•…
Many econometric techniques to obtain unbiased
estimates.
The research process
So even when you just want to analyse the relationship between
two variables using econometrics, it is still necessary to include
other relevant independent variables.
But then, what is “relevant”?
Or, what variables should I include in the regression function?
Before answering the question, let’s go through the research
process…
The research process

• Identify the research problem


• Develop the research question
• Survey the literature
• Construct the analytical framework (including
the regression function)
• Collect data
• Analyse data
• Interpret the results
• Write report
The Linear Regression Model (LRM)
❑The general form of the LRM model is:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝑒𝑖
❑Or, as written in short form:
𝑌𝑖 = 𝛽𝑋 + 𝑒𝑖
❑𝑌 is the regressand/dependent/explained
variable
❑𝑋 is a vector of
regressors/independent/explanatory variables
❑𝑒 is an error term/residual.
Population (True) Model
𝑌𝑖 = 𝐵0 + 𝐵1 𝑋1𝑖 + 𝐵2 𝑋2𝑖 + ⋯ + 𝐵𝑘 𝑋𝑘𝑖 + 𝑢𝑖
❑This equation is known as the population or true
model.
❑PRF: Population Regression Function
❑It consists of two components:
❑(1) A deterministic component, 𝑩𝑿 (the conditional
mean of 𝑌, or 𝐸 𝑌|𝑋 .
❑(2) A nonsystematic, or random component, 𝑢𝑖 .
Regression Coefficients
❑𝐵0 is the intercept
❑𝐵1 to 𝐵𝑘 are the slope coefficients
❑Collectively, they are the regression
coefficients or regression parameters
❑Each slope coefficient measures the (partial)
rate of change in the mean value of 𝑌 for a unit
change in the value of a regressor, ceteris paribus
Sample Regression Function: SRF
❑The sample counterpart is:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝑒𝑖
❑Or, as written in short form:
𝑌𝑖 = 𝑏𝑋 + 𝑒𝑖
where 𝑒 is a residual.
❑The deterministic component (fitted value) is
written as:
𝑌෡𝑖 = 𝑏1 + 𝑏2 𝑋2𝑖 + 𝑏3 𝑋3𝑖 + ⋯ + 𝑏𝑘 𝑋𝑘𝑖 = 𝒃𝑿
SRF ⇨ PRF
❑What you expect about your SRF?
❑Researchers do not have full information
❑Sampling Strategy to obtain sample data
❑Policy recommendation for population
❑How to link between SRF and PRF
❑Your testing research hypothesis
Population Regression Function (PRF)

y E(y|x) = b0 + b1x
y4
u4 {

y3 } u3
y2 u2 {

y1 } u1

x1 x2 x3 x4 x
Sample Regression Function (SRF)
y
y4 e4 {
 
yˆ =  + b . X
y3 . e3
y2 e2 {

y1 } e1

x1 x2 x3 x4 x
Method Of Ordinary Least Squares
❑Method of Ordinary Least Squares (OLS) search for
coefficients that minimizes residual sum of squares
(RSS):

𝑅𝑆𝑆 = ෍ 𝑢𝑖2 = ෍ 𝑌𝑖 − 𝐵1 − 𝐵2 𝑋2𝑖 − 𝐵3 𝑋3𝑖 − ⋯ − 𝐵𝑘 𝑋𝑘𝑖 2

❑We could not find out PRF but we can fine SRF, so we
apply OLS for data sample to find SRF
❑To obtain values of the regression coefficients,
derivatives are taken with respect to the regression
coefficients and set equal to zero.
Goodness Of Fit: R2
❑𝑅2 , the coefficient of determination, is an overall
measure of goodness of fit of the estimated
regression line.
❑Gives the percentage of the total variation in the
dependent variable that is explained by the
regressors.
❑It is a value between 0 (no fit) and 1 (perfect fit).
2
❑Let: Explained Sum of Squares 𝐸𝑆𝑆 = σ 𝑌 − 𝑌 ෠ ത
Residual Sum of Squares 𝑅𝑆𝑆 = σ 𝑒 2
Total Sum of Squares 𝑇𝑆𝑆 = σ 𝑌 − 𝑌ത 2
2 𝐸𝑆𝑆 𝑅𝑆𝑆
❑Then: 𝑅 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
Goodness Of Fit: R2
Degree of Freedom: 𝑑𝑓
• 𝑛 is total number of observations
• 𝑘 is total number of estimated coefficients
• 𝑑𝑓 for 𝑅𝑆𝑆 = 𝑛 − 𝑘
Goodness Of Fit: R Squared Adjusted
❑𝑅2 is that it is an increasing function of 𝑘.
❑Sometimes researchers play the game of “maximizing” 𝑅2
(somebody think the higher the 𝑅2 , the better the model.)
❑To avoid this temptation: 𝑅2 should takes into account the
number of regressors.
❑Such an 𝑅2 is called an adjusted 𝑅2 , denoted as ഥ𝑅2 (R-bar
squared), and is computed from the (unadjusted) 𝑅2 as
follows:
𝑛−1
ഥ𝑅2 = 1 − 1 − 𝑅2
𝑛−𝑘
Example
Data: 9,200 individual laborers from VHLSS 2008
• workday: number of working days in 12 months
• wage: in thousands VND/day
• gender: dummy variable, male = 1
• age: in years
• edu: schooling years (years)
• married: dummy variable, married = 1
Summary statistics
. sum workday wage gender age edu married

Variable Obs Mean Std. Dev. Min Max

workday 9258 112.0871 115.7593 0 797


wage 9258 92.87657 740.2821 -1599.333 60000
gender 9258 .4373515 .4960864 0 1
age 9258 33.04774 14.01728 15 65
edu 9246 9.024443 2.968456 0 13

married 9258 .4089436 .4916654 0 1

. tab gender . tab married

gender Freq. Percent Cum. married Freq. Percent Cum.

0 5,209 56.26 56.26 0 5,472 59.11 59.11


1 4,049 43.74 100.00 1 3,786 40.89 100.00

Total 9,258 100.00 Total 9,258 100.00


Regression results

. reg workday wage

Source SS df MS Number of obs = 9,263


F(1, 9261) = 43.97
Model 587776.652 1 587776.652 Prob > F = 0.0000
Residual 123803535 9,261 13368.2685 R-squared = 0.0047
Adj R-squared = 0.0046
Total 124391311 9,262 13430.2863 Root MSE = 115.62

workday Coef. Std. Err. t P>|t| [95% Conf. Interval]

wage .010764 .0016233 6.63 0.000 .0075819 .0139461


_cons 111.2295 1.210748 91.87 0.000 108.8562 113.6028
Regression results

. reg workday wage gender age edu married F-statistic and its
ESS RSS P-value
Source SS df MS Number of obs = 9,251
F(5, 9245) = 13.65
Model 910367.176 5 182073.435 Prob > F = 0.0000
Residual 123276639 9,245 13334.412 R-squared = 0.0073
Adj R-squared = 0.0068
Total 124187006 9,250 13425.6223 Root MSE = 115.47
Regressand TSS
workday Coef. Std. Err. t P>|t| [95% Conf. Interval]

wage .0107087 .0016216 6.60 0.000 .00753 .0138874


gender 7.806794 2.439072 3.20 0.001 3.025675 12.58791
age .346997 .1078171 3.22 0.001 .1356517 .5583424
edu .8199364 .416426 1.97 0.049 .0036495 1.636223
married 2.914869 3.077622 0.95 0.344 -3.117948 8.947686
_cons 87.75976 6.189891 14.18 0.000 75.6262 99.89331

P value of
Regressors t-statistic
t-statistic
Model selection
. estat ic

Akaike's information criterion and Bayesian information criterion

Model Obs ll(null) ll(model) df AIC BIC

. 9,263 -57166.77 -57144.84 2 114293.7 114307.9

. estat ic

Akaike's information criterion and Bayesian information criterion

Model Obs ll(null) ll(model) df AIC BIC

. 9,251 -57091.11 -57057.08 6 114126.2 114168.9


Assumptions of the Classical LRM
A1 – Linear in parameters
A2 – Regressors X are fixed (non-stochastic)
A3 – Normal distribution of the error term
A4 – Homoskedasticity of the error term
A5 – No autocorrelation
A6 – Exogeneity of X
A7 – Full rank
A8 – No specification error
Assumptions of Classical LRM
❑A1: Model is linear in the parameters
❑A2: Regressors 𝑋𝑠 are fixed or nonstochastic
❑A3: Given 𝑋, the expected value of the error term
is zero, or 𝐸 𝑒𝑖 𝑋 = 0 and follow 𝑁(0, 𝜎 2 ).
❑A4: Homoskedastic, or constant, variance of 𝑢𝑖 . Or
𝑣𝑎𝑟 𝑢𝑖 𝑋 = 𝜎 2 is a constant.
Assumptions of Classical LRM
❑A5: No autocorrelation 𝑐𝑜𝑣(𝑢𝑖 , 𝑢𝑗 |𝑋) = 0, 𝑖 ≠ 𝑗.
❑A6: No correlation between 𝑋 and 𝑒, or E e𝑋 = 0
❑A7: The number of observations must be greater
than the number of parameters, and no
multicollinearity, or no perfect linear relationships
among the 𝑋 variables.
❑A8: No specification bias.
Variance and Standard errors of OLS estimators

• For the LRM, an estimate of the variance of the


error term 𝑢𝑖
σ 2
2
𝑒𝑖 𝑅𝑆𝑆
𝜎ො = =
𝑛−𝑘 𝑛−𝑘
Obtaining the residuals
. predict u, resid
(12 missing values generated)

. sum u, detail

Residuals

Percentiles Smallest
1% -122.7432 -445.4255
5% -116.9056 -157.4657
10% -113.8881 -155.805 Obs 9246
25% -105.6947 -151.2645 Sum of Wgt. 9246

50% -34.00515 Mean 7.65e-08


Largest Std. Dev. 115.3109
75% 86.24759 532.8849
90% 180.5074 590.8502 Variance 13296.61
95% 207.5081 628.8676 Skewness .802321
99% 262.9074 681.4844 Kurtosis 2.903277
Homoskedastic Case

y
f(y|x)

E(y|x) = b0 + b1x

x1 x2
Heteroskedastic Case

f(y|x)

E(y|x) = b0 + b1x

x1 x2 x3 x
Example: working days and wage
Regression without outliers
. reg workday wage gender age edu married if wage<15000

Source SS df MS Number of obs = 9243


F( 5, 9237) = 39.01
Model 2559495.5 5 511899.1 Prob > F = 0.0000
Residual 121221789 9237 13123.5021 R-squared = 0.0207
Adj R-squared = 0.0201
Total 123781284 9242 13393.3439 Root MSE = 114.56

workday Coef. Std. Err. t P>|t| [95% Conf. Interval]

wage .0578099 .0044383 13.03 0.000 .0491098 .06651


gender 7.651627 2.421049 3.16 0.002 2.905836 12.39742
age .3406722 .1070498 3.18 0.001 .1308309 .5505135
edu .7197082 .4134357 1.74 0.082 -.090717 1.530133
married 3.178578 3.056972 1.04 0.298 -2.813763 9.170918
_cons 84.89901 6.147383 13.81 0.000 72.84878 96.94924
Before and after removing outliers

. estimate table Model1 Model2

Variable Model1 Model2

wage .01071412 .0578099


gender 8.0911102 7.6516267
age .33527388 .34067219
edu .82827398 .71970818
married 2.5600104 3.1785778
_cons 87.949064 84.899009
Gauss – Markov Theorem
❑On the basis of assumptions A1 to A8, the
OLS method gives best linear unbiased
estimators (BLUE):
❑(1) Estimators are linear functions of the
dependent variable Y.
❑(2) The estimators are unbiased; in repeated
applications of the method, the estimators
approach their true values.
❑(3) In the class of linear estimators, OLS
estimators have minimum variance; i.e., they
are efficient, or the “best” estimators.
Hypothesis testing
Testing individual coefficient: t test
Testing multiple coefficients: F test
Testing Individual Coefficient: t test
❑To test the following hypothesis:
❑𝐻0 : 𝐵𝑘 = 0
❑𝐻1 : 𝐵𝑘 ≠ 0
❑Calculate the following and use the 𝑡 table to obtain
the critical 𝑡 value with 𝑛 − 𝑘 degrees of freedom for
a given level of significance (or 𝛼, equal to 10%, 5%,
or 1%):
𝑏𝑘
𝑡=
𝑠𝑒 𝑏𝑘
❑If this value is greater than the critical 𝑡 value, we
can reject 𝐻0.
Testing Individual Coefficient: t test
❑Step 1: Form hypotheses
❑𝐻0 : 𝐵𝑘 = 0
❑𝐻1 : 𝐵𝑘 ≠ 0
❑Step 2: Determine confidence interval, critical
values, region of rejection, region of acceptance.
𝑡𝛼∗ ,𝑛−𝑘
2
❑Step 3: Calculate test statistic
𝛽𝑚
𝑡𝑡𝑡 =
𝑠𝛽𝑚
❑Step 4: Decide
Testing Individual Coefficient: t test
❑If 𝑡𝑡𝑡 > 𝑡𝛼,𝑛−𝑘 Reject 𝐻0 at level of significance of 𝛼
2

❑If 𝑃𝑣𝑎𝑙𝑢𝑒 < 𝛼 Reject 𝐻0 at level of significance of 𝛼

P-value/2

Region of rejection Region of acceptance Region of rejection


Testing Individual Coefficient: t test

. test wage

( 1) wage = 0

F( 1, 9237) = 169.65
Prob > F = 0.0000

. test married

( 1) married = 0

F( 1, 9237) = 1.08
Prob > F = 0.2985
Testing Individual Coefficient: t test
. reg workday wage gender age edu married if wage<15000

Source SS df MS Number of obs = 9243


F( 5, 9237) = 39.01
Model 2559495.5 5 511899.1 Prob > F = 0.0000
Residual 121221789 9237 13123.5021 R-squared = 0.0207
Adj R-squared = 0.0201
Total 123781284 9242 13393.3439 Root MSE = 114.56

workday Coef. Std. Err. t P>|t| [95% Conf. Interval]

wage .0578099 .0044383 13.03 0.000 .0491098 .06651


gender 7.651627 2.421049 3.16 0.002 2.905836 12.39742
age .3406722 .1070498 3.18 0.001 .1308309 .5505135
edu .7197082 .4134357 1.74 0.082 -.090717 1.530133
married 3.178578 3.056972 1.04 0.298 -2.813763 9.170918
_cons 84.89901 6.147383 13.81 0.000 72.84878 96.94924
Testing multiple coefficients: 𝐹 Test
❑Testing the following hypothesis is equivalent to
testing the hypothesis that all the slope coefficients
are 0:
❑𝐻0 : 𝑅2 = 0
❑𝐻1 : 𝑅2 ≠ 0
❑Calculate the following and use the 𝐹 table to obtain
the critical 𝐹 value with 𝑘 − 1 degrees of freedom in
the numerator and 𝑛 − 𝑘 degrees of freedom in the
denominator for a given level of significance:
𝐸𝑆𝑆/𝑑𝑓 𝑅2 /(𝑘 − 1)
𝐹= =
𝑅𝑆𝑆/𝑑𝑓 (1 − 𝑅2 )/(𝑛 − 𝑘)
❑If this value is greater than the critical 𝐹 value, reject
𝐻0.
Testing multiple coefficients: 𝐹 Test
❑Step 1: Form hypotheses
❑𝐻0 : 𝛽𝑚+1 = 𝛽𝑚+2 = ⋯ = 𝛽𝑘 = 0
❑𝐻𝛼 : At least one β different from 0

❑Step 2: Calculate test statistic (𝐹)


(𝑅𝑆𝑆𝑅 −𝑅𝑆𝑆𝑈 )/(𝑑𝑓𝑅 −𝑑𝑓𝑈 )
𝐹𝑐 =
𝑅𝑆𝑆𝑈 /𝑑𝑓𝑈
𝑑𝑓𝑈 = 𝑛 − 𝑘
𝑑𝑓𝑅 = 𝑛 − 𝑚
𝑑𝑓𝑅 − 𝑑𝑓𝑈 = 𝑘 − 𝑚
Testing multiple coefficients: 𝐹 Test
❑Step 3: Determine the critical value

𝐹𝑘−𝑚,𝑛−𝑘 (𝛼)
❑(𝑘 − 𝑚) degree of freedom for nominator
❑(𝑛 − 𝑘) degree of freedom for denominator
❑Step 4: Decide
❑𝐹𝑡𝑡 > 𝐹 ∗ , 𝑜𝑟
❑𝑃𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹𝑡𝑡 < 𝛼
=> Reject 𝐻0 at the significance level of 𝛼
Fisher Distribution
Testing multiple coefficients: 𝐹 Test

. test gender age edu married

( 1) gender = 0
( 2) age = 0
( 3) edu = 0
( 4) married = 0

F( 4, 9237) = 5.76
Prob > F = 0.0001
F test for overall significance
❑Step 1: Form hypotheses
❑𝐻0 : 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑘 = 0
❑𝐻𝛼 : At least one β different from 0

❑Step 2: Calculate test statistic (𝐹)


𝐸𝑆𝑆/𝑑𝑓 𝑅2 /(𝑘 − 1)
𝐹𝑐 = =
𝑅𝑆𝑆/𝑑𝑓 (1 − 𝑅2 )/(𝑛 − 𝑘)
F test for overall significance
❑Step 3: Determine the critical value

𝐹𝑘−1,𝑛−𝑘 (𝛼)
❑(𝑘 − 1) degree of freedom for nominator
❑(𝑛 − 𝑘) degree of freedom for denominator
❑Step 4: Decide
❑𝐹𝑡𝑡 > 𝐹 ∗ , 𝑜𝑟
❑𝑃𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹𝑡𝑡 < 𝛼
=> Reject the Null 𝐻0 at the significance level of 𝛼
F test for overall significance

. test wage gender age edu married

( 1) wage = 0
( 2) gender = 0
( 3) age = 0
( 4) edu = 0
( 5) married = 0

F( 5, 9237) = 39.01
Prob > F = 0.0000
F test for overall significance
. reg workday wage gender age edu married if wage<15000

Source SS df MS Number of obs = 9243


F( 5, 9237) = 39.01
Model 2559495.5 5 511899.1 Prob > F = 0.0000
Residual 121221789 9237 13123.5021 R-squared = 0.0207
Adj R-squared = 0.0201
Total 123781284 9242 13393.3439 Root MSE = 114.56

workday Coef. Std. Err. t P>|t| [95% Conf. Interval]

wage .0578099 .0044383 13.03 0.000 .0491098 .06651


gender 7.651627 2.421049 3.16 0.002 2.905836 12.39742
age .3406722 .1070498 3.18 0.001 .1308309 .5505135
edu .7197082 .4134357 1.74 0.082 -.090717 1.530133
married 3.178578 3.056972 1.04 0.298 -2.813763 9.170918
_cons 84.89901 6.147383 13.81 0.000 72.84878 96.94924
Qualitative regressors
Dummy variable as a regressor
Transforming categorical variables into dummies
Dummy regressor and structural change/difference
Example data
A survey of 20,306 individuals in the U.S.
• male 1 = male; 2 = female
• age age (year)
• wage wage ($/hour)
• tenure # years working for current employer
• union 1 = union member, 0 otherwise
• edu years of schooling (years)
• race 1 = white; 2 = black; 3 = others
• married 1 = married or living together with a
partner, 0 otherwise
Data file: wage.dta
Mincer’s wage function
• ... is a function of schooling years and
experience
ln wage = b0 + b1edu + b 2tenure +  X + 

where X is a set of individual characteristics.


The wage function estimated

. reg lwage edu tenure

Source SS df MS Number of obs = 20306


F( 2, 20303) = 2478.09
Model 1484.78544 2 742.392718 Prob > F = 0.0000
Residual 6082.42986 20303 .299582813 R-squared = 0.1962
Adj R-squared = 0.1961
Total 7567.2153 20305 .372677434 Root MSE = .54734

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0917296 .0015444 59.40 0.000 .0887025 .0947567


tenure .0176723 .0004975 35.52 0.000 .0166971 .0186475
_cons 1.436601 .0212446 67.62 0.000 1.39496 1.478242
Exercise
• It is hypothesized that wage increases with age,
but up to some point of age, wage starts
decreasing. How would you test this hypothesis?
• Use quadratic form:
[in Stata]
age2 = age * age
The wage function estimated

. reg lwage edu tenure age age2

Source SS df MS Number of obs = 20306


F( 4, 20301) = 1432.31
Model 1665.53974 4 416.384936 Prob > F = 0.0000
Residual 5901.67555 20301 .290708613 R-squared = 0.2201
Adj R-squared = 0.2199
Total 7567.2153 20305 .372677434 Root MSE = .53917

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0914099 .001522 60.06 0.000 .0884266 .0943932


tenure .0131021 .0005322 24.62 0.000 .0120589 .0141452
age .0484024 .0030883 15.67 0.000 .0423491 .0544556
age2 -.0005047 .000039 -12.92 0.000 -.0005812 -.0004281
_cons .3992089 .0619266 6.45 0.000 .2778277 .5205901
Introducing dummy regressors
• Now test whether
• Gender
• Union membership
• Marital status
... affect wage.
Wage function
. reg lwage edu tenure age age2 male married union

Source SS df MS Number of obs = 20306


F( 7, 20298) = 1043.08
Model 2001.93042 7 285.99006 Prob > F = 0.0000
Residual 5565.28488 20298 .274178977 R-squared = 0.2646
Adj R-squared = 0.2643
Total 7567.2153 20305 .372677434 Root MSE = .52362

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0918647 .0014804 62.05 0.000 .088963 .0947665


tenure .0105571 .0005355 19.72 0.000 .0095075 .0116067
age .0440438 .0030179 14.59 0.000 .0381284 .0499592
age2 -.0004571 .0000381 -12.00 0.000 -.0005318 -.0003825
male .2416219 .0074448 32.46 0.000 .2270295 .2562144
married .1015896 .0080023 12.70 0.000 .0859045 .1172748
union .094119 .01064 8.85 0.000 .0732638 .1149742
_cons .2894436 .0603995 4.79 0.000 .1710557 .4078316

β indicates change in ln(wage) when the dummy regressor change from 0 to 1.


Introducing categorical variable
• Recall the variable race, which is a categorical
variable:
• race = 1 if white, = 2 if black, 3 if others.
• How to introduce this variable into the wage
function?
• We can’t introduce it directly to the regression
function
• Instead, we have to create a set of corresponding
dummy variables
Transforming categorical variables to
dummy variables
• In Stata race = 1 if white, = 2 if black, 3 if others

gen white = 0
replace white = 1 if race == 1
gen black = 0
replace black = 1 if race == 2
• Then introduce white and black to the regression
The wage function again
. reg lwage edu tenure age age2 male married union white black

Source SS df MS Number of obs = 20306


F( 9, 20296) = 850.55
Model 2072.44796 9 230.271995 Prob > F = 0.0000
Residual 5494.76734 20296 .27073154 R-squared = 0.2739
Adj R-squared = 0.2735
Total 7567.2153 20305 .372677434 Root MSE = .52032

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0877139 .0015252 57.51 0.000 .0847243 .0907035


tenure .0108456 .0005335 20.33 0.000 .0098 .0118912
age .0492297 .0030161 16.32 0.000 .0433178 .0551415
age2 -.0005244 .0000381 -13.77 0.000 -.000599 -.0004498
male .2313833 .0074354 31.12 0.000 .2168094 .2459573
married .0754955 .008141 9.27 0.000 .0595384 .0914525
union .1009366 .0105833 9.54 0.000 .0801925 .1216807
white .0708694 .0133984 5.29 0.000 .0446075 .0971314
black -.0758799 .0147049 -5.16 0.000 -.1047027 -.0470571
_cons .241169 .0606826 3.97 0.000 .1222261 .3601119
Gender difference in return to
education
• If we want to test whether there is a difference in
return to education between male and female
gen edu_male = edu * male
Testing gender difference in return to
education
. reg lwage edu tenure age age2 male edu_male married union white black

Source SS df MS Number of obs = 20306


F( 10, 20295) = 765.46
Model 2072.44883 10 207.244883 Prob > F = 0.0000
Residual 5494.76646 20295 .270744837 R-squared = 0.2739
Adj R-squared = 0.2735
Total 7567.2153 20305 .372677434 Root MSE = .52033

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0876133 .0023348 37.52 0.000 .0830369 .0921897


tenure .010846 .0005335 20.33 0.000 .0098003 .0118918
age .0492296 .0030162 16.32 0.000 .0433176 .0551415
age2 -.0005244 .0000381 -13.77 0.000 -.000599 -.0004498
male .2290828 .0411142 5.57 0.000 .1484957 .30967
edu_male .0001706 .0029992 0.06 0.955 -.0057081 .0060493
married .0754973 .0081413 9.27 0.000 .0595397 .0914549
union .1009726 .0106024 9.52 0.000 .080191 .1217541
white .0708411 .013408 5.28 0.000 .0445604 .0971218
black -.075916 .0147189 -5.16 0.000 -.1047662 -.0470657
_cons .2425737 .0655144 3.70 0.000 .1141602 .3709872
Race difference in return to
education
• Now test whether there is difference in return to
education between racial groups (white, black
and others)
gen edu_white = edu*white
gen edu_black = edu*black
reg lwage edu tenure age age2 male married union
white edu_white black edu_black
Racial difference in return to
education
. reg lwage edu tenure age age2 male married union white edu_white black edu_black

Source SS df MS Number of obs = 20306


F( 11, 20294) = 701.65
Model 2084.98023 11 189.543658 Prob > F = 0.0000
Residual 5482.23506 20294 .270140685 R-squared = 0.2755
Adj R-squared = 0.2751
Total 7567.2153 20305 .372677434 Root MSE = .51975

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

edu .0739171 .0029136 25.37 0.000 .0682061 .0796281


tenure .010904 .0005331 20.46 0.000 .0098592 .0119488
age .0493213 .0030129 16.37 0.000 .0434158 .0552267
age2 -.000528 .000038 -13.88 0.000 -.0006026 -.0004535
male .2303741 .0074287 31.01 0.000 .2158132 .244935
married .0744159 .0081415 9.14 0.000 .0584578 .090374
union .1013344 .0105719 9.59 0.000 .0806125 .1220562
white -.2160375 .0461188 -4.68 0.000 -.3064341 -.125641
edu_white .0229096 .0035645 6.43 0.000 .0159229 .0298963
black -.1386274 .0607147 -2.28 0.022 -.2576331 -.0196218
edu_black .0062051 .0047123 1.32 0.188 -.0030315 .0154416
_cons .4052171 .0674429 6.01 0.000 .2730235 .5374107
Let’s enjoy
If you have any questions, please contact me at
luat.do@luatdo.com

Вам также может понравиться