Вы находитесь на странице: 1из 20

MULTIPLE REGRESSION PART 3

Topics Outline
Dummy Variables
Interaction Terms
Nonlinear Transformations
Quadratic Transformations
Logarithmic Transformations
Dummy Variables
Thus far, the examples we have considered involved quantitative explanatory variables such as
machine hours, production runs, price, expenditures. In many situations, however, we must work
with categorical explanatory variables such as gender (male, female), method of payment
(cash, credit card, check), and so on. The way to include a categorical variable in the regression
model is to represent it by a dummy variable.
A dummy variable (also called indicator or 0 1 variable) is a variable with possible values 0 and 1.
It equals 1 if a given observation is in a particular category and 0 if it is not.
If a given categorical explanatory variable has only two categories, then you can define one
dummy variable xd to represent the two categories as
1 if the observation is in category 1
xd =
0 otherwise
Example 1
Data collected from a sample of 15 houses are stored in Houses.xlsx.
House
1
2
M
14
15

Value
($ thousands)
234.4
227.4
M
233.8
226.8

Size
(thousands of square feet)
2.00
1.71
M
1.89
1.59

Presence of Fireplace
Yes
No
M
Yes
No

(a) Develop a regression model for predicting the assessed value y of houses, based on the size x1
of the house and whether the house has a fireplace.
To include the categorical variable for the presence of a fireplace, the dummy variable is
defined as
1 if the house has a fireplace
x2 =
0 if the house does not have a fireplace
-1-

To code this dummy variable in Excel, enter


=IF(C2="Yes",1,0)
in cell D2 and drag it down. The data become:
House
1
2
M
14
15

Value
234.4
227.4
M
233.8
226.8

Size
2.00
1.71
M
1.89
1.59

Fireplace
1
0
M
1
0

Assuming that the slope of assessed value with the size of the house is the same for houses
that have and do not have a fireplace, the multiple regression model is

y = + 1 x1 + 2 x 2 +
Here are the regression results for this model.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
ANOVA

0.9006
0.8111
0.7796
2.2626
15

Regression
Residual
Total

df
2
12
14

SS
263.7039
61.4321
325.1360

MS
131.8520
5.1193

F
Significance F
25.7557 0.0000

Intercept
Size
Fireplace

Coefficients
200.0905
16.1858
3.8530

Standard Error
4.3517
2.5744
1.2412

t Stat
45.9803
6.2871
3.1042

P-value
0.0000
0.0000
0.0091

(b) Interpret the regression coefficients.


The regression equation is

y = 200.0905 + 16.1858x1 + 3.8530x2


y = a + b1 x1 + b2 x2

-2-

Lower 95%
190.6090
10.5766
1.1486

Upper 95%
209.5719
21.7951
6.5574

For houses with a fireplace, you substitute x 2 = 1 into the regression equation:
y = 200.0905 + 16.1858x1 + 3.8530(1)
y = 203.9435 + 16.1858x1
y = (a + b2 ) + b1 x1
For houses without a fireplace, you substitute x 2 = 0 into the regression equation:
y = 200.0905 + 16.1858x1 + 3.8530(0)
y = 200.0905 + 16.1858x1
y = a + b1 x1
Interpretation of a

(1)

(2)

The expected value of a house with 0 square feet and no fireplace is $200,091, which
obviously does not make sense in this context.
Interpretation of b1
The effect of x1 on y is the same for houses with or without a fireplace. When x1 increases
by one unit, y is expected to change by b1 units for houses with or without a fireplace.
Thus, holding constant whether a house has a fireplace, for each increase of 1 thousand
square feet in the size of the house, the predicted assessed value is estimated to increase by
16.1858 thousand dollars (i.e., $16,185.80).
Interpretation of b2
The slope of equations (1) and (2) is the same ( b1 = 16.1858 ), but the intercepts differ by an
amount b2 = 3.8530 . Geometrically, the two equations correspond to two parallel lines that
are a vertical distance b2 = 3.8530 apart. Therefore, the interpretation of b2 is that it
indicates the difference between the two intercepts 203.9435 and 200.0905.

Thus, holding constant the size of the house, the presence of a fireplace is estimated to
increase the predicted assessed value of the house by 3.8530 thousand dollars (i.e. $3,853).
-3-

(c) Does the regression equation provide a good fit for the observed data?
The test statistic for the slope of the size of the house with assessed value is 6.2871,
and the P-value is approximately zero.
The test statistic for presence of a fireplace is 3.1042, and the P-value is 0.0091.
Thus, each of the two variables makes a significant contribution to the model.
In addition, the coefficient of determination indicates that 81.11% of the variation in assessed
value is explained by variation in the size of the house and whether the house has a fireplace.

When a categorical variable has two categories (fireplace, no fireplace), one dummy variable is used.
When a categorical variable has m categories, m 1 dummy variables are required, with each
dummy variable coded as 0 or 1.
Example 2
Define a multiple regression model using sales (y) as the response variable and price ( x1 ) and
package design as explanatory variables. Package design is a three-level categorical variable
with designs A, B, or C.
Solution:
To model the m = 3-level categorical variable package design, m 1 = 3 1 = 2 dummy
variables are needed:
1 if package design A is used
x2 =
0 otherwise
1 if package design B is used
x3 =
0 otherwise
Therefore, the regression model is
y = + 1 x1 + 2 x 2 + 3 x3 +
Here the package design is coded as:
Package design

A ( x2 )

B ( x3 )

B
C

0
0

1
0

-4-

Interaction Terms
In the regression models discussed so far, the effect an explanatory variable has on the response
variable has been assumed to be independent of the other explanatory variables in the model.
An interaction occurs if the effect of an explanatory variable on the response variable changes
according to the value of a second explanatory variable.
For example, it is possible for advertising to have a large effect on the sales of a product when
the price of a product is low. However, if the price of the product is too high, increases in
advertising will not dramatically change sales. In other words, you cannot make general
statements about the effect of advertising on sales. The effect that advertising has on sales is
dependent on the price. Therefore, price and advertising are said to interact.
When interaction between two variables is present, we cannot study the effect of one variable on
the response y independently of the other variable. Meaningful conclusions can be developed
only if we consider the joint effect that both variables have on the response.
To account for the effect of two explanatory variables xi and x j acting together, an interaction
term (sometimes referred to as a cross-product term) x i x j is added to the model.
Example 1 (Continued)
(d) Formulate a regression model to evaluate whether an interaction exists.
In the regression model, we assumed that the effect the size of the home has on the assessed
value is independent of whether the house has a fireplace. In other words, we assumed that the
slope of assessed value with size is the same for houses with fireplaces as it is for houses
without fireplaces. If these two slopes are different, an interaction exists between the size of
the home and the fireplace.
To evaluate whether an interaction exists, the following model is considered:
y = + 1 x1 + 2 x 2 + 3 x1 x 2 +
where x1 x 2 is the interaction term. With this new x 3 = x1 x 2 variable in the model, it means
the value of x 2 changes how x1 affects y.
If we factor out x1 we get:

y = + ( 1 + 3 x 2 ) x1 + 2 x 2 +

Thus, each value of x 2 yields a different slope in the relationship between y and x1 .
Expressed in other words, the parameter 3 of the interaction term gives an adjustment to the
slope of x1 for the possible values of x 2 .
-5-

(e) Interpret the estimated regression equation.


The data for the model with an interaction term are:
House
1
2
M
14
15

Value
234.4
227.4
M
233.8
226.8

Size
2.00
1.71
M
1.89
1.59

Fireplace
1
0
M
1
0

Size Fireplace
2.00
0.00
M
1.89
0.00

The regression output for this model is:


Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.9179
0.8426
0.7996
2.1573
15

ANOVA
Regression
Residual
Total

df
3
11
14

SS
MS
F
Significance F
273.9441 91.3147 19.6215
0.0001
51.1919 4.6538
325.1360

Intercept
Size
Fireplace
Size*Fireplace

Coefficients
Standard Error
t Stat P-value
212.9522
9.6122 22.1544 0.0000
8.3624
5.8173 1.4375 0.1784
-11.8404
10.6455 -1.1122 0.2898
9.5180
6.4165 1.4834 0.1661

Lower 95%
Upper 95%
191.7959
234.1084
-4.4414
21.1662
-35.2710
11.5902
-4.6046
23.6406

The estimated regression equation is


y = a + b1 x1 + b2 x 2 + b3 x1 x 2

y = 212.9522 + 8.3624x1 11.8404x2 + 9.5180 x1 x2


To see the interaction effect, we have to evaluate this equation for the possible values of x 2 .

-6-

For houses with a fireplace, x 2 = 1 and the regression equation is:

y = 212.9522 + 8.3624 x1 11.8404(1) + 9.5180x1 (1)


y = 201.1118 + 17.8804x1

(3)

y = (a + b2 ) + (b1 + b3 ) x1
For houses without a fireplace, x 2 = 0 and the regression equation is:

y = 212.9522 + 8.3624x1 11.8404(0) + 9.5180x1 (0)


y = 212.9522 + 8.3624x1

(4)

y = a + b1 x1
Interpretation of b2
The coefficient of the indicator variable b2 = 11.8404 provides a different intercept to
separate the houses with and without a fireplace at the origin (where Size = 0 sq ft).
Here it does not make great sense. Literally, it says that for houses with size of 0 sq ft, the
value of a house without a fireplace is about $11,840 higher than the value of a house with a
fireplace.
Interpretation of b3
The coefficient of the interaction term b3 = 9.5180 says that the slope relating the size of the house
to its value is steeper by $9,518 for houses with a fireplace than for houses without a fireplace.

The two lines, (3) and (4), meet at (size, value) = (1.2440, 223.3550). Thus, the value of a
house with a size greater than 1,244 sq ft is higher when the house has a fireplace.
-7-

(f) Does the interaction term make a significant contribution to the regression model?
To test for the existence of an interaction, the null and alternative hypotheses are:
H 0 : 3 = 0
H a : 3 0
The test statistic for the interaction of size and fireplace is 1.4834 with a P-value = 0.1661.
Because the P-value is large, you do not reject the null hypothesis.
Thus, although the slope adjuster b3 = 9.5180 implies the value gap between houses with
and without fireplace increases with house size, this effect is not really significant.
In other words, the interaction term does not make a significant contribution to the model,
given that size and presence of a fireplace are already included. Therefore, you can conclude
that the slope of assessed value with size is the same for houses with fireplaces and without
fireplaces.
Note:
If the correlation between interaction terms and the original variables in the regression is high,
collinearity problems can result. In a regression with several variables, the number of interaction
variables that could be created is very large and the likelihood of collinearity problems is high.
Therefore, it is wise not to use interaction variables indiscriminately. There should be some good
reason to suspect that two variables might be related or some specific question that can be
answered by an interaction variable before this type of variable is used.

-8-

Nonlinear Transformations
The general linear model has the form
y = + 1 x1 + 2 x 2 + L + k x k +
It is linear in the sense that the right side of the equation is a constant plus a sum of products of
constants and variables. However, there is no requirement that the response variable y or the
explanatory variables x1 through x k be the original variables in the data set. Most often they are,
but they can also be transformations of original variables. You can transform the response
variable y or any of the explanatory variables, the xs. You can also do both.
The purpose of nonlinear transformations is usually to straighten out the points in a
scatterplot in order to overcome violations of the assumptions of regression or to make the form
of a model linear. They can also arise because of economic considerations. Among the many
transformations available are the square root, the reciprocal, the square, and transformations
involving the common logarithm (base 10) and the natural logarithm (base e).
The type of transformation to correct for curvilinearity is not always obvious. Different
transformations may be tried and the one that appears to do the best job chosen.
There may be theoretical results as well to support the use of certain transformations in certain cases.
As always, subject matter expertise is important in any analysis. If several different transformations
straighten out the data equally well, the one that is easiest to interpret is preferred.
The most frequently used nonlinear transformations in business and economic applications are
the quadratic and logarithmic transformations.
Quadratic Transformations
One of the most common nonlinear relationships between the response variable y and an explanatory
variable x is a curvilinear relationship in which y increases (or decreases) at a changing rate for
various values of x. The quadratic regression model defined below can be used to analyze this type
of relationship between x and y.
y = + 1 x1 + 2 x12 +
This model is similar to the multiple regression model except that the second explanatory
variable is the square of the first explanatory variable. Once again, the least squares method can
be used to compute sample regression coefficients a, b1 , and b2 as estimates of the population
parameters , 1 , and 2 . The estimated regression equation for the quadratic model is
y = a + b1 x1 + b2 x12
In this equation, the first regression coefficient a represents the y intercept; the second
regression coefficient b1 represents the linear effect; and the third regression coefficient b2
represents the quadratic effect.

-9-

Example 3
Fly Ash
Fly ash is an inexpensive industrial waste by-product that can be used as a substitute for Portland cement,
a more expensive ingredient of concrete. How does adding fly ash affect the strength of concrete?
Batches of concrete were prepared in which the percentage of fly ash ranged from 0% to 60%.
Data were collected from a sample of 18 batches and stored in FlyAsh.xlsx.
Batch
1
2
M
17
18

Strength (psi)
4779
4706
M
5030
4648

Fly Ash %
0
0
M
60
60

(a) A linear model has been fit to these data. Below is the regression output.
What do these results show?
Regression Statistics
Multiple R

0.4275

R Square

0.1827

Adjusted R Square

0.1317

Standard Error

460.7787

Observations

18

ANOVA
df

SS

Regression

MS

Significance F

759618.0571

759618.0571

Residual

16

3397072.4429

212317.0277

Total

17

4156690.5000

Coefficients

Standard Error

4924.5952

213.2991

23.0877

0.0000

4472.4213

5376.7691

10.4171

5.5074

1.8915

0.0768

-1.2579

22.0922

Intercept
Fly Ash%

t Stat

3.5778

P-value

0.0768

Lower 95%

Upper 95%

1000

6500
6000

500
5500

Residuals

Strength (psi)

5000
4500
4000
0

20

40

0
4900
-500

5100

5300

60
-1000

Fly Ash %

- 10 -

Predicted Strength

5500

The t test indicates that the linear term is significant at the 0.10 (but not at the 0.05) level of
significance (P-value = 0.0768). The extremely low coefficient of determination
( r 2 = 0.1827) shows that the linear model explains only about 18% of the variation in strength.
Moreover, the scatterplot of the data and the plot of residuals versus fitted values indicate that
a linear model is not appropriate for these data. For example, the scatterplot of Strength versus
Fly Ash % indicates an initial increase in the strength of the concrete as the percentage of fly
ash increases. The strength appears to level off and then drop after achieving maximum
strength at about 40% fly ash. Strength for 50% fly ash is slightly below strength at 40%,
but strength at 60% is substantially below strength at 50%.
Therefore, to estimate strength based on fly ash percentage, a quadratic model seems more
appropriate for these data, not a linear model.
(b) The data for the quadratic model are:
Batch
1
2
M
17
18

Strength (psi)
4779
4706
M
5030
4648

Fly Ash %
0
0
M
60
60

(Fly Ash %)2


0
0
M
3600
3600

Below are the regression results for the quadratic model.


What does the residual plot show? What is the estimated regression equation?
Regression Statistics
Multiple R
0.8053
R Square
0.6485
Adjusted R Square
0.6016
Standard Error
312.1129
Observations

500
Residuals

300

18

100
-1004000

4500

5000

5500

6000

-300
-500

Predicted Strength

ANOVA
df
Regression
Residual
Total

Intercept
Fly Ash%
Fly Ash%^2

2
15
17

SS
2695473.4897
1461217.0103
4156690.5000

MS
1347736.7448
97414.4674

F
13.8351

Significance F
0.0004

Coefficients
4486.3611
63.0052
-0.8765

Standard Error
174.7531
12.3725
0.1966

t Stat
25.6726
5.0923
-4.4578

P-value
0.0000
0.0001
0.0005

Lower 95%
4113.8836
36.6338
-1.2955

- 11 -

Upper 95%
4858.8386
89.3767
-0.4574

The curved pattern in the residual plot is gone. The points in the residual plot show a random
scatter with an approximately equal spread above and below the horizontal 0 line.
From the regression output: a = 4,486.3611
Therefore, the quadratic regression equation is

b1 = 63.0052

b2 = 0.8765

y = 4,486.3611 + 63.0052 x1 0.8765 x12


Predicted Strength = 4,486.3611 + 63.0052 Fly Ash% 0.8765 (Fly Ash%)2
The following figure is a scatterplot that shows the fit of the quadratic regression curve to the
original data. (In Excel, use the option Polynomial to add the trendline.)
Scatterplot of Fly Ash and Strength
6500

y = -0.8765x2 + 63.005x + 4486.4

Strength (psi)

6000
5500
5000
4500
4000
0

10

20

30

40

50

60

Fly Ash %

(c) Interpret the regression coefficients.


The y intercept 4,486.3611 is the predicted strength when the percentage of fly ash is 0.
To interpret the coefficients b1 = 63.0052 and b2 = 0.8765, observe that after an initial
increase, strength decreases as fly ash percentage increases. This nonlinear relationship is
further demonstrated by predicting the strength for fly ash percentages of 20, 40, and 60.
Using the quadratic regression equation,
Predicted Strength = 4,486.3611 + 63.0052 FlyAsh% 0.8765 Fly Ash%^2
For FlyAsh% = 20, Predicted Strength = 4,486.3611 + 63.0052(20) 0.8765(20)2 = 5,395.865
For FlyAsh% = 40, Predicted Strength = 4,486.3611 + 63.0052(40) 0.8765(40)2 = 5,604.169
For FlyAsh% = 60, Predicted Strength = 4,486.3611 + 63.0052(60) 0.8765(60)2 = 5,111.273
Thus, the predicted concrete strength for 40% fly ash is 208.304 psi above the predicted
strength for 20% fly ash, but the predicted strength for 60% fly ash is 492.896 psi below the
predicted strength for 40% fly ash.

- 12 -

(d) Test the significance of the quadratic model.


The null and alternative hypotheses for testing whether there is a significant overall
relationship between strength y and fly ash percentage x1 are as follows:
H 0 : 1 = 2 = 0

(There is no overall relationship between x1 and y.)

H 0 : 1 and/or 2 0 (There is an overall relationship between x1 and y.)


The overall F test statistic used for this test is F = 13.8351. The corresponding P-value is 0.0004.
Because of the small P-value, you reject the null hypothesis and conclude that there is a
significant overall relationship between strength and fly ash percentage.
(e) Test the quadratic effect.
To test the significance of the contribution of the quadratic term, you use the following null
and alternative hypotheses:
H 0 : 2 = 0 (Including the quadratic term does not significantly improve the model.)
H 0 : 2 0 (Including the quadratic term significantly improves the model.)
The test statistic and the corresponding P-value are: t = 4.4578, P-value = 0.0005
You reject H 0 and conclude that the quadratic term is statistically significant and should be
kept in the model.
(f) How good is the quadratic model? Is it better than the linear model?

r2

se

y = + 1 x1 +

18%

461

y = + 1 x1 + 2 x12 +

65%

312

Model

The coefficient of determination r 2 = 0.6485 shows that about 65% of the variation in strength is
explained by the quadratic relationship between strength and the percentage of fly ash.
The percentage variation explained by the linear model is much smaller: about 18%.
Another indicator that the regression has been improved by adding the quadratic term is the reduction
in the standard error s e from about 461 in the linear model to about 312 in the quadratic model.
Thus, based on our findings in (a) through (f), we can conclude that the quadratic model is
significantly better than the linear model for representing the relationship between strength
and fly ash percentage.
Note:
Although this was not the case in this example, but it can happen that in a quadratic model the
quadratic term is significant and the linear term is not. In such situations (for statistical reasons
not discussed here), the general rule is to keep the linear term despite of its insignificance.
- 13 -

Logarithmic Transformations
If scatterplots suggest nonlinear relationships, there are many nonlinear transformations of y
and/or the xs that could be tried in a regression analysis. The reason that logarithmic
transformations are arguably the most frequently used nonlinear transformations, besides the fact
that they often produce good fits, is that they can be interpreted naturally in terms of percentage
changes. In real studies, this interpretability is an important advantage over other potential
nonlinear transformations.
The log transformations put values on a different scale that compresses large distances so that
they are more comparable to smaller distances.
It is common in business and economic applications to use natural ln (base e) logarithms,
although the base used is usually not important.
Interpretation of a slope coefficient b when log is used
Case 1: Predicted y = a + L + b log x + L

(x is log-transformed, y is not log-transformed)

The expected change in y (increase or decrease depending on the sign of b) when x increases by
1% is approximately 0.01b.
Example: Predicted y = 5.67 + 0.34 log x
This regression equation implies that every 1% increase in x (for example, from 200 to 202) is
accompanied by about (0.01)(0.34) = 0.0034 increase in y.
Case 2: Predicted log y = a + L + bx + L

(x is not log-transformed, y is log-transformed)

Whenever x increases by 1 unit, the expected value of y changes (increases or decreases


depending on the sign of b) by a constant percentage, and this percentage is approximately equal
to b written as a percentage (that is, 100b%).
Example: Predicted log y = 5.67 + 0.34 x
b = 0.34 and written as a percentage it is (100)(0.34) = 34%.
When x increases by 1 unit, the expected value of y increases by approximately 34%.
Case 3: Predicted log y = a + L + b log x + L

(both x and y are log-transformed)

The expected change in y (increase or decrease depending on the sign of b) when x increases by
1% is approximately b%.
Example: Predicted log y = 5.67 + 0.34 log x
For every 1% increase in x , y is expected to increase by approximately 0.34%.

- 14 -

Example 4
Fuel Consumption
The file Fuel_Consumption.xlsx contains data on the fuel consumption in gallons per capita for
each of the 50 states and Washington, DC. Here is part of the data.
State
Alabama
Alaska
M
Wyoming
Washington D.C.

FuelCon
547.92
440.38
M
715.55
289.99

Population
4486508
643786
M
498703
570898

Area
Density
50750
88.4041
570374
1.1287
M
M
97105
5.1357
61 9358.9836

The goal is to develop a regression equation to predict fuel consumption based on the population
density (defined as population/area).
The scatterplot of FuelCon versus Density is shown below.
Scatterplot of FuelCon versus Density
750

FuelCon

650

550

450

350

250
0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Density

Looking at the scatterplot, it is clear that this is not a linear relationship.


One thing to note about this plot is how the values spread out on the x axis. At the left-hand side
of the x axis, the values are clumped together. Moving from left to right, the values become
progressively more spread out. This suggests the use of a log transformation of Density.
The log transformation evens out the successively larger distances between the values.
The scatterplot of FuelCon versus the natural logarithm of Density (LogDensity) is shown below.
- 15 -

Scatterplot of FuelCon versus LogDensity


750

FuelCon

650
550
450
350
250
0

10

LogDensity

The relationship appears to be linear.


On the next page are the regression results using Density and using LogDensity as an
explanatory variable. The following table provides summary statistics for the two models.

r2

se

y = + x +

20.6%

65.17

y = + log x +

27.8%

62.16

Model

The regression results indicate that using LogDensity as the explanatory variable produces a
better model fit than the regression using Density.
The estimated regression equation for the logarithmic model is
Predicted FuelCon = 597.1867 24.5308 Log Density
This equation shows that if the population density increases by 1%, the average fuel
consumption will decrease by (0.01)(24.5308) = 0.2453 gallons per capita.

- 16 -

Regression of FuelCon on Density


Regression Statistics
Multiple R
0.4538
R Square
0.2059
Adjusted R Square
0.1897
Standard Error
65.1675
Observations
51
ANOVA
df
Regression
Residual
Total

Intercept
Density

1
49
50
Coefficients
495.6283
-0.0251

SS
MS
F
53960.7466 53960.7466 12.7062
208093.4001 4246.8041
262054.1466
Standard Error
9.4811
0.0070

Significance F
0.0008

t Stat
P-value
52.2752 0.0000
-3.5646 0.0008

Lower 95%
476.5752
-0.0392

SS
MS
F
72748.3136 72748.3136 18.8302
189305.8330 3863.3843
262054.1466

Significance F
0.0001

Upper 95%
514.6814
-0.0109

Regression of FuelCon on LogDensity


Regression Statistics
Multiple R
0.5269
R Square
0.2776
Adjusted R Square
0.2629
Standard Error
62.1561
Observations
51
ANOVA
df
Regression
Residual
Total

1
49
50

Intercept
LogDensity

Coefficients
597.1867
-24.5308

Standard Error
26.9612
5.6531

- 17 -

t Stat
P-value
22.1499 0.0000
-4.3394 0.0001

Lower 95%
543.0062
-35.8911

Upper 95%
651.3671
-13.1705

Example 5
Imports and GDP
The gross domestic product (GDP) and dollar amount of total imports (Imports), both in billions
of dollars for 25 countries are saved in Imports_and_GDP.xlsx.
Country
Argentina
Australia
M
United Kingdom
United States

Imports
GDP
20.300
391.000
68.000
528.000
M
M
330.100 1520.000
1148.000 10082.000

The objective is to find an equation showing the relationship between Imports (y) and GDP (x).
The scatterplot of Imports versus GDP shows that this is not a linear relationship.
Scattreplot of Imports versus GDP
1200
1000

Imports

800
600
400
200
0
0

2000

4000

6000

8000

10000

GDP

At the left-hand side of the x axis and the bottom of the y axis, the values are clumped together.
Moving from left to right on the x axis, the values become more spread out. The same thing
happens when moving up the y axis the values become progressively more spread out.
This suggests the use of a log transformation for both the x and y variables.
As the scatterplot of LogImports versus LogGDP below shows, the relationship appears much
closer to linear.

- 18 -

LogImports

Scatterplot of LogImports versus LogGDP


8
7
6
5
4
3
2
1
0
-1
-2

-2

10

LogGDP

The results for the regression of LogImports on LogGDP are shown below.
The regression of Imports on GDP is not shown for comparison purposes, because the response
variable y has been transformed to log y and the usual comparisons are not valid. In particular,
the interpretations of se and r2 are different because the units of the response variable are
completely different. For example, increases in r 2 when the natural logarithm transformation is
applied to y do not necessarily suggest an improved model. Because of the above, it is difficult
to compare this regression to any model using y as the response variable.
Note that transformations of the explanatory variables do not create this type of problem.
It is only when the y variable is transformed that comparison becomes more difficult.
Regression Statistics
Multiple R
0.9168
R Square
0.8404
Adjusted R Square
0.8335
Standard Error
0.9142
Observations
25
ANOVA
df
Regression
1
Residual
23
Total
24

Intercept
LogGDP

Coefficients
-1.1275
0.8670

SS
MS
F
101.2551 101.2551 121.1527
19.2226
0.8358
120.4777
Standard Error
0.4346
0.0788

t Stat
-2.5941
11.0069

- 19 -

P-value
0.0162
0.0000

Significance F
0.0000

Lower 95%
-2.0265
0.7041

Upper 95%
-0.2284
1.0300

The regression model is

log y = + log x +

The estimated regression equation is


Predicted LogImports = 1.1275 + 0.8670 LogGDP
The slope coefficient 0.8670 indicates that if the GDP increases by 1%, then the Imports are
expected to increase by approximately 0.8670% (about 1%).
If the estimated regression equation is used for forecasting, natural logs of the y values are
forecasted, not the y values themselves. For example, what is the forecast of Imports for a
country with GDP = 500 billions of dollars?
Using the estimated regression equation,
LogImports = 1.1275 + 0.8670 LogGDP
= 1.1275 + 0.8670 Log(500)
= 1.1275 + 0.8670 (6.2146)
= 4.2606
The forecast value for y (Imports) must be computed as
Imports = e 4.2606 = 70.85 71 billions of dollars

- 20 -

Вам также может понравиться