Академический Документы
Профессиональный Документы
Культура Документы
Assignment 2
Date Assigned: May 24th, 2018 Date Due: May 27th, 2018
(Questions: 1)
Instructions:
Provide all numerical results with two digits of precision only. Labels of all figures and tables are
mentioned below the figures and tables respectively.
Q 1. Consider the datasets provided in Tables 1 and 2, in order to answer the following questions.
Assume , and ̂ ,̂ ,̂ ,̂ ,̂ ,
̂ , as obtained using Least Squares approach.
a) Calculate F-statistic using Training Data. What can be inferred from the determined value?
(6 points)
b) Implement Forward Selection technique with a Stopping Rule: “Maximum three Predictors”.
(15 points)
c) Suppose an Interaction Effect exists between “Student” and “Annual Income”.
i. Extend your originally developed Multiple Linear Regression model by including the
Interaction Term. (3 points)
ii. Determine the relationship between the Interaction Term and “Credit Limit” in terms of
magnitude and direction. (2 points)
iii. Calculate Standard Error (SE) of the Interaction Term Coefficient Estimate (ITCE)
determined in part (i). Is the determined coefficient a good estimate? (5 points)
iv. Calculate t-statistic for the ITCE determined in part (i). What can be inferred from the
determined value? Will you revert to your originally developed Multiple Linear
Regression model, or will you keep this new model? (6 points)
v. Calculate statistic for the extended model over Training Data. What can be inferred
from the determined value? (13 points)
vi. Calculate Test MSE using the extended model. Is the model performing well on Test
Data? [Hint: Test MSE on the originally developed Multiple Linear Regression model =
73,281,963.26] (12 points)
Non-Linear Regression:
d) Observe the graph of “Annual Income” vs. “Credit Limit” in Figure 1. Transform your originally
developed Multiple Linear Regression model by including a cubic term for “Annual Income”.
(2 points)
e) What is your opinion on the transformation carried out in part (d) for: (2 points)
i. Quadratic Regression Model
Page 1 of 13
CS 4701: Data Science Assignment 2 Answer Key
1 1 34 0 1 $14,891 $3,606
2 0 82 1 1 $106,025 $6,645
3 1 71 0 0 $104,593 $7,075
4 0 36 0 0 $148,924 $9,504
5 1 68 0 1 $55,882 $4,897
6 1 77 0 0 $80,180 $8,047
7 0 41 1 1 $71,061 $6,819
Table 1. Credit Card Customers - Training Data
8 0 37 0 0 $20,996 $3,388
9 1 87 0 0 $71,408 $7,114
10 0 66 0 0 $15,125 $3,300
Table 2. Credit Card Customers - Test Data
10,000
9,000
8,000
7,000
Credit Limit
6,000
5,000
4,000
3,000
2,000
1,000
0
0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000
Annual Income
Page 2 of 13
CS 4701: Data Science Assignment 2 Answer Key
A 1.
a)
Here, ∑ ̅ ∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅
where, ̅ ∑ ∑
Hence,
Here, ∑ ̂
∑ ̂
̂
̂ ̂ ̂ ̂ ̂ ̂
where, ̂
Hence,
Hence,
Since value of is found to be very low, hence none of the predictors has a strong relationship
with the response variable.
We then fit Simple Linear Regressions and calculate their respective RSS.
For :
̂ ̂
Page 3 of 13
CS 4701: Data Science Assignment 2 Answer Key
∑ ̂
∑ ̂
̂ ̂ ̂ ̂ ̂ ̂ ̂
For :
̂ ̂
For :
̂ ̂
Page 4 of 13
CS 4701: Data Science Assignment 2 Answer Key
For :
̂ ̂
For :
̂ ̂
Page 5 of 13
CS 4701: Data Science Assignment 2 Answer Key
We select to be added to our model, since it results in the lowest RSS among all predictors.
̂ ̂ ̂
We then fit Simple Linear Regressions and calculate their respective RSS.
For :
̂ ̂
For :
̂ ̂
Page 6 of 13
CS 4701: Data Science Assignment 2 Answer Key
For :
̂ ̂
For :
̂ ̂
Page 7 of 13
CS 4701: Data Science Assignment 2 Answer Key
We select to be added to our model, since it results in the lowest RSS among all predictors.
̂ ̂ ̂ ̂
We then fit Simple Linear Regressions and calculate their respective RSS.
For :
̂ ̂
For :
̂ ̂
Page 8 of 13
CS 4701: Data Science Assignment 2 Answer Key
For :
̂ ̂
We select to be added to our model, since it results in the lowest RSS among all predictors.
̂ ̂ ̂ ̂ ̂
c)
i. ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂
∑ ̅ ̅
where, ̂ ̂
∑ ̅
[Don’t calculate ̂ using ̂ ̂ ̂]
∑ ̅ ̅
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅
Here,
Page 9 of 13
CS 4701: Data Science Assignment 2 Answer Key
and, ̅ ∑ ∑
Hence, ̂
and, ̂
ii. Interaction Term ( ) has no relationship with “Credit Limit”, since its coefficient
(̂) .
iii. { ̂} { ̂}
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅
Hence,
{ ̂}
iv. For :
̂
(̂)
There is some relationship between “Interaction Term” and “Credit Limit”. Hence, I will keep
this new model.
v.
where,
and, ∑ ̂
Page 10 of 13
CS 4701: Data Science Assignment 2 Answer Key
∑ ̂
̂
̂ ̂ ̂ ̂ ̂ ̂
Hence,
We can infer that 82% Variance is explained in “Credit Limit” by regressing onto five
different predictors. Hence, the relationship between “Credit Limit” and the five different
predictors is quite strong.
vi. using extended model would remain same as that of the original model, since the
extended model contains only a new Interaction Term, whose coefficient is zero.
Since value of is found to be very high, hence the extended model is not
performing well on Test Data.
Non-Linear Regression:
d) ̂ ̂
̂
∑ ̅ ̅
where, ̂ ̂ [Don’t calculate ̂ using ̂ ̂ ]
∑ ̅
∑ ̅ ̅
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅
Here,
Page 11 of 13
CS 4701: Data Science Assignment 2 Answer Key
and,
̅ ∑ ∑
Hence,
̂
∑ ̅ ̅
where, ̂ ̂ [Don’t calculate ̂ using ̂ ̂ ]
∑ ̅
∑ ̅ ̅
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅
Here,
and,
̅ ∑ ∑
Hence, ̂
Page 12 of 13
CS 4701: Data Science Assignment 2 Answer Key
Hence, ̂
e) Since both the coefficients ( ̂ ̂ are equal to zero, hence both the Quadratic and Cubic
terms have no relationships with “Credit Limit”. Thus Quadratic Regression Model and Cubic
Regression Model are the same as the original model.
Page 13 of 13