Вы находитесь на странице: 1из 11

Guanglong Bao

Instructor: Ani Aghababyan


ALY 6015 Intermediate Analytics
Spring term A
05/08/2018
Week 3 Assignment
Part 1
Exercise 1
Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003)
“Least Angle Regression” Annals of Statistics). This has patient level data on the progression of
diabetes. Next, load the glmnet package that will be used to implement LASSO.
Input:

Output:

Result:
In this exercise, we load the lars package, glmnet package and the diabetes dataset. Then we can
do the implement LASSO.

Exercise 2
Inspect the relationship of each of the predictors with the dependent variable.
Input:

Output:
Result:
There are three matrices x, x2, and y in the dataset. While x has a smaller set of independent
variables, x2 contains the full set with quadratic and interaction terms. y is the dependent
variable which is a quantitative measure of the progression of diabetes. Generate separate
scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. we can
see the basic relationship of x and y. The most of the x and y are positive correlated, just one of
the independent variable is negative correlated with y.

Exercise 3
Input:

Output:

Result:
Regress y on the predictors in x using OLS. We will use this result as benchmark for
comparison.

Exercise 4
Use the glmnet function to plot the path of each of x’s variable coefficients against the L1 norm
of the beta vector.
Input:

Output:

Result:
This graph indicates at which stage each coefficient shrinks to zero. We can see the L1 norm
continuous decrease, the coefficients are shrink to zero. The horizontal axis at the top of the
graph indicates the number of non-zero coefficients currently selected at λ.

Exercise 5
Use the cv.glmnet function to get the cross validation curve and the value of lambda that
minimizes the mean cross validation error.

Input:

Output:
Result:
We can see the range of log of Lambda is almost between 0 and 2.

Exercise 6
Using the minimum value of lambda from the previous exercise, get the estimated beta matrix.

Input:

Output:
Result:
According to the result of the beta matrix, we can see when the Lambda is minimum, the
coefficient age and ldl shrink to zero.

Exercise 7
To get a more parsimonious model we can use a higher value of lambda that is within one
standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that
more coefficients are now shrunk to zero.
Input:

Output:

Result:
When the Lambda is within one standard error of the minimum, the coefficient age, , tc, ldl, tch
and glu shrink to zero, which is much higher than the Lambda is minimum.

Exercise 8
Input:

Output:
Result:
As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and
evaluate results.

Exercise 9
Repeat exercise-4 for the new model.
Input:

Output:

Result:
The x2 contains a wider variety of predictors. But the result is similar. When the L1 Norm
decrease the coefficient of the predictor shrink to zero.

Exercise 10
Repeat exercises 5 and 6 for the new model and see which coefficients are shrunk to zero. This is
an effective way to narrow down on important predictors when there are many candidates.
Input:

Output:

Result:
The range of the log of Lambda is between 1 and 2. And see the beta matrix when the Lambda is
minimum, we can know the bmi and ltg still have the most influence to the dependent variable y.
Therefore, we can say the bmi and ltg is the most important variable of the model.
Part 2
2.1 Preparing the dataset
Input:

Output:

Result:
We create a factor containing each individual’s sex and generate a colour vector to distinguish
male and female. Then We extract from the dataset all the continuous variables: the height, the
weight and the age of the people.

2.2 Visualizing the data in a 3-dimensional space


Input:

Output:
Result:
Each person is characterized by three variables. The plot3d() function of the rgl function allows
to see in a 3-d space.

2.3 Centring and reducing the variables


Input:

Output:

Result:
In this plot, all the variables are represented using the same scale.

2.3.1 Centring the data


Input:

Output:
Result:
This operation consists in subtracting the mean of each variable to each point. The scale ()
function allows to compute the centering of all variables in one command. The plot definitively
looks better. The cloud of points is now centered around the origin point (0,0,0). But, we can
now see that the variability is more important for the length than for the two other variables.

2.3.2 Reducing the data


Input:

Output:

Result:
We can reduce the centered data, by dividing each point by the standard deviation of each
variable. The scale() function still allows this operation.

Вам также может понравиться