Академический Документы
Профессиональный Документы
Культура Документы
LUSI YANG
SUPERVISOR: DR. SHELLEY BULL
UNIVERSITY OF TORONTO
MARCH 16, 2017
Abstract
lipoprotein (HDL). Low HDL is a risk factor for cardiovascular disease in the general population.
The two objectives of the study are: (1) gene-based testing using multiple linear regression of
multiple SNPs in a gene and (2) assessing the full multiple regression model using variance
inflation factor (VIF) and ridge regression. The gene of interest in this analysis is the CETP gene
because it has a known association with HDL in the general population, and has been reported in
previous analysis of the type 1 diabetes population (Teslovich et al., 2010; Yoo et al., 2017).
When SNPs within a gene are correlated, multiple linear regression may suffer from
multicollinearity. To assess multicollinearity, VIFs were calculated for each SNP, and the SNPs
that had large VIFs were removed from the full model that included all SNPs of CETP as the
covariates. Ridge regression is another method for assessing multicollinearity, which penalizes
the correlated SNPs in the gene. By using cross-validation in R, we found the tuning parameter !
with the minimized mean squared error (MSE). For VIF, the global hypothesis tests of
association in the full and reduced models for CETP are not very different and inference from the
full model (i.e. the global test) appears to be unaffected by the level of multicollinearity among
the SNPs in CETP. For ridge regression, the ! with the minimized MSE had approximately the
same MSE as the full model, which implies that the full model was unaffected by
multicollinearity.
2 | P a g e
Introduction
single-nucleotide polymorphisms (SNPs), and tested one by one, to detect the association
between each SNP and the disease traits of interest. This is called the single-SNP approach.
Alternatively, gene-based analysis or analysis of multiple SNPs was to study the association at a
gene level rather than at a SNP level. To test for the significance of a gene, a multi-SNP global
statistic is constructed to test if there is an association between the SNPs of a gene and a
particular disease trait. One advantage of gene level testing is the reduction of number of tests of
association. The analysis of multiple SNPs within a gene is usually conducted by multiple linear
Global statistics have been developed to test the combined effects of SNPs with certain
quantitative traits. In the classical regression case, the multi-SNP global statistic has degrees of
freedom (df) that corresponds to the number of SNPs in a gene. In recent study of gene
association, multiple linear combination (MLC) regression for regional testing has been proposed
to reduce the df of the global test statistics (Yoo et al., 2017). The reduced df statistics can
improve both power and robustness in gene association testing as found by Yoo et al. for the
MLC test statistic (2017). The MLC test adapts to the linkage disequilibrium structure of SNPs in
a gene by partitioning and recoding SNPs into bins of positively correlated SNPs. The df of the
To illustrate, the bins of the SNPs from the study are presented in Figure 1. Each colour
represents a bin and each bin contains the positively correlated SNPs from the CETP gene. The
number above each line is the correlation between the adjacent SNPs. Since there are five bins,
the df used in the MLC test equals to 5. Figure 1 is a network which represents the correlation
among the SNPs, but the edges between SNPs with correlation less than 0.5 have been removed.
3 | P a g e
Figure 1. Clustering of SNPs in DCCT/EDIC CETP gene data (Yoo et al., 2017)
The purpose of this report is to apply a global hypothesis testing of gene-based association
using the methods of variance inflation factor (VIF) and ridge regression. These results will then
be compared to the result from the gene-based MLC test statistic. The motivation of using these
two methods is to assess the consequences of the multicollinearity in the multiple regression
models because SNPs from a gene can be correlated. Each method may suggest different
importance of SNPs in a gene, and may provide different information on the association of the
gene with the quantitative trait of interest. Because df is defined as the number of SNPs in a gene,
the df of VIF depends on which SNPs are removed from the reduced regression due to
multicollinearity. This method uses the typical likelihood ratio test (comparison between the null
model and the reduced model) to investigate the association between the gene and the
quantitative trait of interest. On the other hand, ridge regression does not reduce the number of df.
It rather penalizes the coefficients of highly correlated SNPs in the model by shrinking them
close to zero, but does not remove them from the model. By cross-validation, ridge regression
chooses the tuning parameter ! that gives the smallest mean square error (MSE). If this MSE is
close to the MSE of the multiple regression with all SNPs as covariates when ! is close to zero,
then this suggests the multiple regression is not being affected by correlation among the SNPs,
4 | P a g e
Materials and Methods
Stage 3
Stage 2
Stage 1 DCCT/EDIC Genetic
Epidemiology of
The Diabetes Control Study: study of the
Diabetes
and Complications association of SNPs in
Interventions and
Trial (DCCT) : a a large set of candiate
Complications (EDIC):
randomized, clinical genes with
a follow-up of DCCT
trial complicaitons of T1D
patients
patients
Figure 2. The flow chart of the data collection stages
The dataset used in this study was collected in several stages. The first stage was The
Diabetes Control and Complications Trial (DCCT) from 1983 to 1993, which was a randomized
clinical trial that was designed to compare intensive to conventional insulin therapy in type 1
diabetes patients and determine whether the complications of type 1 diabetes (TD1) could be
prevented or delayed (Al-Kateb et al., 2008; Nathan, 2013). In this stage, 1,441 subjects with
TD1 were collected. In the second stage, the Epidemiology of Diabetes Interventions and
Complications (EDIC) was a follow-up study of 90 percent of the DCCT cohort, which was to
study the durability of the DCCT effects on more advanced stages of diabetes complications
including cardiovascular disease (NIH, 2008; Nelson, 2013). DCCT/EDIC Genetic Study was
then derived from the previous two stages. This study investigated the association of SNPs in a
large set of candidate genes with complications of T1D of 1,362 white probands (Yoo et al.,
2017). The participants were genotyped by a custom Illumina GoldenGate Beadarray assay,
which consists of 1,213 SNPs in 183 candidate genes. These candidate genes have at least more
than one SNPs. The inclusion criterion of the Genetic Study was white patients of the DCCT. The
exclusion criterion of the Genetic Study was those individuals with missing genotype data or
discrepant sex based on standard quality control procedures (Al-Kateb et al., 2008).
5 | P a g e
For this study, we are interested in the trait high-density lipoprotein (HDL) because
having low HDL is a risk factor for cardiovascular disease in the general population. There is a
known association between HDL and the gene CETP in the general population, and has been
reported in previous analysis of the same T1D data (Teslovich et al., 2010; Yoo et al., 2017). The
HDL data was obtained at the DCCT baseline in stage 1 of the flow chart. The SNPs data of the
gene CETP were obtained from the custom Illumina GoldenGate Beadarray assay in stage 3 of
the flow chart. There are 10 SNPs in the gene CETP, and each SNP has genotype coding 0, 1, or
2.1 HDL is the dependent variable and the 10 SNPs are the covariates of this study. There are
Figure 3 presents the 10 SNPs from the CETP gene, and each SNP takes value 0, 1, or 2. Most of
the histograms are right skewed as expected based on the definition of minor alleles.1
1. Genotype classification: 0 = 0 copy of minor alleles (i.e. AA); 1 = 1 copy of minor allele (i.e. AT); 2 = 2
copies of minor alleles (i.e. TT). A minor allele is defined as <50% in the sample size.
6 | P a g e
Figure 4. The box plots of HDL and log(HDL)
The original data for HDL was not log transformed, and it was slightly right skewed as shown in
the box plot on the left with more points at the upper tail than the log transformed HDL on the
7 | P a g e
There are several outliers in the box plots in each SNP. Most of the box plots are symmetric and
Regression Models. Suppose there are m SNPs in a gene, and they can be analyzed through a
multiple regression. We denote genotypes of m SNPs as X1, X2, X3 Xm. Each Xi represents the
count of minor alleles for the ith SNP, and can be classified as 0, 1, or 2. Let y be the quantitative
trait of interest, such as HDL. The multiple linear regression can be set up in matrix form as
y = X" + #. (1)
independent variables (number of SNPs in a gene), and the X matrix includes the intercept. " is
vector of observations on dependent variable (HDL). The coefficient "s are estimated by
minimizing the sum of squared residuals. To find the least squares estimates, we used the
function lm() in R.
For the global hypothesis of gene-based association, we can write the following,
To test the global hypothesis, we fitted an intercept-only model and then a full model with all the
SNPs. We then used the function anova() in R to conduct the Likelihood Ratio Test.2 If the p-
value is significant, we reject the null that the intercept-only model is true.
A linear regression model assumes that there is a linear relationship between response and
explanatory variables, errors are normally i.i.d, and homoscedasticity. In addition, influential
points and multicollinearity can affect inference of coefficients. Since SNPs are from
2. For instance, we can use anova(fitnull, fit, test="Chisq"), where the likelihood ratio test is a chi-squared test.
8 | P a g e
the same gene, they may be correlated and cause multicollinearity. Multicollinearity can produce
multicollinearity, we propose the Variance Inflation Factor (VIF) and the Ridge Regression.
Variance Inflation Factor.3 Suppose the model has only one covariate xj such that
where # i are i.i.d. normally distributed errors, and the variance of the estimated coefficient 1, is
23
var(1, )min= : 2 . (4)
6;<(567 857 )
Suppose now the model is multiple linear regression with covariates that are correlated with the
23 (
var(1, ) = : 2 , (6)
6;<(567 857 ) (8 =73
where >,) is obtained from regressing the jth covariate on the remaining covariates. The greater
the linear dependence, the larger the >,) and the more inflated the variance of 1, .
E3 <
: (F GF ))
?@A(B7 )) 6;< 67 7 <G H3
7 (
VIFj = = E3
= . (7)
?@A(B7 ))C6: (8 =73
: (F GF ))
6;< 67 7
VIFs can be calculated for all the covariates in the linear regression model.
9 | P a g e
Researchers determine the thresholds to reduce the multicollinearity in the model. There are
no correct thresholds; however, they generally range from 4 to 10 (OBrien, 2007). This also
implies R2 ranges from 0.75 to 0.9, which shows a high linear dependence among the covariate of
A function to calculate VIF values can be found in Rs car library, and it is denoted as vif().
Step 1. Fit a full model with all SNPs in the linear regression by using the lm() function in R.
Step 2. Apply the vif() function for fit: vif(fit). Check which covariate has the largest VIF.
Step 3. Delete the covariate that has the largest VIF, and fit the model again and call this new
model model1. Then check which covariate has the largest VIF. This deletion will continue until
VIFs of all the remaining covariates are less than a reasonable threshold.
Ridge Regression. The linear regression model matrix can be set up as equation (1). The
where the estimates for " is unbiased under the ordinary least squares estimation. However, in
ridge regression, instead of minimizing sum of squared residuals, it minimizes the penalized
residual sum of squares (equation 11); thus, the estimation of coefficient " changes to
10 | P a g e
where equation 12 shows that the parameter " ridge is biased. There exists a ! that can minimize
the mean squared error. There are several steps to implement selection of ! in R.4
Step 1. Install the glmnet package, and use model.matrix() function to input an x matrix. This x
matrix consists of all the columns of the covariates in the dataset and an intercept column. Also,
Step 2. Use the glmnet() function in the package to specify that a ridge regression is performed by
setting alpha=0. Set standardize=TRUE allows the glmnet() function to standardize the x
variables in the dataset. If we only want 20 lambdas, we should set nlambda=20. For instance,
Step 3. Then type fit.ridge as the example above, one obtains the number of degrees of freedom,
%dev% (% of deviance explained), and the descending order of lambdas. If one wants to see
how each coefficient behaves based on different lambdas, one can input plot(fit.ridge). The
general trend is that as lambda gets bigger, there will be more penalties on the coefficients, which
Step 4. Use the cross-validation to choose the tuning parameter ! in R is a general method. The
function is cv.glmnet(), and by default the function performs ten-fold cross-validation. This
function will divide the dataset into 10 blocks. Then 9 blocks of the data become the training
data, and 1 block of the data becomes the validation set. This training data will estimate the
coefficients based on the 9 blocks of data. The estimated coefficients will then be used to fit with
the validation set. The prediction error for this ! will be estimated. There are 10 ways of dividing
the training data and the validation set; thus, 10 errors will be generated. R will calculate the
average error for this !, which is called the cross-validation or the mean square error (MSE).
11 | P a g e
As the default, this will be done for 100 !s in R. We want to choose one ! that corresponds to the
cv.ridge$lambda.min, where cv.ridge is defined from the function cv.glmnet. If one wants to
locate where lambda.min is in the matrix, one can type cv.ridge$lambda. Then type cv.ridge$cvm
to see this !s corresponding cross-validated error or MSE. One can also verify this !s cross-
Step 5. After obtaining the !, analyze the coefficients by coef(cv.ridge, s = "lambda.min"). The
Results
The main interest of this study is to use the Variance of Inflation Factor (VIF) and ridge
regression to assess the consequences of the multicollinearity in the multiple regression models.
To test for the global hypothesis of gene-based association, we compared the model without any
covariates and the full model. The result was significant (p-value = 2.181E-11), which implies
that we reject the null that all " s are 0. Thus, we conclude there is a strong association between
the CETP and the quantitative trait HDL. In Table 1, a full model with all SNPs of the gene
CETP is presented.
12 | P a g e
To check the validity of the model, we plotted the diagnostic plots in Figure 6 from R by
using the plot function. The Residual vs Fitted plot randomly bounce around the 0 line, and this
indicates that the covariates and the response variables have a linear relationship. These residuals
roughly form a horizontal band around the 0 linear, which suggests that the variances of the error
terms are equal. This plot also suggests that there are no outliers. The Q-Q plot is close to
diagonal, which means that the residuals are normally distributed. The scale-location plot checks
for homoscedasticity. Since the residuals are spread equally and has no discernible pattern, the
homoscedasticity assumption is true. The Cooks Distance plot shows there are no influential
cases because of the relatively small Cooks distances of each observation. Thus, this model
passed the assumption tests of a multiple linear regression. However, this model may be affected
13 | P a g e
Variance Inflation Factor (VIF).
In Table 2, Model 1 consists of all the SNPs from the CETP gene; Model 2 consists of the
remaining SNPs after detecting high VIF in SNP 2 in Model 1; Model 3 consists of the remaining
covariates from Model 2 except SNP 9 because of its high VIF in Model 2.
Since SNP 2 had the highest VIF in Model 1, we regressed SNP 2 on the remaining SNPs
in R. The resulting R2 is 0.8771, and this means there is a high linear dependence among the
predictor SNP 2 and remaining SNPs. Thus, we removed SNP 2 from the model. For Model 2,
we fitted a linear regression model without SNP 2, and all the VIFs decreased. However, SNP 9
had the largest VIF, and the R2 for regressing SNP 9 on the remaining SNPs was 0.8537. This
also suggested that SNP 9 had a high linear dependence with the other SNPs; thus, it was
removed. For Model 3, all VIFs were relatively small. Hence, multicollinearity assumption was
reduced, and the reduced model includes only the remaining 8 SNPs. The diagnostic plots of the
reduced model were similar to the diagnostic plots of the full model. Thus, the reduced model
also passed all the linear regression assumptions. The estimates of the reduced model are
presented in Table 3.
14 | P a g e
Table 3. Regression Analysis Results: The Reduced Model
Coefficients Estimate Standard Error 95% CI t-value
Intercept -0.039 0.018 (-0.074, -0.004) -2.170
SNP 1 -0.032 0.022 (-0.075, 0.011) -1.502
SNP 3 -0.030 0.015 (-0.0594, -0.0006) -2.056
SNP 4 -0.039 0.017 (-0.072, -0.005) -2.253
SNP 5 -0.067 0.019 (-0.104, -0.030) -3.503
SNP 6 -0.014 0.016 (-0.045, 0.017) -0.881
SNP 7 0.015 0.016 (-0.016, 0.046) 0.895
SNP 8 0.056 0.025 (0.007, 0.105) 2.277
SNP 10 -0.029 0.018 (-0.064, 0.006) -1.596
To test for the global hypothesis of gene-based association, we compared the model
without any covariates and the final model. The result was significant (p-value = 5.603E-12),
which implies that we reject the null that all " s are 0.
Ridge Regression. R, by default, generates 100 lambdas in for both glmnet and cv.glmnet
functions. Applying plot() for the glmnet function, the following plot was generated.
15 | P a g e
The above plot describes that as lambda penalty increases; there will be more penalty on the
coefficients. Another way of saying this is that the coefficients shrink towards zero as lambda
value increases. Applying plot() for the cv.glmnet, the following plot was generated.
Figure 8 is a plot with all the mean-squared errors of each lambda calculated by R. The plot
suggests that around -3, lambda will have the smallest mean-squared error. To find this lambda,
cv.ridge$labmda.min was applied, and the resulting was 0.052. The corresponding mean cross-
validation or the mean-squared error is 0.055. The resulting regression coefficients are
16 | P a g e
The coefficient estimates of SNPs 3, 6, 9 and 10 above are close to zero. In ridge regression, they
are the smallest coefficients in the model. This indicates that the ridge regression down weighted
the variables that were less important or have no association with HDL.
Discussion
The coefficient estimates of the three models are presented in Table 5. The full model has
all the 10 SNPs in the CETP gene. It was compared to a null model with no covariates, and the
likelihood ratio test was significant with p-value equals to 2.181E-11. This implies that there is a
strong association between CETP and the quantitative trait HDL. In the reduced model, after
using VIF to detect multicollinearity, SNP 2 and SNP 9 were both eliminated from the model due
to their high VIFs. This model was compared to the null model, and the likelihood ratio test was
significant with p-value equals to 5.603E-12. Essentially, after removing highly correlated
covariates, the reduced model has a similarly small p-value. Lastly, in the ridge regression model,
the estimates for SNP 3, SNP 6, SNP 9, and SNP 10 are close to zero, and MSE is minimized,
which implies that these SNPs do not improve the prediction. Referring back to Figure 1, SNPs 3
and 6 are correlated with each other and they are both penalized. SNPs 9 and 10 both are
correlated with some other SNPs in the gene; thus, they are also penalized by the ridge. The
17 | P a g e
lambda 0.052 (log( ) = -2.96) had the smallest mean square error. Based on Figure 8, as
approaches to 0, the mean square remains roughly constant. This indicates that the MSE of the
ridge with lambda 0.052 is roughly the same as the MSE of an ordinary linear regression. Thus,
the full model is adequate to test for the association between the gene and HDL.
Yoo et al. (2017) presented the MLC regression method. This method had fewer degrees
of freedom than other approaches in testing the gene association. The method divided the 10
SNPs into 5 clusters as shown in Figure 1. The MLC test combined the SNPs that are correlated
by putting them into the same cluster and then averaging the regression coefficients. The null
hypothesis of the test was that there is no association of any cluster and the alternative is that at
least one cluster has an association with HDL. The small p-value of the MLC test statistic
indicates a strong association between the gene and HDL (p-value of 3.69E-12). As a result, the
reduced model, VIF final model, the ridge regression, and the MLC all suggest that there is an
The impact of this study is to assess linear regression methods for gene-based association
analysis. VIF and ridge regression are techniques to assess multicollinearity in the multiple
regression linear model. When SNPs within a gene are more likely to be correlated with each
other, multiple linear regression may suffer from multicollinearity. However, the global
hypothesis test of association in the full and reduced models for CETP are not very different and
inference from the full model (i.e. the global test) appears to be unaffected by the level of
multicollinearity among the set of SNPs in CETP. Similarly, ridge regression is a model selection
method that shrinks the highly correlated coefficients in the model. Ridge regression will choose
a ! with the smallest MSE, and if this MSE is close to the MSE of a linear regression (i.e. when !
is close to zero), then this implies that the full model is adequate to test for association. Thus,
18 | P a g e
both methods serve as a tool to confirm the validity of the full multiple regression in testing
genetic association.
One of the major weaknesses of this study is that the full model only includes the SNPs of
the CETP gene. Other variables such as age and sex are also associated with HDL and could have
been included in this analysis to explain variation in HDL. According to a former study of HDL
cholesterol, HDL increases with age in men but not in women (Ferrara et al., 1997). Another
weakness of this study is that there are another 182 genes with 1203 SNPs that could be studied
To conclude, the VIF and ridge regression have shown that the full model with 10 CETP
SNPs is appropriate to study the association of the CETP gene and HDL. These are methods that
can be used to evaluate the validity of full multiple regression for a gene-based genetic
association analysis. For further studies, other patient characteristic data such as age and sex
could be incorporated to explain variation in HDL. Furthermore, we could also investigate least
absolute shrinkage and selection operator (LASSO) method because this method also deals with
highly correlated predictors by selecting only one of them and shrinking the others to zero.
LASSO also chooses ! that minimizes the MSE through cross-validation. If this MSE is close to
the MSE of a linear regression (i.e. when ! is close to zero), this would imply that the full model
19 | P a g e
Appendices
References
Al-Kateb, H., Boright, A. P., Mirea, L., Xie, X., Sutradhar, R., Mowjoodi, A., . . . Paterson, A. D.
(2007). Multiple Superoxide Dismutase 1/Splicing Factor Serine Alanine 15 Variants Are
Associated With the Development and Progression of Diabetic Nephropathy: The
Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and
Complications Genetics Study. Diabetes, 57(1), 218-228. doi:10.2337/db07-1059
Ferrara, A., Barrett-Connor, E., & Shan, J. (1997, July 01). Total, LDL, and HDL Cholesterol
Decrease With Age in Older Men and Women. Retrieved March 17, 2017, from
http://circ.ahajournals.org/content/96/1/37
Friedman, J., Hastie, T., & Tibshirani, R. (2016, March 17). Package 'glmnet' Retrieved from
https://cran.rproject.org/web/packages/glmnet/glmnet.pdf
Hastie, T., & Qian, J. (2014, June 26). Glmnet Vignette. Retrieved from
https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2015). An introduction to statistical learning
with applications in R. New York, NY: Springer.
Nathan, D. M., & Group, F. T. (2014, January). The Diabetes Control and Complications
Trial/Epidemiology of Diabetes Interventions and Complications Study at 30 Years:
Overview. Retrieved March 23, 2017, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3867999/
National Institutes of Health . (2008, May). DCCT and EDIC: The Diabetes Control and
Complications Trial and Follow-up Study . Retrieved from
https://www.niddk.nih.gov/about-niddk/research-areas/diabetes/dcct-edic-diabetes
control-complications-trial-follow-up-study/Documents/DCCT-EDIC_508.pdf
OBrien, R. M. (2007). A Caution Regarding Rules of Thumb for Variance Inflation Factors.
Quality & Quantity, 41(5), 673-690. doi:10.1007/s11135-006-9018-6
Teslovich, T. M., Musunuru, K., Edmondson, A. V., & Stylianou, A. C. (2010). Biological,
clinical and population relevance of 95 loci for blood lipids. Nature, 707-713.
Yoo, Y. J., Sun, L., Poirier, J. G., Paterson, A. D., & Bull, S. B. (2017). Multiple linear
combination (MLC) regression tests for common variants adapted to linkage
disequilibrium structure. Genetic Epidemiology. 108-121. doi: 10.1002/gepi.22024
Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/gepi.22024/epdf
20 | P a g e
The R Code
tol=1e-32
#dir="C:/Users/yyoo/Dropbox/genebased/Rprog/"
dir="M:/lyang/Practicum Dataset/HDL_Candidate_genes/taleban_QT_example/single genes/"
pedfile="example_clb_hdl_3.ped"
mapfile="example_3.map"
#LDfile="mydata.LD"
#ped<-ped_a[,c(1:58,61:64)]
#map<-map_a[c(1:26,28:29),]
ped<-ped_a[,c(1:12,15:64)]
map<-map_a[c(1:3,5:29),]
ped<-ped_a
map<-map_a
##############################
Data Analysis
# Histograms
par(mfrow = c(2,5))
i=1
for(i in 1:10) {
barplot(table(datanew[i+1]), main = paste("Histogram of SNP", i, sep = ""), ylim=range(0,
1200), xlab="Genotype")
}
# Box Plot
22 | P a g e
# Summary for HDL
par(mfrow = c(1,2))
boxplot(exp(datanew$pheno), xlab="", ylab="HDL", main="HDL Box plot")
boxplot(datanew$pheno, xlab="", ylab="log(HDL)", main="log(HDL) Box plot")
summary(datanew$pheno)
par(mfrow = c(2,5))
i=1
for (i in 1:10)
{
boxplot(datanew$pheno[datanew[i+1]==0],
datanew$pheno[datanew[i+1]==1],datanew$pheno[datanew[i+1]==2],boxwex=0.5, main =
paste("log(HDL) SNP", i, sep=""),ylab= "log(HDL)", xlab="Genotype", names=c("0","1","2"))
}
# Model Selection
# Multiple Linear Regression
nth = c(1:10)
colnames(datanew) = c("y", paste("SNP", nth, sep=""))
fit = lm(y~SNP1+SNP2+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
fitnull = lm(y~1, data=datanew)
anova(fitnull, fit, test="Chisq")
#Diagnostic Plots
par(mfrow=c(2,2))
plot(fit,which = 1:4)
# Ridge Regression
library(glmnet)
# We choose alpha=0 because it is Ridge Regression.
# glmnet automatically standradize the x variables
SNP1 = SNP20; SNP2 = SNP21; SNP3 = SNP22; SNP4 = SNP23; SNP5 = SNP24;
SNP6 = SNP25; SNP7 = SNP26; SNP8 = SNP27; SNP9 = SNP28; SNP10 = SNP29; y = pheno;
x = model.matrix(y~SNP1+SNP2+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
23 | P a g e
fit.ridge=glmnet(x,y,alpha=0, standardize = TRUE)
lbs_fun <- function(fit.ridge, ...) {
L <- length(fit.ridge$lambda)
x <- log(fit.ridge$lambda[L])
y <- fit.ridge$beta[, L][-1]
labs <- names(y)
text(x, y, labels=labs)
legend('topright', legend=labs, col=1:length(labs), lty=1, cex = 0.65)
}
plot(fit.ridge, xvar="lambda", col = 1:10)
lbs_fun(fit.ridge)
# Cross-validation
# By default, the function performs ten-fold cross-validation
cv.ridge=cv.glmnet(x, y, alpha=0, standardize = TRUE)
plot(cv.ridge, main="Cross-validation")
bestlam = cv.ridge$lambda.min
min(cv.ridge$cvm)
coef(cv.ridge, s = "lambda.min")
24 | P a g e