Gene-Based Analysis of Genetic Association With HDL: Application of Variance Inflation Factors and Ridge Regression

GENE-BASED
ANALYSIS OF GENETIC ASSOCIATION WITH HDL : APPLICATION OF

VARIANCE INFLATION FACTORS AND RIDGE REGRESSION
LUSI YANG
SUPERVISOR: DR. SHELLEY BULL
UNIVERSITY OF TORONTO
MARCH 16, 2017

Abstract
The paper reports a gene-based analysis of genetic association with high-density
lipoprotein (HDL). Low HDL is a risk factor for cardiovascular disease in the general population.
The two objectives of the study are: (1) gene-based testing using multiple linear regression of
multiple SNPs in a gene and (2) assessing the full multiple regression model using variance
inflation factor (VIF) and ridge regression. The gene of interest in this analysis is the CETP gene
because it has a known association with HDL in the general population, and has been reported in
previous analysis of the type 1 diabetes population (Teslovich et al., 2010; Yoo et al., 2017).
When SNPs within a gene are correlated, multiple linear regression may suffer from
multicollinearity. To assess multicollinearity, VIFs were calculated for each SNP, and the SNPs
that had large VIFs were removed from the full model that included all SNPs of CETP as the
covariates. Ridge regression is another method for assessing multicollinearity, which penalizes
the correlated SNPs in the gene. By using cross-validation in R, we found the tuning parameter !
with the minimized mean squared error (MSE). For VIF, the global hypothesis tests of
association in the full and reduced models for CETP are not very different and inference from the
full model (i.e. the global test) appears to be unaffected by the level of multicollinearity among
the SNPs in CETP. For ridge regression, the ! with the minimized MSE had approximately the
same MSE as the full model, which implies that the full model was unaffected by
multicollinearity.
2 | P a g e

Introduction
In genome-wide association studies (GWAS), researchers scanned a large number of
single-nucleotide polymorphisms (SNPs), and tested one by one, to detect the association
between each SNP and the disease traits of interest. This is called the single-SNP approach.
Alternatively, gene-based analysis or analysis of multiple SNPs was to study the association at a
gene level rather than at a SNP level. To test for the significance of a gene, a multi-SNP global
statistic is constructed to test if there is an association between the SNPs of a gene and a
particular disease trait. One advantage of gene level testing is the reduction of number of tests of
association. The analysis of multiple SNPs within a gene is usually conducted by multiple linear
regression, with the SNPs as the covariates.
Global statistics have been developed to test the combined effects of SNPs with certain
quantitative traits. In the classical regression case, the multi-SNP global statistic has degrees of
freedom (df) that corresponds to the number of SNPs in a gene. In recent study of gene
association, multiple linear combination (MLC) regression for regional testing has been proposed
to reduce the df of the global test statistics (Yoo et al., 2017). The reduced df statistics can
improve both power and robustness in gene association testing as found by Yoo et al. for the
MLC test statistic (2017). The MLC test adapts to the linkage disequilibrium structure of SNPs in
a gene by partitioning and recoding SNPs into bins of positively correlated SNPs. The df of the
MLC test statistic equals to the number of bins constructed.
To illustrate, the bins of the SNPs from the study are presented in Figure 1. Each colour
represents a bin and each bin contains the positively correlated SNPs from the CETP gene. The
number above each line is the correlation between the adjacent SNPs. Since there are five bins,
the df used in the MLC test equals to 5. Figure 1 is a network which represents the correlation
among the SNPs, but the edges between SNPs with correlation less than 0.5 have been removed.
3 | P a g e

Figure 1. Clustering of SNPs in DCCT/EDIC CETP gene data (Yoo et al., 2017)
The purpose of this report is to apply a global hypothesis testing of gene-based association
using the methods of variance inflation factor (VIF) and ridge regression. These results will then
be compared to the result from the gene-based MLC test statistic. The motivation of using these
two methods is to assess the consequences of the multicollinearity in the multiple regression
models because SNPs from a gene can be correlated. Each method may suggest different
importance of SNPs in a gene, and may provide different information on the association of the
gene with the quantitative trait of interest. Because df is defined as the number of SNPs in a gene,
the df of VIF depends on which SNPs are removed from the reduced regression due to
multicollinearity. This method uses the typical likelihood ratio test (comparison between the null
model and the reduced model) to investigate the association between the gene and the
quantitative trait of interest. On the other hand, ridge regression does not reduce the number of df.
It rather penalizes the coefficients of highly correlated SNPs in the model by shrinking them
close to zero, but does not remove them from the model. By cross-validation, ridge regression
chooses the tuning parameter ! that gives the smallest mean square error (MSE). If this MSE is
close to the MSE of the multiple regression with all SNPs as covariates when ! is close to zero,
then this suggests the multiple regression is not being affected by correlation among the SNPs,
and is adequate for gene-based association analysis.
4 | P a g e

Materials and Methods
The Sample and the Quality of the Data.
Stage 3
Stage 2
Stage 1 DCCT/EDIC Genetic
Epidemiology of
The Diabetes Control Study: study of the
Diabetes
and Complications association of SNPs in
Interventions and
Trial (DCCT) : a a large set of candiate
Complications (EDIC):
randomized, clinical genes with
a follow-up of DCCT
trial complicaitons of T1D
patients
patients

Figure 2. The flow chart of the data collection stages
The dataset used in this study was collected in several stages. The first stage was The
Diabetes Control and Complications Trial (DCCT) from 1983 to 1993, which was a randomized
clinical trial that was designed to compare intensive to conventional insulin therapy in type 1
diabetes patients and determine whether the complications of type 1 diabetes (TD1) could be
prevented or delayed (Al-Kateb et al., 2008; Nathan, 2013). In this stage, 1,441 subjects with
TD1 were collected. In the second stage, the Epidemiology of Diabetes Interventions and
Complications (EDIC) was a follow-up study of 90 percent of the DCCT cohort, which was to
study the durability of the DCCT effects on more advanced stages of diabetes complications
including cardiovascular disease (NIH, 2008; Nelson, 2013). DCCT/EDIC Genetic Study was
then derived from the previous two stages. This study investigated the association of SNPs in a
large set of candidate genes with complications of T1D of 1,362 white probands (Yoo et al.,
2017). The participants were genotyped by a custom Illumina GoldenGate Beadarray assay,
which consists of 1,213 SNPs in 183 candidate genes. These candidate genes have at least more
than one SNPs. The inclusion criterion of the Genetic Study was white patients of the DCCT. The
exclusion criterion of the Genetic Study was those individuals with missing genotype data or
discrepant sex based on standard quality control procedures (Al-Kateb et al., 2008).
5 | P a g e

For this study, we are interested in the trait high-density lipoprotein (HDL) because
having low HDL is a risk factor for cardiovascular disease in the general population. There is a
known association between HDL and the gene CETP in the general population, and has been
reported in previous analysis of the same T1D data (Teslovich et al., 2010; Yoo et al., 2017). The
HDL data was obtained at the DCCT baseline in stage 1 of the flow chart. The SNPs data of the
gene CETP were obtained from the custom Illumina GoldenGate Beadarray assay in stage 3 of
the flow chart. There are 10 SNPs in the gene CETP, and each SNP has genotype coding 0, 1, or
2.1 HDL is the dependent variable and the 10 SNPs are the covariates of this study. There are
1,362 individuals in the study, and there is no missing data.
Descriptive Analysis of the Data
Figure 3. The histograms of the 10 SNPs of the gene CETP
Figure 3 presents the 10 SNPs from the CETP gene, and each SNP takes value 0, 1, or 2. Most of
the histograms are right skewed as expected based on the definition of minor alleles.1
1. Genotype classification: 0 = 0 copy of minor alleles (i.e. AA); 1 = 1 copy of minor allele (i.e. AT); 2 = 2
copies of minor alleles (i.e. TT). A minor allele is defined as <50% in the sample size.
6 | P a g e

Figure 4. The box plots of HDL and log(HDL)
The original data for HDL was not log transformed, and it was slightly right skewed as shown in
the box plot on the left with more points at the upper tail than the log transformed HDL on the
right. The log-transformation made the distribution of HDL roughly symmetric.
Figure 5. The Box plots of HDL for 10 SNPs
7 | P a g e

There are several outliers in the box plots in each SNP. Most of the box plots are symmetric and
they indicate that the log transformation of HDL is adequate.
Regression Models. Suppose there are m SNPs in a gene, and they can be analyzed through a
multiple regression. We denote genotypes of m SNPs as X1, X2, X3 Xm. Each Xi represents the
count of minor alleles for the ith SNP, and can be classified as 0, 1, or 2. Let y be the quantitative
trait of interest, such as HDL. The multiple linear regression can be set up in matrix form as
y = X" + #. (1)
X is an n (m+1) design matrix, n is the number of observations, m is the number of
independent variables (number of SNPs in a gene), and the X matrix includes the intercept. " is
an (m+1) 1 vector of coefficients we want to estimate. # is an n 1 error vector. y is an n 1
vector of observations on dependent variable (HDL). The coefficient "s are estimated by
minimizing the sum of squared residuals. To find the least squares estimates, we used the
function lm() in R.
For the global hypothesis of gene-based association, we can write the following,
&' : "( = ") = = "* = 0 VS &+ : at least one of ", 0. (2)
To test the global hypothesis, we fitted an intercept-only model and then a full model with all the
SNPs. We then used the function anova() in R to conduct the Likelihood Ratio Test.2 If the p-
value is significant, we reject the null that the intercept-only model is true.
A linear regression model assumes that there is a linear relationship between response and
explanatory variables, errors are normally i.i.d, and homoscedasticity. In addition, influential
points and multicollinearity can affect inference of coefficients. Since SNPs are from
2. For instance, we can use anova(fitnull, fit, test="Chisq"), where the likelihood ratio test is a chi-squared test.
8 | P a g e

the same gene, they may be correlated and cause multicollinearity. Multicollinearity can produce
biased or inefficient inferences of the parameter estimates. To reduce the severity of
multicollinearity, we propose the Variance Inflation Factor (VIF) and the Ridge Regression.
Variance Inflation Factor.3 Suppose the model has only one covariate xj such that
Yi = . + ", /0, + # i , (3)
where # i are i.i.d. normally distributed errors, and the variance of the estimated coefficient 1, is
23
var(1, )min= : 2 . (4)
6;<(567 857 )
Suppose now the model is multiple linear regression with covariates that are correlated with the
covariate xj, the model is
Yi = . + "( /0( + ") /0) + + ", /0, + "* /0* . (5)
The variance of the estimated coefficient 1, becomes
23 (
var(1, ) = : 2 , (6)
6;<(567 857 ) (8 =73
where >,) is obtained from regressing the jth covariate on the remaining covariates. The greater
the linear dependence, the larger the >,) and the more inflated the variance of 1, .
The VIF is simply the ratio of the two variances
E3 <
: (F GF ))
?@A(B7 )) 6;< 67 7 <G H3
7 (
VIFj = = E3
= . (7)
?@A(B7 ))C6: (8 =73
: (F GF ))
6;< 67 7
VIFs can be calculated for all the covariates in the linear regression model.
3. VIF explained based on the online lecture notes at https://onlinecourses.science.psu.edu/stat501/node/347
9 | P a g e

Researchers determine the thresholds to reduce the multicollinearity in the model. There are
no correct thresholds; however, they generally range from 4 to 10 (OBrien, 2007). This also
implies R2 ranges from 0.75 to 0.9, which shows a high linear dependence among the covariate of
interest and the remaining covariates.
A function to calculate VIF values can be found in Rs car library, and it is denoted as vif().
To use vif() to assess multicollinearity, several steps are involved.
Step 1. Fit a full model with all SNPs in the linear regression by using the lm() function in R.
Lets call this linear regression fit.
Step 2. Apply the vif() function for fit: vif(fit). Check which covariate has the largest VIF.
Step 3. Delete the covariate that has the largest VIF, and fit the model again and call this new
model model1. Then check which covariate has the largest VIF. This deletion will continue until
VIFs of all the remaining covariates are less than a reasonable threshold.
Ridge Regression. The linear regression model matrix can be set up as equation (1). The
regression coefficients can be estimated by using the formula
" = (XTX)-1XTY. (8)
E(") = ", (9)
where the estimates for " is unbiased under the ordinary least squares estimation. However, in
ridge regression, instead of minimizing sum of squared residuals, it minimizes the penalized
residual sum of squares (equation 11); thus, the estimation of coefficient " changes to
" ridge = (XTX + !I)-1XTY. (10)
PRSS(") =(YX")T(Y-X") +!" I " (11)
E(" - ") = [(XTX + !I)-1XTY I] ", (12)
10 | P a g e

where equation 12 shows that the parameter " ridge is biased. There exists a ! that can minimize
the mean squared error. There are several steps to implement selection of ! in R.4
Step 1. Install the glmnet package, and use model.matrix() function to input an x matrix. This x
matrix consists of all the columns of the covariates in the dataset and an intercept column. Also,
y, the dependent variable, needs to be specified from the data.
Step 2. Use the glmnet() function in the package to specify that a ridge regression is performed by
setting alpha=0. Set standardize=TRUE allows the glmnet() function to standardize the x
variables in the dataset. If we only want 20 lambdas, we should set nlambda=20. For instance,
fit.ridge=glmnet(x, y, alpha=0, nlabmda=20, standardize = TRUE).
Step 3. Then type fit.ridge as the example above, one obtains the number of degrees of freedom,
%dev% (% of deviance explained), and the descending order of lambdas. If one wants to see
how each coefficient behaves based on different lambdas, one can input plot(fit.ridge). The
general trend is that as lambda gets bigger, there will be more penalties on the coefficients, which
shrinks closer to zero, but not exactly zero.
Step 4. Use the cross-validation to choose the tuning parameter ! in R is a general method. The
function is cv.glmnet(), and by default the function performs ten-fold cross-validation. This
function will divide the dataset into 10 blocks. Then 9 blocks of the data become the training
data, and 1 block of the data becomes the validation set. This training data will estimate the
coefficients based on the 9 blocks of data. The estimated coefficients will then be used to fit with
the validation set. The prediction error for this ! will be estimated. There are 10 ways of dividing
the training data and the validation set; thus, 10 errors will be generated. R will calculate the
average error for this !, which is called the cross-validation or the mean square error (MSE).
4. Information on glmnet package is available at https://cran.rproject.org/web/packages/glmnet/glmnet.pdf
11 | P a g e

As the default, this will be done for 100 !s in R. We want to choose one ! that corresponds to the
minimized cross-validated error or MSE. In R, to choose the best ! is to use
cv.ridge$lambda.min, where cv.ridge is defined from the function cv.glmnet. If one wants to
locate where lambda.min is in the matrix, one can type cv.ridge$lambda. Then type cv.ridge$cvm
to see this !s corresponding cross-validated error or MSE. One can also verify this !s cross-
validated error mean by using min(cv.ridge$cvm).
Step 5. After obtaining the !, analyze the coefficients by coef(cv.ridge, s = "lambda.min"). The
final ! varies slightly because the cross-validation function procedure in R is random.
Results
The main interest of this study is to use the Variance of Inflation Factor (VIF) and ridge
regression to assess the consequences of the multicollinearity in the multiple regression models.
To test for the global hypothesis of gene-based association, we compared the model without any
covariates and the full model. The result was significant (p-value = 2.181E-11), which implies
that we reject the null that all " s are 0. Thus, we conclude there is a strong association between
the CETP and the quantitative trait HDL. In Table 1, a full model with all SNPs of the gene
CETP is presented.
Table 1. Regression Analysis Results: The Full Model

Coefficients Estimate Standard Error 95% CI* t-value
Intercept -0.108 0.053 (-0.212, -0.004) -2.025
SNP 1 -0.040 0.022 (-0.083, 0.003) -1.784
SNP 2 0.035 0.026 (-0.016, 0.086) 1.365
SNP 3 0.0005 0.027 (-0.052, 0.053) 0.019
SNP 4 -0.008 0.028 (-0.063, 0.047) -0.296
SNP 5 -0.063 0.019 (-0.100, -0.026) -3.280
SNP 6 -0.011 0.016 (-0.042, 0.020) -0.680
SNP 7 0.011 0.028 (-0.044, 0.066) 0.400
SNP 8 0.048 0.025 (-0.001, 0.097) 1.896
SNP 9 0.003 0.025 (-0.046, 0.052) 0.135
SNP 10 -0.018 0.031 (-0.079, 0.043) -0.588
*CI = Confidence Interval
12 | P a g e

To check the validity of the model, we plotted the diagnostic plots in Figure 6 from R by
using the plot function. The Residual vs Fitted plot randomly bounce around the 0 line, and this
indicates that the covariates and the response variables have a linear relationship. These residuals
roughly form a horizontal band around the 0 linear, which suggests that the variances of the error
terms are equal. This plot also suggests that there are no outliers. The Q-Q plot is close to
diagonal, which means that the residuals are normally distributed. The scale-location plot checks
for homoscedasticity. Since the residuals are spread equally and has no discernible pattern, the
homoscedasticity assumption is true. The Cooks Distance plot shows there are no influential
cases because of the relatively small Cooks distances of each observation. Thus, this model
passed the assumption tests of a multiple linear regression. However, this model may be affected
by multicollinearity. We used VIF to help evaluate this.
Figure 6. The Diagnostic plots of the Final Model
13 | P a g e

Variance Inflation Factor (VIF).
Table 2. The VIFs

Covariate Model 1 Model 2 Model 3
SNP 1 1.89 1.78 1.78
SNP 2 8.137 - -
SNP 3 7.462 2.25 2.24
SNP 4 6.244 2.358 2.343
SNP 5 2.579 2.530 2.530
SNP 6 2.887 2.827 2.825
SNP 7 4.011 4.01 1.312
SNP 8 2.046 1.938 1.936
SNP 9 6.838 6.837 -
SNP 10 6.984 6.26 2.29
In Table 2, Model 1 consists of all the SNPs from the CETP gene; Model 2 consists of the
remaining SNPs after detecting high VIF in SNP 2 in Model 1; Model 3 consists of the remaining
covariates from Model 2 except SNP 9 because of its high VIF in Model 2.
Since SNP 2 had the highest VIF in Model 1, we regressed SNP 2 on the remaining SNPs
in R. The resulting R2 is 0.8771, and this means there is a high linear dependence among the
predictor SNP 2 and remaining SNPs. Thus, we removed SNP 2 from the model. For Model 2,
we fitted a linear regression model without SNP 2, and all the VIFs decreased. However, SNP 9
had the largest VIF, and the R2 for regressing SNP 9 on the remaining SNPs was 0.8537. This
also suggested that SNP 9 had a high linear dependence with the other SNPs; thus, it was
removed. For Model 3, all VIFs were relatively small. Hence, multicollinearity assumption was
reduced, and the reduced model includes only the remaining 8 SNPs. The diagnostic plots of the
reduced model were similar to the diagnostic plots of the full model. Thus, the reduced model
also passed all the linear regression assumptions. The estimates of the reduced model are
presented in Table 3.
14 | P a g e

Table 3. Regression Analysis Results: The Reduced Model
Coefficients Estimate Standard Error 95% CI t-value
Intercept -0.039 0.018 (-0.074, -0.004) -2.170
SNP 1 -0.032 0.022 (-0.075, 0.011) -1.502
SNP 3 -0.030 0.015 (-0.0594, -0.0006) -2.056
SNP 4 -0.039 0.017 (-0.072, -0.005) -2.253
SNP 5 -0.067 0.019 (-0.104, -0.030) -3.503
SNP 6 -0.014 0.016 (-0.045, 0.017) -0.881
SNP 7 0.015 0.016 (-0.016, 0.046) 0.895
SNP 8 0.056 0.025 (0.007, 0.105) 2.277
SNP 10 -0.029 0.018 (-0.064, 0.006) -1.596
To test for the global hypothesis of gene-based association, we compared the model
without any covariates and the final model. The result was significant (p-value = 5.603E-12),
which implies that we reject the null that all " s are 0.
Ridge Regression. R, by default, generates 100 lambdas in for both glmnet and cv.glmnet
functions. Applying plot() for the glmnet function, the following plot was generated.
Figure 7. Coefficients at different lambda values
15 | P a g e

The above plot describes that as lambda penalty increases; there will be more penalty on the
coefficients. Another way of saying this is that the coefficients shrink towards zero as lambda
value increases. Applying plot() for the cv.glmnet, the following plot was generated.
Figure 8. Mean-Squared Error for the 100 lambda values
Figure 8 is a plot with all the mean-squared errors of each lambda calculated by R. The plot
suggests that around -3, lambda will have the smallest mean-squared error. To find this lambda,
cv.ridge$labmda.min was applied, and the resulting was 0.052. The corresponding mean cross-
validation or the mean-squared error is 0.055. The resulting regression coefficients are
Table 4. The Coefficient Estimates of K = 0.052

Coefficients Estimate
Intercept -0.104
SNP 1 -0.026
SNP 2 0.025
SNP 3 -0.006
SNP 4 -0.020
SNP 5 -0.047
SNP 6 -0.006
SNP 7 0.015
SNP 8 0.031
SNP 9 0.003
SNP 10 -0.010
16 | P a g e

The coefficient estimates of SNPs 3, 6, 9 and 10 above are close to zero. In ridge regression, they
are the smallest coefficients in the model. This indicates that the ridge regression down weighted
the variables that were less important or have no association with HDL.
Discussion
Table 5. The Coefficient Estimates Comparison

Coefficients Full Model Reduced Model Ridge Regression
Intercept -0.108 -0.039 -0.1038
SNP 1 -0.040 -0.032 -0.026
SNP 2 0.035 - 0.025
SNP 3 0.0005 0.030 -0.00565
SNP 4 -0.008 -0.039 -0.02
SNP 5 -0.063 -0.067 -0.047
SNP 6 -0.011 -0.014 -0.0057
SNP 7 0.011 0.015 0.015
SNP 8 0.048 0.056 0.031
SNP 9 0.003 - 0.0026
SNP 10 -0.018 -0.029 -0.0095
The coefficient estimates of the three models are presented in Table 5. The full model has
all the 10 SNPs in the CETP gene. It was compared to a null model with no covariates, and the
likelihood ratio test was significant with p-value equals to 2.181E-11. This implies that there is a
strong association between CETP and the quantitative trait HDL. In the reduced model, after
using VIF to detect multicollinearity, SNP 2 and SNP 9 were both eliminated from the model due
to their high VIFs. This model was compared to the null model, and the likelihood ratio test was
significant with p-value equals to 5.603E-12. Essentially, after removing highly correlated
covariates, the reduced model has a similarly small p-value. Lastly, in the ridge regression model,
the estimates for SNP 3, SNP 6, SNP 9, and SNP 10 are close to zero, and MSE is minimized,
which implies that these SNPs do not improve the prediction. Referring back to Figure 1, SNPs 3
and 6 are correlated with each other and they are both penalized. SNPs 9 and 10 both are
correlated with some other SNPs in the gene; thus, they are also penalized by the ridge. The
17 | P a g e

lambda 0.052 (log( ) = -2.96) had the smallest mean square error. Based on Figure 8, as
approaches to 0, the mean square remains roughly constant. This indicates that the MSE of the
ridge with lambda 0.052 is roughly the same as the MSE of an ordinary linear regression. Thus,
the full model is adequate to test for the association between the gene and HDL.
Yoo et al. (2017) presented the MLC regression method. This method had fewer degrees
of freedom than other approaches in testing the gene association. The method divided the 10
SNPs into 5 clusters as shown in Figure 1. The MLC test combined the SNPs that are correlated
by putting them into the same cluster and then averaging the regression coefficients. The null
hypothesis of the test was that there is no association of any cluster and the alternative is that at
least one cluster has an association with HDL. The small p-value of the MLC test statistic
indicates a strong association between the gene and HDL (p-value of 3.69E-12). As a result, the
reduced model, VIF final model, the ridge regression, and the MLC all suggest that there is an
association between the gene CETP and the HDL.
The impact of this study is to assess linear regression methods for gene-based association
analysis. VIF and ridge regression are techniques to assess multicollinearity in the multiple
regression linear model. When SNPs within a gene are more likely to be correlated with each
other, multiple linear regression may suffer from multicollinearity. However, the global
hypothesis test of association in the full and reduced models for CETP are not very different and
inference from the full model (i.e. the global test) appears to be unaffected by the level of
multicollinearity among the set of SNPs in CETP. Similarly, ridge regression is a model selection
method that shrinks the highly correlated coefficients in the model. Ridge regression will choose
a ! with the smallest MSE, and if this MSE is close to the MSE of a linear regression (i.e. when !
is close to zero), then this implies that the full model is adequate to test for association. Thus,
18 | P a g e

both methods serve as a tool to confirm the validity of the full multiple regression in testing
genetic association.
One of the major weaknesses of this study is that the full model only includes the SNPs of
the CETP gene. Other variables such as age and sex are also associated with HDL and could have
been included in this analysis to explain variation in HDL. According to a former study of HDL
cholesterol, HDL increases with age in men but not in women (Ferrara et al., 1997). Another
weakness of this study is that there are another 182 genes with 1203 SNPs that could be studied
similarly from the candidate genes dataset.
To conclude, the VIF and ridge regression have shown that the full model with 10 CETP
SNPs is appropriate to study the association of the CETP gene and HDL. These are methods that
can be used to evaluate the validity of full multiple regression for a gene-based genetic
association analysis. For further studies, other patient characteristic data such as age and sex
could be incorporated to explain variation in HDL. Furthermore, we could also investigate least
absolute shrinkage and selection operator (LASSO) method because this method also deals with
highly correlated predictors by selecting only one of them and shrinking the others to zero.
LASSO also chooses ! that minimizes the MSE through cross-validation. If this MSE is close to
the MSE of a linear regression (i.e. when ! is close to zero), this would imply that the full model
is adequate to test for association.
19 | P a g e

Appendices
References
Al-Kateb, H., Boright, A. P., Mirea, L., Xie, X., Sutradhar, R., Mowjoodi, A., . . . Paterson, A. D.
(2007). Multiple Superoxide Dismutase 1/Splicing Factor Serine Alanine 15 Variants Are
Associated With the Development and Progression of Diabetic Nephropathy: The
Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and
Complications Genetics Study. Diabetes, 57(1), 218-228. doi:10.2337/db07-1059
Detecting Multicollinearity Using Variance Inflation Factors. (2017). Retrieved from

https://onlinecourses.science.psu.edu/stat501/node/347
Ferrara, A., Barrett-Connor, E., & Shan, J. (1997, July 01). Total, LDL, and HDL Cholesterol
Decrease With Age in Older Men and Women. Retrieved March 17, 2017, from
http://circ.ahajournals.org/content/96/1/37
Friedman, J., Hastie, T., & Tibshirani, R. (2016, March 17). Package 'glmnet' Retrieved from
https://cran.rproject.org/web/packages/glmnet/glmnet.pdf
Hastie, T., & Qian, J. (2014, June 26). Glmnet Vignette. Retrieved from
https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2015). An introduction to statistical learning
with applications in R. New York, NY: Springer.
Nathan, D. M., & Group, F. T. (2014, January). The Diabetes Control and Complications
Trial/Epidemiology of Diabetes Interventions and Complications Study at 30 Years:
Overview. Retrieved March 23, 2017, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3867999/
National Institutes of Health . (2008, May). DCCT and EDIC: The Diabetes Control and
Complications Trial and Follow-up Study . Retrieved from
https://www.niddk.nih.gov/about-niddk/research-areas/diabetes/dcct-edic-diabetes
control-complications-trial-follow-up-study/Documents/DCCT-EDIC_508.pdf
OBrien, R. M. (2007). A Caution Regarding Rules of Thumb for Variance Inflation Factors.
Quality & Quantity, 41(5), 673-690. doi:10.1007/s11135-006-9018-6
Teslovich, T. M., Musunuru, K., Edmondson, A. V., & Stylianou, A. C. (2010). Biological,
clinical and population relevance of 95 loci for blood lipids. Nature, 707-713.
Yoo, Y. J., Sun, L., Poirier, J. G., Paterson, A. D., & Bull, S. B. (2017). Multiple linear
combination (MLC) regression tests for common variants adapted to linkage
disequilibrium structure. Genetic Epidemiology. 108-121. doi: 10.1002/gepi.22024
Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/gepi.22024/epdf
20 | P a g e

The R Code
The Dataset Code
tol=1e-32
#dir="C:/Users/yyoo/Dropbox/genebased/Rprog/"
dir="M:/lyang/Practicum Dataset/HDL_Candidate_genes/taleban_QT_example/single genes/"
pedfile="example_clb_hdl_3.ped"
mapfile="example_3.map"
#LDfile="mydata.LD"
# sep="\t" for tab delimited files

ped_a=read.table(paste(dir,pedfile,sep=""),header=F)
map_a=read.table(paste(dir,mapfile,sep=""),header=F,
col.names=c("chr","rs","cM","bp","gene"))
#LD=read.table(paste(dir,LDfile,sep=""),header=F, col.names=c("SNP.1","SNP.2","r"))
#LD=cbind(LD,(LD$r)^2)
#dimnames(LD)[[2]]=c("SNP.1","SNP.2","r","r2")
#instead of reading LD file, now this code calculates r2 from haplotype data
#ped<-ped_a[,c(1:58,61:64)]
#map<-map_a[c(1:26,28:29),]
ped<-ped_a[,c(1:12,15:64)]
map<-map_a[c(1:3,5:29),]
ped<-ped_a
map<-map_a
##############################
#loop for each gene

genelist=unique(map$gene)
result=NULL
# 1:length(genelist)
for(i in 1:length(genelist)){
#for(i in 1:1){
snpindex=c(1:length(map$gene))[map$gene==genelist[i]]
snpall=as.character(map$rs[map$gene==genelist[i]])
datacolumn=c(6,((2*snpindex[1]+5):(2*tail(snpindex,1)+6)))
dataraw=ped[,datacolumn]
genodata=matrix(0,length(dataraw[,1]),length(snpindex))
#minor allele coding

mALvector=NULL
MALvector=NULL
for(j in 1:length(snpindex)){
21 | P a g e

#get minor alleles
alleles=factor(c(as.character(dataraw[,2*j]),as.character(dataraw[,(2*j+1)])))
acount=as.data.frame(table(alleles))
mAL=as.character(acount$alleles[order(acount$Freq)==1] )
MAL=as.character(acount$alleles[order(acount$Freq)==2] )
mALvector=c(mALvector,mAL)
MALvector=c(MALvector,MAL)
#counting of alleles
for(n in 1:length(dataraw[,1])){
if(dataraw[n,2*j]==mAL)genodata[n,j]=genodata[n,j]+1
if(dataraw[n,2*j+1]==mAL)genodata[n,j]=genodata[n,j]+1
}
}
SNPs=paste("SNP",snpindex,sep="")
#######datanew=data.frame(cbind(dataraw[,2],genodata))
datanew=data.frame(cbind(dataraw[,1],genodata))
dimnames(datanew)[[2]]=c("pheno",SNPs)
}
attach(datanew)
y = datanew$pheno
Data Analysis
# Histograms
par(mfrow = c(2,5))
i=1
for(i in 1:10) {
barplot(table(datanew[i+1]), main = paste("Histogram of SNP", i, sep = ""), ylim=range(0,
1200), xlab="Genotype")
}
# Table for Genotype Proportion and Minor Allele Frequency

store = matrix(nrow = 10, ncol = 4)
for(i in 1:10){
z = datanew[2:11][i]
t = prop.table(table(z))
MAF = t[3] +t[2] * 0.5
store[i,] = cbind(t(t), t(MAF))
}
nth = c(1:10)
rownames(store) = paste("SNP", nth, sep="")
colnames(store) = c("0","1", "2", "MAF")
# Correlation of the SNPs

cor(datanew[2:11])
# Box Plot
22 | P a g e

# Summary for HDL
par(mfrow = c(1,2))
boxplot(exp(datanew$pheno), xlab="", ylab="HDL", main="HDL Box plot")
boxplot(datanew$pheno, xlab="", ylab="log(HDL)", main="log(HDL) Box plot")
summary(datanew$pheno)
par(mfrow = c(2,5))
i=1
for (i in 1:10)
{
boxplot(datanew$pheno[datanew[i+1]==0],
datanew$pheno[datanew[i+1]==1],datanew$pheno[datanew[i+1]==2],boxwex=0.5, main =
paste("log(HDL) SNP", i, sep=""),ylab= "log(HDL)", xlab="Genotype", names=c("0","1","2"))
}
# Model Selection
# Multiple Linear Regression
nth = c(1:10)
colnames(datanew) = c("y", paste("SNP", nth, sep=""))
fit = lm(y~SNP1+SNP2+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
fitnull = lm(y~1, data=datanew)
anova(fitnull, fit, test="Chisq")
#Diagnostic Plots
par(mfrow=c(2,2))
plot(fit,which = 1:4)
# Multicollinearity treatment with VIF

library(car)
vif(fit)
model1 = lm(y~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10, data=datanew)
vif(model1)
checksnp2 = lm(SNP2~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
model2 = lm(y~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP10, data=datanew)
vif(model2)
checksnp9 = lm(SNP9~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP10, data=datanew)
finalfit = lm(y~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP10, data=datanew)
anova(fitnull, finalfit, test="Chisq")
# Ridge Regression
library(glmnet)
# We choose alpha=0 because it is Ridge Regression.
# glmnet automatically standradize the x variables
SNP1 = SNP20; SNP2 = SNP21; SNP3 = SNP22; SNP4 = SNP23; SNP5 = SNP24;
SNP6 = SNP25; SNP7 = SNP26; SNP8 = SNP27; SNP9 = SNP28; SNP10 = SNP29; y = pheno;
x = model.matrix(y~SNP1+SNP2+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
23 | P a g e

fit.ridge=glmnet(x,y,alpha=0, standardize = TRUE)
lbs_fun <- function(fit.ridge, ...) {
L <- length(fit.ridge$lambda)
x <- log(fit.ridge$lambda[L])
y <- fit.ridge$beta[, L][-1]
labs <- names(y)
text(x, y, labels=labs)
legend('topright', legend=labs, col=1:length(labs), lty=1, cex = 0.65)
}
plot(fit.ridge, xvar="lambda", col = 1:10)
lbs_fun(fit.ridge)
# Cross-validation
# By default, the function performs ten-fold cross-validation
cv.ridge=cv.glmnet(x, y, alpha=0, standardize = TRUE)
plot(cv.ridge, main="Cross-validation")
bestlam = cv.ridge$lambda.min
min(cv.ridge$cvm)
coef(cv.ridge, s = "lambda.min")
cv.ridge=cv.glmnet(x, y, alpha=0, standardize = TRUE)

plot(cv.ridge, main="Cross-validation")
bestlam = cv.ridge$lambda.min
min(cv.ridge$cvm)
coef(cv.ridge, s = "lambda.min")
24 | P a g e

Gene-Based Analysis of Genetic Association With HDL: Application of Variance Inflation Factors and Ridge Regression

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Gene-Based Analysis of Genetic Association With HDL: Application of Variance Inflation Factors and Ridge Regression

Загружено:

Авторское право:

Доступные форматы

GENE-BASED

ANALYSIS OF GENETIC ASSOCIATION WITH HDL : APPLICATION OF

The paper reports a gene-based analysis of genetic association with high-density

In genome-wide association studies (GWAS), researchers scanned a large number of

regression, with the SNPs as the covariates.

MLC test statistic equals to the number of bins constructed.

and is adequate for gene-based association analysis.

The Sample and the Quality of the Data.

1,362 individuals in the study, and there is no missing data.

Descriptive Analysis of the Data

Figure 3. The histograms of the 10 SNPs of the gene CETP

right. The log-transformation made the distribution of HDL roughly symmetric.

Figure 5. The Box plots of HDL for 10 SNPs

they indicate that the log transformation of HDL is adequate.

X is an n (m+1) design matrix, n is the number of observations, m is the number of

an (m+1) 1 vector of coefficients we want to estimate. # is an n 1 error vector. y is an n 1

&' : "( = ") = = "* = 0 VS &+ : at least one of ", 0. (2)

biased or inefficient inferences of the parameter estimates. To reduce the severity of

Yi = . + ", /0, + # i , (3)

covariate xj, the model is

Yi = . + "( /0( + ") /0) + + ", /0, + "* /0* . (5)

The variance of the estimated coefficient 1, becomes

The VIF is simply the ratio of the two variances

3. VIF explained based on the online lecture notes at https://onlinecourses.science.psu.edu/stat501/node/347

interest and the remaining covariates.

To use vif() to assess multicollinearity, several steps are involved.

Lets call this linear regression fit.

regression coefficients can be estimated by using the formula

" = (XTX)-1XTY. (8)

E(") = ", (9)

" ridge = (XTX + !I)-1XTY. (10)

PRSS(") =(YX")T(Y-X") +!" I " (11)

E(" - ") = [(XTX + !I)-1XTY I] ", (12)

y, the dependent variable, needs to be specified from the data.

fit.ridge=glmnet(x, y, alpha=0, nlabmda=20, standardize = TRUE).

shrinks closer to zero, but not exactly zero.

4. Information on glmnet package is available at https://cran.rproject.org/web/packages/glmnet/glmnet.pdf

minimized cross-validated error or MSE. In R, to choose the best ! is to use

validated error mean by using min(cv.ridge$cvm).

final ! varies slightly because the cross-validation function procedure in R is random.

Table 1. Regression Analysis Results: The Full Model

by multicollinearity. We used VIF to help evaluate this.

Figure 6. The Diagnostic plots of the Final Model

Table 2. The VIFs

Figure 7. Coefficients at different lambda values

Figure 8. Mean-Squared Error for the 100 lambda values

Table 4. The Coefficient Estimates of K = 0.052

Table 5. The Coefficient Estimates Comparison

association between the gene CETP and the HDL.

similarly from the candidate genes dataset.

is adequate to test for association.

Detecting Multicollinearity Using Variance Inflation Factors. (2017). Retrieved from

The Dataset Code

# sep="\t" for tab delimited files

#loop for each gene

#minor allele coding

# Table for Genotype Proportion and Minor Allele Frequency

# Correlation of the SNPs

# Multicollinearity treatment with VIF

cv.ridge=cv.glmnet(x, y, alpha=0, standardize = TRUE)

Вам также может понравиться