Вы находитесь на странице: 1из 24

GENE-BASED

ANALYSIS OF GENETIC ASSOCIATION WITH HDL : APPLICATION OF


VARIANCE INFLATION FACTORS AND RIDGE REGRESSION

LUSI YANG
SUPERVISOR: DR. SHELLEY BULL
UNIVERSITY OF TORONTO
MARCH 16, 2017

Abstract

The paper reports a gene-based analysis of genetic association with high-density

lipoprotein (HDL). Low HDL is a risk factor for cardiovascular disease in the general population.

The two objectives of the study are: (1) gene-based testing using multiple linear regression of

multiple SNPs in a gene and (2) assessing the full multiple regression model using variance

inflation factor (VIF) and ridge regression. The gene of interest in this analysis is the CETP gene

because it has a known association with HDL in the general population, and has been reported in

previous analysis of the type 1 diabetes population (Teslovich et al., 2010; Yoo et al., 2017).

When SNPs within a gene are correlated, multiple linear regression may suffer from

multicollinearity. To assess multicollinearity, VIFs were calculated for each SNP, and the SNPs

that had large VIFs were removed from the full model that included all SNPs of CETP as the

covariates. Ridge regression is another method for assessing multicollinearity, which penalizes

the correlated SNPs in the gene. By using cross-validation in R, we found the tuning parameter !

with the minimized mean squared error (MSE). For VIF, the global hypothesis tests of

association in the full and reduced models for CETP are not very different and inference from the

full model (i.e. the global test) appears to be unaffected by the level of multicollinearity among

the SNPs in CETP. For ridge regression, the ! with the minimized MSE had approximately the

same MSE as the full model, which implies that the full model was unaffected by

multicollinearity.

2 | P a g e

Introduction

In genome-wide association studies (GWAS), researchers scanned a large number of

single-nucleotide polymorphisms (SNPs), and tested one by one, to detect the association

between each SNP and the disease traits of interest. This is called the single-SNP approach.

Alternatively, gene-based analysis or analysis of multiple SNPs was to study the association at a

gene level rather than at a SNP level. To test for the significance of a gene, a multi-SNP global

statistic is constructed to test if there is an association between the SNPs of a gene and a

particular disease trait. One advantage of gene level testing is the reduction of number of tests of

association. The analysis of multiple SNPs within a gene is usually conducted by multiple linear

regression, with the SNPs as the covariates.

Global statistics have been developed to test the combined effects of SNPs with certain

quantitative traits. In the classical regression case, the multi-SNP global statistic has degrees of

freedom (df) that corresponds to the number of SNPs in a gene. In recent study of gene

association, multiple linear combination (MLC) regression for regional testing has been proposed

to reduce the df of the global test statistics (Yoo et al., 2017). The reduced df statistics can

improve both power and robustness in gene association testing as found by Yoo et al. for the

MLC test statistic (2017). The MLC test adapts to the linkage disequilibrium structure of SNPs in

a gene by partitioning and recoding SNPs into bins of positively correlated SNPs. The df of the

MLC test statistic equals to the number of bins constructed.

To illustrate, the bins of the SNPs from the study are presented in Figure 1. Each colour

represents a bin and each bin contains the positively correlated SNPs from the CETP gene. The

number above each line is the correlation between the adjacent SNPs. Since there are five bins,

the df used in the MLC test equals to 5. Figure 1 is a network which represents the correlation

among the SNPs, but the edges between SNPs with correlation less than 0.5 have been removed.
3 | P a g e


Figure 1. Clustering of SNPs in DCCT/EDIC CETP gene data (Yoo et al., 2017)

The purpose of this report is to apply a global hypothesis testing of gene-based association

using the methods of variance inflation factor (VIF) and ridge regression. These results will then

be compared to the result from the gene-based MLC test statistic. The motivation of using these

two methods is to assess the consequences of the multicollinearity in the multiple regression

models because SNPs from a gene can be correlated. Each method may suggest different

importance of SNPs in a gene, and may provide different information on the association of the

gene with the quantitative trait of interest. Because df is defined as the number of SNPs in a gene,

the df of VIF depends on which SNPs are removed from the reduced regression due to

multicollinearity. This method uses the typical likelihood ratio test (comparison between the null

model and the reduced model) to investigate the association between the gene and the

quantitative trait of interest. On the other hand, ridge regression does not reduce the number of df.

It rather penalizes the coefficients of highly correlated SNPs in the model by shrinking them

close to zero, but does not remove them from the model. By cross-validation, ridge regression

chooses the tuning parameter ! that gives the smallest mean square error (MSE). If this MSE is

close to the MSE of the multiple regression with all SNPs as covariates when ! is close to zero,

then this suggests the multiple regression is not being affected by correlation among the SNPs,

and is adequate for gene-based association analysis.

4 | P a g e

Materials and Methods

The Sample and the Quality of the Data.

Stage 3
Stage 2
Stage 1 DCCT/EDIC Genetic
Epidemiology of
The Diabetes Control Study: study of the
Diabetes
and Complications association of SNPs in
Interventions and
Trial (DCCT) : a a large set of candiate
Complications (EDIC):
randomized, clinical genes with
a follow-up of DCCT
trial complicaitons of T1D
patients
patients

Figure 2. The flow chart of the data collection stages

The dataset used in this study was collected in several stages. The first stage was The

Diabetes Control and Complications Trial (DCCT) from 1983 to 1993, which was a randomized

clinical trial that was designed to compare intensive to conventional insulin therapy in type 1

diabetes patients and determine whether the complications of type 1 diabetes (TD1) could be

prevented or delayed (Al-Kateb et al., 2008; Nathan, 2013). In this stage, 1,441 subjects with

TD1 were collected. In the second stage, the Epidemiology of Diabetes Interventions and

Complications (EDIC) was a follow-up study of 90 percent of the DCCT cohort, which was to

study the durability of the DCCT effects on more advanced stages of diabetes complications

including cardiovascular disease (NIH, 2008; Nelson, 2013). DCCT/EDIC Genetic Study was

then derived from the previous two stages. This study investigated the association of SNPs in a

large set of candidate genes with complications of T1D of 1,362 white probands (Yoo et al.,

2017). The participants were genotyped by a custom Illumina GoldenGate Beadarray assay,

which consists of 1,213 SNPs in 183 candidate genes. These candidate genes have at least more

than one SNPs. The inclusion criterion of the Genetic Study was white patients of the DCCT. The

exclusion criterion of the Genetic Study was those individuals with missing genotype data or

discrepant sex based on standard quality control procedures (Al-Kateb et al., 2008).

5 | P a g e

For this study, we are interested in the trait high-density lipoprotein (HDL) because

having low HDL is a risk factor for cardiovascular disease in the general population. There is a

known association between HDL and the gene CETP in the general population, and has been

reported in previous analysis of the same T1D data (Teslovich et al., 2010; Yoo et al., 2017). The

HDL data was obtained at the DCCT baseline in stage 1 of the flow chart. The SNPs data of the

gene CETP were obtained from the custom Illumina GoldenGate Beadarray assay in stage 3 of

the flow chart. There are 10 SNPs in the gene CETP, and each SNP has genotype coding 0, 1, or

2.1 HDL is the dependent variable and the 10 SNPs are the covariates of this study. There are

1,362 individuals in the study, and there is no missing data.

Descriptive Analysis of the Data

Figure 3. The histograms of the 10 SNPs of the gene CETP

Figure 3 presents the 10 SNPs from the CETP gene, and each SNP takes value 0, 1, or 2. Most of

the histograms are right skewed as expected based on the definition of minor alleles.1

1. Genotype classification: 0 = 0 copy of minor alleles (i.e. AA); 1 = 1 copy of minor allele (i.e. AT); 2 = 2
copies of minor alleles (i.e. TT). A minor allele is defined as <50% in the sample size.

6 | P a g e


Figure 4. The box plots of HDL and log(HDL)

The original data for HDL was not log transformed, and it was slightly right skewed as shown in

the box plot on the left with more points at the upper tail than the log transformed HDL on the

right. The log-transformation made the distribution of HDL roughly symmetric.

Figure 5. The Box plots of HDL for 10 SNPs

7 | P a g e

There are several outliers in the box plots in each SNP. Most of the box plots are symmetric and

they indicate that the log transformation of HDL is adequate.

Regression Models. Suppose there are m SNPs in a gene, and they can be analyzed through a

multiple regression. We denote genotypes of m SNPs as X1, X2, X3 Xm. Each Xi represents the

count of minor alleles for the ith SNP, and can be classified as 0, 1, or 2. Let y be the quantitative

trait of interest, such as HDL. The multiple linear regression can be set up in matrix form as

y = X" + #. (1)

X is an n (m+1) design matrix, n is the number of observations, m is the number of

independent variables (number of SNPs in a gene), and the X matrix includes the intercept. " is

an (m+1) 1 vector of coefficients we want to estimate. # is an n 1 error vector. y is an n 1

vector of observations on dependent variable (HDL). The coefficient "s are estimated by

minimizing the sum of squared residuals. To find the least squares estimates, we used the

function lm() in R.

For the global hypothesis of gene-based association, we can write the following,

&' : "( = ") = = "* = 0 VS &+ : at least one of ", 0. (2)

To test the global hypothesis, we fitted an intercept-only model and then a full model with all the

SNPs. We then used the function anova() in R to conduct the Likelihood Ratio Test.2 If the p-

value is significant, we reject the null that the intercept-only model is true.

A linear regression model assumes that there is a linear relationship between response and

explanatory variables, errors are normally i.i.d, and homoscedasticity. In addition, influential

points and multicollinearity can affect inference of coefficients. Since SNPs are from

2. For instance, we can use anova(fitnull, fit, test="Chisq"), where the likelihood ratio test is a chi-squared test.

8 | P a g e

the same gene, they may be correlated and cause multicollinearity. Multicollinearity can produce

biased or inefficient inferences of the parameter estimates. To reduce the severity of

multicollinearity, we propose the Variance Inflation Factor (VIF) and the Ridge Regression.

Variance Inflation Factor.3 Suppose the model has only one covariate xj such that

Yi = . + ", /0, + # i , (3)

where # i are i.i.d. normally distributed errors, and the variance of the estimated coefficient 1, is

23
var(1, )min= : 2 . (4)
6;<(567 857 )

Suppose now the model is multiple linear regression with covariates that are correlated with the

covariate xj, the model is

Yi = . + "( /0( + ") /0) + + ", /0, + "* /0* . (5)

The variance of the estimated coefficient 1, becomes

23 (
var(1, ) = : 2 , (6)
6;<(567 857 ) (8 =73

where >,) is obtained from regressing the jth covariate on the remaining covariates. The greater

the linear dependence, the larger the >,) and the more inflated the variance of 1, .

The VIF is simply the ratio of the two variances

E3 <
: (F GF ))
?@A(B7 )) 6;< 67 7 <G H3
7 (
VIFj = = E3
= . (7)
?@A(B7 ))C6: (8 =73
: (F GF ))
6;< 67 7

VIFs can be calculated for all the covariates in the linear regression model.

3. VIF explained based on the online lecture notes at https://onlinecourses.science.psu.edu/stat501/node/347

9 | P a g e

Researchers determine the thresholds to reduce the multicollinearity in the model. There are

no correct thresholds; however, they generally range from 4 to 10 (OBrien, 2007). This also

implies R2 ranges from 0.75 to 0.9, which shows a high linear dependence among the covariate of

interest and the remaining covariates.

A function to calculate VIF values can be found in Rs car library, and it is denoted as vif().

To use vif() to assess multicollinearity, several steps are involved.

Step 1. Fit a full model with all SNPs in the linear regression by using the lm() function in R.

Lets call this linear regression fit.

Step 2. Apply the vif() function for fit: vif(fit). Check which covariate has the largest VIF.

Step 3. Delete the covariate that has the largest VIF, and fit the model again and call this new

model model1. Then check which covariate has the largest VIF. This deletion will continue until

VIFs of all the remaining covariates are less than a reasonable threshold.

Ridge Regression. The linear regression model matrix can be set up as equation (1). The

regression coefficients can be estimated by using the formula

" = (XTX)-1XTY. (8)

E(") = ", (9)

where the estimates for " is unbiased under the ordinary least squares estimation. However, in

ridge regression, instead of minimizing sum of squared residuals, it minimizes the penalized

residual sum of squares (equation 11); thus, the estimation of coefficient " changes to

" ridge = (XTX + !I)-1XTY. (10)

PRSS(") =(YX")T(Y-X") +!" I " (11)

E(" - ") = [(XTX + !I)-1XTY I] ", (12)

10 | P a g e

where equation 12 shows that the parameter " ridge is biased. There exists a ! that can minimize

the mean squared error. There are several steps to implement selection of ! in R.4

Step 1. Install the glmnet package, and use model.matrix() function to input an x matrix. This x

matrix consists of all the columns of the covariates in the dataset and an intercept column. Also,

y, the dependent variable, needs to be specified from the data.

Step 2. Use the glmnet() function in the package to specify that a ridge regression is performed by

setting alpha=0. Set standardize=TRUE allows the glmnet() function to standardize the x

variables in the dataset. If we only want 20 lambdas, we should set nlambda=20. For instance,

fit.ridge=glmnet(x, y, alpha=0, nlabmda=20, standardize = TRUE).

Step 3. Then type fit.ridge as the example above, one obtains the number of degrees of freedom,

%dev% (% of deviance explained), and the descending order of lambdas. If one wants to see

how each coefficient behaves based on different lambdas, one can input plot(fit.ridge). The

general trend is that as lambda gets bigger, there will be more penalties on the coefficients, which

shrinks closer to zero, but not exactly zero.

Step 4. Use the cross-validation to choose the tuning parameter ! in R is a general method. The

function is cv.glmnet(), and by default the function performs ten-fold cross-validation. This

function will divide the dataset into 10 blocks. Then 9 blocks of the data become the training

data, and 1 block of the data becomes the validation set. This training data will estimate the

coefficients based on the 9 blocks of data. The estimated coefficients will then be used to fit with

the validation set. The prediction error for this ! will be estimated. There are 10 ways of dividing

the training data and the validation set; thus, 10 errors will be generated. R will calculate the

average error for this !, which is called the cross-validation or the mean square error (MSE).

4. Information on glmnet package is available at https://cran.rproject.org/web/packages/glmnet/glmnet.pdf

11 | P a g e

As the default, this will be done for 100 !s in R. We want to choose one ! that corresponds to the

minimized cross-validated error or MSE. In R, to choose the best ! is to use

cv.ridge$lambda.min, where cv.ridge is defined from the function cv.glmnet. If one wants to

locate where lambda.min is in the matrix, one can type cv.ridge$lambda. Then type cv.ridge$cvm

to see this !s corresponding cross-validated error or MSE. One can also verify this !s cross-

validated error mean by using min(cv.ridge$cvm).

Step 5. After obtaining the !, analyze the coefficients by coef(cv.ridge, s = "lambda.min"). The

final ! varies slightly because the cross-validation function procedure in R is random.

Results

The main interest of this study is to use the Variance of Inflation Factor (VIF) and ridge

regression to assess the consequences of the multicollinearity in the multiple regression models.

To test for the global hypothesis of gene-based association, we compared the model without any

covariates and the full model. The result was significant (p-value = 2.181E-11), which implies

that we reject the null that all " s are 0. Thus, we conclude there is a strong association between

the CETP and the quantitative trait HDL. In Table 1, a full model with all SNPs of the gene

CETP is presented.

Table 1. Regression Analysis Results: The Full Model


Coefficients Estimate Standard Error 95% CI* t-value
Intercept -0.108 0.053 (-0.212, -0.004) -2.025
SNP 1 -0.040 0.022 (-0.083, 0.003) -1.784
SNP 2 0.035 0.026 (-0.016, 0.086) 1.365
SNP 3 0.0005 0.027 (-0.052, 0.053) 0.019
SNP 4 -0.008 0.028 (-0.063, 0.047) -0.296
SNP 5 -0.063 0.019 (-0.100, -0.026) -3.280
SNP 6 -0.011 0.016 (-0.042, 0.020) -0.680
SNP 7 0.011 0.028 (-0.044, 0.066) 0.400
SNP 8 0.048 0.025 (-0.001, 0.097) 1.896
SNP 9 0.003 0.025 (-0.046, 0.052) 0.135
SNP 10 -0.018 0.031 (-0.079, 0.043) -0.588
*CI = Confidence Interval

12 | P a g e

To check the validity of the model, we plotted the diagnostic plots in Figure 6 from R by

using the plot function. The Residual vs Fitted plot randomly bounce around the 0 line, and this

indicates that the covariates and the response variables have a linear relationship. These residuals

roughly form a horizontal band around the 0 linear, which suggests that the variances of the error

terms are equal. This plot also suggests that there are no outliers. The Q-Q plot is close to

diagonal, which means that the residuals are normally distributed. The scale-location plot checks

for homoscedasticity. Since the residuals are spread equally and has no discernible pattern, the

homoscedasticity assumption is true. The Cooks Distance plot shows there are no influential

cases because of the relatively small Cooks distances of each observation. Thus, this model

passed the assumption tests of a multiple linear regression. However, this model may be affected

by multicollinearity. We used VIF to help evaluate this.

Figure 6. The Diagnostic plots of the Final Model

13 | P a g e

Variance Inflation Factor (VIF).

Table 2. The VIFs


Covariate Model 1 Model 2 Model 3
SNP 1 1.89 1.78 1.78
SNP 2 8.137 - -
SNP 3 7.462 2.25 2.24
SNP 4 6.244 2.358 2.343
SNP 5 2.579 2.530 2.530
SNP 6 2.887 2.827 2.825
SNP 7 4.011 4.01 1.312
SNP 8 2.046 1.938 1.936
SNP 9 6.838 6.837 -
SNP 10 6.984 6.26 2.29

In Table 2, Model 1 consists of all the SNPs from the CETP gene; Model 2 consists of the

remaining SNPs after detecting high VIF in SNP 2 in Model 1; Model 3 consists of the remaining

covariates from Model 2 except SNP 9 because of its high VIF in Model 2.

Since SNP 2 had the highest VIF in Model 1, we regressed SNP 2 on the remaining SNPs

in R. The resulting R2 is 0.8771, and this means there is a high linear dependence among the

predictor SNP 2 and remaining SNPs. Thus, we removed SNP 2 from the model. For Model 2,

we fitted a linear regression model without SNP 2, and all the VIFs decreased. However, SNP 9

had the largest VIF, and the R2 for regressing SNP 9 on the remaining SNPs was 0.8537. This

also suggested that SNP 9 had a high linear dependence with the other SNPs; thus, it was

removed. For Model 3, all VIFs were relatively small. Hence, multicollinearity assumption was

reduced, and the reduced model includes only the remaining 8 SNPs. The diagnostic plots of the

reduced model were similar to the diagnostic plots of the full model. Thus, the reduced model

also passed all the linear regression assumptions. The estimates of the reduced model are

presented in Table 3.

14 | P a g e

Table 3. Regression Analysis Results: The Reduced Model
Coefficients Estimate Standard Error 95% CI t-value
Intercept -0.039 0.018 (-0.074, -0.004) -2.170
SNP 1 -0.032 0.022 (-0.075, 0.011) -1.502
SNP 3 -0.030 0.015 (-0.0594, -0.0006) -2.056
SNP 4 -0.039 0.017 (-0.072, -0.005) -2.253
SNP 5 -0.067 0.019 (-0.104, -0.030) -3.503
SNP 6 -0.014 0.016 (-0.045, 0.017) -0.881
SNP 7 0.015 0.016 (-0.016, 0.046) 0.895
SNP 8 0.056 0.025 (0.007, 0.105) 2.277
SNP 10 -0.029 0.018 (-0.064, 0.006) -1.596

To test for the global hypothesis of gene-based association, we compared the model

without any covariates and the final model. The result was significant (p-value = 5.603E-12),

which implies that we reject the null that all " s are 0.

Ridge Regression. R, by default, generates 100 lambdas in for both glmnet and cv.glmnet

functions. Applying plot() for the glmnet function, the following plot was generated.

Figure 7. Coefficients at different lambda values

15 | P a g e

The above plot describes that as lambda penalty increases; there will be more penalty on the

coefficients. Another way of saying this is that the coefficients shrink towards zero as lambda

value increases. Applying plot() for the cv.glmnet, the following plot was generated.

Figure 8. Mean-Squared Error for the 100 lambda values

Figure 8 is a plot with all the mean-squared errors of each lambda calculated by R. The plot

suggests that around -3, lambda will have the smallest mean-squared error. To find this lambda,

cv.ridge$labmda.min was applied, and the resulting was 0.052. The corresponding mean cross-

validation or the mean-squared error is 0.055. The resulting regression coefficients are

Table 4. The Coefficient Estimates of K = 0.052


Coefficients Estimate
Intercept -0.104
SNP 1 -0.026
SNP 2 0.025
SNP 3 -0.006
SNP 4 -0.020
SNP 5 -0.047
SNP 6 -0.006
SNP 7 0.015
SNP 8 0.031
SNP 9 0.003
SNP 10 -0.010

16 | P a g e

The coefficient estimates of SNPs 3, 6, 9 and 10 above are close to zero. In ridge regression, they

are the smallest coefficients in the model. This indicates that the ridge regression down weighted

the variables that were less important or have no association with HDL.

Discussion

Table 5. The Coefficient Estimates Comparison


Coefficients Full Model Reduced Model Ridge Regression
Intercept -0.108 -0.039 -0.1038
SNP 1 -0.040 -0.032 -0.026
SNP 2 0.035 - 0.025
SNP 3 0.0005 0.030 -0.00565
SNP 4 -0.008 -0.039 -0.02
SNP 5 -0.063 -0.067 -0.047
SNP 6 -0.011 -0.014 -0.0057
SNP 7 0.011 0.015 0.015
SNP 8 0.048 0.056 0.031
SNP 9 0.003 - 0.0026
SNP 10 -0.018 -0.029 -0.0095

The coefficient estimates of the three models are presented in Table 5. The full model has

all the 10 SNPs in the CETP gene. It was compared to a null model with no covariates, and the

likelihood ratio test was significant with p-value equals to 2.181E-11. This implies that there is a

strong association between CETP and the quantitative trait HDL. In the reduced model, after

using VIF to detect multicollinearity, SNP 2 and SNP 9 were both eliminated from the model due

to their high VIFs. This model was compared to the null model, and the likelihood ratio test was

significant with p-value equals to 5.603E-12. Essentially, after removing highly correlated

covariates, the reduced model has a similarly small p-value. Lastly, in the ridge regression model,

the estimates for SNP 3, SNP 6, SNP 9, and SNP 10 are close to zero, and MSE is minimized,

which implies that these SNPs do not improve the prediction. Referring back to Figure 1, SNPs 3

and 6 are correlated with each other and they are both penalized. SNPs 9 and 10 both are

correlated with some other SNPs in the gene; thus, they are also penalized by the ridge. The

17 | P a g e

lambda 0.052 (log( ) = -2.96) had the smallest mean square error. Based on Figure 8, as

approaches to 0, the mean square remains roughly constant. This indicates that the MSE of the

ridge with lambda 0.052 is roughly the same as the MSE of an ordinary linear regression. Thus,

the full model is adequate to test for the association between the gene and HDL.

Yoo et al. (2017) presented the MLC regression method. This method had fewer degrees

of freedom than other approaches in testing the gene association. The method divided the 10

SNPs into 5 clusters as shown in Figure 1. The MLC test combined the SNPs that are correlated

by putting them into the same cluster and then averaging the regression coefficients. The null

hypothesis of the test was that there is no association of any cluster and the alternative is that at

least one cluster has an association with HDL. The small p-value of the MLC test statistic

indicates a strong association between the gene and HDL (p-value of 3.69E-12). As a result, the

reduced model, VIF final model, the ridge regression, and the MLC all suggest that there is an

association between the gene CETP and the HDL.

The impact of this study is to assess linear regression methods for gene-based association

analysis. VIF and ridge regression are techniques to assess multicollinearity in the multiple

regression linear model. When SNPs within a gene are more likely to be correlated with each

other, multiple linear regression may suffer from multicollinearity. However, the global

hypothesis test of association in the full and reduced models for CETP are not very different and

inference from the full model (i.e. the global test) appears to be unaffected by the level of

multicollinearity among the set of SNPs in CETP. Similarly, ridge regression is a model selection

method that shrinks the highly correlated coefficients in the model. Ridge regression will choose

a ! with the smallest MSE, and if this MSE is close to the MSE of a linear regression (i.e. when !

is close to zero), then this implies that the full model is adequate to test for association. Thus,

18 | P a g e

both methods serve as a tool to confirm the validity of the full multiple regression in testing

genetic association.

One of the major weaknesses of this study is that the full model only includes the SNPs of

the CETP gene. Other variables such as age and sex are also associated with HDL and could have

been included in this analysis to explain variation in HDL. According to a former study of HDL

cholesterol, HDL increases with age in men but not in women (Ferrara et al., 1997). Another

weakness of this study is that there are another 182 genes with 1203 SNPs that could be studied

similarly from the candidate genes dataset.

To conclude, the VIF and ridge regression have shown that the full model with 10 CETP

SNPs is appropriate to study the association of the CETP gene and HDL. These are methods that

can be used to evaluate the validity of full multiple regression for a gene-based genetic

association analysis. For further studies, other patient characteristic data such as age and sex

could be incorporated to explain variation in HDL. Furthermore, we could also investigate least

absolute shrinkage and selection operator (LASSO) method because this method also deals with

highly correlated predictors by selecting only one of them and shrinking the others to zero.

LASSO also chooses ! that minimizes the MSE through cross-validation. If this MSE is close to

the MSE of a linear regression (i.e. when ! is close to zero), this would imply that the full model

is adequate to test for association.

19 | P a g e

Appendices

References

Al-Kateb, H., Boright, A. P., Mirea, L., Xie, X., Sutradhar, R., Mowjoodi, A., . . . Paterson, A. D.
(2007). Multiple Superoxide Dismutase 1/Splicing Factor Serine Alanine 15 Variants Are
Associated With the Development and Progression of Diabetic Nephropathy: The
Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and
Complications Genetics Study. Diabetes, 57(1), 218-228. doi:10.2337/db07-1059

Detecting Multicollinearity Using Variance Inflation Factors. (2017). Retrieved from


https://onlinecourses.science.psu.edu/stat501/node/347

Ferrara, A., Barrett-Connor, E., & Shan, J. (1997, July 01). Total, LDL, and HDL Cholesterol
Decrease With Age in Older Men and Women. Retrieved March 17, 2017, from
http://circ.ahajournals.org/content/96/1/37

Friedman, J., Hastie, T., & Tibshirani, R. (2016, March 17). Package 'glmnet' Retrieved from
https://cran.rproject.org/web/packages/glmnet/glmnet.pdf

Hastie, T., & Qian, J. (2014, June 26). Glmnet Vignette. Retrieved from
https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2015). An introduction to statistical learning
with applications in R. New York, NY: Springer.

Nathan, D. M., & Group, F. T. (2014, January). The Diabetes Control and Complications
Trial/Epidemiology of Diabetes Interventions and Complications Study at 30 Years:
Overview. Retrieved March 23, 2017, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3867999/

National Institutes of Health . (2008, May). DCCT and EDIC: The Diabetes Control and
Complications Trial and Follow-up Study . Retrieved from
https://www.niddk.nih.gov/about-niddk/research-areas/diabetes/dcct-edic-diabetes
control-complications-trial-follow-up-study/Documents/DCCT-EDIC_508.pdf

OBrien, R. M. (2007). A Caution Regarding Rules of Thumb for Variance Inflation Factors.
Quality & Quantity, 41(5), 673-690. doi:10.1007/s11135-006-9018-6

Teslovich, T. M., Musunuru, K., Edmondson, A. V., & Stylianou, A. C. (2010). Biological,
clinical and population relevance of 95 loci for blood lipids. Nature, 707-713.

Yoo, Y. J., Sun, L., Poirier, J. G., Paterson, A. D., & Bull, S. B. (2017). Multiple linear
combination (MLC) regression tests for common variants adapted to linkage
disequilibrium structure. Genetic Epidemiology. 108-121. doi: 10.1002/gepi.22024
Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/gepi.22024/epdf

20 | P a g e

The R Code

The Dataset Code

tol=1e-32
#dir="C:/Users/yyoo/Dropbox/genebased/Rprog/"
dir="M:/lyang/Practicum Dataset/HDL_Candidate_genes/taleban_QT_example/single genes/"
pedfile="example_clb_hdl_3.ped"
mapfile="example_3.map"
#LDfile="mydata.LD"

# sep="\t" for tab delimited files


ped_a=read.table(paste(dir,pedfile,sep=""),header=F)
map_a=read.table(paste(dir,mapfile,sep=""),header=F,
col.names=c("chr","rs","cM","bp","gene"))
#LD=read.table(paste(dir,LDfile,sep=""),header=F, col.names=c("SNP.1","SNP.2","r"))
#LD=cbind(LD,(LD$r)^2)
#dimnames(LD)[[2]]=c("SNP.1","SNP.2","r","r2")
#instead of reading LD file, now this code calculates r2 from haplotype data

#ped<-ped_a[,c(1:58,61:64)]
#map<-map_a[c(1:26,28:29),]

ped<-ped_a[,c(1:12,15:64)]
map<-map_a[c(1:3,5:29),]

ped<-ped_a
map<-map_a

##############################

#loop for each gene


genelist=unique(map$gene)
result=NULL
# 1:length(genelist)
for(i in 1:length(genelist)){
#for(i in 1:1){
snpindex=c(1:length(map$gene))[map$gene==genelist[i]]
snpall=as.character(map$rs[map$gene==genelist[i]])
datacolumn=c(6,((2*snpindex[1]+5):(2*tail(snpindex,1)+6)))
dataraw=ped[,datacolumn]
genodata=matrix(0,length(dataraw[,1]),length(snpindex))

#minor allele coding


mALvector=NULL
MALvector=NULL
for(j in 1:length(snpindex)){
21 | P a g e

#get minor alleles
alleles=factor(c(as.character(dataraw[,2*j]),as.character(dataraw[,(2*j+1)])))
acount=as.data.frame(table(alleles))
mAL=as.character(acount$alleles[order(acount$Freq)==1] )
MAL=as.character(acount$alleles[order(acount$Freq)==2] )
mALvector=c(mALvector,mAL)
MALvector=c(MALvector,MAL)
#counting of alleles
for(n in 1:length(dataraw[,1])){
if(dataraw[n,2*j]==mAL)genodata[n,j]=genodata[n,j]+1
if(dataraw[n,2*j+1]==mAL)genodata[n,j]=genodata[n,j]+1
}
}
SNPs=paste("SNP",snpindex,sep="")
#######datanew=data.frame(cbind(dataraw[,2],genodata))
datanew=data.frame(cbind(dataraw[,1],genodata))
dimnames(datanew)[[2]]=c("pheno",SNPs)
}
attach(datanew)
y = datanew$pheno

Data Analysis

# Histograms
par(mfrow = c(2,5))
i=1
for(i in 1:10) {
barplot(table(datanew[i+1]), main = paste("Histogram of SNP", i, sep = ""), ylim=range(0,
1200), xlab="Genotype")
}

# Table for Genotype Proportion and Minor Allele Frequency


store = matrix(nrow = 10, ncol = 4)
for(i in 1:10){
z = datanew[2:11][i]
t = prop.table(table(z))
MAF = t[3] +t[2] * 0.5
store[i,] = cbind(t(t), t(MAF))
}
nth = c(1:10)
rownames(store) = paste("SNP", nth, sep="")
colnames(store) = c("0","1", "2", "MAF")

# Correlation of the SNPs


cor(datanew[2:11])

# Box Plot
22 | P a g e

# Summary for HDL
par(mfrow = c(1,2))
boxplot(exp(datanew$pheno), xlab="", ylab="HDL", main="HDL Box plot")
boxplot(datanew$pheno, xlab="", ylab="log(HDL)", main="log(HDL) Box plot")
summary(datanew$pheno)
par(mfrow = c(2,5))
i=1
for (i in 1:10)
{
boxplot(datanew$pheno[datanew[i+1]==0],
datanew$pheno[datanew[i+1]==1],datanew$pheno[datanew[i+1]==2],boxwex=0.5, main =
paste("log(HDL) SNP", i, sep=""),ylab= "log(HDL)", xlab="Genotype", names=c("0","1","2"))
}

# Model Selection
# Multiple Linear Regression
nth = c(1:10)
colnames(datanew) = c("y", paste("SNP", nth, sep=""))
fit = lm(y~SNP1+SNP2+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
fitnull = lm(y~1, data=datanew)
anova(fitnull, fit, test="Chisq")
#Diagnostic Plots
par(mfrow=c(2,2))
plot(fit,which = 1:4)

# Multicollinearity treatment with VIF


library(car)
vif(fit)
model1 = lm(y~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10, data=datanew)
vif(model1)
checksnp2 = lm(SNP2~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
model2 = lm(y~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP10, data=datanew)
vif(model2)
checksnp9 = lm(SNP9~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP10, data=datanew)
finalfit = lm(y~SNP1+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP10, data=datanew)
anova(fitnull, finalfit, test="Chisq")

# Ridge Regression
library(glmnet)
# We choose alpha=0 because it is Ridge Regression.
# glmnet automatically standradize the x variables
SNP1 = SNP20; SNP2 = SNP21; SNP3 = SNP22; SNP4 = SNP23; SNP5 = SNP24;
SNP6 = SNP25; SNP7 = SNP26; SNP8 = SNP27; SNP9 = SNP28; SNP10 = SNP29; y = pheno;
x = model.matrix(y~SNP1+SNP2+SNP3+SNP4+SNP5+SNP6+SNP7+SNP8+SNP9+SNP10,
data=datanew)
23 | P a g e

fit.ridge=glmnet(x,y,alpha=0, standardize = TRUE)
lbs_fun <- function(fit.ridge, ...) {
L <- length(fit.ridge$lambda)
x <- log(fit.ridge$lambda[L])
y <- fit.ridge$beta[, L][-1]
labs <- names(y)
text(x, y, labels=labs)
legend('topright', legend=labs, col=1:length(labs), lty=1, cex = 0.65)
}
plot(fit.ridge, xvar="lambda", col = 1:10)
lbs_fun(fit.ridge)

# Cross-validation
# By default, the function performs ten-fold cross-validation
cv.ridge=cv.glmnet(x, y, alpha=0, standardize = TRUE)
plot(cv.ridge, main="Cross-validation")
bestlam = cv.ridge$lambda.min
min(cv.ridge$cvm)
coef(cv.ridge, s = "lambda.min")

cv.ridge=cv.glmnet(x, y, alpha=0, standardize = TRUE)


plot(cv.ridge, main="Cross-validation")
bestlam = cv.ridge$lambda.min
min(cv.ridge$cvm)
coef(cv.ridge, s = "lambda.min")

24 | P a g e

Вам также может понравиться