Вы находитесь на странице: 1из 23

The Application of

Naive Bayes Model Averaging


to Predict Alzheimer͛s Disease
from Genome-Wide Data
Wei Wei, Shyam Visweswaran and Gregory F. Cooper

m 
    
 

@    
   
 
     
 
 
° Genome-wide association studies (GWASs)
° Single-nucleotide polymorphism (SNP)
° High-throughput genotyping technologies
° Alzheimer͛s disease (AD):
° AD afflicts about 10% of persons over 65 and
almost half of those over 85
° ~5.5 million cases currently in U.S.
° 95% of all AD cases are Late-Onset AD (LOAD)
 
° Source
TGEN dataset by Reiman et al *
° Cases
° 1411 individuals
° 861 LOAD and 550 controls
° SNPs
° 312,316 SNPs
° Two additional SNPs (rs429358 and rs7412) genotyped
separately (these determine APOE status)
____________________________________________________________________
* Reiman E, Webster J, Myers A, Hardy J, Dunckley T, Zismann V, et al. GAB2 alleles
modify Alzheimer's risk in APOE epsilon4 carriers. Neuron. 2007;54(5):713-20.
 
° Bayesian Model Averaging
° Represents uncertainty about the correctness of
any given model
° Performs inference by weighting the prediction of
each model by our uncertainty in that model
° Model-Averaged Naïve Bayes (MANB)
MANB efficiently averages over all naive Bayes
models (on a given set of variables) in making a
prediction for an individual patient case
@ 

LOAD

SNP 1 SNP 2 SNP 3 ͙ SNP


312318
@   !  !
Perform feature selection using a greedy, forward-stepping
search that optimizes the prediction of LOAD

LOAD

SNP SNP SNP SNP


25,920 276,455 104,582 1,100
@ @ "@"

LOAD

SNP 1 SNP 2
͙ SNP
312,318
@ @"

Model 1 ͙ Model i ͙ Model


1 , 1 

    *
ÿ  ÿ 


6*
    
6  
6  66  
@ @"
° We can take advantage of the conditional independence
relationships in NB models to make it efficient to model
average over all those many models.
° The computational ͞trick͟ is as follows*
° For each O  we construct a model-averaged conditional
probability,  (O  | ), by averaging over whether or
not there is an arc from  to O 

This step can be viewed


as a ͞soft͟ form of
feature selection.

____________________________________________________________________
* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
@ @"
° We can take advantage of the conditional independence
relationships in NB models to make it efficient to model
average over all those many models.
° The computational ͞trick͟ is as follows*
° For each O  we construct a model-averaged conditional
probability,  (O  | ), by averaging over whether or not
there is an arc from  to O 
° We use these model-averaged conditional probabilities to define a
new NB model M over which we now perform NB inference.
° Performing inference with M is the same as model averaging over
the exponential number of NB models discussed previously.
____________________________________________________________________
* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
@      
° Structure priors
° FSNB and MANB assume each arc is present with some
probability „, independent of the status of other arcs in
the model.
° Informed by the literature, we chose a value of „ that
yields an expected number of arcs of 20.
° Parameter priors
If we think of (O  |) as defining a table of
probabilities, then we assume that every way of filling in
that table (consistent with the axioms of probability) is
equally likely
„  
@ #$ 

° Five-fold cross-validation
° Performance measures
° Area under the ROC curve (AUC) as a measure of
discrimination
° Calibration plots and Hosmer-Lemeshow goodness-of-
fit statistics
° Run time
° Control algorithms
° NB
° FSNB
º º   
%  %
2000
1684.2
1500

MANB
1000
NB

500 FSNB

16.1 15.6
0
MANB NB FSNB

Machine parameters: CPU 2.33 GHz, RAM 2 GB. Training time


was the average over the five cross-validation folds. Time for loading
data into memory is not included, but was about XYZ seconds.
º " º  "m
  
 "m
 ! 
@"(95%
confidence interval of
their AUC difference is
-0.008 to 0.029). Their
performance is strongly
influenced by several
APOE SNPs.
 "m
 @"
 


 (p<0.00001).
º    

  
 
 with
almost all the test
cases having
probability
predictions near 0 or
1. Such extreme
predictions occur
because there are
such a large number
of features in the
model.
º    
  !
  
! 
 algorithm
among the three we
evaluated. This result
is likely due to the
FSNB models
containing only a few
SNP features (< 4).
º    
& ! @"

  
@" 
  '

@" 
  !'We
believe this result may
be due to FSNB having
such a small number of
features in its models.
! 
º 

 ! @"

"m Ë Ë

  ËË Ë

º  ËË ËË
" " 

 A full description of the MANB algorithm is available


in the appendix of our paper.
 It provides all the details needed to readily
implement the algorithm.
 (        
° Apply the MANB algorithm to additional
datasets
° Predict additional clinical outcomes
° Use both genomic and clinical data to predict
clinical outcomes
° Explore the use of additional genome-wide
measurement platforms, including next
generation sequencing data
° Include additional control algorithms in future
evaluations
"  

° We thank Mr. Kevin Bui for his help in data


preparation, software development, and the preparation
of the appendix. We thank Dr. Pablo Hennings-
Yeomans, Dr. Michael Barmada, and the other members
of our research group for helpful discussions.
° The research reported here was funded by NLM grant
R01-LM010020 and NSF grant IIS-0911032.
Thank you

Questions?

Вам также может понравиться