Вы находитесь на странице: 1из 42

Lecture 2: Population Structure

02-­‐715  Advanced  Topics  in  Computa8onal  


Genomics  

1  
What is population structure?

•  Popula8on  Structure  
–  A  set  of  individuals  characterized  by  some  measure  of  gene8c  
dis8nc8on  
–  A  “popula8on”  is  usually  characterized  by  a  dis8nct  distribu8on  over  
genotypes  
–  Example  
Genotypes                                  aa                              aA                                  AA  

Popula8on  1   Popula8on  2  

2  
1000 Genome Projects

3  
Motivation

•  Reconstruc*ng  individual  ancestry:  The  Genographic  Project  


–  hJps://genographic.na8onalgeographic.com/genographic/index.html  
•  Studying  human  migra*on  
–  Out  of  Africa  
–  Mul*-­‐regional  hypothesis  
•  Study  of  various  traits  
–  Lactose  intolerance  
–  Origins  in  Europe?  
–  Infer  from    
•  Migra8on  studies  
•  Muta8on  studies  in  popula8ons  

4  
200,000  years  
ago  

50,000  years  ago  

30,000  years  ago  


10,000  years  ago  

hJps://genographic.na8onalgeographic.com/
genographic/index.html  
5  
Overview

•  Background  
–  Hardy-­‐Weinberg  Equilibrium  
–  Gene8c  driZ  
–  Wright’s  FST  

•  Inferring  popula8on  structure  from  genotype  data  


–  Structure  (Falush  et  al.,  2003)  
–  Matrix  factoriza8on/dimensionality  reduc8on  methods  (Engelhardt  &  
Stephens,  2010)  

6  
Hardy-Weinberg Equilibrium

•  Hardy-­‐Weinberg  Equilibruim  
–  Under  random  ma8ng,  both  allele  and  genotype  frequencies  in  a  
popula8on  remain  constant  over  genera8ons.  
–  Assump8ons  of  the  standard  random  ma8ng  
•  Diploid  organism  
•  Sexual  reproduc8on  
•  Nonoverlapping  genera8ons  
•  Random  ma8ng  
•  Large  popula8on  size  
•  Equal  allele  frequencies  in  the  sexes  
•  No  migra8on/muta8on/selec8on  
–  Chi-­‐square  test  for  Hardy-­‐Weinberg  equilibrium  

7  
Hardy-Weinberg Equilibrium

•  D,  H,  R:  genotype  frequencies  for  AA,  Aa,  aa,  respec8vely.  


•  p  q:  allele  frequencies  of  A  and  a  

8  
Hardy-Weinberg Equilibrium

•  The  genotype  and  allele  frequencies  of  the  offspring  

9  
Genetic Drift

•  The  change  in  allele  frequencies  in  a  popula8on  due  to  


random  sampling  

•  Neutral  process  unlike  natural  selec8on  


–  But  gene8c  driZ  can  eliminate  an  allele  from  the  given  popula8on.    

•  The  effect  of  gene8c  driZ  is  larger  in  a  small  popula8on  

10  
Population Divergence

•  Wright’s  FST
–  Sta8s8cs  used  to  quan8fy  the  extent  of  divergence  among  mul8ple  
popula8ons  rela8ve  to  the  overall  gene8c  diversity    
–  Summarizes  the  average  devia8on  of  a  collec8on  of  popula8ons  a  way  
from  the  mean  
–  FST = Var(pk)/p’(1-p’)
•  p’: the overall frequency of an allele across all subpopulations
•  pk :the allele frequency within population k  

11  
Scenarios of How Populations Evolve

12  
Methods for Learning Population
Structure from Genetic Markers
•  Low-­‐dimensional  projec8on  
–  PCA-­‐based  methods  (PaJerson  et  al.,  PLoS  Gene8cs  2006)  

•  Clustering  
–  Distance-­‐based  (Bowcock  et  al.,  Nature  1994)  
–  Model-­‐based  
•  STRUCTURE  (Pritchard  et  al.,  Gene8cs  2000)  
•  mStruct  (Shringarpure  &  Xing,  Gene8cs  2008)  

13  
Probabilistic Models for Population
Structure
•  Mixture  model  
–  Cluster  individuals  into  K  popula8ons  

•  Admixture  model  
–  The  genotypes  of  each  individual  are  an  admixture  of  mul8ple  ancestor  
popula8ons  
–  Assumes  alleles  are  in  linkage  equilibrium  

•  Linkage  model  
–  Model  recombina8on,  correla8on  in  alleles  across  chromosome  

•  F  model  
–  Model  correla8on  in  alleles  in  ancestry  

14  
Mixture Model

•  K  popula8ons  

•  z(i):  popula8on  of  origin  of  individual  i  

•  For  each  of  the  K  popula8ons  


–  pklj:  the  frequency  of  allele  j  at  locus  l  in  popula8on  k  

15  
Admixture Model

•  Relax  the  assump8on  of  one  ancestor  per  individual  in  


mixture  model  

•  Individuals  can  have  ancestors  in  mul8ple  different  


popula8ons  

•  qk(i):  propor8on  of  individual  i’s  genome  derived  from  


popula8on  k  
•  Alleles  at  different  lock  can  come  from  different  popula8ons  

16  
Structure Model

•  Hypothesis:  Modern  popula8ons  are  created  by  an  


intermixing  of  ancestral  popula8ons.  
•  An  individual’s  genome  contains  contribu8ons  from  one  or  
more  ancestral  popula8ons.  
•  The  contribu8ons  of  popula8ons  can  be  different  for  different  
individuals.  
•  Other  assump8ons  
–  Hardy-­‐weinberg  equilbrium  
–  No  linkage  disequilbrium  
–  Markers  are  i.i.d  (independent  and  iden8cally  distributed)  

17  
Linkage Model

•  From  admixture  model,  replace  the  assump8on  that  the  


ancestry  labels  zil  for  individual  i,  locus  l  are  independent  with  
the  assump8on  that  adjacent  zil  are  correlated.  

•  Use  Poisson  process  to  model  the  correla8on  between  


neighboring  alleles  
–  dl  :  distance  between  locus  l  and  locus  l+1  
–  r:  recombina8on  rate  

18  
Linkage Model

•  As  recombina8on  rate  r  goes  to  infinity,  all  loci  become  


independent  and  linkage  model  becomes  admixture  model.  

•  Recombina8on  rate  r  can  be  viewed  as  being  related  to  the  
number  of  genera8ons  since  admixture  occurred.  

•  Use  MCMC  algorithm  to  fit  the  unkown  parameters.  

19  
F Model

•  Introduce  correla8ons  in  allele  frequencies  among  ancestral  


popula8ons  
–  pAl:  allele  frequencies  in  ancestral  popula8ons  modeled  as  symmetric  
Dirichlet  distribu8on  

–  Subpopula8ons  of  the  ancestral  popula8on  go  through  gene8c  driZ  at  
different  rate  Fk    

–  Individuals  are  admixture  of  those  K  popula8ons  who  went  through  


gene8c  driZ  from  the  common  ancestral  popula8on    
20  
F Model

•  Rela8onship  between  Fk  and  FST  

•  Designed  to  between  closely  related  popula8ons  with  similar  


allele  frequencies  

21  
Scenarios of How Populations Evolve

22  
Unknown Parameters To Be Estimated

•  qi:  the  admixture  propor8ons  of  individual  i  


•  pk:  allele  frequencies  of  popula8on  k  
•  zi:  popula8on  label  for  each  locus  of  individual  i  
•  r  :  recombina8on  rate  
•  Fk  :  es8mate  of  popula8on  divergence  from  the  ancestral  
popula8on  

23  
Population Structure from Ancestry
Proportion of Each Individual
•   How  to  display  popula8on  structure?  

Ancestral
proportion

Africa   Europe   Mid-­‐East   Cent./S.  Asia   East  Asia   Oceania  

Genetic structure of Human Populations (Rosenberg et al.,


Science 2002)‫‏‬#
24  
Population of Origin Assignments of a
Single Individual

True  origin  

Es8mated  
Origin  
(Phased  
data)  

Es8mated  
Origin  
(Unphased  
data)  

25  
Admixture vs Divergence

26  
Posterior Distribution of Recombination
Rate
•  Using  the  original  
dataset  

•  AZer  permu8ng  the  


genotype  loci  

27  
Distinguishing Between Two Closely
Related Populations

28  
Three Sources of Linkage Disequilibrium

•  Mixture  LD  
–  Due  to  varia8on  in  ancestry  across  individuals  that  induce  correla8on  
among  markers  at  different  loci    
–  Modeled  by  admixture  model  

•  Admixture  LD  
–  Due  to  unbroken  chunks  of  DNA  derived  from  an  ancestor  popula8on.  
–  Modeled  by  linkage  model  

•  Background  LD  
–  Due  to  LD  within  popula8ons  
–  Decays  at  smaller  scale  

29  
Low-dimensional Projections

•  Gene8c  data  is  very  large  


–  Number  of  markers  may  range  from  a  few  hundreds  to  hundreds  of  
thousands  
–  Thus  each  individual  is  described  by  a  high-­‐dimensional  vector  of  marker  
configura8ons    
–  A  low-­‐dimensional  projec8on  allows  easy  visualiza8on  

•  Technique  used  
–  Factor  analysis  
–  Many  sta8s8cal  methods  exist  –  ICA,  PCA,  NMF  etc.  
–  Principal  Components  Analysis  (next  slide)  

•  Allows  projec8on  of  individuals  into  a  low  dimensional  space  

•  Usually  projected  to  2  dimensions  to  allow  visualiza8on  

30  
Principal Component Analysis

•  Most  common  form  of  factor  analysis  

•  The  new  variables/dimensions  ...  


–  Are  linear  combina8ons  of  the  original  ones  

–  Are  uncorrelated  with  one  another  


•  Orthogonal  in  original  dimension  space  

–  Capture  as  much  of  the  original  variance  in  the  data  as  possible  

–  Are  called  Principal  Components  

•  Demo  at  hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html  

31  
What are the new axes?

Original  Variable  B  
PC  2  
PC  1  

Original  Variable  A  

•   Orthogonal  direc8ons  of  greatest  variance  in  data  


•   Projec8ons  along  PC1  discriminate  the  data  most  along  any  one  axis  
32  
Principal Components

•  First  principal  component  is  the  direc8on  of  greatest  


variability  (covariance)  in  the  data  
•  Second  is  the  next  orthogonal  (uncorrelated)  
direc8on  of  greatest  variability  
–  So  first  remove  all  the  variability  along  the  first  
component,  and  then  find  the  next  direc8on  of  greatest  
variability  
•  And  so  on  …  

33  
Dimensionality Reduction
Can  ignore  the  components  of  lesser  significance.    

You  do  lose  some  informa8on,  but  if  the  eigenvalues  are  small,  you  don’t  lose  much  
–  n  dimensions  in  original  data    
–  calculate  n  eigenvectors  and  eigenvalues  
–  choose  only  the  first  p  eigenvectors,  based  on  their  eigenvalues  
–  final  data  set  has  only  p  dimensions  

34  
PCA Analysis
(Cavalli-sforza,1978)

•  Plot  of  geographical  distribu8on  of  3  PCs  (Intensity  propor8onal  to  value  of  each  component)  
–  First  –  blue  
–  Second    -­‐  green  
–  Third    -­‐  red  

35  
Matrix Factorization and Population
Structure
•  Matrix  factoriza8on  for  learning  popula8on  structure  

Individuals’  ancestry   Subpopula8on  Allele  


Genotype  Data    
propor8ons   Frequencies  
(NxP  matrix)   =   x  
(NxK  matrix)   (KxP  matrix)  
N:  number  of  samples  
K:  number  of  
P:  number  of  genotypes  
subpopula8ons  

36  
Unifying Framework of Matrix
Factorization
•  Admixture  
–  Based  on  probability  models:  rows  of    Λ  and  columns  of  F  should  sum  
to  1.  
–  Works  well  if  the  individuals  are  admixtures  of  discretely  separated  
popula8ons  

•  PCA  
–  Based  on  eigen  decomposi8on:  columns  of  Λ  are  orthogonal,  rows  of  F  
are  orthnormal.  
–  Works  well  for  the  case  of  isola8on-­‐by-­‐distance  (con8nuous  varia8on  
of  popula8ons  among  individuals)  

•  Sparse  factor  model  


–  Sparsity  via  automa8c  relevance  determina8on  prior  
37  
Discrete/Admixed Populations

Loading  1   Loading  2   Loading  3  

SFA  

PCA  

Admixture  

38  
Isolation-by-Distance Models

39  
Clustered Populations in 1d Habitat
•  SFA  
Assume  two  
popula8ons  

Assume  five  
popula8ons  

•  Admixture  
Assume  two  
popula8ons  

Assume  five  
popula8ons  

•  PCA  

40  
Analysis of European Genotype Data

PCA   SFAm   Admixture  


41  
Comparison of Different Methods

PCA   Model-­‐based  Clustering    

Advantages   •   Sta8s8cal  tests  for   •   Genera8ve  process  that  explicitly  


significance  of  results   models  admixture  
(PaJerson  et  al.  2006)   •   Clustering  is  probabilis8c:  it  is  possible  
•   Easy  visualiza8on   to  assign  confidence  level  of  clusters  

Disadvantages   •   No  intui8on  about   •   Computa8onally  more  demanding    


underlying  processes   •  Based  on  assump8ons  of  evolu8onary      
models:      
•   Structure:  No  models  of  muta8on,  
recombina8on  
•   Muta8on  added  in  mStruct    
•   Recombina8on  added  in  
extension  by  Falush  et  al.  

42  

Вам также может понравиться