Lecture 2

Lecture 2: Population Structure
02-‐715 Advanced Topics in Computa8onal

Genomics
1
What is population structure?
•  Popula8on Structure
–  A set of individuals characterized by some measure of gene8c
dis8nc8on
–  A “popula8on” is usually characterized by a dis8nct distribu8on over
genotypes
–  Example
Genotypes aa aA AA
Popula8on 1 Popula8on 2
2
1000 Genome Projects
3
Motivation
•  Reconstruc*ng individual ancestry: The Genographic Project

–  hJps://genographic.na8onalgeographic.com/genographic/index.html
•  Studying human migra*on
–  Out of Africa
–  Mul*-‐regional hypothesis
•  Study of various traits
–  Lactose intolerance
–  Origins in Europe?
–  Infer from
•  Migra8on studies
•  Muta8on studies in popula8ons
4
200,000 years
ago
50,000 years ago
30,000 years ago

10,000 years ago
hJps://genographic.na8onalgeographic.com/
genographic/index.html
5
Overview
•  Background
–  Hardy-‐Weinberg Equilibrium
–  Gene8c driZ
–  Wright’s FST
•  Inferring popula8on structure from genotype data

–  Structure (Falush et al., 2003)
–  Matrix factoriza8on/dimensionality reduc8on methods (Engelhardt &
Stephens, 2010)
6
Hardy-Weinberg Equilibrium
•  Hardy-‐Weinberg Equilibruim
–  Under random ma8ng, both allele and genotype frequencies in a
popula8on remain constant over genera8ons.
–  Assump8ons of the standard random ma8ng
•  Diploid organism
•  Sexual reproduc8on
•  Nonoverlapping genera8ons
•  Random ma8ng
•  Large popula8on size
•  Equal allele frequencies in the sexes
•  No migra8on/muta8on/selec8on
–  Chi-‐square test for Hardy-‐Weinberg equilibrium
7
•  D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.

•  p q: allele frequencies of A and a
8
•  The genotype and allele frequencies of the offspring
9
Genetic Drift
•  The change in allele frequencies in a popula8on due to

random sampling
•  Neutral process unlike natural selec8on

–  But gene8c driZ can eliminate an allele from the given popula8on.
•  The effect of gene8c driZ is larger in a small popula8on
10
Population Divergence
•  Wright’s FST
–  Sta8s8cs used to quan8fy the extent of divergence among mul8ple
popula8ons rela8ve to the overall gene8c diversity
–  Summarizes the average devia8on of a collec8on of popula8ons a way
from the mean
–  FST = Var(pk)/p’(1-p’)
•  p’: the overall frequency of an allele across all subpopulations
•  pk :the allele frequency within population k
11
Scenarios of How Populations Evolve
12
Methods for Learning Population
Structure from Genetic Markers
•  Low-‐dimensional projec8on
–  PCA-‐based methods (PaJerson et al., PLoS Gene8cs 2006)
•  Clustering
–  Distance-‐based (Bowcock et al., Nature 1994)
–  Model-‐based
•  STRUCTURE (Pritchard et al., Gene8cs 2000)
•  mStruct (Shringarpure & Xing, Gene8cs 2008)
13
Probabilistic Models for Population
Structure
•  Mixture model
–  Cluster individuals into K popula8ons
•  Admixture model
–  The genotypes of each individual are an admixture of mul8ple ancestor
popula8ons
–  Assumes alleles are in linkage equilibrium
•  Linkage model
–  Model recombina8on, correla8on in alleles across chromosome
•  F model
–  Model correla8on in alleles in ancestry
14
Mixture Model
•  K popula8ons
•  z(i): popula8on of origin of individual i
•  For each of the K popula8ons

–  pklj: the frequency of allele j at locus l in popula8on k
15
Admixture Model
•  Relax the assump8on of one ancestor per individual in

mixture model
•  Individuals can have ancestors in mul8ple different

popula8ons
•  qk(i): propor8on of individual i’s genome derived from

popula8on k
•  Alleles at different lock can come from different popula8ons
16
Structure Model
•  Hypothesis: Modern popula8ons are created by an

intermixing of ancestral popula8ons.
•  An individual’s genome contains contribu8ons from one or
more ancestral popula8ons.
•  The contribu8ons of popula8ons can be different for different
individuals.
•  Other assump8ons
–  Hardy-‐weinberg equilbrium
–  No linkage disequilbrium
–  Markers are i.i.d (independent and iden8cally distributed)
17
Linkage Model
•  From admixture model, replace the assump8on that the

ancestry labels zil for individual i, locus l are independent with
the assump8on that adjacent zil are correlated.
•  Use Poisson process to model the correla8on between

neighboring alleles
–  dl : distance between locus l and locus l+1
–  r: recombina8on rate
18
Linkage Model
•  As recombina8on rate r goes to infinity, all loci become

independent and linkage model becomes admixture model.
•  Recombina8on rate r can be viewed as being related to the
number of genera8ons since admixture occurred.
•  Use MCMC algorithm to fit the unkown parameters.
19
F Model
•  Introduce correla8ons in allele frequencies among ancestral

popula8ons
–  pAl: allele frequencies in ancestral popula8ons modeled as symmetric
Dirichlet distribu8on
–  Subpopula8ons of the ancestral popula8on go through gene8c driZ at
different rate Fk
–  Individuals are admixture of those K popula8ons who went through

gene8c driZ from the common ancestral popula8on
20
F Model
•  Rela8onship between Fk and FST
•  Designed to between closely related popula8ons with similar

allele frequencies
21
Scenarios of How Populations Evolve
22
Unknown Parameters To Be Estimated
•  qi: the admixture propor8ons of individual i

•  pk: allele frequencies of popula8on k
•  zi: popula8on label for each locus of individual i
•  r : recombina8on rate
•  Fk : es8mate of popula8on divergence from the ancestral
popula8on
23
Population Structure from Ancestry
Proportion of Each Individual
•  How to display popula8on structure?
Ancestral
proportion
Africa Europe Mid-‐East Cent./S. Asia East Asia Oceania
Genetic structure of Human Populations (Rosenberg et al.,

Science 2002)‫‏‬#
24
Population of Origin Assignments of a
Single Individual
True origin
Es8mated
Origin
(Phased
data)
Es8mated
Origin
(Unphased
data)
25
Admixture vs Divergence
26
Posterior Distribution of Recombination
Rate
•  Using the original
dataset
•  AZer permu8ng the

genotype loci
27
Distinguishing Between Two Closely
Related Populations
28
Three Sources of Linkage Disequilibrium
•  Mixture LD
–  Due to varia8on in ancestry across individuals that induce correla8on
among markers at different loci
–  Modeled by admixture model
•  Admixture LD
–  Due to unbroken chunks of DNA derived from an ancestor popula8on.
–  Modeled by linkage model
•  Background LD
–  Due to LD within popula8ons
–  Decays at smaller scale
29
Low-dimensional Projections
•  Gene8c data is very large

–  Number of markers may range from a few hundreds to hundreds of
thousands
–  Thus each individual is described by a high-‐dimensional vector of marker
configura8ons
–  A low-‐dimensional projec8on allows easy visualiza8on
•  Technique used
–  Factor analysis
–  Many sta8s8cal methods exist – ICA, PCA, NMF etc.
–  Principal Components Analysis (next slide)
•  Allows projec8on of individuals into a low dimensional space
•  Usually projected to 2 dimensions to allow visualiza8on
30
Principal Component Analysis
•  Most common form of factor analysis
•  The new variables/dimensions ...

–  Are linear combina8ons of the original ones
–  Are uncorrelated with one another

•  Orthogonal in original dimension space
–  Capture as much of the original variance in the data as possible
–  Are called Principal Components
•  Demo at hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html
31
What are the new axes?
Original Variable B
PC 2
PC 1
Original Variable A
•  Orthogonal direc8ons of greatest variance in data

•  Projec8ons along PC1 discriminate the data most along any one axis
32
Principal Components
•  First principal component is the direc8on of greatest

variability (covariance) in the data
•  Second is the next orthogonal (uncorrelated)
direc8on of greatest variability
–  So first remove all the variability along the first
component, and then find the next direc8on of greatest
variability
•  And so on …
33
Dimensionality Reduction
Can ignore the components of lesser significance.
You do lose some informa8on, but if the eigenvalues are small, you don’t lose much
–  n dimensions in original data
–  calculate n eigenvectors and eigenvalues
–  choose only the first p eigenvectors, based on their eigenvalues
–  final data set has only p dimensions
34
PCA Analysis
(Cavalli-sforza,1978)
•  Plot of geographical distribu8on of 3 PCs (Intensity propor8onal to value of each component)
–  First – blue
–  Second -‐ green
–  Third -‐ red
35
Matrix Factorization and Population
Structure
•  Matrix factoriza8on for learning popula8on structure
Individuals’ ancestry Subpopula8on Allele

Genotype Data
propor8ons Frequencies
(NxP matrix) = x
(NxK matrix) (KxP matrix)
N: number of samples
K: number of
P: number of genotypes
subpopula8ons
36
Unifying Framework of Matrix
Factorization
•  Admixture
–  Based on probability models: rows of Λ and columns of F should sum
to 1.
–  Works well if the individuals are admixtures of discretely separated
popula8ons
•  PCA
–  Based on eigen decomposi8on: columns of Λ are orthogonal, rows of F
are orthnormal.
–  Works well for the case of isola8on-‐by-‐distance (con8nuous varia8on
of popula8ons among individuals)
•  Sparse factor model

–  Sparsity via automa8c relevance determina8on prior
37
Discrete/Admixed Populations
Loading 1 Loading 2 Loading 3
SFA
PCA
Admixture
38
Isolation-by-Distance Models
39
Clustered Populations in 1d Habitat
•  SFA
Assume two
popula8ons
Assume five
popula8ons
•  Admixture
Assume two
popula8ons
Assume five
popula8ons
•  PCA
40
Analysis of European Genotype Data
PCA SFAm Admixture

41
Comparison of Different Methods
PCA Model-‐based Clustering
Advantages •  Sta8s8cal tests for •  Genera8ve process that explicitly

significance of results models admixture
(PaJerson et al. 2006) •  Clustering is probabilis8c: it is possible
•  Easy visualiza8on to assign confidence level of clusters
Disadvantages •  No intui8on about •  Computa8onally more demanding

underlying processes •  Based on assump8ons of evolu8onary
models:
•  Structure: No models of muta8on,
recombina8on
•  Muta8on added in mStruct
•  Recombina8on added in
extension by Falush et al.
42

Lecture 2

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture 2

Загружено:

Авторское право:

Доступные форматы

Lecture 2: Population Structure

02-­‐715 Advanced Topics in Computa8onal

• Reconstruc*ng individual ancestry: The Genographic Project

50,000 years ago

30,000 years ago

• Inferring popula8on structure from genotype data

• D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.

• The genotype and allele frequencies of the oﬀspring

• The change in allele frequencies in a popula8on due to

• Neutral process unlike natural selec8on

• z(i): popula8on of origin of individual i

• For each of the K popula8ons

• Relax the assump8on of one ancestor per individual in

• Individuals can have ancestors in mul8ple diﬀerent

• qk(i): propor8on of individual i’s genome derived from

• Hypothesis: Modern popula8ons are created by an

• From admixture model, replace the assump8on that the

• Use Poisson process to model the correla8on between

• As recombina8on rate r goes to inﬁnity, all loci become

• Use MCMC algorithm to ﬁt the unkown parameters.

• Introduce correla8ons in allele frequencies among ancestral

– Individuals are admixture of those K popula8ons who went through

• Rela8onship between Fk and FST

• Designed to between closely related popula8ons with similar

• qi: the admixture propor8ons of individual i

Africa Europe Mid-­‐East Cent./S. Asia East Asia Oceania

Genetic structure of Human Populations (Rosenberg et al.,

• AZer permu8ng the

• Gene8c data is very large

• Allows projec8on of individuals into a low dimensional space

• Usually projected to 2 dimensions to allow visualiza8on

• Most common form of factor analysis

• The new variables/dimensions ...

– Are uncorrelated with one another

– Are called Principal Components

• Demo at hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html

• Orthogonal direc8ons of greatest variance in data

• First principal component is the direc8on of greatest

Individuals’ ancestry Subpopula8on Allele

• Sparse factor model

Loading 1 Loading 2 Loading 3

PCA SFAm Admixture

PCA Model-­‐based Clustering

Advantages • Sta8s8cal tests for • Genera8ve process that explicitly

Disadvantages • No intui8on about • Computa8onally more demanding

Вам также может понравиться

02-‐715 Advanced Topics in Computa8onal

•  Reconstruc*ng individual ancestry: The Genographic Project

•  Inferring popula8on structure from genotype data

•  D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.

•  The genotype and allele frequencies of the oﬀspring

•  The change in allele frequencies in a popula8on due to

•  Neutral process unlike natural selec8on

•  z(i): popula8on of origin of individual i

•  For each of the K popula8ons

•  Relax the assump8on of one ancestor per individual in

•  Individuals can have ancestors in mul8ple diﬀerent

•  qk(i): propor8on of individual i’s genome derived from

•  Hypothesis: Modern popula8ons are created by an

•  From admixture model, replace the assump8on that the

•  Use Poisson process to model the correla8on between

•  As recombina8on rate r goes to inﬁnity, all loci become

•  Use MCMC algorithm to ﬁt the unkown parameters.

•  Introduce correla8ons in allele frequencies among ancestral

–  Individuals are admixture of those K popula8ons who went through

•  Rela8onship between Fk and FST

•  Designed to between closely related popula8ons with similar

•  qi: the admixture propor8ons of individual i

Africa Europe Mid-‐East Cent./S. Asia East Asia Oceania

•  AZer permu8ng the

•  Gene8c data is very large

•  Allows projec8on of individuals into a low dimensional space

•  Usually projected to 2 dimensions to allow visualiza8on

•  Most common form of factor analysis

•  The new variables/dimensions ...

–  Are uncorrelated with one another

–  Are called Principal Components

•  Demo at hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html

•  Orthogonal direc8ons of greatest variance in data

•  First principal component is the direc8on of greatest

•  Sparse factor model

PCA Model-‐based Clustering

Advantages •  Sta8s8cal tests for •  Genera8ve process that explicitly

Disadvantages •  No intui8on about •  Computa8onally more demanding