Академический Документы
Профессиональный Документы
Культура Документы
1
What is population structure?
• Popula8on
Structure
– A
set
of
individuals
characterized
by
some
measure
of
gene8c
dis8nc8on
– A
“popula8on”
is
usually
characterized
by
a
dis8nct
distribu8on
over
genotypes
– Example
Genotypes
aa
aA
AA
Popula8on 1 Popula8on 2
2
1000 Genome Projects
3
Motivation
4
200,000
years
ago
hJps://genographic.na8onalgeographic.com/
genographic/index.html
5
Overview
• Background
– Hardy-‐Weinberg
Equilibrium
– Gene8c
driZ
– Wright’s
FST
6
Hardy-Weinberg Equilibrium
• Hardy-‐Weinberg
Equilibruim
– Under
random
ma8ng,
both
allele
and
genotype
frequencies
in
a
popula8on
remain
constant
over
genera8ons.
– Assump8ons
of
the
standard
random
ma8ng
• Diploid
organism
• Sexual
reproduc8on
• Nonoverlapping
genera8ons
• Random
ma8ng
• Large
popula8on
size
• Equal
allele
frequencies
in
the
sexes
• No
migra8on/muta8on/selec8on
– Chi-‐square
test
for
Hardy-‐Weinberg
equilibrium
7
Hardy-Weinberg Equilibrium
8
Hardy-Weinberg Equilibrium
9
Genetic Drift
• The effect of gene8c driZ is larger in a small popula8on
10
Population Divergence
• Wright’s
FST
– Sta8s8cs
used
to
quan8fy
the
extent
of
divergence
among
mul8ple
popula8ons
rela8ve
to
the
overall
gene8c
diversity
– Summarizes
the
average
devia8on
of
a
collec8on
of
popula8ons
a
way
from
the
mean
– FST = Var(pk)/p’(1-p’)
• p’: the overall frequency of an allele across all subpopulations
• pk :the allele frequency within population k
11
Scenarios of How Populations Evolve
12
Methods for Learning Population
Structure from Genetic Markers
• Low-‐dimensional
projec8on
– PCA-‐based
methods
(PaJerson
et
al.,
PLoS
Gene8cs
2006)
• Clustering
– Distance-‐based
(Bowcock
et
al.,
Nature
1994)
– Model-‐based
• STRUCTURE
(Pritchard
et
al.,
Gene8cs
2000)
• mStruct
(Shringarpure
&
Xing,
Gene8cs
2008)
13
Probabilistic Models for Population
Structure
• Mixture
model
– Cluster
individuals
into
K
popula8ons
• Admixture
model
– The
genotypes
of
each
individual
are
an
admixture
of
mul8ple
ancestor
popula8ons
– Assumes
alleles
are
in
linkage
equilibrium
• Linkage
model
– Model
recombina8on,
correla8on
in
alleles
across
chromosome
• F
model
– Model
correla8on
in
alleles
in
ancestry
14
Mixture Model
• K popula8ons
15
Admixture Model
16
Structure Model
17
Linkage Model
18
Linkage Model
• Recombina8on
rate
r
can
be
viewed
as
being
related
to
the
number
of
genera8ons
since
admixture
occurred.
19
F Model
– Subpopula8ons
of
the
ancestral
popula8on
go
through
gene8c
driZ
at
different
rate
Fk
21
Scenarios of How Populations Evolve
22
Unknown Parameters To Be Estimated
23
Population Structure from Ancestry
Proportion of Each Individual
•
How
to
display
popula8on
structure?
Ancestral
proportion
True origin
Es8mated
Origin
(Phased
data)
Es8mated
Origin
(Unphased
data)
25
Admixture vs Divergence
26
Posterior Distribution of Recombination
Rate
• Using
the
original
dataset
27
Distinguishing Between Two Closely
Related Populations
28
Three Sources of Linkage Disequilibrium
• Mixture
LD
– Due
to
varia8on
in
ancestry
across
individuals
that
induce
correla8on
among
markers
at
different
loci
– Modeled
by
admixture
model
• Admixture
LD
– Due
to
unbroken
chunks
of
DNA
derived
from
an
ancestor
popula8on.
– Modeled
by
linkage
model
• Background
LD
– Due
to
LD
within
popula8ons
– Decays
at
smaller
scale
29
Low-dimensional Projections
• Technique
used
– Factor
analysis
– Many
sta8s8cal
methods
exist
–
ICA,
PCA,
NMF
etc.
– Principal
Components
Analysis
(next
slide)
30
Principal Component Analysis
– Capture as much of the original variance in the data as possible
31
What are the new axes?
Original
Variable
B
PC
2
PC
1
Original Variable A
33
Dimensionality Reduction
Can
ignore
the
components
of
lesser
significance.
You
do
lose
some
informa8on,
but
if
the
eigenvalues
are
small,
you
don’t
lose
much
– n
dimensions
in
original
data
– calculate
n
eigenvectors
and
eigenvalues
– choose
only
the
first
p
eigenvectors,
based
on
their
eigenvalues
– final
data
set
has
only
p
dimensions
34
PCA Analysis
(Cavalli-sforza,1978)
• Plot
of
geographical
distribu8on
of
3
PCs
(Intensity
propor8onal
to
value
of
each
component)
– First
–
blue
– Second
-‐
green
– Third
-‐
red
35
Matrix Factorization and Population
Structure
• Matrix
factoriza8on
for
learning
popula8on
structure
36
Unifying Framework of Matrix
Factorization
• Admixture
– Based
on
probability
models:
rows
of
Λ
and
columns
of
F
should
sum
to
1.
– Works
well
if
the
individuals
are
admixtures
of
discretely
separated
popula8ons
• PCA
– Based
on
eigen
decomposi8on:
columns
of
Λ
are
orthogonal,
rows
of
F
are
orthnormal.
– Works
well
for
the
case
of
isola8on-‐by-‐distance
(con8nuous
varia8on
of
popula8ons
among
individuals)
SFA
PCA
Admixture
38
Isolation-by-Distance Models
39
Clustered Populations in 1d Habitat
• SFA
Assume
two
popula8ons
Assume
five
popula8ons
• Admixture
Assume
two
popula8ons
Assume
five
popula8ons
• PCA
40
Analysis of European Genotype Data
42