Академический Документы
Профессиональный Документы
Культура Документы
Patrik Dhaeseleer
Sequence motifs are becoming increasingly important in the analysis of gene regulation. How do we define sequence motifs,
and why should we use sequence logos instead of consensus sequences to represent them? Do they have any relation with
binding affinity? How do we search for new instances of a motif in this sea of DNA?
a HEM13
CCCATTGTTCTC
TTTCTGGTTCTC
TCAATTGTTTAG
CTCATTGTTGTC
TCCATTGTTCTC
CCTATTGTTCTC
TCCATTGTTCGT
CCAATTGTTTTG
HEM13
HEM13
ANB1
ANB1
ANB1
ANB1
ROX1
8.0
4.0
0.0
5'
3'
5'
3'
5'
3'
2.0
1.0
0.0
002700000010
464100000505
000001800112
422087088261
Bob Crimi
A
C
G
T
Counts
YCHATTGTTCTC
Bits
2.0
Bits
PRIMER
1.0
0.0
423
PRIMER
TTGACA motif centered around 35, forms
the binding site for the 70 subunit of the
core RNA polymerase. However, despite the
high degree of conservation at each position
(ranging from 54% to 82% for each base), it is
actually extremely rare to find a promoter that
matches this consensus sequence exactly, with
most promoters matching only 79 out of the
12 bases. Rather than representing a typical
binding sequence, the consensus sequence in
this case is instead a highly unusual sequence.
It turns out that the activity of each promoter
is related to how well it matches the consensus sequence, so the activity level of each gene
can be fine-tuned by how much its 10 and
35 regions deviate from the consensus.
A better description of the binding
sequence in this case is through a Position
Frequency Matrix (PFM). Rather than only
keeping track of the most common base at
each position, we record how often each
base occurs in known sites. For example,
the Rox1 transcription factor is known to
bind at least eight sites in three genes in the
Saccharomyces cerevisiae genome. Figure 1
shows the multiple alignment of these eight
binding sites, with a consensus sequence of
YCHATTGTTCTC. (Conventionally, a single
base is shown if it occurs in more than half
the sites and at least twice as often as the second most frequent base. Otherwise, a doubledegenerate symbol is used if two bases occur
in more than 75% of the sites, or a tripledegenerate symbol when one base does not
occur at all.) The frequency matrix and its
graphic representation in Figure 1 clearly
show a core motif of ATTGTT, with much
lower conservation in the flanking bases.
Sequence logos
By scaling each stack of letters in Figure 1d
with some measure of the conservation at
each base, we get a much clearer view of the
binding sequence. In a sequence logo, developed by Schneider and Stephens1, each stack
is scaled with the information content of the
base frequencies at that position:
(1)
424
(2)
(3)
PRIMER
that the probability of binding to the known
binding sites (versus the more abundant
background DNA) is maximized. The optimal weight matrix is then given by:
f
W(b,i) = log2 pb,i
(4)
425