Вы находитесь на странице: 1из 5

RE-MuSiC: A Tool for Multiple Sequence Alignment

with Regular Expression Constraints


Yun-Sheng Chung† Wei-Hsun Lee‡ Chuan Yi Tang† Chin Lung Lu‡,¶,∗

† Department of Computer Science


National Tsing Hua University
Hsinchu 300, Taiwan
yschung@algorithm.cs.nthu.edu.tw; cytang@cs.nthu.edu.tw
‡ Institute of Bioinformatics, and ¶ Department of Biological Science and Technology
National Chiao Tung University
Hsinchu 300, Taiwan
richard.bi94g@nctu.edu.tw; cllu@mail.nctu.edu.tw

1 Problem Formulation
The input consists of σ sequences S1 , . . . , Sσ over alphabet Σ, and a sequence of con-
straints R1 , . . . , Rm , where each Rj is a regular expression. The goal is to find an align-
ment with the highest possible score such that the constraints are satisfied. An alignment
A of S1 , . . . , Sσ is said to satisfy the constraints if in A there exist m regions with the
following property. Let the `th region, corresponding to the `th constraint, be composed
of consecutive columns k` , k` + 1, . . . , k`0 , ` = 1, . . . , m, which is also required to preceed
the (` + 1)st without overlapping. Then the substring of each Si corresponding to region
` is required to match the regular expression R` . For an illustration please see Fig. 1.

2 Method
First consider the case of pairwise alignment. For simplicity let the two input sequences
have equal lengths of n. Let Ah = (Qh , Σ, δh , qh , Fh ) be an -free NFA equivalent to Rh
(qh is the initial state of Ah ), and Mh = (QMh , W Mh , ΣMh , δ Mh , q0Mh , F Mh ) be the weighted
automaton corresponding to Ah , where QMh = Qh × Qh , ΣMh = (Σ ∪ {-}) × (Σ ∪ {-}) \

To whom correspondence should be addressed.

1
Input
S1 = cgacgta, S2 = acgcgta
R1 = a, R2 = t
Optimal unconstrained alignment
- cc gg -a cc gg tt aa
a

Optimal constrained alignment


c g A c g - T a
- - A c g -c g T a
Figure 1: An illustration of a constrained alignment. Here a match has a score of 1, while
all other cases are scored 0. Capital letters in the constrained alignment represent the
columns responsible for the satisfactions of the constraints.

{(-, -)}, q0Mh = (qh , qh ), F Mh = Fh × Fh , and, for (p, q) ∈ QMh and (a, b) ∈ ΣMh ,

Mh
δ (p, a) × δ (q, b)
h h if (p, q) 6∈ F Mh ∪ {q0Mh }
δ ((p, q), (a, b)) =
δ (p, a) × δ (q, b) ∪ {(p, q)} otherwise
h h

On each pair (i1 , i2 ) of indices of the sequences, denote as Mhi1 ,i2 the corresponding
weighted automaton. Also let M be an NFA obtained by “concatenating” M1 , . . . , Mm
by adding an “empty move” from each final state of Mh to the initial state of Mh+1 ,
1 ≤ h < m. The initial state of M is set to be the initial state of M1 , and the final states
of M are the final states of Mm . Hence M accepts all feasible constrained alignments (see
Fig. 2 for an example). The score held in state (p, q) of Mhi1 ,i2 (resp. M i1 ,i2 ) is denoted
as Whi1 ,i2 (p, q) (resp. W i1 ,i2 (p, q)). The score Whi1 ,i2 (p, q) represents the score of a best
alignment A of S1 [1..i1 ] and S2 [1..i2 ] such that all of R1 , . . . , Rh−1 are satisfied and that
state (p, q) of Mh is reached if A is given to Mh as input. It is equal to W i1 ,i2 (p, q),
the score of optimally aligning S1 [1..i1 ] and S2 [1..i2 ] such that the alignment can lead us
from the initial state of M i1 ,i2 to state (p, q). The goal of the algorithm is to compute
Wmn,n , since max(p,q)∈Fm ×Fm Wmn,n (p, q) is the optimal score of aligning S1 and S2 with all
m regular expression constraints satisfied.
The algorithm iterates over all pairs (i1 , i2 ) of indices of sequences S1 and S2 , row by
row. It computes Whi1 ,i2 for all 1 ≤ i1 , i2 ≤ n and 1 ≤ h ≤ m throughout its execution.
Let Vh and Eh be the set of states and set of arcs in automaton Ah , respectively. Denote
as Lih1 ,i2 a |Vh | × |Vh | table used to hold Whi1 ,i2 . On each sequence index pair (i1 , i2 ),
the algorithm computes Li11 ,i2 , . . . , Lm
i1 ,i2
, in the order. For all (p, q) ∈ QM1 , Li11 ,i2 [p, q] is
computed using the (first) algorithm we proposed in [3]. The same is performed for L ih1 ,i2 ,
h > 1, except that Lhi1 ,i2 [qh , qh ] is initialized to the maximum of Lih−1
1 ,i2
[p, q] rather than
−∞ as the other states are, the maximum taken over all final states (p, q) of Mh−1 .
The above procedure takes O( m 2 i1 ,i2
P
h=1 |Vh ||Eh |n ) time, since each Lh takes O(|Eh ||Vh |)
time by [3]. In what follows we present how to reconstruct the optimal alignment in

2
all edit op. all edit op.
(-, a) (-, t)
p11 4 −∞ p12 r11 3 −∞ r12
(a, a) (t, t)
(a, -) (a, -) (t, -) (t, -)
empty
move
p21 −∞ 3 p22 r21 3 −∞ r22
(-, a) (-, t)
all edit op. all edit op.
(a)

Alignment underlying W 6,5 [r21 ]:


c g A c g - t
- - A c g -c g -
with
6,5 6,5
β1,2 [r21 ] = (2, 0), η1,2 [r21 ] = (3, 1)
6,5 6,5
β2,2 [r21 ] = (5, 5), η2,2 [r21 ] = (6, 5)
(b)

Figure 2: An illustration of the weighted automaton M corresponding to the example in


Fig. 1. State (p2 , p1 ), etc., is denoted as p21 , etc., for brevity. (a) Automaton M 6,5 , which
is the concatenation of M16,5 and M26,5 . Numbers in the states are the scores W 6,5 [p11 ],
etc. Initial states of M16,5 and M26,5 are p11 and r11 , respectively, while the final states are
p22 and r22 , respectively. The initial and final states of M 6,5 are p11 and r22 , respectively.
(b) The alignment underlying W 6,5 [r21 ] is the optimal alignment of S1 [1..6] and S2 [1..5]
such that state r21 is reached.

Pm
O( h=1 h|Vh |2 n) space without affecting the time complexity, by a generalization of the
method we proposed in [3].
Consider an optimal alignment A satisfying all constraints. Intuitively, if we know
the substrings of S1 and S2 responsible for A’s satisfaction of the constraints, then A (or
another optimal solution) can be reconstructed efficiently. Suppose that the substrings
of S1 and S2 responsible for A’s satisfying the `th regular expression are S1 [b`1 ..e`1 ] and
S2 [b`2 ..e`2 ], 1 ≤ ` ≤ m. Then we can align S1 [b`1 ..e`1 ] and S2 [b`2 ..e`2 ] for all ` = 1, . . . , h,
align S1 [1..b11 − 1] and S2 [1..b12 − 1], align S1 [e`1 + 1..b`+1
1 − 1] and S2 [e`2 + 1..b`+1
2 − 1]
for ` = 1, . . . , m − 1, and align S1 [em m
1 + 1..n] and S2 [e2 + 1..n]. These alignments can
be concatenated in the proper order to obtain A. Each alignment can be computed in
linear space by Hirschberg’s celebrated divide-and-conquer algorithm [4]. In the example
of Fig. 1, b11 = 3, e11 = 3, b12 = 1, e12 = 1, b21 = 6, e21 = 6, b22 = 6 and e22 = 6.
i1 ,i2 i1 ,i2
Two types of |Vh | × |Vh | tables, β`,h and η`,h , 1 ≤ ` ≤ h, are maintained for this
purpose. Let A be the alignment underlying the score Whi1 ,i2 (p, q). Then β`,h
i1 ,i2
[p, q] keeps
the indices of S1 and S2 corresponding to the column immediately before the first column
i1 ,i2
of leaving the initial state of M` . If ` = h (resp. ` < h), then η`,h [p, q] keeps the indices
of S1 and S2 corresponding to the first column of A which leads us to arrive at state (p, q)

3
n,n
(resp. a final state) of M` . It can be seen that, if (p, q) is the final state of Mm with the
n,n n,n
best Wmn,n (p, q) value, then β`,m [p, q] stores b`1 − 1 and b`2 − 1, and η`,m [p, q] stores e`1 and
e`2 , where the meanings of these b and e values are as discussed in the last paragraph.1
i1 ,i2 i1 ,i2
The computation of βh,h and ηh,h , h = 1, . . . , m, proceeds in the same manner as
i1 ,i2 i1 ,i2 i1 ,i2
in [3].2 As to β`,h [qh , qh ] and η`,h [qh , qh ], ` 6= h, we initialize them to β`,h−1 [p, q] and
i1 ,i2 i1 ,i2 i1 ,i2
η`,h−1 [p, q], respectively, where (p, q) is the final state of Mh−1 with the best Wh−1 (p, q)
score among all final states. In general, when ` 6= h, for (initial or non-initial) state
i0 ,i0
(p, q), each time when Lih1 ,i2 [p, q] is updated to, say, Lh1 2 [p0 , q 0 ] + γ(x, y), where (i01 , i02 ) ∈
{(i1 − 1, i2 − 1), (i1 − 1, i2 ), (i1 , i2 − 1)} and (x, y) is the corresponding edit operation (e.g.,
i1 ,i2 i1 ,i2
(x, y) = (S1 [i1 ], -) if i01 = i1 − 1 and i02 = i2 ), β`,h [p, q] and η`,h [p, q] are also updated to
i0 ,i0 i0 ,i0
β`,h
1 2
[p0 , q 0 ] and η`,h
1 2
[p0 , q 0 ], respectively.
When row i1 is being considered, all tables for rows less than i1 − 1 are not necessary
and can be discarded. Therefore the overall space requirement, including the reconstruc-
tion of the optimal alignment, is O( m 2
P
h=1 h|Vh | n).
In [1], an algorithm extending the one in [2] to support multiple constraints and
multiple sequences, is proposed. In the pairwise case, the algorithm in [1] has time
complexity O( m
P 2 2
Pm 2
h=1 |E h | n ) and space complexity O( h=1 |Eh | n). It is clear that |Vh | =
O(|Eh |), hence the algorithm presented here never takes more time than the one in [1].
In the wosrt case, |Eh | can be proportional to |Vh |2 , and the algorithm in [1] takes
O( m
P 4 2
Pm 3 2
h=1 |Vh | n ) time, while the one presented here takes O( h=1 |Vh | n ) time. Hence
the time complexity of the algorithm presented here compares favorably with the one
in [1]. As to the space complexity, if |Eh | = O(|Vh |), then the algorithm here may
take more space than the one in [1]. However, if m
P 2
Pm
h=1 |Eh | = Ω( h=1 h|Vh |), then the
algorithm here is more space efficient. More importantly, the stated space complexity
for the algorithm in [1] does not include the reconstruction of the alignment; only the
alignment score can be computed. It is clearly an important issue for a web-server to
report the alignment. If a naı̈ve backtracking method is used to augment the algorithm
in [1] to reconstruct the alignment, the space requirement would be O( m 2 2
P
h=1 |Eh | n ),
which is too high for practical use.
To deal with multiple sequences, it is easy to adopt a progressive method with the
above algorithm for pairwise alignment being the kernel. Although the solution may not
be the mathematically optimal one, the time requirement is much more reasonable to be
implemented in a web-server.
1 i1 ,i2 i1 ,i2
When (p, q) is not a final state, the values of ηh,h [p, q] are actually not relevant (but the βh,h [p, q]
values are still critical).
2 i1 ,i2
If (p, q) is not a final state, then the value held in ηh,h [p, q] may not be as described in the last
paragraph. But as just mentioned, this does not affect the correctness.

4
References
[1] A. N. Arslan. Multiple sequence alignment containing a sequence of regular expres-
sions. In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics
and Computational Biology (CIBCB 2005), pages 1–7, 2005.

[2] A. N. Arslan. Regular expression constrained sequence alignment. In Proc. 16th


Annual Symposium on Combinatorial Pattern Matching (CPM 2005), volume 3537
of Lecture Notes in Computer Science, pages 322–333. Springer, 2005.

[3] Y.-S. Chung, C. L. Lu, and C. Y. Tang. Efficient algorithms for regular expression
constrained sequence alignment. In Proc. 17th Annual Symposium on Combinatorial
Pattern Matching (CPM 2006), volume 4009 of Lecture Notes in Computer Science,
pages 389–400. Springer, 2006.

[4] D. S. Hirschberg. A linear space algorithm for computing maximal common subse-
quences. Communications of the ACM, 18:341–343, 1975.

Вам также может понравиться