Вы находитесь на странице: 1из 5

Motifs Recognition in DNA Sequences Comparing the Motif Finding Automaton Algorithm against a Traditional Approach

Yazmn Magallanes, 2Ivan Olmos, 1Mauricio Osorio, 1Luis O. Peredo, 1Christian Sarmiento. 1 Universidad de las Amricas, Puebla. 2Benemrita Universidad Autnoma de Puebla. luisoscar@gmail.com Abstract
need to find all instances in T, where P where P is the pattern to search for. Due to the fact that P represents a set of patterns to search for, the MSP is equivalent to the exact set matching problem [1]. Some patters being searched will often involve letters that have multiple values, meaning that an inappropriate algorithm could lead to an exponentially grown automaton. It must also be considered that the databases used will always be substantially big, being these from 10,000 characters and up to millions of characters to be compared. Therefore to improve search time an automaton with the minimal number of states will most likely perform better. There are different approaches for solving this problem, this paper presents two possibilities, first the MFA algorithm which will be explained in the next section, as well as the traditional approach using the well-known subset construction but taking some especial considerations to improve its performance. We will see in this paper a brief description of the MFA algorithm in section 2 and one for the traditional algorithm in section 3. In section 4 the comparison between both algorithms is discussed explaining the empirical results in section 4.1. At the end of the paper the most important outcomes are summarized in section 5.

This paper presents the comparison between algorithms used to find inexact motifs (transformed into a set of exact sub-sequences) in a DNA sequence. The MFA algorithm builds an automaton that searches for the set of exact sub-sequences by building a finite automaton in a similar way to the KMP algorithm. This algorithm is compared against the traditional automaton using the basic idea of the subset construction. This traditional algorithm is implemented using some characteristics to increase its performance which will end in an algorithm that can be proven to have the optimal number of states.

1. Introduction
The DNA motif search problem (or MSP for short) consists of finding a pattern P (the motif to search for) from a text T (the DNA database created from the nucleotides alphabet: A for adenine, C for cytosine, G for guanine, and T for thymine), where we output the positions in T where P appears (the instances of P in T). The problems complexity arises because T is much longer than P, P may have variations, and may also be found repeated times (they may even overlap) to be considered interesting [1,2]. The importance of finding this type of patterns (called motifs) is related to biological problems such as finding genes functionality or genes regulatory subsequences [1,3]. In the MSP problem, T is defined in the alphabet B = {A, C, G, T}. However, P is a string wherein all its letters are defined by the union of two alphabets: a main alphabet B (as described before), and a extended alphabet E based on the IUPAC alphabet, where E = {R, Y, K, M, S, W, B, D, H, V, N}, where R = {G, A}, Y = {T, C}, K = {G, T}, M = {A, C}, S = {G, C}, W = {A, T}, B = {G, T, C}, D = {G, A, T}, H = {A, C, T}, V = {G, C, A}, and N = {A, C, G, T}. This alphabet is used to represent ambiguities in the pattern. As an example, if P = AKM, then this pattern produces the set of strings P = {AGA, ATA, AGC, ATC}. As a consequence of this, we

2. The MFA algorithm


The Motif Finding Automaton [4] (the MFA for short) implements a strategy that stores knowledge about how the patterns matches overlap with itself. This automaton must be constructed in three phases: first, a phase called expansion of P is implemented, where P is generated from P; in a second phase, a matrix that stores the states of the MFA automaton is computed (the total number of states is computed from the cardinality of each character in P); finally, in a third phase the transition matrix of the MFA automaton is generated. The key of this automaton is the construction of the transition matrix, because it is generated by a function that computes, row by row, each

transition. This function finds a state that represents the prefix that matches with the largest suffix of the current state. This phase is computed very fast, because states where generated in a specific order (expansion process). However, the main disadvantage of this approach is the total number of states that generated, because in the worst case, is exponential with respect to the cardinality of P. However and based on domain experts, in the DNA motif search problem, the cardinality of P is usually small, between 6 32 nucleotides.

3. ETA Algorithm
The Enhanced Traditional Automata (ETA) is an algorithm based on the model used in the classical automata subset construction. The traditional approach is used with some considerations in mind to improve efficiency and response time. The traditional automaton uses the finite state machine (FSM) mathematical model to be able to change states according to the corresponding function. This procedure is carried in two phases. The first face is used to pre-process the given information to create the corresponding state table, which will be then used in the second phase to search for matching strings. In the preprocess the alphabet (possible letters to be found) is used to create different states, this takes effect by taking the possible combinations and assigning them a given state as follows:

//A state can be 0, 1, 32, 46 in where a sub-state is a number among them (32 for example) For each state in the table { For each sub-state in the state { For each column in the row { If the letter in the column is in the position of the sub-state in the pattern { Concatenate the value in (column, row) with (sub-state + 1) } } } For each new state in the table { A new row is added to the table with the same initial state } } Return table A very important characteristic of this algorithm is the number of states it creates. This can be proven using the following theorem: Greibach theorem [5]: Let M be an NFA, A = L (M). Starting with M, do the following: (a) Reverse the transitions and interchange start and final states to get an NFA for rev A; (b) Make the resulting NFA by the subset construction deterministic, omitting inaccessible states; (c) Do the above again. The resulting FDA is minimal with respect to the number of states. Theorem 1: Given a pattern P, our ETA algorithm constructs a Finite Deterministic Automata with the minimum number of states. Using this theorem it can be proven that the way the algorithm creates the automaton means that it will always create a deterministic automaton which will always be the minimal state automata. Theorem 1 is illustrated with the following example:

{1}

{1,2}

{1,3}

The special characteristics that the algorithm takes into consideration are: 1) Pre-processing will be worked in an ordered way to avoid the creation of repeated states. 2) It does not have to create a non-deterministic model since it uses the pattern to directly create the deterministic one.

4
(a)

The used pseudo code is the following: Function automata (pattern) A table is defined as a state matrix with one row (initial state) and four columns (A, C, G, and T) The initial states are set to zero

N
0

(b) The automaton is already deterministic which means there is no need to change it again.

(c-a)

external process that consumed processor capacity while the test where taking place.

4.1. Analisis of the empirical results 4


The empirical analysis of both algorithms shows only the generation time for each pattern, since as it has already been stated, the search time is essentially the same regardless of the pattern being used. The following graphs (1.1 to 1.4) represent the empirical results of the MFA algorithm (dark bars) and the traditional algorithm (light bars) using a real database from the Asperguidos family. The x axis represents the used strings; the y axis shows the pre-process execution time in milliseconds. In the graphs 1.1 to 1.3 the behavior of both algorithms is shown based on the number of characters in the extended alphabet for the search pattern. It can be easily seen that for small patterns the behavior of both algorithms is similar, however when the pattern increases, so does the execution time of the MFA algorithm in an exponential way. The difference in the incremented time is quite substantial between both algorithms. In graph 1.4, the results are based on the number of characters in the extended alphabet for a fixed size pattern of 32 letters. In this case it is easy to observe the same results explained in the previous graphs, it is seen that by incrementing the number of characters in the extended alphabet, the creation time of the automata in the MFA algorithm is exponentially increased. Another important factor to observe is that the kind of pattern to use directly defines the required time for its creation. In those patterns where multi valued letters are used, like R which has 2 possibilities, the time required for processing multiple possibilities should be considered. Regarding the number of created states, tests where carried with two different patterns, the first one has 10 consecutive Ns and a single A at the end, the results are quite different since the MFA automaton generated a total of 2,446,676 different states while the traditional automaton generated just 12. The second pattern tested was NNNNNA, this pattern generated 7 different states with the traditional automaton and 2,388 with the MFA algorithm. Using this as a reference it can be said that the traditional algorithm will create a total of x+2 (being x the number of Ns) while the MFA algorithm will create (7*4^x)/3 - 4/3 states.

The resulting automaton is the same as the one used at the beginning of the theorem. (c-b) This step is the one the traditional algorithm executes, that is why this algorithm is optimized.

4. Comparison
The linear automaton was tested against the traditional automata (Ordered). Some considerations have to be made in order to properly analyze the results. First of all, the search algorithm used in both were not the same, however since they both share the same search criteria it can be assumed that either one of them could use each others search algorithm, therefore this is discarded. There were two important times to be considered. The pre-processing time takes into account the time needed for the algorithm to create its own automata. The search time will define which automaton is faster at finding the required strings. Results showing the comparison between these two times are shown in the next section. While analyzing the algorithms the most critical difference between the two of them, which directly modifies the search time, is how they both create their states. The linear algorithm will create an independent state for each of its possible solutions; this creates an exponential growth in the final automata. However the linear algorithm is designed to always end up with the same final state, which helps at finding solutions. The traditional algorithm works as explained before, but working in an ordered way is the key to reduce the preprocessing as well as the search time. This is based on the idea that if the new state happens to be a smaller state than the current one it is not even considered. This along with not allowing the same state to be written twice will help create the smallest possible automata, which will be proven later. The idea of having a small state table means that the search time will be greatly reduced since internally less states have to be compared, something that has a direct connection with the memory used since it was also noticed that the traditional algorithm could still handle bigger patterns without clogging the computer even when the linear automata had already failed. In all the tests made the traditional automata showed faster processing times, except for a single case in where the linear automata was 1 ms faster, this case has still to be studied but it is believed that it could be associated with an

Graph 1.1 Graph 1.4 The graph 2 shows the empirica results from the liner al algorithm using real data bases. X axis: Used strings; Y axis: Execution time of the pre-pro ocess in milliseconds; Z axis: Used database. The objective of graph 2 is to sh how how the execution time increases exponentially when the size of the pattern n is incremented, however, there is no real change when s another database is used (all of the databases shown have e different sizes)

Graph 1.2

Graph 2

4.2. Comparison Data


Graph 1.3 Laptop Vaio VGN-CS170F 4.00GB of RAM Windows vista de 64 bits Core 2 Duo 2.27GHz

The traditional algorithm was programmed in c# while the MFA algorithm in C++.

6. References
[1] Navarro, Gonzalo. A guide tour to approximate string matching. ACM computing surveys. Vol. 33, Issue 1. Pag. 31-88 (2001) [2] Baeza-Yates, R. Gonnet, G. H. A new approach to text searching. Communications of the ACM. Vol. 35, No. 10 (1994) [3] Aluru, Srinivas. Handbook of computational molecular biology. Champan & All/Crc Computer and Information Science Series. ISBN 1584884061 (2005) [4] Perez, Gerardo. et.al. An automaton for motifs recognition in DNA sequences. To appear in the proceedings of the MICAI 2009 conference. Springer-Verlag. [5] D. C. Kozen, Automata and computability, Springer, New York, 1997.

5. Conclusions
Throughout this paper it was show the results between these two approaches. It was demonstrated how the traditional approach will always return the minimal state automaton, something that the MFA will not do. This will directly affect the efficiency of the created automaton, taking as a basic statement that having less states means having less things to compare. The real difference can be seen in big patterns where there are a lot of characters from the extended alphabet being used, in some cases the difference is over 17,000 times for the automaton generation time. However it was also shown how having big databases does not affect the search time substantially since it increases in a linear way, and this does not affect the automaton generation time. In a test of an A and 15 Ns the results showed how the traditional algorithm will always create a minimal number of states automaton something that shows to be much more efficient than the MFA algorithm which generates over 200,000 times more states. This exaggerated number of states will affect the computer memory directly, and as it was seen, as a direct result of this will be the algorithm not being able to handle patterns like ANNNNNNNNNNNNNNN.

Вам также может понравиться