Вы находитесь на странице: 1из 6

Proceedings of the International Conference on Computer and Communication Engineering 2008

May 13-15, 2008 Kuala Lumpur, Malaysia

Parameterized Shift-And String Matching Algorithm Using Super Alphabet


1

Rajesh Prasad 1, Suneeta Agarwal 2 Motilal Nehru National Institute of Technology Allahabd, India-211004 2 Motilal Nehru National Institute of Technology Allahabad, India -211004 rajesh_ucer@yahoo.com
occurrences of P in T, the symbols of must match exactly whereas the symbol of can be renamed. A given pattern P is said to match with a sub-string t of the text T, if a bijection from the symbols of P to the symbols of t. This problem has important applications in software maintenance and plagiarism detection. In software maintenance, it is required to find equivalency between two sections of codes. Two sections of codes are said to be equivalent if one can be transformed into the other via one-to-one correspondence. A number of algorithms have already been developed for exact string matching problem, but only a few exist for parameterized matching. In [5], parameterized on-line matching algorithm for a single pattern was developed. This algorithm runs in O(n log min (m, ||)) worst-case time. Another algorithm was given in [7] that achieve the same time bound both in average and worst case. In [8], shift-or string-matching algorithm for parameterized string matching problem was developed. In this paper, we generalize the shift-and stringmatching algorithm [9] for parameterized string matching problem. The new algorithm is named as parameterized shift-and (PSA) string matching algorithm. We further extend this new algorithm (PSA) by using the concept of super alphabets [10]. By using a super alphabet of size s, algorithm (PSA) is speededup by a factor of s where s is the size of the super alphabet (i.e. s is the number of characters processed simultaneously). We also show the performance of super alphabet PSA with respect to duplicity present in the code. To the best of our knowledge, these generalizations of shift-and string matching algorithm have not been studied. The paper is organized as follows: In section 2 we explain the parameterized string-matching problem. In section 3 we present shift-and string-matching algorithm for exact string matching problem. In section 4 we develop parameterized shift-and string matching algorithm. In section 5, we present the super alphabet

Abstract
In the parameterized string matching, a given pattern P is said to match with a sub-string t of the text T, if there exist a bijection from the symbols of P to the symbols of t. This problem has an important application in software maintenance where it is required to find equivalency between two sections of codes. Two sections of codes are said to be equivalent if one can be transformed into the other by renaming identifiers and variables only. In this paper, we extend shift-and string matching algorithm to find all such occurrences of P in T. The new algorithm is named as parameterized shift-and (PSA) string matching algorithm. We further extend PSA by using the concept of super alphabets. Implementation results show that by using a super alphabet of size s, the algorithm (PSA) is speeded-up by a factor of s, where s is size of the super alphabet (i.e. s is the number of characters processed simultaneously). We also show the performance of super alphabet PSA with respect to duplicity present in the code. However these algorithms are applicable only when pattern length (m) is less than or equal to word length (w) of computer used (i.e. m w).

Keywords:

Algorithm, parallelism, shift-and, parameterized matching.

finite automata, prev-encoding,

bitand

I. INTRODUCTION In the traditional string matching problem, all occurrences of a given pattern P[0m-1] in the given text T[0n-1] are to be reported. Many algorithms for solving this type of problem exists [1, 2, 3, 4]. In parameterized string matching [5], we have two disjoint alphabets, : for fixed symbols, and : for parameter symbols. The symbols of P and T are taken from . In this type of matching, while looking for

978-1-4244-1692-9/08/$25.00 2008 IEEE

937

approach to parameterized shift-and string matching algorithm. We then present experimental results in section 6. In section 7, we give conclusions and future work. II. PARAMETERIZED STRING MATCHING PROBLEM We assume that pattern is P[0m-1] and text is T[0n-1]. All the symbols of P and T are taken from , where is fixed symbol alphabet of size and is parameter symbol alphabet of size . The pattern P matches the text sub-string T[jj+m-1] if and only if i {0, 1, 2m-1}, fj(P[i] = T[j+i]), where fj(.) is a bijective mapping on . It must be identity on but need not be identity on . For example, let = {A, B}, = {X, Y, Z, W} and P = XYABX. P matches the text substring ZWABZ with bijective mapping X Z and Y W. This mapping can be simplified by prev encoding [2]. For any string S, prev(S) maps its each parameter symbols to a non-negative integer p, where p is the number of symbols since the last occurrences of s in S. The first occurrence of any parameter symbol in prev encoding is encoded as 0 and if s it is mapped to itself (i.e. to s). For example, prev(P) = 00AB4 and prev(ZWABZ) = 00AB4. With this scheme of prev- encoding, the problem of parameterized string matching can be transformed to traditional string matching problem, where prev(P) is matched against prev(T[jj+m-1]) for all j = 0, 1, 2...n-m. The prev(P) and prev(T[jj+m-1]) can be recursively updated as j increases with the help of following lemma [5]. Lemma 1: Let S = prev(S). Then for S = prev(S[jj+m-1]), i such that S[i] it holds that S[i] = S[i] if and only if S[i] < m, otherwise S[i] = 0. III. SHIFT-AND STRING MATCHING ALGORITHM Before presenting shift-and string matching algorithm [9], we define the following terms: - bm-1b1b0 denotes bits of computer word of length m. - Exponentiation is used to denote bit repetition (e.g. 041 = 00001) - C-like syntax is used for operations on the bits of computer words: | is for bit-wise or, & is for bit-wise and, ~ complements of all the bits. The shift left operation, <<r, moves all bits to the left and enters r zeros in the right, i.e. bm- 1bm-2b1b0 <<r = bm -rb2 b10 r.

Shift-and algorithm is variant of shift-or (it is little bit less efficient than shift-or). In shift-and algorithm, an automaton is constructed as follows: The automaton has states 0, 1, 2m. The state 0 is the initial state, state m is the final state and i = 0, 1m-1, there is a transition from state i to state i+1 on character P[i]. In addition, there is also a transition for every c from and to the initial state, which makes the automaton non-deterministic. For example, for the pattern P = ba, the shift-and automaton is shown in figure 1. a, b
b
0 1

a
2

Figure 1.

A shift - and automaton for P = ba

The preprocessing algorithm builds a table B having one bit mask entry for each c . For 0 i m-1, the mask B[c] has ith bit set to 1 iff P[i] = c, otherwise it is 0. If ith bit in B[c] is 1, then in automaton, on character c, there is a transition from state i to state i+1. For example, consider the pattern P = abca with m = 4, and = {a, b, c, d}. According to the above definition B[a] = 1001, B[b] = 0010, B[c] = 0100 and B[d] = 0000 (assuming only m bits are needed to represent each bit mask). 0th and 3rd bits in B[a] are 1, which indicates that in automaton, on input a, there is a transition from state 0 to state 1 and also from state 3 to state 4. Bit 1st in B[b] is 1 which indicates that in automaton, on input b, there is a transition from state 1 to state 2. Bit 2nd in B[c] is 1, which indicates that in automaton, on input c, there is a transition from state 2 to state 3. For the search, algorithm needs a bit mask D so that ith bit of this mask is set to 1 if and only if state i in figure 1 is active. For example, if D = 0001, then state 0 is active. Initially, D = 0m (i.e. initially all states are non-active). For each text symbol c the state vector D is updated by D ((D << 1) | 0m-11) & B[c]. If after the ith step (after processing the ith symbol of the text), the (m-1)th bit of D is 1, then there is an occurrence of P with shift i-m. If m w, then the algorithm running time is O(n). In general, running time of the algorithm is O(nm/w). Following example illustrates the algorithm. Example 1 Consider the pattern P = abca and text T = babca on = {a, b, c, d}. Here the pattern length m = 4 and the

938

text length n = 5. The bit masks B[a] = 1001, B[b] = 0010, B[c] = 0100, and B[d] = 0000. The state of the automaton is numbered from 0 to 4. State 0 is initial state and state 4 is final state. Initially D = 0000. For each text character c, D is updated by D ((D << 1) | 0m-11) & B[c]. i Text B[c] D 1 b 0010 0000 2 a 1001 0001 3 b 0010 0010 4 c 0100 0100 5 a 1001 1001

After the 5th step D=1001; the 3rd bit in D is set to 1, which indicates that pattern occurs with shift 5-4 = 1. IV. PARAMETERIZED SHIFT-AND STRING MATCHING ALGORITHM In this section, we discuss parameterized shift-and string matching algorithm. The algorithm is generalization of the algorithm explained in section 3. We can generalize it in the following way: (i) The pattern P is encoded by prev-encoding and stored as prev(P). To compute prev(P), an array prev[c] is formed, which for each symbol c , stores the position of its last occurrence in P. For example, let pattern P = XAYBX on fixed alphabet = {A, B} and parameter alphabet = {X, Y}. Here prev(P) = 0A0B4. (ii) For all j = 0, 1, 2...n-m, prev(T[jj+m-1]) can be efficiently prev-encoded by lemma 1. (iii) The table B is built such that all parameterized pattern prefixes can be searched in parallel. To simplify indexing into B array, it is assumed that = {0, 1-1}, and prev encoded parameter offsets are mapped into the range {+m-1}. For this purpose, an array A[0+m-1] is formed, in which the positions 0-1 are occupied by element of and the rest positions are occupied by prev encoded offsets. For example, if we take pattern P = XAYBX on = {A, B} and = {X, Y} with = 2, m = 5, then prev(P) = 0A0B4. Array A looks like as: A 0 B 1 0 2 1 3 2 4 3 5 4 6

when searching for all the length m prefixes of the text in parallel, then some non-zero encoded offset p in T should be interpreted as zero in some case. For example, when searching for P in T[14] = 1A21, 1(from left) should be zero. The solution to this problem is that, the lemma 1 should be applied in parallel to all m-length sub strings of T. This is achieved as follows: The bit vector B[A[+i]] is the match vector for A[i], for 0 i +m-1. If the jth bit of this vector is 1, it means that P[j] = A[i]. If any of the ith least significant bit of B[A[]] is 1, corresponding bit of B[A[+i]] is also set to 1. This is achieved as follows: B[A[+i]] B[A[+i]] | (B[A[]] & ~(~0 << i)) which, signifies that, for i +m-1, A[i] is treated as A[i] for prefixes whose length is greater than A[i] and as zero for shorter prefixes thus satisfies lemma 1. The algorithm 1 and example 2 explains the algorithm. Algorithm 1 (Parameterized Shift-And) Parameterized_Shift_And (T, n, P, m) 1 P Prev-encode(P, m) 2 for i 0 to -1 3 do pos[[i]] - 4 for i 0 to +m-1 5 do B[A[i]] 0 6 for i 0 to m-1 7 do B[P[i]] B[P[i]] | (1 << i)) 8 for i 1 to m-1 9 do B[A[+i]] B[A[+i]] | (B[A[]] & ~(~0<<i)) 10 D 0 11 mm 1 << (m-1) 12 for i 0 to n-1 13 do c T[i] 14 if c 15 then c i-pos[T[i]] if c > m-1 16 17 then c 0 18 pos[T[i]] i 19 D ((D << 1) | 0m-11) & B[c] 20 if (D & mm) = mm 21 then report match at position i-m Obviously, the running time of the algorithm is O(nm/w). Following is the algorithm to evaluate prev-encoding of pattern P while prev-encoding of the text is embedded in the search code. Algorithm 2 (Algorithm for prev-encoding)

Let P is prev(P) and T is prev(T). Searching for P in T cant be done directly as explained below. Let P = XAXX and T = ZZAZZAZZ then P = 0A21 and T = 01A21A21. Clearly, P has two overlapping parameterized occurrences in T (one with shift = 1 and other with shift = 4), but P does not have any occurrences in T at all. The problem occurs because of

939

Prev-encode (P, m) //P[0m-1] is a pattern and m is the length of the //pattern. The algorithm returns prev(P) 1 for i 0 to -1 2 do pos[[i]] - 3 for i 0 to m-1 4 do c P[i] 5 if c 6 then c i-pos[P[i]] 7 if c > m-1 8 then c 0 9 pos[P[i]] i 10 prev[P[i]] c 11 return (prev(P)) Example 2 Let P = XAXX and T = ZZAZAZZAZZ on = {A} and = {X, Z}. Here = 1 and = 2. prev(P) is equal to P = 0A21 and prev(T) = 01A2A21A21. On processing the P as in section 3(Line 4-7 in algorithm 1), we get B[0] = 0001, B[1] = 1000, B[A] = 0010, B[2] = 0100, and B[3] = 0000. Initially, D = 0000 and mm = 1000. From the preceding discussions, the placing of element of and prev-encoded symbols are shown in the following array A A 0 0 1 1 2 2 3 3 4

= 0010 Step 6: D = (0100 | 0001) & 0101 = 0101 Step 7: D = (1010 | 0001) & 1001 = 1001 3rd bit is set to 1; therefore pattern occurs with shift 7-4 = 3. Step 8: D = (0010 | 0001) & 0010 = 0010 Step 9: D = (0100 | 0001) & 0101 = 0101 Step 10: D = (1010 | 0001) & 1000 = 1000 3rd bit is set to 1; therefore pattern occurs with shift 10-4 = 6. V. PARAMETERIZED SUPER ALPHABET SHIFTAND ALGORITHM We first explain the concept of super alphabet [9] then we explain the parameterized shift-and algorithm using it. In super alphabet approach, processing of s number of symbols of the text T at a time is done (i.e. we use a super alphabet of size s, packing s symbols of T to form a single super symbol, where is the size of fixed alphabet ). There are two techniques to form super alphabets: - First technique uses non-overlapping characters. - Second technique uses overlapping characters. For Example, let T = abcabda and s = 2. Using the first technique, T[0] = ab, T[1] = ca, T[2] = bd, T[3] = a and using the second technique, T[0] = ab, T[1] = bc, T[2] = ca, T[3] = ab, T[4] = bd, T[5] = da. In the present paper, we have taken the first technique of forming super alphabets. In these two techniques the original automaton is not modified, but super alphabet is used to simulate finite automaton faster. Algorithm 1 given in section 4 is speeded-up by a factor of s by using the concept of super alphabets. In this algorithm, P is encoded in the same way as in section 4 but s symbols of the text T are encoded at a time and its encoding is also embedded into the search code (in the algorithm 3 given below). In parameterized super alphabet shift-and string matching algorithm, prev(T) is transformed into T, where for any s, T contains s consecutive symbols of prev(T) and the vector D is modified as: D ((D<<s) | 0m-11s) & B[T[i]] Let C = {c1, c2, c3cs} is super alphabet consisting of s symbol of prev(T), then B[C] = ((B[c1] & 1m) << s-1) | (B[c2] & 1m) << s-2) || (B[cs] & 1m). If after the lth step, any of the m-1 to m+s-2 bits (from right to

On processing the line 8-9 of the algorithm 1, we get, B[A[2]] = B[A[2]] | (B[A[1]] & ~(~0<<1)) B[1] = 1000 | (0001& 0001) = 1000 | 0001 = 1001 Similarly, B[2] = 0100 | (0001 & 0011) = 0101 B[3] = 0000 | (0001 & 0111) = 0001 // Begin loop number 12 of the algorithm Step 1: D((D<<1) | 0m-1 1) & B[prev[T[0]]] D = (0000 | 0001) & B[0] D = (0000 | 0001) & 0001 = 0001 Similarly, Step 2: D = (0010 | 0001) & 1001 = 0011 & 1001 = 0001 Step 3: D = (0010 | 0001) & 0010 = 0010 Step 4: D = (0100 | 0001) & 0101 = 0101 Step 5: D = (1010 | 0001) & 0010

940

left) is 1, then pattern may occur at ls-d-1, (where dth bit from right is a 1) and hence need to be verified. For verification purpose, we have taken the help of lemma 1.The simulation algorithm works in time O(n/s(m+s-1)/w+t), where t is the number of matches reported. If we select s = O(w/log2) then the bound becomes O(nmlog2/w2+t). Following is the pseudo code of the algorithm. Algorithm 3 (Super Alphabet Parameterized Shift-and) Parameterized_Super_Alphabet_Shift_And(T,n,P,m,s) //prev[] is a global array, which stores prev-encoded //text for verification purpose 1 P Prev-encode(P, m) 2 for i 0 to -1 3 do pos[[i]] - 4 for i 0 to +m-1 5 do B[A[i]] 0 6 for i 0 to m-1 7 do B[P[i]] B[P[i]] | (1 << i)) 8 for i 1 to m-1 9 do B[A[+i]] B[A[+i]] | (B[A[]] & ~(~0 << i)) 10 D 0 11 i 0, l 0 12 while i < n 13 do l l+1, p 1, C 0 14 for j i to i+s-1 15 do c T[j] 16 if c 17 then c j-pos[T[j]] if c > m-1 18 19 then c 0 20 pos[T[j]] j 21 C C | ((B[c] & 1m ) << s-p) 22 p p+1 23 prev[j] c 24 D ((D << s) | 0m-11s ) & C 25 for d m-1 to m+s-2 26 do mm 1 << d 27 if (D & mm) = mm 28 then Verify(ls-d-1) 29 ii+s Algorithm 4 (Algorithm for Verification) Verify (j) // Verify the match at position j 1 D 0, mm 1 << (m-1) 2 for i j to j+m-1 3 do D (D << 1 | 0m+s-21) & B[prev[i]]

4 if (D & mm) = mm 5 then pattern occurs at position j Following example explains the algorithm 3. Example 3 The minimum number of bits required in this algorithm is equal to m+s-1. Let P = XAXX and T = ZZAZAZZAZZ on = {A} and = {X, Z}. Here = 1, = 2, m = 4 and n = 10. prev(P) equal to P = 0A21. After processing the algorithm, we get the content of prev-encoded array T = 01A2A21A21. Let size of super alphabet s = 2. The minimum number of bits needed to represent B and D in this algorithm is equal to m+s-1 = 5. On processing the P as in section 3(Line 4-7 in algorithm 3), we get B[0] = 00001, B[1] = 01000, B[A] = 00010, B[2] = 00100, B[3] = 00000. Initially, D = 00000. After processing the P as in section 4 (example 2) the content of table B is B[0] = 00001, B[1] = 01001, B[2] = 00101, B[3] = 00001. // Begin loop number 12 of the algorithm 3 Step 1: B[ZZ] = ((B[0] & 14)<<1) | (B[1] & 14) = (00001&01111)<<1|(01001& 01111) = 01011 The vector D is given by D((D<<2) | 0312) & B[01]. Therefore, D = (00000 | 00011) & 01011 = 00011 Similarly, Step 2: B[AZ] = ((B[A] & 14)<<1) | (B[2] & 14) = (00010 & 01111)<<1|(00101 & 01111) = 00101 D = (01100 | 00011) & 00101 = 00101 Step 3: B[AZ] = ((B[A] & 14)<<1) | (B[2] & 14) = 00101 D = (10100 | 00011) & 00101 = 00101 Step 4: B[ZA] = ((B[1] & 14)<<1) | (B[A] & 14) = (01001 & 01111)<<1 | (00010 & 01111) = 10010 D = (10100 | 00011) & 10010 = 10010 4th bit is 1, therefore pattern may occurs with shift 42-4-1=3 and hence need to be verify. On verification, we find that pattern occurs. Step 5: B[ZZ] = (B[2] & 14)<<1 | (B[1] & 14) = (00101 & 01111)<<1 | (01001 & 01111) = 01011 D = (01000 | 00011) & 01011 = 01011

941

3rd bit is 1, therefore pattern may occurs with shift 52-3-1 = 6 and hence need to be verify. On verification, we find that pattern occurs. VI. EXPERIMENTAL RESULTS We have implemented the algorithms 1 and algorithm 3 in C and compiled with Borland C++ version 3.1 compilers. We performed the experiments on the Pentium R D 2.80 GHz machine with 512 MB RAM, running Microsoft Windows XP. A text file of size 264244 characters long is taken as input. The symbols of text and patterns are taken from , which is fixed and equal to the set {a, b, c, d, e, f, g, h, i, j, k, l}. The algorithms can also be tested for different sets of symbols of different sizes. We experimented with parameterized super alphabet shiftand algorithm for s = 1(the original algorithm), s = 3, s =5, s=7 and s = 8. We also experimented with parameterized super alphabet shift-and algorithm by increasing the size of parameter symbol alphabet () and therefore decreasing the size of fixed symbol alphabet () (because is fixed), while keeping the text size, pattern length (m), and super alphabet size (s) fixed. The algorithm only counts number of matches. Table 1 gives the timings for the parameterized super alphabet shift-and algorithm. Figure 2 gives the effect on processing time of algorithm 3 (PSSA) on increasing the size of alphabet (i.e. increasing the duplicity in the code).
PARAMETERIZED SUPER ALPHABET SHIFT-AND EXECUTION TIME(S) IN SECOND FOR VARIOUS SUPER ALPHABET (FOR FIXED PATTERN LENGTH M = 5) Super alphabet Time (s) s=1 12.209 190 s=3 10.175 824 s=5 8.0208 79 s=7 7.01245 2 s=8 6.45301 1

VII. CONCLUSIONS We have shown that using a simple trick of moving to a super alphabet decreases the search time of the algorithms. The parameterized super alphabet shift-and (s = 3, s = 5, s = 7, and s = 8) algorithm is faster than parameterized shift-and (s=1). The experimental results show that when we increase duplicity present in the code (i.e. increasing the size of , and hence decreasing the size of ), Algorithm 3 performs better than Algorithm 1. However these algorithm works only when pattern length (m) less than or equal to word length of computer used (i.e. m w). Further extension of super alphabet parameterized shift-and algorithm can be made for larger pattern size (i.e. for m > w) and also for multiple patterns. REFERENCES
R.A. Baeza-Yates, and G.H. Gonnet, A new approach to text searching, Communication of ACM, 35(10), pp. 74-82, 1992. [2] R.S. Boyer, and J. S. Moore, A fast string-searching algorithm, Communication of ACM, 20(10), pp. 762-772, 1977. [3] A.V.Aho, and M.J. Corasick, Efficient String Matching: An aid to bibliographic search, Communication of ACM 18(6), pp. 333-340, 1975 [4] P.D. Michailidis, and K.G. Margaritis, On-line String matching Algorithms: Survey and Experimental Results, International Journal of Computer Mathematics, 76(4), pp. 411434, 2001. [5] B.S. Baker, Parameterized pattern matching by Boyer-Moore type algorithms, In proceedings of the 6th ACM-SIAM Annual Symposium on Discrete Algorithms, San Francisco, CA, pp. 541-550, 1995. [6] B.S. Baker, Parameterized duplication in string: algorithm and application in software maintenance, SIAM J. Computing, 26(5), pp. 1343-1362, 1997. [7] A. Amir, M. Farach, and S. Muthukrishnan, Alphabet dependence in parameterized matching, Information Processing Letter 49(3), pp. 111-115, 1994. [8] Kimmo Fredriksson, and Maxim Mozgovoy, Efficient parametrized string matching, In Information Processing Letters (IPL), 100(3), pp. 91-96, 2006. [9] G. Navarro, and Mathieu Raffinot, Fast and Flexible String Matching by Combining Bit-parallelism and Suffix automata, ACM Journal of Experimental Algorithms , 5(4), 2000. [10] Kimmo Fredriksson, Shift-or string matching with super alphabets, In Information Processing Letters (IPL), 87(4), pp. 201-204, 2003. [1]

18 16 14 12 Time (s) 10 8 6 4 2 0 6, 6 5, 7 4, 8 3, 9 2, 10 1,11 Sigma, Pi PSSA (s=2)

Figure 2.

Effect on Processing Time of Algorithm 3 on increasing parameter symbol alphabet .

942

Вам также может понравиться