Вы находитесь на странице: 1из 22

1

A Fast Multiple String-Pattern Matching Algorithm

Sun Kim Yanggon Kim

Sun.Kim@usa.dupont.edu ykim@towson.edu
Central Research & Development Computer and Information Sciences Department
DuPont Towson University
Overview of the Talk

1. Review of pattern matching algorithms (single pattern)


2. Motivation
3. Compact encoding scheme
4. A new multiple string pattern matching algorithm
5. Experimental result
6. Discussion
The String Matching Problem

Given a pattern P = a1a2...an,


find all occurrences of P in a text T = b1b2...bm.

Extension to multipattern cases:


Given a set of patterns, P1, P2, ..., Pl ,
find all occurrences of P in a text T = b1b2...bm.
A Brute Force Algorithm

Pattern h a s h i n g

Text C o m p a c t e n c o d i n g a n d h a s h i n g ...
h a s h i n g
h a s h i n g
h a s h i n g
h a s h i n g
h a s h i n g
..............
String Matching Algorithms (A Brief Review)

Knuth-Morris-Pratt Algorithm
- SIAM Journal on Computing, 1977
- linear – read characters exactly once.

Boyer-Moore style Algorithm


- Communications of the ACM, 1977
- sublinear due to text shifting heuristics (in practice).

Text-partitioning Algorithm
- Submitted to ACM Journal of Experimental Algorithmics, 1998
- sublinear due to text partitioning scheme (in practice).

Seminumerical Algorithm
- Numeric representation of a string of characters
- linear
Boyer-Moore style algorithms and the text partition algorithm run sublinear
thanks to their heuristics.
Boyer-Moore Style Algorithms
Horspool and Sunday Algorithms

Pattern C C (there are two occurrences of character C )

Sunday’s algorithm
Text[i] Text[i+Plen]

C C

pattern matches from left to right

new current text position


after shift
Horspool’s algorithm
Text[i]

C C
pattern matches
from right to left.
new current text position
after shift
Text Partitioning Algorithm

P a r t i t i o n i n g t e x t a n d h a s h i n g

h a s h i n g
The charcater n aligns
the pattern and a match is found.

No alignment with
character d

No alignment with
character t

h a s h i n g
The charcater i aligns
the pattern but no match is found.
Existing Multiple String Pattern Matching Algorithms

Aho-Corasick Algorithm, 1975


automata approach

Commentz-Walter, 1979
Aho-Corasick technique + Boyer-Moore technique

Baeza-Yates, 1989
Aho-Corasick technique + Boyer-Moore-Horspool technique

Wu-Manber, 1994
Boyer-Moore technique
Motivation

Character shifting heuristics:


we can skip to the next position in the text without missing any pattern occurrence.

Thanks to the heuristics, Boyer-Moore style algorithms and the text partition algo-
rithm run much faster than any other algorithms including KMP.

However, the heuristics become much less effective as more patterns are being searched
simultaneously.
Compact Encoding Scheme (DNA Characters)
By exploiting the small alphabet size of four,
we can encode each character in two bits;
00 for a,
01 for t,
10 for g, and
11 for c.

Consider a DNA pattern P = gatca and a DNA sequence S = gtacgatcagac.

P = 10 00 01 11 00 and
S = 10 01 00 11 10 00 01 11 00 10 00 11
Compact Encoding Scheme (General Case)

1. Scan input pattern P and determine how many bits E are needed for compact
encoding.
2. Define the encoding function Encode for each symbol in P and any symbol that
does not occur in P.
3. Encode each symbol in P and T by function Encode.
Compact Encoding Scheme (Example)

Consider a pattern P = encoding and T = “Compact encoding can ...”.

Then, we can encode each character in three bits:


Encode(e) = 001, Encode(n) = 010, Encode(c) = 011, Encode(o) = 100,
Encode(d) = 101, Encode(i) = 110, Encode(g) = 111, and
Encode(-) = 000 for any character - that does not occur in P.

P = 001 010 011 100 101 110 010 111


T = 000 100 000 000 000 011 000 000 001 010 011 100 101 110 010 111
000 011 010
Encoding Patterns in Compact Encoding Scheme

We need to store each input pattern encoding to a variable, say W in compact


encoding scheme.

1. Initialize a variable W with all 0’s.


2. Perform logical OR on W and Encode(e) = 001,
3. Shift W left 3 bits and perform logical OR on W and
Encode(n) = 010,
4. Continue the step 3 until the last pattern character is processed.

Then the word W will be 00000000001010011100101110010111 where the lower


24 bits are the bit representation of the input pattern encoding.
Pattern Mask and Scanning Text
Single Pattern Searching

We simply scan the text with a scanning variable T.


To determine pattern occurrences, we need a mask, PMASK.

P 00000000001010011100101110010111
PMASK 00000000111111111111111111111111

While scanning the text with T:


1. Perform logical AND on T and PMASK.
2. Perform logical XOR on the result from Step 1 and P.
3. Test if the above result is all 0’s;
if so, the pattern occurs, otherwise, it dose not.
Single Pattern Searching in Compact Encoding Scheme
Example

T = Compact encoding can ...


P = encoding

T = 00000000000100000000000011000000
(T AND PMASK) = 00000000000100000000000011000000.
((T AND PMASK) XOR P) = 00000000001110011100101101010111

T = Compact encoding can ...


P = encoding

T = 11000000001010011100101110010111
(T AND PMASK) = 00000000001010011100101110010111
((T AND PMASK) XOR P) = 00000000000000000000000000000000
Finding Multiple Patterns

A separate pattern mask, PMASKi, is needed for each pattern Pi.


Then, perform the pattern testing procedure for each pattern at every
text position.
⇒ expensive for many input patterns

To avoid unnecessary pattern occurrence testings,


we use a hashing scheme:
1. Prepare a hash table of size H bits, and
2. a hash mask, HMASK whose lower H bits are 1’s and others are
0’s.
A Multiple String-Pattern Matching Algorithm
(MULTI1)

1. Scan the input patterns to determine the number of bits, E, for each character
encoding and define the encoding function Encode
2. Encode each pattern Pi and set the associated mask PMASKi .
3. Set the hash mask HMASK according to the hash table size.
4. Initialize the text scanning variable T to 0.
5. While scanning the text character by shifting T E bits left,
(a) at the hash entry position computed by logically ANDing T and HMASK,
(b) if the hash entry at the position is empty, skip the pattern testing procedure
and scan the next text character.
(c) otherwise, perform the pattern testing procedure for all patterns at the hash
entry.
A New String Matching Algorithm (MULTI1)
(A Partial C Code)

struct hash_entry {PAT p; PATMASK pmask; struct hash_entry * next};


struct hash_entry HTBL[HASH_SZ]; /* HASH_SZ = exp(2,H) */
PAT HMASK; /* a mask for hashing */

for (i=1; i<=n; i++) insert_pattern_into_hash_table(P[i]);

T = encode_ncharacters(text, S);
i = S+1;
while (i <= Tlen) {
if (HTBL[T&HMASK] != NULL) {
candidate = HTBL[T&HMASK];
while(candidate) {
if (!(candidate->p ^ (T & candidate->pmask)))
report_pattern_match(candidate);
candidate = candidate->next;
}
}
T = T<<E | ENCODE(text[i]);
i ++;
}
Application of Our Technique to Other Algorithms
Our technique can be applied to other types of string matching algorithms,
considering more than one characters at a time.
The partitioning algorithm with the technique (MULTI2)

P a r t i t i o n i n g t e x t a n d h a s h i n g

h a s h i n g
The charcater in aligns
the pattern and found a match.

No alignment with
character nd

No alignment with
character " t"
No alignment with
character ti
Experiment with An English Text
Prepare random N patterns from /usr/dict/words and find pattern occurrences
in the King James Bible (4,441,849 characters) obtained from
the project Gutenberg (ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/).

No. Patterns Min. Pattern Length grep agrep MULTI1 MULTI2


10 3 12.8 0.5 0.8 0.4
50 4 64.7 0.6 0.8 0.5
100 3 155.7 1.3 0.9 1.0
200 3 - 1.3 0.9 1.1
500 3 - 1.8 1.1 1.2
1000 3 - 2.3 1.2 1.6
2000 3 - 3.3 1.6 2.1
5000 3 - 5.5 2.9 3.8
10000 3 - 7.6 4.8 5.2
20000 3 - 12.3 11.3 8.6
Experiment with DNA Sequences
Prepare random N DNA patterns and
find pattern occurrences in a large file of DNA sequences (18,617,116 characters) from
7 different completed genomes in organisms, Archaeoglobus fulgidus, Aquifex aeoli-
cus, Bacillus subtilis, Borrelia burgdorferi, Chlamydia trachomatis, Escherichia
coli and Haemophilus influenzae Rd obtained from GenBank (ftp://ncbi.nlm.nih.gov).

No. Patterns Min. Pattern Length grep agrep MULTI1 MULTI2


10 10 4.2 1.6 3.4 0.8
50 10 21.8 2.1 3.5 2.5
100 10 47.7 4.0 3.5 2.6
200 10 108.1 5.6 3.5 2.7
500 10 - 9.2 3.4 3.0
1000 10 - 15.8 3.6 3.4
2000 10 - 29.6 3.8 4.2
5000 10 - 73.2 4.6 5.1
10000 10 - 151.9 5.9 10.0
Discussion

1. Compact encoding and hashing scheme improves the performance of multiple


string pattern matching algorithms:
(a) a simple left-to-right scanning algorithm (MULTI1),
(b) the text partitioning algorithm (MULTI2).
2. Compact encoding and hashing scheme can be applied to single pattern searching
by considering multiple characters at a time.
3. Encoding function can be dynamically defined w.r.t the set of characters that
appears in the input patterns:
adaptive string pattern matching.

Вам также может понравиться