Академический Документы
Профессиональный Документы
Культура Документы
Sun.Kim@usa.dupont.edu ykim@towson.edu
Central Research & Development Computer and Information Sciences Department
DuPont Towson University
Overview of the Talk
Pattern h a s h i n g
Text C o m p a c t e n c o d i n g a n d h a s h i n g ...
h a s h i n g
h a s h i n g
h a s h i n g
h a s h i n g
h a s h i n g
..............
String Matching Algorithms (A Brief Review)
Knuth-Morris-Pratt Algorithm
- SIAM Journal on Computing, 1977
- linear – read characters exactly once.
Text-partitioning Algorithm
- Submitted to ACM Journal of Experimental Algorithmics, 1998
- sublinear due to text partitioning scheme (in practice).
Seminumerical Algorithm
- Numeric representation of a string of characters
- linear
Boyer-Moore style algorithms and the text partition algorithm run sublinear
thanks to their heuristics.
Boyer-Moore Style Algorithms
Horspool and Sunday Algorithms
Sunday’s algorithm
Text[i] Text[i+Plen]
C C
C C
pattern matches
from right to left.
new current text position
after shift
Text Partitioning Algorithm
P a r t i t i o n i n g t e x t a n d h a s h i n g
h a s h i n g
The charcater n aligns
the pattern and a match is found.
No alignment with
character d
No alignment with
character t
h a s h i n g
The charcater i aligns
the pattern but no match is found.
Existing Multiple String Pattern Matching Algorithms
Commentz-Walter, 1979
Aho-Corasick technique + Boyer-Moore technique
Baeza-Yates, 1989
Aho-Corasick technique + Boyer-Moore-Horspool technique
Wu-Manber, 1994
Boyer-Moore technique
Motivation
Thanks to the heuristics, Boyer-Moore style algorithms and the text partition algo-
rithm run much faster than any other algorithms including KMP.
However, the heuristics become much less effective as more patterns are being searched
simultaneously.
Compact Encoding Scheme (DNA Characters)
By exploiting the small alphabet size of four,
we can encode each character in two bits;
00 for a,
01 for t,
10 for g, and
11 for c.
P = 10 00 01 11 00 and
S = 10 01 00 11 10 00 01 11 00 10 00 11
Compact Encoding Scheme (General Case)
1. Scan input pattern P and determine how many bits E are needed for compact
encoding.
2. Define the encoding function Encode for each symbol in P and any symbol that
does not occur in P.
3. Encode each symbol in P and T by function Encode.
Compact Encoding Scheme (Example)
P 00000000001010011100101110010111
PMASK 00000000111111111111111111111111
T = 00000000000100000000000011000000
(T AND PMASK) = 00000000000100000000000011000000.
((T AND PMASK) XOR P) = 00000000001110011100101101010111
T = 11000000001010011100101110010111
(T AND PMASK) = 00000000001010011100101110010111
((T AND PMASK) XOR P) = 00000000000000000000000000000000
Finding Multiple Patterns
1. Scan the input patterns to determine the number of bits, E, for each character
encoding and define the encoding function Encode
2. Encode each pattern Pi and set the associated mask PMASKi .
3. Set the hash mask HMASK according to the hash table size.
4. Initialize the text scanning variable T to 0.
5. While scanning the text character by shifting T E bits left,
(a) at the hash entry position computed by logically ANDing T and HMASK,
(b) if the hash entry at the position is empty, skip the pattern testing procedure
and scan the next text character.
(c) otherwise, perform the pattern testing procedure for all patterns at the hash
entry.
A New String Matching Algorithm (MULTI1)
(A Partial C Code)
T = encode_ncharacters(text, S);
i = S+1;
while (i <= Tlen) {
if (HTBL[T&HMASK] != NULL) {
candidate = HTBL[T&HMASK];
while(candidate) {
if (!(candidate->p ^ (T & candidate->pmask)))
report_pattern_match(candidate);
candidate = candidate->next;
}
}
T = T<<E | ENCODE(text[i]);
i ++;
}
Application of Our Technique to Other Algorithms
Our technique can be applied to other types of string matching algorithms,
considering more than one characters at a time.
The partitioning algorithm with the technique (MULTI2)
P a r t i t i o n i n g t e x t a n d h a s h i n g
h a s h i n g
The charcater in aligns
the pattern and found a match.
No alignment with
character nd
No alignment with
character " t"
No alignment with
character ti
Experiment with An English Text
Prepare random N patterns from /usr/dict/words and find pattern occurrences
in the King James Bible (4,441,849 characters) obtained from
the project Gutenberg (ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/).