Вы находитесь на странице: 1из 17

String Processing

There are many practical applications of string processing, such as processing encoded information in
the information processing, processing genetic codes that encoded by four characters (A, C, T, and G)
in the computational biology, processing data exchange in communications system, the study of
describing sets of strings which is belong to theory of formal languages in programming systems.

1. String Sorts
This section discuss how we solve sorting problem in the case of string. Does not like any general
sorting algorithms that we already knew, sorting the string has different method to achieve better
performance. On the other hand, we should have insight about the string before execute any sorting
algorithm. The useful insight that may important, such as, randomness, longest common prefixes,
encoding used in the string, etc. Without enough insight about the string we dealing, our sorting
algorithm may not performed well as expected.

Key-Indexed Counting
Before we discuss any string algorithms, we need to
understand how to indexing set of string-integer pairs. An
example of string-integer pairs illustrated in the left.
Intuitively, the idea of key-indexed is sort string by its key.
That’s why we can utilized this method for complex string
sort discussed later. Below, the implementation of key-
indexed counting in java:
int N = a.length;

String[] aux = new String[N];


int[] count = new int[R+1]; // R is Radix

// Computer frequency counts


for (int i=0; i < N; i++)
// key() return section
count[a[i].key() + 1]++;
// Transform counts to indicies
for (int r = 0; r < R; r++)
count[r+1] += count[r];
// Distribute the records
for (int i = 0; i < N; i++)
aux[count[a[i].key()]++] = a[i];
// Copy back
for (int i = 0; i < N; i++)
a[i] = aux[i];
At first step, we used key+1 index in frequency counts to get the offset of each occurrences, not just
count how many key we found. In the example above, the result of count array will be [0, 0, 3, 5, 6, 6].
The first and second index will always be zero.
At second step, we transform our array count into array of indexes, so our count array will be
[0, 0, 3, 8, 14, 20]. What we can tell from that array is we can found strings with key of two start at
index two, we can found strings with key of three start at index seven, and goes on for the rest on array.
Some may think that this method similar with offset method in pagination. Pay attention that our first
and second indexes still has zero value.
At third step, we distributed each string based on its key that we have indexed in the count array. By
incrementing each index stored in count array, our count array will be [0, 3, 8, 14, 20, 20] after each
string sorted in the auxiliary array.
Last step is copying all sorted string in the auxiliary array back into original array of string.
The running time of key-indexed counting is 8N+3R+1 which is proofed by N+R+1 for initializations,
2N for first step, 2R for second step, 3N for third step, and 2N for fourth step.

Least-Significant-Digit (LSD) Radix Sort


The term least significant means we examines every character in the string from right-to-left, while
term digit means we used standard string encoding, such as base-256 ASCII string. Since doing scan
from right-to-left, LSD is useful for sorting fixed-length strings, such as car license plates, IP
addresses, back account numbers, telephone number, etc.
LSD is quite simple because we can adapt the key-indexed counting method as we mentioned before.
Here the implementation of LSD in java:
int N = a.length
int R = 256;
String[] aux = new String[N];

for (int d = W-1; d>= 0; d--)


{ // W is fixed length of all strings
int[] count = new int[R+1];
for (int i = 0; i < N; i++)
count[a[i].charAt(d) + 1]++;
for (int r = 0; r < R; r++)
count[r+1] += count[r];
for (int i = 0; i < N; i++)
aux[count[a[i].charAt(d)]++] = a[i];
for (int i = 0; i < N; i++)
a[i] = aux[i];
}
The trace of LSD algorithm illustrated in the image below:
Image 1: Trace of LSD

The running time of LSD is ~7WN+3WR and extra space proportional to N+R.

Most-Significant-Digit (MSD) Radix Sorting


Surely, most of practical cases in string processing dealing with non-fixed length string. So, to achieve
general purpose string sort algorithm we need to implement a method that capable to scan character of
string from left-to-right. In this case, we implement MSD.
The key idea of MSD similar with quicksort, but instead used two or three partition, MSD only used a
subarray as partition to sort. Then for every string in the set, MSD recursively partitioning and sort
subarray. Surely, the length of partition always reduced over time since we dealing with non-fixed
length string.
Below the implementation of MSD in java:
public class MSD {
private static int R = 256; // radix
private static final int M = 5; // cuttof that indicate small array
private static String[] aux; // Auxiliary array for distribution

private static int charAt(String s, int d) {


// Return character encoding index, or -1 if outsite length of string
if (d < s.length()) return s.chartAt(d); else return -1;
}

public static void sort(String[] a) {


int N = a.length;
aux = new String[N];
sort(a, 0, N-1, 0);
}

private static void sort(String[] a, int lo, int hi, int d) {


if (hi <= lo + M) {
Insertion.sort(a, lo, hi, d); return;
}
int[] count = new int[R+2];
for (int i = lo; i <= hi; i++)
count[charAt(a[i], d) + 1]++;
for (int r = 0; r < R+1; r++)
count[r+1] += count[r];
for (int i = lo; i <= hi; i++)
aux[count[charAt(a[i], d) + 1]++] = a[i];
for (int i = lo; i <= hi; i++)
a[i] = aux[i-lo];
for (int r = 0; r < R; r++)
sort(a, lo + count[r], lo + count[r+1] – 1, d+1);
}
}
Image below illustrate MSD on small set of string which has R=15 (LOWERCASE encoding):

Image 2: Trace of MSD

We maintenance cutoff value to identify small array, so we can improve running time of sorting in
MSD by implementing insertion sort for small array. Another aspect that we should concern about is
radix range in MSD. It is okay for standard ASCII which has R=256, but for UNICODE which has
R=65536, it is not good.
The running time of MSD is between 8N+3R and ~7wN+3WR, where w is the average string length.

Three-Way String Radix Quicksort


As mentioned before, MSD could be slow for the case of huge encoding sizes, such as UNICODE
which is has 65536 mapped encodes. To solve that problem, we need to remove need of auxiliary array.
Three-way string quicksort will do that, by maintenance three-way partitioning code: first partition less
than second partition and third partition larger than second partition. We can say that this algorithm is a
hybrid that combine normal quicksort and MSD.
Below the implementation of three-way string quicksort in java:
public class Quick3string {
private static int charAt(String s, int d) {
// Return character encoding index, or -1 if outsite length of string
if (d < s.length()) return s.chartAt(d); else return -1;
}

public static void sort(String[] a) {


sort(a, 0, a.length -1, 0);
}

private static void sort(String[] a, int lo, int hi, int d) {


if (hi <= lo) return;

int lt = lo, gt = hi;


int v = charAt(a[lo], d);
int i = lo + 1;
while (i <= gt) {
int t = charAt(a[i], d);
if (t < v) exch(a, lt++, i++);
else if (t < v) exch(a, i, gt--);
else i++;
}

sort(a, lo, lt-1, d);


if (v >= 0) sort(a, lt, gt, d+1);
sort(a, gt+1, hi, d);
}
}
Image below illustrate trace of three-way quicksort in small set of strings without cutoff:

Image 3: Trace of 3-way string quicksort


Since in the core three-way quicksort implemented normal quicksort extensively, the running time of
three-way quicksort s about ~2NlnN on the average.

2. Tries
In this section, we dealing with string search problem. A primitive algorithm such as BST or Red-
Black BST has been discovered as base performance in the case of searching set of integers or
characters. In string cases, we look for improvement opportunity by looking at the pattern such as some
of string may share common longest prefix or has uniformly distributed length. The tries is pattern
aware, thus offered suitable data structure that optimizing string search running time. It is like
compression method, but in this case we highly used graph theory.
Too avoid confusion with term ‘tree’, a term trie came from word retrieval which is introduced by
E.Fredkin in 1960. It is little bit like a wordplay same as term ‘dynamic programming’ that refer to
mathematical concept, not exactly about computer programming.
Image 4 illustrated the anatomy of a trie correspond to small set of words. To reproduce a trie
construction illustrated in image 4, we need to following rules:
1. A root is a node with null values.
2. All leafs are nodes that do not have
child or followed by null link.
3. A trie correspond to map of key-
value pair, in this case, key is string
and value is integer of string
identifier.
4. A string may a prefix or suffix of
another string, while a complete
string is a path from root to any leaf.
For example, a word ‘she’ is a prefix
of word ‘shells’ that each terminated
by id 0 and 3.

Image 4: Anatomy of a trie

R-Way Trie
The idea of R-way trie is we take R possible value to create each parent node, so we can satisfy
randomness of input string cases. For example, if we need to indexing any ASCII based words and all
the words are distributed uniformly, then at least we need to satisfy LR possible combinations, where L
is average length of string. Image 5 illustrated the construction of R-Way trie for 256 ASCII characters
of word sea, shells, and she.

Image 5: R-Way tries construction

Below, the implementation of R-way tries in Java:


public class TrieST<Value>
{
private static int R = 256;
private Node root;

private static class Node


{
private Object val;
private Node[] next = new Node[R];
}

public Value get(String key)


{
Node x = get(root, key, 0);
if (x == null) return null;
return (Value) x.val;
}

private Node get(Node x, String key, int d)


{
if (x == Null) return null;
if (d == key.length()) return x;
char c = key.chartAt(d);
return get(x.next[c], key, d+1);
}

public void put(String key, Value val)


{
root = put(root, key, val, 0);
}
private Node put(Node x, String key, Value val, int d)
{
if (x == null) x == new Node();
if (d == key.length()) { x.val = val; return x; }
char c = key.chartAt(d);
x.next[c] = put(x.next[s], key, val, d+1);
return x;
}
}
Since each level of trie consisted by nodes taken from set of array with R-length, then each level has
probability 1/R to be passed by search operations. Thus, the running time of R-way tries is ~logRN and
space in between RN and RNw, where R is radix, N is max length of string, and w is average length of
string. In the case of speed, this is good, but not for space. In typical system, R-way tries will be not
possible to be implemented for millions string of unicode.

Ternary Search Tries (TST)


As mentioned before, the implementation of R-way tries
required huge of space. In the practical point of view, the R-
way tries is not suitable. To be helped with practical cases of
bug data, we need to understand that the typical input is not
random. In the case of non-randomness string, we knew that
there will be occurred such any patterns and repetition. Thus,
by knowing this fact, it is possible to reduce R-way tries into
K-way tries which is value of K should be very small as
possible.
Based on its name, Ternary Search Tries (TST) is reduced
version of R-way tries that has three links at most. Looking the
construction of TST which illustrated in image 6, we realize
that TST required different representation. Since we reduced
the possibility of word occurrences, each character appear
explecitly in nodes, so we only need to examines exactly 1/3
probability despite the length of its encoding. The search
operation illustrated in image 7.

Image 6: TST constuction


Image 7: Search on TST

Below the implementation of TST in java:


public class TST<Value>
{
private Node root;

private class Node


{
char c;
Node left, mid, right;
Value val;
}

public Value get(String key);

private Node get(Node x, String key, int d)


{
if (x == null) return null;
chat c – key.chartAt(d);
if (c < x.c) return get(x.left, key, d);
else if (c > x.c) return get(x.right, key, d);
else if (d < key.length() - 1) return get(x.mid, key, d+1)
else return x;
}

public void put(String key, Value val)


{
root = put(root, key, val, 0);
}

private Node put(Node x, String key, Value val, int d)


{
char c = key.charAt(d);
if (x == null) { x = new Node(); x.c = c; }
if (c < x.c) x.left = put(x.left, key, val, d);
else if (c > x.c) x.right = put(x.roght, key, val, d);
else if (d < key.length() - 1) x.mid = put(x.mid, key, val, d+1);
else x.val = val;
return x;
}
}
Since there is 1/3 possibility of occurrence of each character in string, the running time of TST is
approximately ~ln N and space in between 3N and 3Nw.

3. Substring Search
Let’s make a leap, since brute force implementation is not consideration because its running time
(~NM), our attention here is more efficient algorithms.

Knuth-Morris-Pratt (KMP) Substring Search


The idea of KMP algorithm is by remove backup lookup in brute force algorithm, we can implement
substring search in linear time, equal to N characters. For example, we need to find pattern ABABC in
text ABABABC. When there is a mismatch in position 4, KMP will choose better restart position which
is 2 to continue the search. To make it happen, KMP need to do preprocessing which has running time
equal to M. One of proper preprocessing for KMP is DFA (Deterministic Finite Automata).
The graphical implementation of constructing DFA of pattern ABABC illustrate in image 8 and 9.

Image 8: Constraction of DFA ABABC (1)


Image 9: Construction of DFA ABABC (2)

The implementation of KMP algorithm in java will be like this:


import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdIn;
import edu.princeton.cs.algs4.StdOut;

public class KMP


{
private String pat;
private int[][] dfa;

public KMP(String pat)


{ // Construct DFA from pattern
this.pat = pat;
int M = pat.length();
int R = 256;
dfa = new int[R][M];
dfa[pat.charAt(0)][0] = 1;
for (int X = 0, j = 1; j < M; j++) {
// Compute dfa[][]
for (int c = 0; c < R; c++)
dfa[c][j] = dfa[c][X];
dfa[pat.charAt(j)][j] = j+1;
X = dfa[pat.charAt(j)][X];
}
}

public int search(String txt)


{ // Simulate operation of DFA on txt
int i, j, N = txt.length(), M = pat.length();
for (i = 0, j = 0; i<N && j<M; i++)
j = dfa[txt.charAt(i)][j];
if (j == M) return i-M; // Found
else return N; // Not found
}

/**
* Will print:
* % java KMP AACAA AABRAACADABRAACAADABRA
* text: AABRAACADABRAACAADABRA
* pattern: AACAA
*/
public static void main(String[] args)
{
String pat = args[0];
String txt = args[1];
KMP kmp = new KMP(pat);
StdOut.println("text: " + txt);
int offset = kmp.search(txt);
StdOut.print("pattern: ");
for (int i = 0; i < offset; i++)
StdOut.print(" ");
StdOut.println(pat);
}
}

Worst case processing time of KMP is N+M, since M time for preprocessing and N time for searching.

Boyer-Moore Substring Search


Boyer-Moore algorithm used heuristic method to find mismatched and scan each character in the
direction right-to-left. Below the step by step of how Booyer-Moore algorithm doing mismatched
character heuristic:
1. Let i = 0 and s = 0 for skip value.
2. Let j = length of pattern.
3. Compare character of pattern[j] with text[i+j].
4. If not matched, set s to j-(index of missmatched). then increment i by s. Back to step 2.
5. Else decrement j by 1 and back to step 3.
An image below illustrate how Boyer-Moore algorithm find pattern NEEDLE in the text
FINDINAHAYSTACKNEEDLE.

Image 10: Example of missmatched character heuristic


Obviously, we can do this in linear time by constructing an array to lookup which index in the pattern.
Lets called that array as right[] since we used this array for reverse lookup, right to left. Simply, we just
need to allocate R sized array, then filled each index by -1 for each index of character that not found in
the pattern, and 0 to M for character found in the pattern by left order. The construction of right[] array
for pattern NEEDLE illustrate in image 11.

Image 11: Right array example

The implementation of Boyer-Moore algorithm in java will be like this:


import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdIn;
import edu.princeton.cs.algs4.StdOut;

class BoyerMoore
{
private int[] right;
private String pat;

BoyerMoore(String pat)
{ // Computer skip table
this.pat = pat;
int M = pat.length();
int R = 256;
right = new int[R];
for (int c=0; c<R; c++)
right[c] = -1;
for (int j=0; j<M; j++)
right[pat.charAt(j)] = j;
}

public int search(String txt)


{ // Search for pattern in txt
int N = txt.length();
int M = pat.length();
int skip;
for (int i=0; i<=N-M; i+=skip) {
skip = 0;
for (int j=M-1; j>=0; j--) {
if (pat.charAt(j) != txt.charAt(i+j)) {
skip = j-right[txt.charAt(i+j)];
if (skip<1) skip = 1;
break;
}
}
if (skip == 0) return i; // Found
}
return N; // Not found
}

/**
* Will print:
* % java BoyerMoore AACAA AABRAACADABRAACAADABRA
* text: AABRAACADABRAACAADABRA
* pattern: AACAA
*/
public static void main(String[] args)
{
String pat = args[0];
String txt = args[1];
BoyerMoore boyerMoore = new BoyerMoore(pat);
StdOut.println("text: " + txt);
int offset = boyerMoore.search(txt);
StdOut.print("pattern: ");
for (int i = 0; i < offset; i++)
StdOut.print(" ");
StdOut.println(pat);
}
}

The typical implementation of Boyer-Moore algorithm like code above will guarantee worst case
running time to NM. Furthermore, a full implementation of Boyer-Moore will provide linear-time
worst-case guarantee by implementing KMP-like table. If we look into Image 10, its like we can
choose better skip value by implementing KMP-like array rather than simply decrement index of j.

Rabin-Karp Fingerprint Search


Rabin-Karp algorithm use hash function to encode pattern and substring of text, so a match occurred if
only if encoded value of pattern equal to substring. We implement Horner method to do efficient
modular hashing on substring of M length. A simple hash function of substring formulated by:
M −1 M −2 0
x i = ti R + ti + 1 R + ... + t i + M − 1 R

To avoid recalculation, we can derived formula above to compute next substring by subtracting one left
most binomial value, increase the order of rest binomial and added a constant at last:

x i + 1 =(x i − t i RM − 1) R + t i + M
To avoid exhaustive computation and reduce space (since length of integer 2 31), we only need to
reproduce the remainder of each hash value. This method called as modular hashing.
H (x i ) = x i mod Q

where Q is any prime number. This calculation is similar to calculation in hash table, but in this case,
we do not store each hash value in a table, we only care of the hash value for each substring. Since
choose for right value of Q is not trivial problem, we can guarantee that by choosing large enough
value for Q, the probability of collision is 1/Q. This approach called as by Monte Carlo correctness.
But, if in the case we use defensive approach, then we should ensure that matching always correct. In
that case we used approach called as Las Vegas correctness. This approach required back-up for
testing of correctness operation. Graphical implementation of Rabin-Karp algorithm illustrated in
image below.

Image 12: Rabin-karp implementation

The implementation of Rabin-Karp algorithm in Java will be look like this:


import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdIn;
import edu.princeton.cs.algs4.StdOut;

public class RabinKarp


{
private String pat; // Only needed for Las Vegas
private long patHash;
private int M;
private long Q;
private int R = 256;
private long RM;

public RabinKarp(String pat)


{
this.pat = pat;
this.M = pat.length();
Q = longRandomPrime();
RM = 1;
for (int i = 1; i <= M; i++)
RM = (R * RM) % Q;
patHash = hash(pat, M);
}

// Monte carlo
// public boolean check(int i)
// {
// return true;
// }

// Las Vegas
public boolean check(String txt, int i)
{
for (int j = 0; j < M; j++)
if (pat.charAt(j) != txt.charAt(i+j))
return false;
return true;
}

private long hash(String key, int M)


{
long h = 0;
for (int j = 0; j < M; j++)
h = (R * h + key.charAt(j)) % Q;
return h;
}

private int search(String txt)


{
int N = txt.length();
long txtHash = hash(txt, M);
if (patHash == txtHash) return 0;
for (int i = M; i < N; i++) {
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;

int offset = i - M + 1;
if (patHash == txtHash && check(txt, offset))
return offset; // Match
}
return N; // No match found
}
}

Rabin-Karp substring search is known as fingerprint search because it uses a small amount of
information to represent a pattern. Then it looks for this fingerprint (the hash value) in the text. The
algorithm is efficient because the fingerprints can be efficiently computed and compared.
Summary
Algorithm Version Guarantee Typical Backup? Correct? Extra space
Brute force - MN 1.1 N yes yes 1
Full DFA 2N 1.1 N no yes MR
Mismatch transition
KMP 3N 1.1 N no yes M
only
Full algorithm 3N N/M yes yes R
Boyer-Moore Mismatch char heuristic MN N/M yes yes R
Monte Carlo 7N 7N no yes* 1
Rabin-Karp
Las vegas 7N* 7N yes yes 1

Вам также может понравиться