Вы находитесь на странице: 1из 5

Mathematical model for String pattern matching algorithm(Boyer-Moores algorithm)

Notations: A string is a (finite) sequence of symbols (taken from some alphabet).A string is represented by a finite array. A string p1, . . ., pn, for some natural number n, is represented by an array p of length n with entries p[i]=pi, for all natural numbers i, 1in. Hereafter, |p| denotes the length of string p, i.e., the number of symbols in p. On several occasions in the rest of the explanation, we use the expression p[i:j] to denote the substring p[i], . . ., p[j], for natural numbers i and j satisfying 1ij|p|. The String-Pre-processing Algorithm The problem statement The problem of pattern matching can intuitively be described as follows:-

Given some string p, called the pattern, and another string t, called the text, find the first occurrence of p in t, provided that such an occurrence exists; otherwise, return some value indicating that the pattern does not occur in the text. If the pattern occurs in the text, then one can return the index in the text where the first occurrence of that pattern starts. (In case the pattern does not occur in the text, then one can return, say, 1.) At a high level of abstraction, a pattern matching algorithm can be described by first aligning the pattern with the left side of the text. If a match occurs, then we are done. Otherwise, we shift the pattern to the right and repeat the process above until an occurrence of the pattern is found or it is determined that the pattern does not occur in the text. In the Boyer-Moore pattern matching algorithm, when the pattern is aligned with some part of the text, the pattern and the corresponding part of the text are compared character by character from right to left. If a mismatch occurs, i.e., a character in the pattern is different from the corresponding character in the searched text, then the pattern is shifted to the right, and the process of comparing characters in pattern and text is repeated until the algorithm terminates. The motivation for the string-preprocessing is to determine how far to shift the pattern in case of a mismatch of characters. This is done by using the information that some, possibly an empty, suffix of the pattern has been matched in the text. Recall that the string-preprocessing algorithm is performed on the pattern only, without knowledge of the searched text. Let p denote an arbitrary non-empty pattern, and let m denote the length of p. (If the pattern is empty, there is no need for any preprocessing on that pattern.) Intuitively, when a mismatch has occurred at position i, 1im, it means that p[i+1:m] has been found to occur in the (fictitious) text. Then the algorithm determines the rightmost re-occurrence p[rm+i+1:r] with r<m in pattern p such that(1) p[i+1:m] is the same string as p[rm+i+1:r] and (2) the characters p[i] and p[rm+i] differ. NOTE:- The reason for searching for such a re-occurrence is that one can shift the pattern a large number of positions. Note that shifting the pattern a fewer number of positions would be wasteful, since the pattern will not be found in that position of the text after that shift. Also note that a shift of more positions is not an option, because this could result in missing an occurrence of the pattern in the text.

Upon having discussed the background, let us now formally define the input, the output and the mathematical equations that define the transition from the input to the output. INPUT: A string p (p1,p2.pm), called the pattern, and another string t, called the text.

OUTPUT: Although we described this before as shifts, the algorithm actually computes the distance between the index in the text where the character mismatch occurred and the index where the next character comparison will take place.

[--It may be the case, however, that such a rightmost reoccurrence does not exist in the pattern. Instead of determining such a so-called complete (rightmost) re-occurrence in the pattern, the algorithm determines the longest possible partial re-occurrence of a suffix of the pattern.]

BASIS OF TRANSITION (i/p to o/p):

Formally, the rightmost complete re-occurrence or the partial re-occurrence with respect to index i is characterized by function Match defined as: Match(p, i) = max{ r | (mi<r<m p[i]_=p[rm+i] j.(i+1jm p[j]=p[rm+j])) (0rmi j.(1jr p[j]=p[mr+j])) }. Thus, Match(p, i) is the maximal index r in p such that the rightmost complete re-occurrence of p[i+1:m] ends at index r, or the partial re-occurrence ends at index r. Once the value of r is computed The value of l (jump/shift value) is computed.(formulae stated after the pseudocode.)

===========================================================
PSEUDOCODE: Let us now take a look at the pseudocodeNotations used in the pseudocode : At any moment, imagine that the pattern is aligned with a portion of the text of the same length, though only a part of the aligned text may have been matched with the pattern. Henceforth, alignment refers to the substring of t that is aligned with p and l is the index of the left end of the alignment; i.e., p[0] is aligned with t[l] and, in general, p[i], 0 i < m, with t[l + i]. Whenever there is a mismatch, the pattern is shifted to the right, i.e., l is increased. The Variable j- We introduce variable j, 0 < j < m, with the meaning that the suffix of p starting at position j matches the corresponding portion of the alignment. Q2: 0 < j< m, p[j..m] = t[l + j..l + m] Thus, the whole pattern is matched when j = 0, and no part has been matched when j = m

Pseudocode: j := m; while( j > 0 ^ p[j 1] = t[l + j 1]) do j := j 1; endwhile if j = 0 then record a match at l; l := l else l := l endif |{ Q1 ^ Q2 ^ j > 0 ^ p[j 1] t[l + j 1] }

| { Q1 ^ Q2 ^ j = 0 }

(figure: 1.1)

Computation of l This turns out to be essentially a special case of the computation of l

Computation of l The precondition for the computation of l is, j > 0 ^ p[j 1] t[l + j 1]. Let s denote the matched suffix .According to the minimum shift rule, the amount b(s) by which thepattern is shifted isb(s) = min{p r | r R}

Updating l: In the algorithm outlined earlier, we have two assignments to l l := l, when the whole pattern has matched l := l, when p[j..p] = t[l + j..l + p] and p[j 1] t[l + j 1]

These assignments are implemented as follows: l := l is implemented by l := l + b(p) l := l is implemented by l := l + max(b(s), j 1 rt(h)), where s = p[j..p], h = t[l + j 1], and rt(h) is the index of the rightmost occurrence of h in p (or 1 if h does not occur in p)

P, NP, NP-COMPLETE, NP-HARD ANALYSIS: DEFINITIONS: P is the set of all decision problems solvable by deterministic algorithms in polynomial time.

--An algorithm A is of polynomial time complexity if there exists a polynomial p() such that the computing time of A is O(p(n)) for every input of size n. --Deterministic algorithms are those that exhibit the property that result of every operation is uniquely defined. NP is the set of all decision problems solvable by non-deterministic algorithms in polynomial time.

PROPOSITION: The proposed String pre-processing algorithm belongs to P class. Since, all P class problems can have Corresponding NP (non-deterministic) algorithms, the said pre-processing algorithm also does. Proof: Here, we are required to prove that the said algorithm is solvable in polynomial time and that it is deterministic (ref: def of P-class). From the pseudocode (ref: figure 1.1) the value of j ranges from 0 to m(0jm) where m is the length Of the string pattern. The value of l( jump value) is determined by value of j i.e. for j=0 & j>0. In both Cases the vaue is uniquely defined. All operations are uniquely defined. Clearly, from the pseudocode the algorithm has polynomial time complexity. Hence, the pre-processing algorithm belongs to P class. An NP algorithm for this follows: 1 2 j := Choice(0,m); while j > 0 do if (p[j 1] = t[l + j 1]) then failure(); j := j 1 endwhile 3 4 write(j); Success(); (figure:-1.2) Explanation: The above NP algorithm uses function choice() which randomly chooses a value between 0 to m and assigns it to j. The while loop verifies if the choice of j is correct. The time complexity of choice(), success(), failure() is assumed to be O(1).It is solvable in polynomial time by a non-deterministic algorithm.

KEYWORDS: String pattern matching algorithms, Boyer-Moore algorithm, P/NP class.

CONCLUSION: Thus, the Boyer-Moores string pre-processing algorithm calculates the most appropriate no of position by which if the pattern p is shifted, it would result in significant reduction in the time complexity.

Вам также может понравиться