Вы находитесь на странице: 1из 7

Hash: A String Matching Algorithm

Author: Le Khac Minh Tue

1. Gii thiu: a. Hon cnh: Mt lp nhng bi ton rt c quan tm trong khoa hc my tnh ni chung v lp trnh thi c ni ring, l x l xu chui. Trong lp bi ton ny, ngi ta thng rt hay phi i mt vi mt bi ton: tm kim xu chui. b. Pht biu bi ton: i. Cho mt on vn bn, gm k t. ii. Cho mt on mu, gm k t. iii. My tnh cn tr li cu hi: on mu xut hin bao nhiu ln trong on vn bn v ch ra cc v tr xut hin . c. Thut ton: C rt nhiu thut ton c th gii quyt bi ton ny. Ngi vit xin tm tt 2 thut ton ph bin c dng trong cc k thi lp trnh: i. Brute-force approach: Vi mt cch tip cn trc tip, chng ta c th thu c thut ton gii. Tuy nhin phc tp ca n l rt ln trong trng hp xu nht. Thut ton brute-force so khp tt c cc v tr xut hin ca on mu trong on vn bn. C th phc tp cho thut ton ny l (). ii. Knuth-Morris-Pratt algorithm Hay cn c vit tt l KMP, c pht minh vo nm 1974, bi Donald Knuth, Vaughan Pratt v James H. Morris. Thut ton ny s dng mt correction-array, l mt thut ton rt hiu qu, c phc tp l ( + ). d. Mc ch bi vit: Trong bi vit ny, ngi vit ch tp trung vo mt thut ton. Tc gi xin gi thut ton ny l Hash. Theo nh bn thn ngi vit nh gi, y l thut ton rt hiu qu c bit l trong thi c. N hiu qu bi 3 yu t: tc thc thi, linh ng trong vic s dng (ng dng hiu qu) v s n gin trong ci t. u tin, ngi vit xin c trnh by v thut ton ny. Sau , ngi vit s trnh by mt vi ng dng, cch s dng v pht trin thut ton Hash trong cc bi ton tin hc. 2. Thut ton Hash: a. K hiu: i. Tp hp cc ch ci c s dng: . 1

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

on vn bn: [1. . ], on mu: [1. . ], on con t n ca mt xu : [. . ] Chng ta cn tm ra tt c cc v tr (1 + 1) tha mn: [. . + 1] = . b. M t thut ton: ii. iii. iv. v. n gin, gi s rng = {, , , }, ngha l ch gm cc ch ci Latin in thng. biu din mt xu, thay v dng ch ci, chng ta s chuyn sang biu din dng s. V d: xu c vit di dng s l mt dy gm 4 s: (0,2,25,3). Nh vy, mt xu c biu din di dng mt s h c s 26. T y suy ra, 2 xu bng nhau khi v ch khi biu din ca 2 xu h c s 10 ging nhau. y chnh l t tng ca thut ton: i 2 xu t h c s 26 ra h c s 10, ri em so snh h c s 10. Tuy nhin, chng ta nhn thy rng, khi i 1 xu ra biu din h c s 10, biu din ny c th rt ln v nm ngoi phm vi lu tr s nguyn ca my tnh. khc phc iu ny, chng ta chuyn sang so snh 2 biu din ca 2 xu h c s 10 sau khi ly phn d cho mt s nguyn ln. C th hn: nu biu din trong h thp phn ca xu l v biu din trong h thp phn ca xu l , chng ta s coi bng khi v ch khi = , trong l mt s nguyn ln. D dng nhn thy vic so snh vi ri kt lun c bng vi hay khng l sai. = ch l iu kin cn bng ch cha phi iu kin . Tuy nhin, chng ta s chp nhn lp lun sai ny trong thut ton Hash. V coi iu kin cn nh iu kin . Trn thc t, lp lun sai ny c nhng lc dn n so snh xu khng chnh xc v chng trnh b chy ra kt qu sai. Nhng cng thc t cho thy rng, khi chn l mt s nguyn ln, s lng nhng trng hp sai rt t, v ta c th coi Hash l mt thut ton chnh xc. n gin trong vic trnh by tip thut ton, chng ta s gi biu din ca mt xu trong h thp phn sau khi ly phn d cho l m Hash ca xu . Nhc li, 2 xu bng nhau khi v ch khi m Hash ca 2 xu bng nhau. Tr li bi ton ban u, chng ta cn ch ra xut hin nhng v tr no trong . lm c vic ny, chng ta ch cn duyt qua mi v tr xut pht c th ca trong . Gi s v tr l , chng ta s kim tra [. . + 1] c bng vi hay khng. kim tra iu ny, chng ta cn tnh c m Hash ca on [. . + 1] v m Hash ca xu . tnh m Hash ca xu chng ta ch cn lm n gin nh sau:
hashP = 0 for (i : 1 .. n) hashP = (hashP * 26 + P[i] - 'a') mod base

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

Phn kh hn ca thut ton Hash l: Tnh m Hash ca mt on con t n [. . ] ca xu (1 ). hnh dung cho n gin, xt v d sau: Xt xu v biu din ca n di c s 26: (4,1,2,5,1,7,8). Chng ta cn ly m Hash ca on con t phn t th 3 n phn t th 6, ngha l cn ly m Hash ca xu (2,5,1,7). Nhn thy, ly c xu [3. .6], ch cn ly s [1. .6] l (4,1,2,5,1,7) tr cho s ([1. .2] ni thm (0,0,0,0)) l (4,1,0,0,0,0) ta s thu c (2,5,1,7). Tng t, ly c m Hash ca xu [3. .6], ch cn ly m Hash ca [1. .6]tr i (m Hash ca [1. .2] nhn vi 264 ). ci t tng ny, chng ta cn khi to 26 (0 ) v m Hash ca tt c nhng tin t ca , c th l m Hash ca nhng xu [1. . ] (1 ).
pow[0] = 1 for (i : 1 .. m) pow[i] = (pow[i-1] * 26) mod base hashT[0] = 0 for (i : 1 .. m) hashT[i] = (hashT[i-1] * 26 + T[i] - 'a') mod base

Trn on code trn, chng ta thu c mng [] (lu li 26 ) v mng [] (lu li m Hash ca [1. . ]). ly m Hash ca [. . ] ta vit hm sau:
function getHashT(i, j): return (hashT[j] - hashT[i - 1] * pow[j - i + 1] + base * base) mod base

Bi ton chnh c gii quyt, v y l chng trnh chnh:


for (i : 1 .. m - n +1) if hashP = getHashT(i, i + n - 1): print("Match position: ", i)

c. M chng trnh: Chng trnh sau, ti vit bng ngn ng C++, l li gii cho bi trn h thng chm bi trc tuyn VOJ.
#include <iostream> #include <cstdio> #include <cstring> #define FOR(i,a,b) for(int i=a;i<=b;i++) #define base 1000000003LL #define ll long long #define maxn 1000111 using namespace std; ll POW[maxn],hashT[maxn];

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

ll getHashT(int i,int j) { return (hashT[j]-hashT[i-1]*POW[j-i+1]+base*base)%base; } int main() { string T,P; cin >> T >> P; int m=T.size(),n=P.size(); T=" "+T;P=" "+P; POW[0]=1; FOR(i,1,m) POW[i]=(POW[i-1]*26) % base; FOR(i,1,m) hashT[i]=(hashT[i-1]*26+T[i]-'a') % base; ll hashP=0; FOR(i,1,n) hashP=(hashP*26+P[i]-'a') % base; FOR(i,1,m-n+1) if(hashP==getHashT(i,i+n-1)) printf("%d ",i); }

d. nh gi: phc tp ca thut ton l ( + ). Nhng iu quan trng l: chng ta c th kim tra 2 xu c ging nhau hay khng trong (1). y l iu to nn s linh ng cho thut ton Hash. Ngoi s linh ng v tc thc thi, chng ta c th thy ci t thut ton ny thc s rt n gin nu so vi cc thut ton x l xu khc. 3. ng dng: Nh cp trn, thut ton ny s c trng hp chy sai. Tt nhin, bn cnh vic s dng Hash, cn c nhiu thut ton x l xu chui khc, mang li s chnh xc tuyt i. Ti tm gi nhng thut ton l thut ton chun. Vic ci t thut ton chun c th mang li mt tc chy chng trnh cao hn, chnh xc ca chng trnh ln hn. Tuy nhin, ngi lm bi s phi tr gi l s phc tp khi ci t cc thut ton chun . S dng Hash khng ch gip ngi lm bi d dng ci t hn m quan trng ch: Hash c th lm c nhng vic m thut ton chun khng lm c. Sau y, ti s xt mt vi v d chng minh iu ny. a. Longest palindrome substring Bi ton t ra nh sau: Bn c cho mt xu di ( 50 000). Bn cn tm di ca xu i xng di nht gm cc k t lin tip trong . (Xu i xng l xu c t 2 chiu ging nhau). Mt thut ton chun khng th p dng vo bi ton ny l thut ton KMP. Ngoi KMP ra, c 2 thut ton chun c th p dng c. Thut ton th nht l s dng thut ton Manachar tnh bn knh i xng ti tt c v tr trong xu. Thut ton th 2 l s dng Suffix Array v LCP (Longest Common Prefix) cho xu c ni bi v xu

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

vit theo th t ngc li. 2 thut ton ny u khng d dng ci t, v nm ngoi phm vi bi vit, nn ti ch nu s qua m khng i vo chi tit. By gi, chng ta s xt thut ton khng chun l thut ton Hash. n gin, chng ta xt trng hp di ca xu i xng l l (trng hp chn x l hon ton tng t). Gi s xu i xng di l di nht c di l . D thy, trong xu tn ti xu i xng di 2, 4, Tuy nhin, xu khng tn ti xu i xng di + 2, + 4, Nh vy, tha mn tnh cht chia nh phn. Chng ta s chia nh phn tm di ln nht c th. Vi mi di , chng ta cn kim tra xem trong xu c tn ti mt xu con l xu i xng di hay khng. lm vic ny, ta duyt qua tt c tt c cc xu con di trong . Bi ton cn li l: kim tra xem [. . ](1 ; ( + 1) 2 = 1) c phi l xu i xng hay khng. Cch lm nh sau. Gi l xu vit theo th t ngc li. Bng thut ton Hash, chng ta c th kim tra c mt xu con no ca c bng mt xu con no ca hay khng. Nh vy, chng ta cn kim tra [. . ] c bng [ + 1. . + 1] hay khng vi l tm i xng, ni cch khc = cch lm ny l ( log()). b. k-th alphabetical cyclic Bi ton t ra nh sau: Bn c cho mt dy 1 , 2 , , ( 50 000). Sp xp hon v vng quanh ca dy ny theo th t t in. C th, cc hon v vng quanh ca dy ny l (1 , 2 , , ), (2 , 3 , , , 1 ), (3 , 4 , , , 1 , 2 ),... Dy ny c th t t in nh hn dy kia nu s u tin khc nhau ca dy ny nh hn dy kia. Yu cu bi ton l: In ra dy c th t t in ln th . Nu tip cn mt cch trc tip, chng ta s sinh ra tt c cc dy hon v vng quanh, ri sau dng mt thut ton sp xp sp xp li chng theo th t t in, cui cng ch vic in ra dy th sau khi sp xp. Tuy nhin phc tp ca thut ton ny l rt ln v khng th p ng c yu cu v thi gian. C th, cch ny c phc tp l (2 log()), y l tch ca phc tp ca sp xp v phc tp ca mi php so snh dy. Vn gi t tng l sp xp li tt c cc dy hon v vng quanh ri in ra dy ng v tr th , chng ta c gng ci tin phc tp ca vic so snh th t t in ca 2 dy. Nhc li nh ngha v th t t in ca 2 dy: Xt 2 dy v c cng s phn t. Gi v tr th l v tr u tin t tri sang m . < < . Nh vy, ta phi tm on tin t ging nhau di nht ca v , ri so snh k t tip theo. tm c on tin t ging nhau di nht, ta c th s dng Hash kt hp vi chia nh phn. 5
+ . 2

Nh vy bi ton c gii. phc tp cho

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

gii c bi ny, cn s dng thm mt k thut nh na: Thay v sinh ra tt c cc hon v vng quanh, chng ta ch cn nhn i dy a ln, dy mi s c 2 phn t (1 , 2 , , , 1 , 2 , , ). Mt hon v vng quanh s l mt dy con lin tip di ca dy nhn i ny. c. Longest substring and appear at least k times Bi ton t ra nh sau: Bn c cho xu di ( 50000), bn cn tm ra xu con ca c di ln nht, v xu con ny xut hin t nht ln. Bi ton ny c th c gii bng Suffix Array, tuy nhin cch ci t phc tp v khng phi trng tm ca bi vit nn ti s khng nu ra y. Tip tc bn n thut ton Hash thay th thut ton chun. Nhn xt rng, gi s di ln nht tm c l , th vi mi , lun tn ti xu c di xut hin t nht ln. Tuy nhin, vi mi > , khng tn ti xu c di xut hin t nht ln (do l ln nht). Nh vy, tha mn tnh cht chia nh phn. Chng ta c th p dng thut ton tm kim nh phn tm ra ln nht. By gi, vi mi khi ang chia nh phn, chng ta s phi kim tra liu c tn ti xu con no xut hin t nht ln hay khng. iu ny c lm rt n gin, bng cch sinh mi m Hash ca cc xu con di trong . Sau sp xp li cc m Hash ny theo chiu tng dn, ri kim tra xem c mt on lin tip cc m Hash no ging nhau di hay khng. Nh vy, phc tp chia nh phn l (log()), phc tp ca sp xp l ( log()), vy phc tp ca c bi ton l ( log()2 ). 4. Tng kt: a. Thut ton: tng thut ton Hash da trn vic i t h c s ln sang h thp phn, so snh hai s thp phn ln bng cch so snh phn d ca chng vi mt s ln. b. u im: u im ca thut ton Hash l ci t rt d dng. Linh ng trong ng dng v c th thay th cc thut ton chun hm h khc. c. Nhc im: Nhc im ca thut ton Hash l tnh chnh xc. Mc d rt kh sinh test c th lm cho thut ton chy sai, nhng khng phi l khng th. V vy, nng cao tnh chnh xc ca thut ton, ngi ta thng dng nhiu modulo khc nhau so snh m Hash (v d nh dng 3 modulo mt lc).

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

5. Cc ngun tham kho: http://en.wikipedia.org/wiki/String_searching_algorithm http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algo rithm http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm http://vn.spoj.com/problems/SUBSTR/ http://vn.spoj.com/problems/PALINY/ http://acm.sgu.ru/problem.php?contest=0&problem=426 http://vn.spoj.com/problems/DTKSUB/ http://en.wikipedia.org/wiki/Alphabetical_order

Вам также может понравиться