Вы находитесь на странице: 1из 60

-1-

M u
Hn mt thp nin tr li y, khai ph d liu (KPDL) tr thnh mt trong nhng hng nghin cu chnh trong lnh vc khoa hc my tnh v cng ngh tri thc. Hng lot nghin cu, xut ra i c th nghim v ng dng thnh cng vo i sng cng vi hn mi nm lch s cho thy rng KPDL l mt lnh vc nghin cu n nh, c mt nn tng l thuyt vng chc ch khng phi c xem l sm n ti tn nh mt s t nh tin hc nghi ng ti tha ban u ca lnh vc ny. KPDL bao hm rt nhiu hng tip cn. Cc k thut chnh c p dng trong lnh vc ny phn ln c tha k t lnh vc c s d liu (CSDL), machine learning, tr tu nhn to, l thuyt thng tin, xc sut thng k, v tnh ton hiu nng cao. Cc bi ton ch yu trong KPDL l phn lp/d on (classification/prediction), phn cm (clustering), khai ph lut kt hp (association rules mining), khai ph chui (sequence mining), v.v. Lnh vc ny cng l im hi t v giao thoa ca rt nhiu lnh vc khc. KPDL v ang c ng dng thnh cng vo thng mi, ti chnh & th trng chng khon, sinh hc, y hc, gio dc, vin thng, .v.v. thc c y l mt lnh vc nghin cu c nhiu trin vng, ti chn hng nghin cu Khai ph song song lut kt hp m cho ti lun vn ca mnh. Lun vn c xy dng da trn nn cc nghin cu c trong lnh vc khai ph lut kt hp k t nm 1993, ng thi ti cng mnh dn trnh by mt vi xut ca ring mnh m hai trong s nhng xut l nu ln mi lin h gia lut kt hp m v l thuyt tp m v thut ton song song khai ph lut kt hp m. Lun vn c t chc thnh 5 chng nh sau: Chng I trnh by tng quan v KPDL nh nh ngha th no l KPDL v khm ph tri thc t c s d liu, cc bc chnh trong qu trnh khm ph tri thc. Chng ny cng cp n cc k thut v hng tip cn chnh trong KPDL v phn loi cc h thng khai ph theo nhiu tiu ch khc nhau. Phn cui ca chng ny phc ha nhng ng dng chnh ca

-2lnh vc ny v nhng hng nghin cu ang v s c ch trng trong thi gian ti. Chng II trnh by v bi ton khai ph lut kt hp. i vo nhng nghin cu c th hai chng sau, chng ny cung cp nhng hiu bit cn thit v bi ton khai ph lut kt hp. Phn cui chng s l tng hp nhng xut chnh trong hn 10 nm lch s tn ti v pht trin ca bi ton ny. Chng III trnh by v khai ph lut kt hp m. Phn u ca chng pht biu li bi ton khai ph lut kt hp vi thuc tnh s v thuc tnh hng mc cng cc phng php ri rc ha d liu cho bi ton ny. Dng lut kt hp ny cng vi cc phng php ri rc ha i km c mt vi hn ch nh ng ngha ca lut hay vn im bin gy. Lut kt hp m c xut nh mt hng khc phc cc nhc im ca bi ton trn. Bn cnh s tng hp v cc nghin cu trc v dng lut ny, lun vn cng nu ln mi lin h gia lut kt hp v l thuyt tp m v gii quyt cu hi ti sao li chn php tch i s v php ly min cho ton t T-norm. Phn cui ca chng ny l mt xut v cch chuyn i lut kt hp m v dng lut kt hp m vi thuc tnh s da vo ngng wf tng ng vi cc tp m f ca tng thuc tnh m. Chng IV tp trung vo bi ton khai ph song song lut kt hp. Phn u ca chng ny, lun vn tm tt li cc thut ton c xut v th nghim thnh cng. Cc thut ton ny ging nhau mt im l phi ng b ha d nhiu hay t trong sut qu trnh tnh ton v y chnh l nhc im cn khc phc. Nm bt c tnh cht ca lut kt hp m, lun vn xut mt thut ton mi theo cc b x l (BXL) trong h thng song song hn ch c ti a qu trnh trao i d liu v ng b ha. Thut ton khai ph song song lut kt hp m ny c xem l gn l tng bi ngoi vic trnh c nhc im truyn thng, n cn t c s cn bng ti gia cc BXL nh mt chin thut chia tp thuc tnh ng c vin ph hp. Chng V tng kt lun vn bng vic nu li nhng cng vic thc hin v kt qu t c ca lun vn ny. Ngoi ra, chng ny cng

-3cp nhng vn cha c gii quyt hoc gii quyt thu o trong ton lun vn cng nh cng vic v hng nghin cu trong tng lai. Li cm n: u tin, ti mun gi li cm n su sc nht n cn b hng dn khoa hc, thy gio, TS. H Quang Thy, ngi truyn cho ti ngun cm hng nghin cu khoa hc, ngi a ti n vi lnh vc nghin cu ny, v l ngi ging dy, hng dn ti ht sc tn tnh trong sut bn nm qua. Ti xin by t li cm n ti cc thy c gio ging dy ti trong sut hai nm hc qua nh GS. Hunh Hu Tu, GS, TSKH. Nguyn Xun Huy, PGS, TS. Ng Quc To, TS. V c Thi, TS. Nguyn Kim Anh, .v.v. Ti cng xin trn trng cm n cc nh khoa hc v ng thi l cc thy gio trong ban ch nhim lp cao hc K8T1 nh GS. VS. Nguyn Vn Hiu, GS. TSKH. Bch Hng Khang, PGS. TS. H S m, GS. TSKH. Phm Trn Nhu, v PGS. TS. c Gio. Ti cng mun gi li cm n ti nhng thnh vin trong nhm seminar v Khai ph d liu & tnh ton song song nh TS. Vn Thnh, ThS. Phm Th Hon, ThS. on Sn, CN. Bi Quang Minh, ThS. Nguyn Tr Thnh, CN. Nguyn Thnh Trung, CN. To Th Thu Phng, CN. V Bi Hng, .v.v. H l nhng ngi thy, ngi bn st cnh bn ti trong lnh vc nghin cu ny v c nhng gp chuyn mn cng nh s ng vin v tinh thn rt ng trn trng. Ti xin ghi nhn nhng tnh cm, s gip v chuyn mn cng nh trong cuc sng ca cc thy gio, cc bn ng nghip trong B mn Cc H thng thng tin, Khoa Cng ngh, HQG H Ni. S quan tm ca nhng ngi thy nh TS. Nguyn Tu, PGS. TS. Trnh Nht Tin, ThS. Nguyn Quang Vinh, ThS. V B Duy, ThS. L Quang Hiu .v.v. ng vin v khch l ti rt nhiu trong thi gian qua. Cui cng, ti xin gi li cm n su sc ti tt c ngi thn trong gia nh ti, bn b ti. H tht s l ngun ng vin v tn i vi ti trong cuc sng. Hc vin thc hin lun vn

-4Phan Xun Hiu

Mc lc
M u............................................................................................................... 1 Mc lc .............................................................................................................. 4 Danh sch hnh v ............................................................................................. 6 Danh sch bng biu.......................................................................................... 7 Bng t vit tt .................................................................................................. 8 Chng I. Tng quan v Khai ph d liu........................................................ 9 1.1 Khai ph d liu ...................................................................................... 9 1.1.1 Ti sao li Khai ph d liu? ........................................................... 9 1.1.2 nh ngha Khai ph d liu .......................................................... 10 1.1.3. Cc bc chnh trong Khm ph tri thc (KDD) .......................... 11 1.2 Cc hng tip cn v cc k thut p dng trong Khai ph d liu.... 12 1.2.1 Cc hng tip cn v cc k thut chnh trong Khai ph d liu 12 1.2.2 Cc dng d liu c th khai ph ................................................... 13 1.3 ng dng ca Khai ph d liu ............................................................ 14 1.3.1 ng dng ca Khai ph d liu ..................................................... 14 1.3.2 Phn loi cc h Khai ph d liu .................................................. 14 1.4 Nhng vn c ch trng trong Khai ph d liu .......................... 15 Chng II. Lut kt hp .................................................................................. 17 2.1 Ti sao li lut kt hp? ........................................................................ 17 2.2 Pht biu bi ton khai ph lut kt hp ............................................... 18 2.3 Nhng hng tip cn chnh trong khai ph lut kt hp ..................... 20 Chng III. Khai ph lut kt hp m ............................................................ 23 3.1 Lut kt hp c thuc tnh s ................................................................ 23

-53.1.1 Lut kt hp c thuc tnh s ......................................................... 23 3.1.2 Cc phng php ri rc ha ......................................................... 24 3.2 Lut kt hp m .................................................................................... 27 3.2.1 Ri rc ha thuc tnh da vo tp m .......................................... 27 3.2.2 Lut kt hp m (fuzzy association rules) ..................................... 29 3.2.3 Thut ton khai ph lut kt hp m .............................................. 33 3.2.4 Chuyn lut kt hp m v lut kt hp vi thuc tnh s ............ 38 3.2.5 Th nghim v kt lun .................................................................. 38 Chng IV. Khai ph song song lut kt hp m ........................................... 39 4.1 Mt s thut ton song song khai ph lut kt hp ............................... 40 4.1.1 Thut ton phn phi h tr ...................................................... 40 4.1.2 Thut ton phn phi d liu.......................................................... 41 4.1.3 Thut ton phn phi tp ng c vin ............................................ 43 4.1.3 Thut ton sinh lut song song ....................................................... 46 4.1.4 Mt s thut ton khc ................................................................... 47 4.2 Thut ton song song cho lut kt hp m ........................................... 47 4.2.1 Hng tip cn ............................................................................... 47 4.2.2 Thut ton song song cho lut kt hp m .................................... 51 4.3 Th nghim v kt lun ......................................................................... 52 Chng V. Kt lun ........................................................................................ 53 Nhng vn c gii quyt trong lun vn ny................................. 53 Cng vic nghin cu trong tng lai ......................................................... 54 Ti liu tham kho ........................................................................................... 56

-6-

Danh sch hnh v


Hnh 1 - Lng d liu c tch ly tng mnh theo thi gian ............................. 9 Hnh 2 - Cc bc trong qu trnh khm ph tri thc (KDD) .............................. 12 Hnh 3 - Minh ha v lut kt hp ......................................................................... 17 Hnh 4 - V d v vn "im bin gy" khi tin hnh ri rc ha d liu ....... 26 Hnh 5 - th hm thuc ca cc tp m "Tui_tr", "Tui_trung_nin", v "Tui_gi" .............................................................................................................. 27 Hnh 6 - th hm thuc ca hai tp m "Cholesterol_thp" v "Cholesterol_cao" .................................................................................................. 28 Hnh 7 - Thut ton phn phi h tr trn h 3 BXL ....................................... 41 Hnh 8 - Thut ton phn phi d liu trn 3 BXL ................................................ 43

-7-

Danh sch bng biu


Bng 1 - V d v mt CSDL dng giao dch ......................................................... 18 Bng 2 - Cc tp ph bin trong CSDL bng 1 vi h tr ti thiu l 50% . 18 Bng 3 - Lut kt hp sinh t tp ph bin ACW .................................................. 19 Bng 4 - CSDL khm v chn on bnh tim mch ca 17 bnh nhn ................ 23 Bng 5 - Ri rc ha thuc tnh s ri rc hu hn hoc thuc tnh hng mc ... 25 Bng 6 - Ri rc ha thuc tnh s "Lng cholesterol trong mu" ..................... 25 Bng 7 - Ri rc ha thuc tnh s Tui tc ..................................................... 25 Bng 8 - CSDL v khm v chn on bnh tim mch ca 13 bnh nhn ............ 29 Bng 9 - Bng cc k hiu s dng trong thut ton khai ph lut kt hp m ... 34 Bng 10 - Thut ton khai ph lut kt hp m .................................................... 34 Bng 11 - TF - gi tr cc thuc tnh ti cc bn ghi c m ha .................. 35 Bng 12 - C1 - tp tt c cc tp thuc tnh c lc lng bng 1 ......................... 36 Bng 13 - F2 - tp thuc tnh ph bin c lc lng bng 2 ................................. 37 Bng 14 - Cc lut m c sinh ra t CSDL trong bng 8 ................................. 37 Bng 15 - Thut ton sinh lut kt hp tun t ..................................................... 46 Bng 16 - Tp cc thuc tnh m sau khi m ha t CSDL bng 8 ................... 48 Bng 17 - Thut ton h tr vic chia tp thuc tnh m cho cc BXL ................ 51

-8-

Bng t vit tt
T hoc cm t C s d liu Khai ph d liu BXL T vit tt T ting Anh CSDL Database KPDL Data Mining BXL Processor

-9-

Chng I. Tng quan v Khai ph d liu


1.1 Khai ph d liu
1.1.1 Ti sao li Khai ph d liu? Hn mt thp nin tr li y, lng thng tin c lu tr trn cc thit b in t (a cng, CD-ROM, bng t, .v.v.) khng ngng tng ln. S tch ly d liu ny xy ra vi mt tc bng n. Ngi ta c on rng, lng thng tin trn ton cu tng gp i sau khong hai nm v theo s lng cng nh kch c ca cc CSDL cng tng ln mt cch nhanh chng [AR95].

Hnh 1 - Lng d liu c tch ly tng mnh theo thi gian

Chng ta qu thc ang ngp trong d liu, nhng li cm thy i tri thc v thng tin hu ch. Lng d liu khng l ny thc s l mt ngun ti nguyn rt gi tr bi thng tin l yu t then cht trong hot ng kinh doanh v n gip nhng ngi iu hnh v qun l c mt ci nhn su sc, chnh xc, khch quan vo tin trnh kinh doanh trc khi ra quyt nh. KPDL khai thc nhng thng tin tim n c tnh d on t nhng CSDL ln l mt hng tip cn mi vi kh nng gip cc cng ty ch trng vo nhng thng tin c nhiu ngha t nhng tp hp d liu ln (databases, data warehouses, data repositories) mang tnh lch s. Nhng cng c KPDL c th d on nhng xu hng trong tng lai v do cho php doanh nghip ra nhng quyt nh kp thi c nh hng bi tri thc m KPDL em li. S phn tch d liu mt cch t ng v mang tnh d bo ca KPDL c u th hn hn so vi s phn tch thng thng da trn nhng s kin trong qu kh ca cc h h tr ra quyt nh (decision support systems - DSSs) truyn thng trc y. Cng c KPDL cng c th tr

- 10 li nhng cu hi trong lnh vc kinh doanh m trc y c xem l tn nhiu thi gian x l. Vi tt c nhng u th trn, KPDL chng t c tnh hu dng ca n trong mi trng kinh doanh y tnh cnh tranh ngy nay. Gi y, KPDL v ang tr thnh mt trong nhng hng nghin cu chnh ca lnh vc khoa hc my tnh v cng ngh tri thc. Phm vi ng dng ban u ca KPDL ch l trong lnh vc thng mi (bn l) v ti chnh (th trng chng khon). Nhng ngy nay, KPDL c ng dng rng ri trong cc lnh vc khc nh tin-sinh (bio-informatics), iu tr y hc (medical treatment), vin thng (telecommunication), gio dc (education), .v.v. 1.1.2 nh ngha Khai ph d liu Trc khi nu mt vi nh ngha v KPDL, ti xin c gii thch nho nh c gi trnh c nhm ln v tn gi. Vi nhng g ti trnh by trn, chng ta c th hiu mt cch s lc rng KPDL l qu trnh tm kim nhng thng tin (tri thc) hu ch, tim n v mang tnh d bo trong cc tp d liu ln. Nh vy, chng ta nn gi qu trnh ny l khm ph tri thc (Knowledge Discovery in Databases KDD) thay v l KPDL. Tuy nhin cc nh khoa hc trong lnh vc ny ng vi nhau rng hai thut ng trn l tng ng v c th thay th cho nhau. H l gii rng, mc ch chnh ca qu trnh khm ph tri thc l thng tin v tri thc c ch, nhng i tng m chng ta phi x l rt nhiu trong sut qu trnh li chnh l d liu. Mt khc, khi chia cc bc trong qu trnh khm ph tri thc, mt s nh nghin cu li cho rng, KPDL ch l mt bc trong qu trnh khm ph tri thc [FSSU96]. Nh vy, khi xt mc tng quan th hai thut ng ny l tng ng nhau, nhng khi xt c th th KPDL c xem l mt bc trong qu trnh khm ph tri thc. C rt nhiu nh ngha v KPDL, cc nh ngha ny u l nhng nh ngha mang tnh m t. Ti xin trch mt vi nh ngha nguyn bn ting Anh nhm chuyn ti c y nguyn ca tc gi v trnh c nhng sai st ch quan:

- 11 nh ngha 1. William J Frawley, Gregory Piatetsky-Shapiro, v Christopher J Matheus 1991 [FSSU96]:


Knowledge discovery in databases, also known Data mining, is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

nh ngha 2. Marcel Holshemier v Arno Siebes (1994):


Data Mining is the search for relationships and global patterns that exist in large databases but are hidden among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database.

1.1.3. Cc bc chnh trong Khm ph tri thc (KDD) Ngi ta thng chia qu trnh khm ph tri thc thnh cc bc sau [AR95] [MM00] [HK02]: Trch chn d liu (data selection): l bc trch chn nhng tp d liu cn c khai ph t cc tp d liu ln (databases, data warehouses, data repositories) ban u theo mt s tiu ch nht nh. Tin x l d liu (data preprocessing): l bc lm sch d liu (x l vi d liu khng y , d liu nhiu, d liu khng nht qun, .v.v.), rt gn d liu (s dng hm nhm v tnh tng, cc phng php nn d liu, s dng histograms, ly mu, .v.v.), ri rc ha d liu (ri rc ha da vo histograms, da vo entropy, da vo phn khong, .v.v.). Sau bc ny, d liu s nht qun, y , c rt gn, v c ri rc ha. Bin i d liu (data transformation): y l bc chun ha v lm mn d liu a d liu v dng thun li nht nhm phc v cho cc k thut khai ph bc sau. KPDL (data mining): y l bc p dng nhng k thut khai ph (phn nhiu l cc k thut ca machine learning) khai ph, trch chn c nhng mu (patterns) thng tin, nhng mi lin h (relationships) c bit trong d liu. y c xem l bc quan trng v tn nhiu thi gian nht ca ton qu trnh KDD.

- 12 Biu din v nh gi tri thc (knowledge representation & evaluation): nhng mu thng tin v mi lin h trong d liu c khai ph bc trn c chuyn dng v biu din mt dng gn gi vi ngi s dng nh th, cy, bng biu, lut, .v.v. ng thi bc ny cng nh gi nhng tri thc khm ph c theo nhng tiu ch nht nh.

Hnh 2 - Cc bc trong qu trnh khm ph tri thc (KDD)

1.2 Cc hng tip cn v cc k thut p dng trong Khai ph d liu


1.2.1 Cc hng tip cn v cc k thut chnh trong Khai ph d liu Cc hng tip cn ca KPDL c th c phn chia theo chc nng hay lp cc bi ton khc nhau. Sau y l mt s hng tip cn chnh [HK02]. Phn lp v d on (classification & prediction): xp mt i tng vo mt trong nhng lp bit trc. V d: phn lp vng a l theo d liu thi tit. Hng tip cn ny thng s dng mt s k thut ca machine learning nh cy quyt nh (decision tree), mng n ron nhn to (neural network), .v.v. Phn lp cn c gi l hc c gim st (hc c thy supervised learning). Lut kt hp (association rules): l dng lut biu din tri th dng kh n gin. V d: 60 % nam gii vo siu th nu mua bia th c ti 80%

- 13 trong s h s mua thm tht b kh. Lut kt hp c ng dng nhiu trong lnh vc kinh doanh, y hc, tin-sinh, ti chnh & th trng chng khon, .v.v. Khai ph chui theo thi gian (sequential/temporal patterns): tng t nh khai ph lut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ng dng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bo cao. Phn cm (clustering/segmentation): xp cc i tng theo tng cm (s lng cng nh tn ca cm cha c bit trc. Phn cm cn c gi l hc khng gim st (hc khng c thy unsupervised learning). M t khi nim (concept description & summarization): thin v m t, tng hp v tm tt khi nim. V d: tm tt vn bn. 1.2.2 Cc dng d liu c th khai ph Do KPDL c ng dng rng ri nn n c th lm vic vi rt nhiu kiu d liu khc nhau [HK02]. Sau y l mt s kiu d liu in hnh. CSDL quan h (relational databases) CSDL a chiu (multidimensional structures, data warehouses) CSDL dng giao dch (transactional databases) CSDL quan h - hng i tng (object-relational databases) D liu khng gian v thi gian (spatial and temporal data) D liu chui thi gian (time-series data) CSDL a phng tin (multimedia databases) nh m thanh (audio), hnh nh (image), phim nh (video), .v.v. D liu Text v Web (text database & www)

- 14 -

1.3 ng dng ca Khai ph d liu


1.3.1 ng dng ca Khai ph d liu KPDL tuy l mt lnh vc mi nhng thu ht c rt nhiu s quan tm ca cc nh nghin cu nh vo nhng ng dng thc tin ca n. Chng ta c th lit k ra y mt s ng dng in hnh: Phn tch d liu v h tr ra quyt nh (data analysis & decision support) iu tr y hc (medical treatment): mi lin h gia triu chng, chn on v phng php iu tr (ch dinh dng, thuc men, phu thut, ). Text mining & Web mining: phn lp vn bn v cc trang web, tm tt vn bn, .v.v. Tin-sinh (bio-informatics): tm kim, i snh cc h gene v thng tin di truyn, mi lin h gia mt s h gene v mt s bnh di truyn, .v.v. Ti chnh v th trng chng khon (finance & stock market): phn tch tnh hnh ti chnh v d bo gi ca cc loi c phiu trong th trng chng khon, .v.v. Bo him (insurance) .v.v. 1.3.2 Phn loi cc h Khai ph d liu KPDL l mt cng ngh tri thc lin quan n nhiu lnh vc nghin cu khc nhau nh CSDL, k thut my hc (machine learning), gii thut, trc quan ha (visualization), .v.v. Chng ta c th phn loi cc h thng KPDL da trn cc tiu ch khc nhau. Phn loi da trn kiu d liu c khai ph: CSDL quan h (relational database), kho d liu (data warehouse), CSDL giao dch (transactional database), CSDL hng i tng, CSDL khng gian (spatial database), CSDL a phng tin (multimedia database), CSDL Text v WWW, .v.v. Phn loi da trn dng tri thc c khm ph: tm tt v m t (summarization & description), lut kt hp (association rules), phn lp

- 15 (classification), phn cm (clustering), khai ph chui (sequential mining), .v.v. Phn loi da trn k thut c p dng: hng CSDL (databaseoriented), phn tch trc tuyn (OnLine Analytical Processing OLAP), machine learning (cy quyt nh, mng n ron nhn to, k-min, gii thut di truyn, my vect h tr - SVM, tp th, tp m, .v.v.), trc quan ha (visualization), .v.v. Phn loi da trn lnh vc c p dng: kinh doanh bn l (retail), truyn thng (telecommunication), tin-sinh (bio-informatics), y hc (medical treatment), ti chnh & th trng chng khon (finance & stock market), Web mining, .v.v.

1.4 Nhng vn c ch trng trong Khai ph d liu


KPDL l mt lnh vc mi, do ang cn rt nhiu vn cha uc nghin cu mt cch trn vn. Sau y l mt s hng nghin cu v ang thu ht c s ch ca cc nh tin hc. OLAM (OnLine Analytical Mining) - S tch hp gia CSDL, kho d liu, v KPDL. Hin nay mt s h qun tr CSDL nh Oracle, MS SQL Server, DB2 tch hp tnh nng xy dng kho d liu v phn tch trc tuyn (OLAP). Nhng tnh nng ny c h tr di dng nhng cng c i km v ngi dng phi tr tin thm nu cn s dng nhng tnh nng . Nhng nh nghin cu trong lnh vc CSDL khng mun dng li m h mun c mt s tch hp gia CSDL, kho d liu v KPDL [HK02]. Khm ph c nhiu dng tri thc khc nhau t nhiu kiu d liu [HK02] [KV01]. Tnh hiu qu, tnh chnh xc, phc tp tnh ton, kh nng m rng v tch hp, x l nhiu v d liu khng y , tnh hu dng ( ngha) ca tri thc [HK02]. Kt hp KPDL vi tri thc c s (background knowledge) [KV01] [PDD99].

- 16 Vn song song ha v phn tn qu trnh KPDL [AS96] [AM95] [MHT02] [HHMT02] [HKK97] [PCY95] [JPO01] [ZHL98] [DP01]. Ngn ng truy vn trong KPDL (Data Mining Query Language DMQL): cung cp cho ngi s dng mt ngn ng hi thut tin tng t nh SQL i vi CSDL quan h [HK02]. Biu din v trc quan ha tri thc khai ph c sao cho gn gi vi ngi s dng (human-readable expression). Tri thc c th biu din a chiu, a tng ngi dng s dng tri thc hiu qu hn [HK02].

- 17 -

Chng II. Lut kt hp


2.1 Ti sao li lut kt hp?
Lut kt hp l nhng lut c dng 70% khch hng mua bia th mua thm tht b kh, 20% giao dch c mua c bia ln tht b kh hoc 75% bnh nhn ht thuc l v sng ven vng nhim th b ung th phi, trong 25% s bnh nhn va ht thuc l, sng ven vng nhim va ung th phi [AIS93]. mua bia hay ht thuc l v sng ven vng nhim y c xem l v tri (tin - antecedent) ca lut, cn mua tht b kh hay ung th phi l v phi (kt lun - consequent) ca lut. Nhng con s 20% hay 25% l h tr ca lut (support - s phn trm cc giao dch cha c v tri ln v phi), cn 70% hay 75% l tin cy ca lut (confidence - s phn trm cc giao dch tha mn v tri th cng tha mn v phi).

Hnh 3 - Minh ha v lut kt hp

Chng ta nhn thy rng tri thc em li bi nhng lut kt hp dng trn c mt s khc bit c bn so vi thng tin thu c t cc cu lnh truy vn d liu thung thng (ngn ng SQL chng hn). thng l nhng tri thc, nhng mi lin h cha uc bit trc v mang tnh d bo ang tim n trong d liu. Nhng tri thc ny khng n gin ch l kt qu ca cc php nhm, tnh tng hay sp xp m l kt qu ca mt qu trnh tnh ton kh phc tp v tn nhiu thi gian. Tuy lut kt hp l mt dng lut kh n gin nhng li mang rt nhiu ngha. Thng tin m dng lut ny em li l rt ng k v h tr khng nh trong qu trnh ra quyt nh. Tm kim c nhng lut kt hp qu him v mang nhiu thng tin t CSDL tc nghip l mt trong nhng hng tip cn

- 18 chnh ca lnh vc KPDL v y chnh l mt ng lc khng nh thc y vic tp trung nghin cu ca nhiu nh tin hc.

2.2 Pht biu bi ton khai ph lut kt hp


Cho I = {i1, i2, , in} l tp mc bao gm n mc (item cn c gi l thuc tnh - attribute). T = {t1, t2, , tm} l tp gm m giao dch (transaction cn c gi l bn ghi - record), mi giao dch c nh danh bi TID (Transaction IDentification). Mt CSDL D l mt quan h nh phn trn I v T, hay IxT. Nu mc i xut hin trong giao dch t th ta vit (i, t) hoc it. V ngha, mt CSDL l mt tp cc giao dch, mi giao dch t l mt tp mc: t 2I (vi 2I l tp cc tp con ca I) [AIS93] [ZH99]. Sau y l mt v d v CSDL (dng giao dch): I = {A, C, D, T, W}, T = {1, 2, 3, 4, 5, 6} vi thng tin v cc giao dch cho bng sau:
nh danh giao dch (TID) 1 2 3 4 5 6 Tp mc (itemset) AC T W CD W AC T W ACD W ACD TW CD T

Bng 1 - V d v mt CSDL dng giao dch

X I c gi l tp mc (itemset). h tr (support) ca mt tp mc X c k hiu s(X) l phn trm s giao dch trong CSDL cha X. Mt tp mc X c gi l tp ph bin nu h tr ca n ln hn hoc bng mt ngng minsup no c xc nh bi ngi s dng: s(X) minsup [AIS93]. Bng sau y s lit k tt c nhng tp mc ph bin (frequent-itemset) trong CSDL cho bng 1 vi gi tr minsup bng 50%.
Cc tp mc ph bin C W, CW A, D, T, AC, AW, CD, CT, ACW AT, DW, TW, ACT, ATW, CDW, CTW, ACTW h tr tng ng 100% (6) 83% (5) 67% (4) 50% (3)

Bng 2 - Cc tp ph bin trong CSDL bng 1 vi h tr ti thiu l 50%

- 19 Lut kt hp c dng X c Y , trong X v Y l cc tp mc tha mn iu kin X Y = , cn c l tin cy (confidence) ca lut, c = s(XY) / s(X). V mt xc sut, tin cy c ca mt lut l xc sut (c iu kin) xy ra Y vi iu kin xy ra X. Mt lut c xem l tin cy nu tin cy c ca n ln hn hoc bng mt ngng minconf no do ngi dng xc nh: c minconf [AIS93]. Bi ton khai ph lut kt hp ( dng n gin nht) t ra nh sau: Cho mt CSDL D, h tr ti thiu minsup, tin cy ti thiu minconf. Hy tm kim tt c cc lut kt hp c dng X Y tha mn h tr s(XY) minsup v tin cy ca lut c( X Y ) = s(XY) / s(X) minconf . Hu ht cc thut ton c xut khai ph lut kt hp thng chia bi ton ny thnh hai pha [AS94] [MTV94] [AM95] [AS96] [ZH99] [AG00]: Pha 1: Tm tt c cc tp mc ph bin t CSDL tc l tm tt c cc tp mc X tha mn s(X) minsup. y l pha tn kh nhiu thi gian ca CPU (CPU-bound) v thi gian vo ra a (I/O-bound). Pha 2: Sinh cc lut tin cy t cc tp ph bin tm thy pha th nht. Pha ny tng i n gin v tn km t thi gian so vi pha trn. Nu X l mt tp ph bin th lut kt hp c sinh t X c dng X ' c X \ X ' , vi X l tp con khc rng ca X, X \ X l hiu ca hai tp hp, v c l tin cy ca lut tha mn c minconf. V d, vi tp ph bin ACW c tin cy 67% bng 2 v minconf = 70% th chng ta c th sinh cc lut kt hp sau y:
Lut kt hp

A CW 67% C AW W 80% AC AC 100% W 100% AW C 80% C AW


100%

Tha mn minconf 70%? C Khng C C C C

Bng 3 - Lut kt hp sinh t tp ph bin ACW

- 20 -

2.3 Nhng hng tip cn chnh trong khai ph lut kt hp


K t khi c R. Agrawal xut vo nm 1993 [AIS93], lnh vc khai ph lut kt hp n nay c nghin cu v pht trin theo nhiu hng khc nhau. C nhng xut nhm vo ci tin tc thut ton, c nhng xut nhm tm kim lut c ngha hn, v.v. Sau y l mt s hng chnh. Lut kt hp nh phn (binary association rule hoc boolean association rule): l hng nghin cu u tin ca lut kt hp. Hu ht cc nghin cu thi k u v lut kt hp u lin quan n lut kt hp nh phn [AIS93] [AS94] [MTV94]. Trong dng lut kt hp ny, cc mc (thuc tnh) ch c quan tm l c hay khng xut hin trong giao dch ca CSDL ch khng quan tm v mc xut hin. C ngha l vic mua 20 chai bia v 1 chai bia c xem l ging nhau. Thut ton tiu biu nht khai ph dng lut ny l thut ton Apriori v cc bin th ca n [AS94]. y l dng lut n gin v nh sau ny ta bit cc dng lut khc cng c th chuyn v dng lut ny bng mt s phng php nh ri rc ha, m ha, .v.v. Mt v d v dng lut ny: Mua bnh m = yes AND mua ng= yes => mua sa = yes AND mua b = yes, vi h tr 20% v tin cy 80% Lut kt hp c thuc tnh s v thuc tnh hng mc (quantitative and categorical association rule): cc thuc tnh ca cc CSDL thc t c kiu rt a dng (nh phn binary, s - quantitative, hng mc categorical, .v.v.). pht hin lut kt hp vi cc thuc tnh ny, cc nh nghin cu xut mt s phng php ri rc ha nhm chuyn dng lut ny v dng nh phn c th p dng cc thut ton c [AS96] [MY98]. Mt v d v dng lut ny: Gii tnh = Nam AND Tui 50..65 AND Cn nng 60..80 AND Lng ng trong mu > 120mg/dl => Huyt p = Cao, vi h tr 30%, tin cy 65%. Lut kt hp m (fuzzy association rule): vi nhng hn ch cn gp phi trong qu trnh ri rc ha cc thuc tnh s (quantitative attributes), cc nh nghin cu xut lut kt hp m nhm khc phc nhng hn ch trn v chuyn lut kt hp v mt dng t nhin hn, gn gi hn vi ngi s dng [KFW98] [AG00]. Mt v d v dng lut ny: Ho khan =

- 21 yes AND st cao AND au c = yes AND kh th = yes => B nhim SARS = yes, vi h tr 4% v tin cy 85%. Trong lut trn, iu kin st cao v tri ca lut l mt thuc tnh c m ha. Lut kt hp nhiu mc (multi-level association rules): ngoi cc dng lut trn, cc nh nghin cu cn xut mt hng nghin cu na v lut kt hp l lut kt hp nhiu mc [HF95] [SA95]. Vi cch tip cn ny, ngi ta s tm kim thm nhng lut c dng Mua my tnh PC => Mua h iu hnh AND mua phn mm tin ch vn phng, thay v ch nhng lut qu c th nh Mua my tnh IBM PC => Mua h iu hnh Microsoft Windows AND mua Microsoft Office, . R rng, dng lut u l dng lut tng qut ha ca dng lut sau v tng qut ha cng c nhiu mc khc nhau. Lut kt hp vi thuc tnh c nh trng s (association rule with weighted items): trong thc t, cc thuc tnh trong CSDL khng phi c vai tr ngang bng nhau. C mt s thuc tnh c ch trng v lc ta ni nhng thuc tnh c mc quan trng cao hn cc thuc tnh khc. V d, khi kho st v kh nng ly nhim hi chng SARS, thng tin v thn nhit, ng h hp r rng l quan trng hn rt nhiu so vi thng tin v tui tc. Trong qu trnh tm kim lut, chng ta s gn cho cc thuc tnh thn nhit, ng h hp cc trng s ln hn so vi trng s ca thuc tnh tui tc. y l mt hng nghin cu rt th v v c mt s nh nghin cu xut cch gii quyt bi ton ny [LHM99] [WYY01] [THH02]. Vi lut kt hp c thuc tnh c nh trng s, chng ta s khai ph c nhng lut mang rt nhiu ngha, thm ch l nhng lut him (tc c h tr thp, nhng mang mt ngha c bit). Bn cnh nhng nghin cu v nhng bin th ca lut kt hp, cc nh nghin cu cn ch trng xut nhng thut ton nhm tng tc qu trnh tm kim tp ph bin t CSDL. Ngi ta chng minh rng, ch cn tm kim nhng tp ph bin ti i (maximal frequent itemsets) l i din cho tp tt c cc tp ph bin [BCJ01] (thut ton MAFIA), hoc ch cn tm tp cc tp ph bin ng (closed itemsset) l nh [PHM01] (thut ton CLOSET), [ZH99] (thut ton CHARM), [PBTL99]. Nhng thut

- 22 ton ny ci thin ng k v mt tc do p dng c nhng chin lc ct ta tinh xo hn cc thut ton trc . Khai ph lut kt hp song song (parallel mining of association rules): bn cnh khai ph lut kt hp vi cc gii thut tun t, cc nh lm tin hc cng tp trung vo nghin cu cc gii thut song song cho qu trnh pht hin lut kt hp. Nhu cu song song ha v x l phn tn l cn thit bi kch thc d liu ngy cng ln nn i hi tc x l cng nh dung lng b nh ca h thng phi c m bo. C rt nhiu thut ton song song khc nhau c xut [AM95] [PCY95] [AS96] [HKK97] [ZHL98] [ZPO01] [DP01], chng c th ph thuc hoc c lp vi nn tng phn cng. Lut kt hp tip cn theo hng tp th (mining association rules based on rough set): tm kim lut kt hp da trn l thuyt tp th [MS00]. Ngoi ra, cn mt s hng nghin cu khc v khai ph lut kt hp nh: khai ph lut kt hp trc tuyn [AY98], khai ph lut kt hp c kt ni trc tuyn n cc kho d liu a chiu (multidimensional data, data warehouse) thng qua cng ngh OLAP (Online Analysis Processing), MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP), ADO (ActiveX Data Object) for OLAP .v.v.

- 23 -

Chng III. Khai ph lut kt hp m


3.1 Lut kt hp c thuc tnh s
3.1.1 Lut kt hp c thuc tnh s Khai ph lut kt hp vi thuc tnh s v thuc tnh hng mc (quantitative and categorical association rule) l mt trong nhng hng tip cn quan trng trong lnh vc khai ph lut kt hp ( c cp mc 2.3). Dng lut ny uc xut nghin cu ln u tin trong [SA96]. Bng d liu sau y minh ha mt CSDL bao gm cc thuc tnh nh phn (binary), thuc tnh s (quantitative), v thuc tnh hng mc (categorical).
Tui 60 54 54 52 68 54 54 67 46 52 40 37 71 74 29 70 67 Gii tnh 1(n) 1 1 1 1 1 0(nam) 0 0 1 1 1 0 0 1 1 0 Dng au ngc (1, 2, 3, 4) 4 4 4 4 3 3 2 3 2 2 4 3 2 2 2 4 3 Lng cholesterol (mg/ml) 206 239 286 255 274 273 288 277 204 201 167 250 320 269 204 322 544 Lng ng trong mu (>120mg/ml) 0(<120mg/ml) 0 0 0 1(>120mg/ml) 0 1 0 0 0 0 0 0 0 0 0 0 in tm trng thi ngh (0, 1, 2) 2 0 2 0 2 2 2 0 0 0 2 0 0 2 2 2 2 Nhp tim cc i 132 126 116 161 150 152 159 172 172 158 114 187 162 121 202 109 160 B bnh tim (c, khng) 2(c) 2 2 2 2 1(khng) 1 1 1 1 2 1 1 1 1 2 1

Bng 4 - CSDL khm v chn on bnh tim mch ca 17 bnh nhn

Trong CSDL trn, Tui, Lng cholesterol trong mu, Nhp tim cc ai l cc thuc tnh s (quantitative), Dng au ngc, Dng in tm trng thi ngh l cc thuc tnh hng mc (categorical), cn cc thuc tnh cn li nh Gii tnh, B bnh tim, l cc thuc tnh nh phn (binary hay boolean). Thc ra thuc tnh nh phn cng l mt trng hp c bit ca thuc tnh hng mc. Vi CSDL ny, chng ta c th rt ra mt s lut kt hp sau: <Tui: 54..74> AND <Gii tnh: N> AND <Cholesterol: 200..300> => <Bnh tim: C>, vi h tr 23.53% v tin cy l 80%.

- 24 <Gii tnh: Nam> AND <in tm trng thi ngh: 0> AND <Lng ng trong mu 120> => <Bnh tim: Khng>, vi h tr 17.65% v tin cy l 100%. .v.v.

Hng tip cn c xut trong [AS96] nhm tm kim lut kt hp dng nu trn bng cch phn khong min gi tr ca cc thuc tnh s v thuc tnh hng mc chuyn tt c v thuc tnh nh phn ri sau p dng cc thut ton in hnh [AS94] [MTV94] [ZH99] khi ph lut kt hp nh phn trc y. 3.1.2 Cc phng php ri rc ha Cc thut ton khai ph lut kt hp nh phn [AIS93] [AS94] [MTV94] [ZH99] ch c th p dng trn nhng CSDL quan h ch c thuc tnh nh phn hoc CSDL dng giao dch nh trong bng 1. Chng khng th p dng trc tip vi cc CSDL c thuc tnh s v thuc tnh hng mc nh trong CSDL bng 4. Mun thc hin c iu ny, ngi ta [AS96] [MY98] phi tin hnh ri rc ha d liu cho cc thuc tnh s chuyn chng v thuc tnh nh phn. Mc d cc thut ton c xut trong [SA96] [MY98] c th gii quyt trn vn bi ton ny, tuy vy kt qu tm c vn cha lm tha mn nhng nh nghin cu. Vn khng phi thut ton m l cch thc ri rc ha d liu c p dng. Mc ny s trnh by mt vi phng php ri rc ha, ng thi nh gi xem chng c nhng nhc im g. Nu A l thuc tnh s ri rc (quantitative & discrete) hoc l thuc tnh hng mc (categorical) vi min gi tr hu hn dng {v1, v2, , vk} v k b (< 100) th ta s bin i thuc tnh ny thnh k thuc tnh nh phn dng A_V1, A_V2, A_Vk. Gi tr ca mt bn ghi ti trng A_Vi bng True (Yes hoc 1) nu gi tr ca bn ghi ti thuc tnh A ban u bng vi, trong cc trng hp cn li gi tr ca A_Vi s l False (No hoc 0). Thuc tnh Dng au ngc v Dng in tm trng thi ngh trong bng 4 thuc dng ny. Lc Dng au ngc s c chuyn thnh bn thuc tnh nh phn l Dng au ngc_1, Dng au ngc_2, Dng au ngc_3, v Dng au ngc_4.

- 25 Dng au ngc (1, 2, 3, 4) 4 1 3 2 Dng au ngc_1 0 1 0 0 Dng au ngc_2 0 0 0 1 Dng au ngc_3 0 0 1 0 Dng au ngc_4 1 0 0 0

sau khi ri rc ha

Bng 5 - Ri rc ha thuc tnh s ri rc hu hn hoc thuc tnh hng mc

Nu A l thuc tnh s lin tc (quantitative & continuous) hoc A l thuc tnh s ri rc hay thuc tnh hng mc vi min gi tr dng {v1, v2, , vp} (p ln) th ta s nh x thnh q thuc tnh nh phn <A: start1..end1>, <A: start2..end2>, , <A: startq..endq>. Gi tr ca mt bn ghi ti trng <A: starti..endi> s bng True (Yes hoc 1) nu gi tr ca bn ghi ti thuc tnh A ban u nm trong khong [starti..endi], ngc li n s nhn gi tr False (No hoc 0). Thuc tnh Tui, Lng cholesterol, v Nhp tim cc i trong CSDL bng 4 l nhng thuc tnh dng ny. V d ta chia thuc tnh Cholesterol v Tui thnh cc thuc tnh nh phn hai bng sau:
Lng Cholesterol 544 206 286 322 <Cholesterol: 150..249> 0 1 0 0 <Cholesterol: 250..349> 0 0 1 1 <Cholesterol: 350..449> 0 0 0 0 <Cholesterol: 450..549> 1 0 0 0

Bng 6 - Ri rc ha thuc tnh s "Lng cholesterol trong mu"

Tui 74 29 30 59 60

<Tui: 1..29> 0 1 0 0 0

<Tui: 30..59> 0 0 1 1 0

<Tui: 60..120> 1 0 0 0 1

Bng 7 - Ri rc ha thuc tnh s Tui tc

Phng php ri rc ha trn gp phi vn im bin gy [AG00] [KFW98] (sharp boundary problem). Hnh 4 di y cho bit phn b h tr ca mt thuc tnh A no c min gi tr t 1 n 10. Nu chng ta tin hnh ri rc ha thuc tnh A thnh 2 khong l [1..5] v [6..10] v vi h tr cc tiu l 41% th khong [6..10] s khng tha mn h tr ti thiu (40% < minsup = 41%) mc d ln cn bin tri ca khong ny c h tha mn ln hn minsup. V d [4..7] c h tr l 55%, [5..8] c h tr l 45%. Nh vy

- 26 php phn khong ny to nn mt im bin gy gia gi tr 5 v 6 v do vi cch ri rc ny, cc thut ton khng th khai ph ra nhng lut lin quan n cc gi tr nm trong khong [6..10].

Hnh 4 - V d v vn "im bin gy" khi tin hnh ri rc ha d liu

Nhm khc phc im bin gy, [SA96] xut mt cch phn khong mi sao cho cc khong lin k c mt phn gi ln nhau (overlap) phn ng bin gia chng. Cch phn khong ny gii quyt c vn trn, nhng li gp phi mt vn mi l khi tng h tr ca cc khong ln hn 100% v mt s gi tr (nm ln cn bin) c coi trng hn so vi cc gi tr khc ca thuc tnh - iu ny l rt thiu t nhin v c phn mu thun. Ri rc ha theo khong cng ny sinh mt vn v ng ngha. V d ri rc ha thuc tnh Tui trong bng 7 cho thy rng 29 v 30 ch cch nhau mt tui li thuc v hai khong khc nhau. Nu ta cho khong [1..29] l tr, [30..59] l trung nin, cn [60..150] l gi th 59 tui c xem l trung nin trong khi 60 tui li c xem l gi. y l iu rt thiu t nhin v khng thun vi cch t duy ca con ngi bi trong thc t tui 60 ch gi hn tui 59 cht t. khc phc nhng vn ny sinh trn, ngi ta [KFW98] [AG00] xut mt dng lut mi: Lut kt hp m. Dng lut ny khng ch khc phc nhng im yu ca vn phn khong m cn em li mt dng lut t nhin hn v mt ng ngha, gn gi hn vi ngi s dng. Vi dng lut ny, nhng lut kt hp dng <Tui: 54..74> AND <Gii tnh:
N> AND <Cholesterol: 200..300> => <Bnh tim: C>, vi h tr 23.53% v tin cy l 80% s c biu din li thnh lut kt hp m dng <Tui_Gi> AND <Gii tnh: N> AND <Cholesterol_Cao> => <Bnh tim: C>. Trong Tui_Gi

- 27 v Cholesterol_Cao l hai thuc tnh c m ha gn lin vi hai thuc tnh Tui v Cholesterol.

3.2 Lut kt hp m
3.2.1 Ri rc ha thuc tnh da vo tp m Theo l thuyt tp m [LAZ65] [ZHJ91], mt phn t thuc vo mt tp no vi mt mc thuc (membership value) nm trong khong [0, 1]. Gi tr ny c xc nh da vo hm thuc (membership function) tng ng vi mi tp m. V d, cho x l mt thuc tnh cng vi min xc nh Dx (cn c gi l tp v tr), hm thuc xc nh mc thuc ca mi gi tr x ( Dx) vo tp m fx c dng sau:
m f x ( x) : D x [0,1]

(3.1)

By gi chng ta th ng dng khi nim tp m vo vic ri rc ha d liu gii quyt mt s vn cn vng mc phn trn. V d thuc tnh Tui vi tp xc nh trong khong [0, 120], chng ta gn cho n ba tp m tng ng l Tui_tr, Tui_trung_nin, v Tui_gi v th hm thuc tng ng vi ba tp m ny nh sau:

Hnh 5 - th hm thuc ca cc tp m "Tui_tr", "Tui_trung_nin", v "Tui_gi"

Dng tp m ri rc ha d liu, chng ta khc phc c vn im bin gy nh tp m to ra nhng im bin mn hn rt nhiu. V d, trong th hnh 5, tui 59 v 60 c mc thuc vo tp m Tui_gi tng ng l 0.85 v 0.90. Tui 30 v 29 c mc thuc vo tp m Tui_tr ln lt l 0.70 v 0.75. Mt v d khc v cc tp m ng vi thuc tnh Lng cholesterol trong mu l Cholesterol_thp v Cholesterol_cao.

- 28 -

Hnh 6 - th hm thuc ca hai tp m "Cholesterol_thp" v "Cholesterol_cao"

i vi nhng thuc tnh hng mc (categorical) c tp gi tr {v1, v2, , vk} v k khng qu ln th gn vi mi gi tr vi mt tp m A_Vi (A l tn thuc tnh) c hm thuc xc nh nh sau: mA_Vi(x) bng 1 nu x = vi v bng 0 nu x vi. Thc ra, A_Vi hon ton ging nh tp r v gi tr hm thuc ca n ch l 0 hoc 1. Trng hp k qu ln, lc chng ta c th chia khong v gn tp m cho tng khong hoc hi kin chuyn gia c hiu bit v d liu m chng ta ang khai ph. Ri rc ha p dng tp m, chng ta c mt s im li sau: Gii quyt c vn im bin gy nh tp m c th phn khong mn hn nh vo trn ca hm thuc. Ri rc ha bng phn khong i khi to ra s khong rt ln v do s thuc tnh nh phn cng rt ln. Cn khi s dng tp m th s lng tp m gn vi mi thuc tnh l khng ng k. V d, p dng phn khong cho thuc tnh Lng cholesterol chng ta s thu c 5 khong con trong khong [100, 600] ban u, cn p dng tp m th ta ch cn hai tp m l Cholesterol_thp v Cholesterol_cao. u im th ba tp m em li l n cho php chng ta biu din lut kt hp di dng t nhin hn, gn gi vi ngi s dng hn. u im th t m tp m em li l gi tr thuc tnh sau khi ri rc ha (sau khi tnh qua hm thuc) bin thin trong khong [0, 1] cho bit mc thuc t hay nhiu (cc thuc tnh nh phn trc y ch c mt trong hai gi tr 0, 1). iu ny cho chng ta kh nng c lng chnh xc hn ng gp ca cc bn ghi trong CSDL vo mt tp ph bin no .

- 29 u im th nm m sang phn sau chng ta s thy r hn l mc cc thuc tnh c m ha, nhng vn gi nguyn c mt s tnh cht ca thuc tnh nh phn, do vn c th p dng cc thut ton khai ph lut kt hp nh phn vo khai ph lut kt hp m vi mt cht sa i. V d tnh cht mi tp con khc rng ca tp ph bin cng l tp ph bin v mi tp cha tp khng ph bin u l tp khng ph bin (downward closure property) [AS94] vn cn ng nu chng ta chn c php ton T-norm (T-chun) ph hp. Mt u im na i vi ri rc ha da vo tp m l n c th p dng tt cho c hai dng CSDL: CSDL quan h (relational databases) v CSDL dng giao dch (transactional databases). 3.2.2 Lut kt hp m (fuzzy association rules)
Tui 60 54 54 52 68 54 46 37 71 74 29 70 67 Cholesterol (mg/ml) 206 239 286 255 274 288 204 250 320 269 204 322 544 ng trong mu (>120mg/ml) 0 (<120mg/ml) 0 0 0 1 (>120mg/ml) 1 0 0 0 0 0 0 0 B bnh tim (c, khng) 2 (c) 2 2 2 2 1 (khng) 1 1 1 1 1 2 1

Bng 8 - CSDL v khm v chn on bnh tim mch ca 13 bnh nhn

Cho I = {i1, i2, , in} l tp n thuc tnh, iu l thuc tnh th u trong I. T = {t1, t2, , tm} l tp m bn ghi, tv l bn ghi th v trong T. tv[iu] cho bit gi tr ca thuc tnh iu ti bn ghi tv. V d, vi CSDL trong bng 8, t5[i2] = t5[Cholesterol] = 274 (mg/ml). p dng phng php m ha thuc tnh phn trn, chng ta gn vi mt thuc tnh iu vi mt tp cc tp m

Fiu nh sau:

Fiu = f iu1 , f iu2 ,..., f iuk


V d, vi CSDL trong bng 8, chng ta c:

(3.2)

- 30 -

Fi1 = FTui = {Tui_tr, Tui_trung_nin, Tui_gi} (vi k = 3) Fi2 = FCholesterol = {Cholesterol_thp, Cholesterol_cao} (vi k = 2)
Lut kt hp m [AG00] [KFW98] c dng: X is A Y is B (3.3) Trong : X, Y I l cc tp mc (itemset). X = {x1, x2, , xp}, Y = {y1, y2, , yq}. xi xj (nu i j) v yi yj (nu i j). A = {fx1, fx2, , fxp}, B = {fy1, fy2, , fyq} l tp cc tp m tng ng vi cc thuc tnh trong X v Y. fxi Fxi v fyj Fyj. Chng ta cng c th vit li lut kt hp m mt trong hai dng sau: X={x1, , xp} is A={fx1, , fxp} Y={y1, , yq} is B={fy1, , fyq} (3.4) Hoc: (x1 is fx1) (xp is fxp) (y1 is fy1) (yq is fyq) (3.5) (vi l php ton T-norm (T-chun) trong logic m) Mt tp thuc tnh m trong lut kt hp m khng ch l X I m l mt cp <X, A> vi A l tp cc tp m tng ng vi cc thuc tnh trong X. h tr (fuzzy support) ca tp mc <X, A> k hiu l fs(<X, A>) c xc nh theo cng thc:

fs(< X , A >) =
Trong :

{
m v=1

x1

(tv[x1]) x2 (tv[x2 ]) ...xp (tv[xp ]) T

}
(3.6)

X = {x1, , xp}, tv l bn ghi th v trong T. l ton t T-norm (T-chun) trong l thuyt logic m. N c vai tr nh php ton logic AND trong logic c in.

- 31

x (tv [xu ]) c xc nh theo cng thc:


u

x (tv [xu ]) =
u

mxu (tv [xu ]) neu mxu (tv [xu ]) wxu 0 neu nguoc lai

(3.7)

Trong :

mxu

l hm thuc ca tp m

f xu

gn vi thuc tnh xu, cn

wxu l ngng (xc nh bi ngi dng) ca hm thuc mxu .


|T| (lc lng ca T) l s lng bn ghi trong T v chnh l bng m. Tp mc ph bin: mt tp thuc tnh m <X, A> l ph bin nu h tr ca n ln hn hoc bng h tr ti thiu fminsup (fuzzy minumum support) do ngi dng nhp vo: fs(<X, A>) fminsup (3.9)

h tr ca mt lut m c tnh theo cng thc: fs(<X is A => B is Y>) = fs(<XY, AB>) fminsup, c ngha l fs(<X is A => B is Y>) fminsup. tin cy (fuzzy confidence) ca mt lut kt hp m dng X is A => Y is B c k hiu l fc(X is A => Y is B) v xc nh theo cng thc sau: fc(X is A => Y is B) = fs(<X is A => B is Y>) / fs(<X, A>) (3.11) (3.10)

Mt lut c gi l ph bin nu h tr ca n ln hn hoc bng

Mt lut c xem l tin cy nu tin cy ca n ln hn hoc bng tin cy ti thiu fminconf (fuzzy minimum confidence) xc nh bi ngi s dng, c ngha l: fc(X is A => Y is B) fminconf. Ton t T-norm (): c nhiu cch la chn php ton T-norm [PDD99] [DMT03] [LAZ65] [ZHJ91] cho cng thc (3.6) nh: Php ly min: Tch i s: Tch b chn: Tch Drastic: a b = min(a, b) a b = ab a b = max(0, a + b 1) a b = a (nu b=1), = b (nu a=1), = 0 (nu a, b < 1)

- 32 Php giao Yager: a b = 1 min[1, ((1-a)w + (1-b)w)1/w] (vi w > 0).

Khi w = 1 th tr thnh tch b chn, khi w tin ra + th tr thnh hm min, khi w tin v 0 th tr thnh tch Drastic. Qua thc nghim, ti thy hai php ton ph hp nht l php ly min v php tch i s do chng thun tin cho vic tnh ton v th hin c mi lin h cht ch gia cc thuc tnh trong cc tp ph bin. Khi chn php ly min cho ton t T-norm, cng thc (3.6) tr thnh cng thc (3.12), cn khi chn php tch i s th (3.6) s tr thnh cng thc (3.13) nh sau:

fs(< X , A >) =

min{
m v=1

x1

(tv[x1]),x2 (tv[x2 ]),..., xp (tv[xp ]) T

}
(3.12)

fs ( < X , A > ) =

{
m v =1 x u X

xu

( t v [ x u ])

}
(3.13)

Mt l do khc s dng hai php ton ly min v php tch i s cho ton t T-norm li lin quan n ng ngha ca lut kt hp m. Trong logic c in, php ko theo ( hoc ), lin kt hai mnh P v Q c P Q, l mt mnh phc hp, vi ni dung ng ngha l nu P th Q. y l mt lin kt logic kh phc tp, n nhm din t mt quan h nhn qu, tc l ch trong trng hp P v Q c quan h ph thuc nhn qu vi nhau. Nhng khi hnh thc ha, ngi ta gn cho P Q mt gi tr chn l nh l hm ca cc gi tr chn l ca P v ca Q, nn khng trnh khi khin cng v mt gii thch ng ngha [PDD99]. Trong logic m, php ko theo cho ta cc mnh phc hp dng nu u l P th v l Q, trong P l tp m trn tp v tr U v Q l tp m trn tp v tr V. Ta c th xem nu u l P th v l Q tng ng vi vic (u, v) thuc mt tp m no trn tp v tr UxV, ta k hiu tp m l P Q. nh ngha quan h ko theo trong logic m c ngha l nh ngha tp m P Q (hay xc nh hm thuc mPQ ) t cc tp m P v Q (hay t hm thuc mP ca P v mQ ca Q).

- 33 Vic nghin cu cu trc v ng ngha ca lut ko theo trong logic m c nhiu tc gi nghin cu, sau y l mt vi cch xc inh mPQ t mP v mQ [PDD99]: Theo logic c in: (u, v) U x V: mPQ(u, v) = (1- mP, mQ). Trong , l ton t S-norm (hay cn gi l T-i chun). Nu p dng l php ly max ta c mPQ(u, v) = max(1- mP, mQ) (Dienes). Nu p dng l tng xc sut th mPQ(u, v) = 1- mP + mP.mQ (Mizumoto). Cn nu p dng l tng b chn th mPQ(u, v) = min(1, 1- mP + mQ) (Lukaciewicz) .v.v. Chng ta c th hiu ko theo m nu u l P th v l Q ch c gi tr chn l ln khi c hai hm thuc hai v u c gi tr ln, tc l c th s dng ton t T-norm: mPQ(u, v) = (mP, mQ). Nu p dng php ly min cho th ta c mPQ(u, v) = min(mP, mQ) (Mamdani). Nu p dng php ly tch i s th mPQ(u, v) = mP . mQ (Mamdani) [DMT03]. Lut kt hp m cng l mt trong nhng dng lut ko theo m, do n cng phi tun th v mt ng ngha ca dng lut ny. Theo cch hiu ca Mamdani th chng ta c th s dng ton t T-norm c th l vi php ly min v php tch i s. y chnh l mt trong nhng l do ti sao ti chn php ly min v php tch i s cho ton t T-norm cng thc (3.6). 3.2.3 Thut ton khai ph lut kt hp m Thut ton khai ph lut kt hp m c chia lm hai pha nh sau: Pha 1: Tm tt c cc tp thuc tnh m ph bin dng <X, A> c h tr ln hn h tr cc tiu ca ngi dng nhp vo: fs(<X, A>) fminsup. Pha 2: Sinh cc lut kt hp m tin cy t cc tp ph bin tm thy pha th nht. Pha ny n gin v tn km t thi gian hn so vi pha trn. Nu <X, A> l mt tp thuc tnh m ph bin th lut kt hp c sinh ra t X c dng X ' is A' fc X \ X ' is A \ A' , vi X l tp con khc rng ca X, X \ X l hiu ca hai tp hp, A l tp con khc rng ca A v l tp cc tp m tng ng vi cc thuc tnh trong X, A \ A l hiu hai tp hp, fc l tin cy ca lut tha mn fc fminconf (do ngi dng xc nh).

- 34 u vo ca thut ton (inputs): CSDL D vi tp thuc tnh I v tp bn ghi T, h tr ti thiu fminsup v tin cy ti thiu fminconf. u ra ca thut ton (outputs): tp tt c cc lut kt hp m tin cy. Bng cc k hiu (notations):
K hiu D I T DF IF TF ngha CSDL (dng quan h hoc giao dch) Tp cc mc (thuc tnh) trong D Tp cc giao dch (hoc bn ghi) trong D CSDL m (c tnh ton t CSDL ban u thng qua hm thuc ca cc tp m tng ng vi tng thuc tnh) Tp cc mc (thuc tnh) trong DF, mi mc hay thuc tnh u c gn vi mt tp m. Mi tp m f u c mt ngng wf nh trong cng thc (3.7) Tp cc giao dch (hoc bn ghi) trong DF, cc gi tr thuc tnh trong mi giao dch hoc bn ghi c chuyn sang mt gi tr thuc khong [0, 1] nh hm thuc ca cc tp m tng ng vi tng thuc tnh. Tp cc tp mc (thuc tnh) c kch thc k Tp cc tp mc (thuc tnh) ph bin c kch thc k Tp tt c cc tp mc (thuc tnh) ph bin h tr ti thiu tin cy ti thiu

Ck Fk F fminsup fminconf

Bng 9 - Bng cc k hiu s dng trong thut ton khai ph lut kt hp m

Thut ton:
1 2 3 4 5 6 7 8 9 10 11 12 13 BEGIN (DF, IF, TF) = FuzzyMaterialization(D, I, T); F1 = Counting(DF, IF, TF, fminsup); k = 2; while (Fk-1 ) { Ck = Join(Fk-1); Ck = Prune(Ck); Fk = Checking(Ck, DF, fminsup); F = F Fk ; k = k + 1; } GenerateRules(F, fminconf); END
Bng 10 - Thut ton khai ph lut kt hp m

- 35 Thut ton trong bng 10 s dng mt s chng trnh con sau y: Chng trnh con (DF, IF, TF) = FuzzyMaterialization(D, I, T): hm ny thc hin nhim v chuyn i t CSDL D ban u sang CSDL DF vi cc thuc tnh c gn thm cc tp m v gi tr cc thuc tnh cc bn ghi trong T c nh x thnh mt gi tr thuc khong [0, 1] thng qua hm thuc ca cc tp m tng ng vi cc thuc tnh. V d, vi CSDL D trong bng 8, sau khi thc hin hm ny, chng ta s c: IF = {[Tui, Tui_tr] (1), [Tui, Tui_trung_nin] (2), [Tui, Tui_gi] (3),
[Cholesterol, Cholesterol_thp] (4), [Cholesterol, Cholesterol_cao] (5), [ng_trong_mu, ng_trong_mu_0] (6), [ng_trong_mu, ng_trong_mu_1] (7), [Bnh_tim, Bnh_tim_khng] (8), [Bnh_tim, Bnh_tim_c] (9)}

Nh vy IF bao gm 9 thuc tnh c m ha so vi 4 thuc tnh ban u trong CSDL D. Mi thuc tnh mi l mt cp nm trong ngoc vung bao gm tn thuc tnh ban u v tn ca tp m gn vi thuc tnh y. V d, thuc tnh Tui ban u sau khi m ha ta s c ba thuc tnh mi l [Tui, Tui_tr] (1), [Tui, Tui_trung_nin] (2), [Tui, Tui_gi] (3). Ngoi ra chng trnh con FuzzyMaterialization s nh x gi tr cc thuc tnh ban u sang cc gi tr thuc khong [0, 1] nh hm thuc ca cc tp m. V d, bng sau y c tnh ton da trn CSDL D bng 8:
T 60 54 54 52 68 54 46 37 71 74 29 70 67 1 0.00 0.20 0.20 0.29 0.00 0.20 0.44 0.59 0.00 0.00 0.71 0.00 0.00 2 0.41 0.75 0.75 0.82 0.32 0.75 0.97 0.93 0.28 0.25 0.82 0.28 0.32 3 0.92 0.83 0.83 0.78 1.00 0.83 0.67 0.31 1.00 1.00 0.25 1.00 1.00 C 206 239 286 255 274 288 204 250 320 269 204 322 544 4 0.60 0.56 0.52 0.54 0.53 0.51 0.62 0.54 0.43 0.53 0.62 0.43 0.00 5 0.40 0.44 0.48 0.46 0.47 0.49 0.38 0.46 0.57 0.47 0.38 0.57 1.00 0 0 0 0 1 1 0 0 0 0 0 0 0 6 1 1 1 1 0 0 1 1 1 1 1 1 1 7 0 0 0 0 1 1 0 0 0 0 0 0 0 B 2 2 2 2 2 1 1 1 1 1 1 2 1 8 0 0 0 0 0 1 1 1 1 1 1 0 1 9 1 1 1 1 1 0 0 0 0 0 0 1 0

Bng 11 - TF - gi tr cc thuc tnh ti cc bn ghi c m ha

Ch , cc ch ci trong dng u tin ca bng trn c ngha nh sau: T (Tui), C (Cholesterol), (ng trong mu), B (Bnh tim).

- 36 Do hm thuc ca mi tp m f c mt ngng wf nn ch ch nhng gi tr no vt ngng wf mi c tnh n, ngc li nhng gi tr khng vt ngng c xem bng 0 (theo cng thc 3.7). Ngng wf ph thuc vo mi hm thuc v tng thuc tnh. Nhng c t mu trong bng 11 cho bit gi tr ca nhng vt ngng (cc thuc tnh trong bng 11 u ly wf bng 0.5). Nhng khng c t mu c xem c gi tr bng 0. Chng trnh con F1 = Counting(DF, IF, TF, fminsup): hm ny sinh ra F1 l tp tt c cc tp ph bin c lc lng bng 1. Cc tp thuc tnh ph bin ny phi c h tr ln hn hoc bng fminsup. V d, p dng cng thc (3.6) vi ton t T-norm () l tch i s v fminsup bng 46% ta c bng sau:
Tp thuc tnh {[Tui, Tui_tr]} (1) {[Tui, Tui_trung_nin]} (2) {[Tui, Tui_gi]} (3) {[Cholesterol, Cholesterol_thp]} (4) {[Cholesterol, Cholesterol_cao]} (5) {[ng_trong_mu, ng_trong_mu_0]} (6) {[ng_trong_mu, ng_trong_mu_1]} (7) {[Bnh_tim, Bnh_tim_khng]} (8) {[Bnh_tim, Bnh_tim_c]} (9) h tr 10 % 45 % 76 % 43 % 16 % 85 % 15 % 54 % 46 % L tp ph bin? fminsup = 46% Khng Khng C Khng Khng C Khng C C

Bng 12 - C1 - tp tt c cc tp thuc tnh c lc lng bng 1

Nh vy F1 = {{3}, {6}, {8}, {9}} Chng trnh con Ck = Join(Fk-1): hm ny thc hin vic sinh ra tp cc tp thuc tnh m ng c vin c lc lng k t tp cc tp thuc tnh m ph bin lc lng k-1 l Fk-1. Cch kt ni s dng trong hm Join c th hin thng qua ngn ng SQL nh sau: INSERT INTO Ck SELECT p.i1, p.i2, , p.ik-1, q.ik-1 FROM Lk-1 p, Lk-1 q WHERE p.i1 = q.i1, , p.ik-2 = q.ik-2, p.ik-1 < q.ik-1 AND p.ik-1.o q.ik-1.o; Trong , p.ij v q.ij l s hiu ca thuc tnh m th j trong p v q, cn p.ij.o v q.ij.o l s hiu thuc tnh gc ca thuc tnh m th j trong p v q. V d, C2 = {{3, 6}, {3, 8}, {3, 9}, {6, 8}, {6, 9}}. Tp thuc tnh {8, 9} l khng hp l v c (8) v (9) c cng mt thuc tnh gc ban u l Bnh_tim.

- 37 Chng trnh con Ck = Prune(Ck): chng trnh con ny s dng tnh cht mi tp con khc rng ca tp ph bin cng l tp ph bin v mi tp cha tp khng ph bin u l tp khng ph bin (downward closure property) ct ta nhng tp thuc tnh no trong Ck c tp con lc lng k-1 khng thuc tp cc tp thuc tnh ph bin Fk-1. Sau khi ct ta, C2 = {{3, 6}, {3, 8}, {3, 9}, {6, 8}, {6, 9}}. Chng trnh con Fk = Checking(Ck, DF, fminsup): chng trnh con ny duyt qua CSDL DF cp nht h tr cho cc tp thuc tnh trong Ck. Sau khi duyt xong, Checking s ch chn nhng tp ph bin (c h tr ln hn hoc bng fminsup) a vo trong Fk. V d, vi C2 trn, sau khi thc hin Checking, ta c F2 = {{3,6}, {6,8}}.
Tp thuc tnh {3, 6} {3, 8} {3, 9} {6, 8} {6, 9} h tr 62 % 35 % 41 % 46 % 38 % L tp ph bin? C Khng Khng C Khng

Bng 13 - F2 - tp thuc tnh ph bin c lc lng bng 2

Chng trnh cn GenerateRules(F, fminconf): sinh lut kt hp m tin cy t tp cc tp ph bin F. Vi v d trn, sau pha th nht, ta c tp cc tp ph bin F = F1 F2 = {{3}, {6}, {8}, {9}, {3,6}, {6,8}} (F3 khng c v C3 bng tp rng). Di y l bng lit k cc lut m c sinh ra t F:
STT 1 2 3 4 5 6 7 8 Lut Ngi gi ng trong mu 120 mg/ml Khng b bnh tim B bnh tim Ngi gi => ng trong mu 120 mg/ml ng trong mu 120 mg/ml => Ngi gi ng trong mu 120 mg/ml => Khng b bnh tim Khng b bnh tim => ng trong mu 120 mg/ml h tr 76 % 85 % 54 % 46 % 62 % 62 % 46 % 46 % tin cy

82 % 73 % 54 % 85 %

Bng 14 - Cc lut m c sinh ra t CSDL trong bng 8

Vi tin cy cc tiu l 70%, lut th 7 bng trn b loi.

- 38 3.2.4 Chuyn lut kt hp m v lut kt hp vi thuc tnh s Theo cng thc 3.7, mi hm thuc ca mt tp m f u c mt ngng wf. Nhng gi tr no b hn ngng wf th xem nh bng 0. Nh ngng wf, chng ta c th kh m a lut kt hp m v dng gn ging vi lut kt hp vi thuc tnh s (quantitative association rules). V d, vi lut Ngi gi => ng trong mu 120 mg/ml, h tr 62%, tin cy 82% trong bng 14, chng ta c th a v dng sau Tui 46 => ng trong mu 120 mg/ml, h tr 62%, tin cy 82%. Chng ta thy, gi tr nh nht cn vt qu ngng wTui_gi (= 0.5) trong thuc tnh [Tui, Tui_gi] l 0.67. Tui tng ng vi gi tr m bng 0.67 chnh l 46. Trong thuc tnh ny, bt cu ngi no c tui ln hn hoc bng 46 th u c gi tr hm m ln hn hoc bng 0.67. Tui 46 => ng trong mu 120 mg/ml, h tr 62%, tin cy 82% hon ton l mt lut kt hp vi thuc tnh s. V a phn hm thuc ca cc tp m c o hm t thay i (thng l hm n iu hoc s ln o hm i du l rt t) nn vic kh m tng i n gin. 3.2.5 Th nghim v kt lun Th nghim vi kch thc d liu (s bn ghi tng dn) v thi gian tm kim lut Th nghim kt qu bng cch bin thin h tr v tin cy Th nghim s lut tm c khi bin thin cc trng s hm thuc ca cc tp m Th nghim vi cc ton t T-norm khc nhau (php ly min v tch i s) Th nghim chuyn t lut kt hp m sang lut kt hp vi thuc tnh c nh trng s

- 39 -

Chng IV. Khai ph song song lut kt hp m


Mt trong nhng bc quan trng ca khai ph lut kt hp l tm tt c cc tp thuc tnh ph bin trong CSDL. y l bc tng i phc tp v tn nhiu thi gian ca CPU (CPU-bound) ln thi gian vo ra (I/O-bound) nn cc nh lm tin hc b nhiu cng sc ci tin nhng thut ton c hoc tm ra cc thut ton mi nhm tng tc tm kim [AS94] [MTV94] [BCJ01] [PHM01] [ZH99] [PBTL99]. Nhng thut ton ny u dng tun t (sequential algorithms) v lm vic tng i tt vi nhng CSDL c kch c khng qu ln (tiu ch nh gi CSDL ln hay nh ph thuc vo s thuc tnh v s bn ghi). Tuy nhin, nhng thut ton ny s gim tnh hiu qu mt cch ng k khi gp phi nhng CSDL ln (hng trm megabyte tr ln) do hn ch v dung lng b nh trong v tc tnh ton ca mt my tnh n l. Vi s pht trin bng n ca cng ngh phn cng, theo cc h my tnh song song c sc mnh tnh ton vt tri ra i m ra mt hng tip cn mi trong KPDL, l KPDL song song. T nm 1995 tr li y, cc nh nghin cu khng ngng xut cc thut ton song song v phn tn cho bi ton pht hin lut kt hp [AM95] [PCY95] [AS96] [HKK97] [ZHL98] [ZPO01] [DP01]. Nhng thut ton song song kh a dng do mt phn chng c thit k ph thuc vo kin trc ca tng h my tnh song song c th. Trong phn u tin ca chng ny ti mun trnh by s lc mt s thut ton song song uc xut v th nghim. Phn tip theo ti xin xut mt thut ton song song cho bi ton khai ph lut kt hp m chy trn h thng PCCluster vi c ch truyn thng ip ca MPI (Message Passing Interface) [MPIS95] [EMPI97] [JDMPI97]. y l mt thut ton kh l tng bi n hn ch ti a c qu trnh ng b ha v trao i d liu trong trong tin trnh song song ha. Tuy nhin, hn ch ca thut ton ny l ch lm vic c vi lut kt hp m v lut kt hp vi thuc tnh s v do n ph hp vi CSDL dng quan h hn l dng giao dch.

- 40 -

4.1 Mt s thut ton song song khai ph lut kt hp


Trong phn ny, ti xin trnh by mt s thut ton song song c xut v th nghim. Cc thut ton ny c thit k trn h my tnh song song khng chia s (shared-nothing architecture) c tnh cht nh sau: H c N b x l (BXL - processor), mi BXL Pi ny c b nh trong (RAM) v b nh ngoi (thng l a) c lp vi cc BXL cn li trong h thng. N BXL ny c th truyn thng vi nhau nh mt mng tc cao s dng c ch truyn thng ip (message passing). 4.1.1 Thut ton phn phi h tr Thut ton song song phn phi h tr (count distribution) da trn nn thut ton Apriori [AS94]. Trong thut ton ny, N l s BXL, Pi l BXL th i, Di l phn d liu c gn vi BXL Pi (CSDL D ban u c chia ra lm N phn, mi phn gn vi mt BXL). Thut ton bao gm cc bc sau: (1) Bc 1, vi k = 1, tt c N BXL u nhn c Lk l tp tt c cc tp thuc tnh ph bin c lc lng bng 1. (2) Bc 2, vi mi k > 1, thut ton thc hin lp i lp li cc bc sau: o (2.1) Mi BXL Pi to ra tp cc tp thuc tnh ng c vin Ck bng cch kt ni cc tp thuc tnh ph bin trong Lk-1. Nh rng, tt c cc BXL u c thng tin v Lk-1 ging ht nhau nn chng sinh ra Ck cng ging ht nhau. o (2.2) Mi BXL Pi duyt qua CSDL Di ca ring n cp nht h tr cc b cho cc tp thuc tnh ng c vin trong Ck. y chnh l qu trnh cc BXL thc hin song song vi nhau. o (2.3) Sau khi cp nht xong h tr cc b cho cc tp thuc tnh ng c vin trong Ck, cc BXL tin hnh truyn thng cho nhau thu c h tr ton cc. bc ny, cc BXL bt buc phi ng b ha vi nhau. o (2.4) Cc BXL cn c vo h tr ti thiu minsup chn ra tp nhng tp thuc tnh ph bin Lk t tp cc ng c vin Ck.

- 41 o (2.5) Mi BXL c quyn kt thc ti bc ny hoc tip tc thc hin lp li bc 2.1. Hnh sau y minh ha nguyn l lm vic ca thut ton ny.

Hnh 7 - Thut ton phn phi h tr trn h 3 BXL

4.1.2 Thut ton phn phi d liu u im ni bt ca thut ton phn phi h tr l khng cn truyn d liu gia cc BXL trong qu trnh tnh ton. Do , chng c th hot ng c lp v khng ng b vi nhau trong khi duyt d liu trn b nh hoc a cc b. Tuy nhin, nhc im ca thut ton ny l khng khai thc ht sc mnh tng hp ca N b nh ng vi N BXL ca ton h thng. Gi s mi BXL c dung lng b nh cc b l |M| th s tp thuc tnh ng c vin c cp nht h tr trong mi pha b gii hn bi hng s m ph thuc |M|. Khi s BXL trong h thng tng t 1 n N, h thng s c mt b nh tng hp vi dung lng N x |M|, nhng vi thut ton phn phi h tr trn, chng ta cng ch m c m tp thuc tnh ng c vin do tnh cht ca thut ton l tt c cc BXL u c tp Ck ging ht nhau. Thut ton phn phi d liu (data distribution) c thit k vi mc ch tn dng c sc mnh tng hp ca b nh h thng khi s BXL tng ln. Trong thut ton ny, mi BXL tin hnh cp nht h tr cho mt s cc tp thuc

- 42 tnh ng c vin ca ring n. Do , khi s BXL trong h thng tng ln, thut ton ny c th cp nht h tr cho rt nhiu cc tp thuc tnh ng c vin trong mt pha. Nhc im ca thut ton ny l mi BXL phi truyn v nhn d liu mi pha nn n ch kh thi khi h thng c mt mi trng truyn thng nhanh v n nh gia cc nt trong h thng. Thut ton song song phn phi d liu (data distribution) cng da trn nn thut ton Apriori [AS94]. Trong thut ton ny, N l s BXL, Pi l BXL th i, Di l phn d liu c gn vi BXL Pi (CSDL D ban u c chia ra lm N phn, mi phn gn vi mt BXL). Thut ton bao gm cc bc sau: (1) Bc 1: tng t nh trong thut ton phn phi h tr (2) Bc 2: vi k > 1: o (2.1) Mi BXL Pi to tp cc tp thuc tnh ng c vin Ck t tp cc tp thuc tnh ph bin Lk-1. N khng thao tc tt c trn Ck m ch gi li mt phn ca Ck c chia u cho N BXL. Phn c gi li cho BXL Pi c xc nh nh nh danh tin trnh (process identification) m khng cn truyn thng gi cc tin trnh. Cc Cki c chia tha mn: Cki Ckj = (vi mi i j) v U C ki = C k .
i =1 N

o (2.2) BXL Pi ch m h tr cho cc tp mc ng c vin trong Cki bng cch s dng d liu cc b Di ca n v d liu nhn c t cc BXL khc trong h thng. o (2.3) Sau khi m xong h tr, mi BXL Pi chn ra tp nhng tp thuc tnh ph bin cc b Lki t Cki tng ng. Nh rng Lki Lkj = (vi mi i j) v U Lik = Lk .
i =1 N

o (2.4) Cc BXL tin hnh trao i Lki cho nhau sao cho tt c cc BXL u nhn c Lk sinh Ck+1 cho ln lp tip theo. Bc ny cn s ng b ha gia cc BXL. Sau khi nhn c bc Lk, mi BXL c th c lp quyt nh ngng lm vic hoc tip tc thc hin bc lp tip theo.

- 43 Hnh sau y minh ha nguyn l lm vic ca thut ton ny.

Hnh 8 - Thut ton phn phi d liu trn 3 BXL

4.1.3 Thut ton phn phi tp ng c vin Hn ch ca hai thut ton trn (count & data distribution) ch do mi giao dch hoc bn ghi trong CSDL u c th h tr mt tp thuc tnh ng c vin no nn cc giao dch hay bn ghi phi c i snh vi tt c cc tp thuc tnh ng c vin. iu ny dn n vic thut ton phn phi h tr phi lu gi tp cc tp ng c vin ging nhau trn mi BXL v thut ton phn phi d liu phi gi d liu cho nhau trong qu trnh cp nht h tr. Hn na, hai thut ton ny phi tin hnh ng b ha cui mi pha thc hin song song trao i h tr cc b hoc tp cc tp ph bin cho nhau. Yu cu ng b ha trong sut thi gian thc hin ca thut ton s lm gim hiu sut thc hin ca h thng do cc BXL hon thnh cng vic sm phi ch i cc BXL hon thnh cng vic mun hn. Nguyn nhn ca vn ny l do hai thut ton trn mi chia cng vic cho cc BXL mt cch cng bng ch cha chia mt cch va cng bng va khn ngoan. Thut ton phn phi tp ng c vin (candidate distribution) c gng chia tp ng c vin sao cho cc BXL c th c lp lm vic v hn ch ti a cng vic ng b ha. Bt u mt pha l no (l c xc nh da theo kinh nghim),

- 44 thut ton ny chia tp thuc tnh ph bin Ll-1 cho cc BXL sao cho mi BXL Pi c th to ra tp ng c vin Cmi (m l) c lp vi cc BXL khc (Cmi Cmj = , i j). ng thi, d liu cng c chia li sao cho mi BXL Pi c th cp nht h tr cho cc tp ng c vin trong Cmi c lp vi cc BXL khc. ng thi gian , d liu c phn chia li sao cho mi BXL Pi c th cp nht h tr cho cc tp thuc tnh ng c vin trong Cmi mt cch c lp vi cc BXL khc. Nh rng, s phn chia d liu ph thuc rt nhiu vo bc phn chia tp ng c vin trc . Nu phn chia tp ng c vin khng kho lo th chng ta khng th c mt phn hoch d liu cho cc BXL m ch c mt phn chia tng i ngha l c th c nhng phn d liu trng lp trn cc BXL. Sau khi phn hoch Lk-1, cc BXL lm vic c lp vi nhau. Vic cp nht h tr cho tp cc ng c vin cc b khng i hi cc BXL phi truyn thng vi nhau. Ch c mt s ph thuc duy nht gia cc BXL l chng phi gi cho nhau nhng thng tin cn cho vic ct ta cc ng c vin khng cn thit. Tuy nhin, nhng thng tin ny c th c truyn cho nhau theo ch d b v cc BXL khng cn phi i nhn y thng tin ny t cc BXL khc. Cc BXL c gng ct ta c cng nhiu cng tt nh vo nhng thng tin n t cc BXL khc. Nhng thng tin n mun s c s dng cho ln ct ta tip theo. Thut ton phn phi tp ng c vin bao gm nhng bc sau: (1) Bc 1 (k < l): s dng mt trong hai thut ton phn phi h tr hoc phn phi d liu. (2) Bc 2 (k = l): o (2.1) Phn chia Lk-1 cho N BXL. Chng ta s xem xt cch phn chia phn sau. Qu trnh phn chia ny l ging ht nhau v c thc hin song song trn cc BXL. o (2.2) Mi BXL Pi s s dng Lik-1 to ra Cki ca n. o (2.3) Pi s cp nht h tr cho cc tp ng c vin trong Cki v CSDL s c phn chia li ngay sau . o (2.4) Sau , Pi thc hin trn d liu cc b v tt c d liu nhn c t cc BXL khc. N to ra N-1 b m nhn d b nhn cc

- 45 Lkj t cc BXL khc. Nhng Lkj ny cn thit cho bc ct ta cc tp ng c vin trong Cik+1. o (2.5) Pi sinh ra Lki t Cki v truyn thng lan truyn (broadcast) d b ti N-1 b vi x l khc. (3 Bc 3 (k > l): o (3.1) Mi BXL Pi thu thp tt c nhng tp ph bin t cc BXL khc. Thng tin v cc tp ph bin ny s c dng ct ta. Cc tp thuc tnh nhn c t BXL j s c di k-1, nh hn k-1 (nu l BXL chm), hoc ln hn k-1 (nu l BXL nhanh). o (3.2) Pi to ra Cki da vo Lik-1. Mt trng hp c th xy ra l Pi khng nhn c Ljk-1 t cc BXL khc, do Pi cn phi cn thn trong khong thi gian ct ta. o (3.3) Pi thc hin duyt d liu cp nht h tr cho cc tp thuc tnh trong Cki. Sau n tnh ton Lki t Cki v truyn d b thng tin v Lki ti N-1 BXL cn li trong h thng. Chin lc phn chia d liu: Chng ta xem xt cch phn chia d liu ca thut ton ny thng qua mt v d n gin sau y. Cho L3 = {ABC, ABD, ABE, ACD, ACE, BCD, BCE, BDE, CDE}. Tip L4 = {ABCD, ABCE, ABDE, ACDE, BCDE}, L5 = {ABCDE} v L6 = . Chng ta xt tp = {ABC, ABD, ABE} vi cc thnh vin ca n c chung phn u l AB. Nh rng, cc tp thuc tnh ABCD, ABCE, ABDE v ABCDE cng c chung tin t AB. Do , gi s rng cc thuc tnh trong tp thuc tnh c sp theo th t t vng, chng ta c th phn chia cc tp ph bin trong Lk da vo tin t c di k-1 u tin ca cc tp, nh vy cc BXL c th lm vic c lp vi nhau. Ci t thut ton ny trong thc t phc tp hn rt nhiu bi hai l do. L do th nht l mt BXL c th phi nhn cc tp thuc tnh ph bin c tnh ton bi cc BXL khc cho bc ct ta tip theo. Trong v d trn, BXL c gn tp ng c vin phi bit BCDE c phi l tp ph bin hay khng mi quyt nh c c ct ta tp ABCDE hay khng, nhng tin t ca BCDE l BC nn BCDE li thuc v mt BXL khc. L do th hai l chng ta phi tnh ton cn bng ti cho cc BXL trong h thng.

- 46 4.1.3 Thut ton sinh lut song song Cho mt tp ph bin l, chng trnh con sinh lut kt hp s sinh ra lut dng a => (l a), trong a l mt tp con khc rng ca l. h tr ca lut chnh l h tr ca tp ph bin l (tc l s(l)), cn tin cy ca lut l t s s(l)/s(a). sinh lut hiu qu, chng ta tin hnh duyt cc tp con ca l c kch thc ln trc tin v s tip tc xt cc tp con nh hn khi lut va sinh tha mn tin cy ti thiu (minconf). V d, l l tp ph bin ABCD, nu lut ABC => D khng tha mn tin cy ti thiu th lut AB => CD cng khng tha mn do h tr ca AB lun ln hn hoc bng ABC. Nh vy chng ta khng cn xt cc lut m v tri l tp con ca ABC v chng khng tha mn tin cy ti thiu. Thut ton sinh lut tun t [AS94] th hin tng trn nh sau:
// Simple algorithm Forall frequent itemset lk, k > 1 do Call gen_rules(lk, lk); // The gen_rules generates all valid rules => (l - ), for all am Procedure gen_rules(lk : frequent k-itemset, am : frequent m-itemset) Begin 1 A = {(m-1)-itemsets am-1 | am-1 am} 2 Forall am-1 A do begin 3 conf = s(lk)/s(am-1); 4 if (conf minconf) then begin output the rule am-1 => (lk am-1); 5 if (m 1 > 1) then 6 Call gen_rules(lk, am-1); 7 8 end 9 end End
Bng 15 - Thut ton sinh lut kt hp tun t

sinh lut song song, chng ta chia tp cc tp thuc tnh ph bin cho tt c cc BXL trong h thng. Mi BXL sinh lut trn cc tp ph bin c phn chia cho n s dng thut ton trn. Trong thut ton sinh lut song song, tnh tin cy ca mt lut, BXL c th cn phi tham chiu n h tr ca mt tp ph bin nm trn mt BXL khc. V l do ny, cc BXL nn c thng tin v ton b cc tp ph bin truc khi thc hin thut ton sinh lut song song.

- 47 4.1.4 Mt s thut ton khc Ngoi ba thut ton nu trn, cc nh nghin cu trong lnh vc ny xut thm kh nhiu thut ton khai ph lut kt hp song song khc. Thut ton phn phi d liu thng minh (Intelligent Data Distribution Algorithm) [HKK97] c xut da trn thut ton phn phi d liu vi mt bc ci tin trong vic truyn d liu gia cc BXL trong thi gian tnh ton. Thay v truyn d liu gia cp BXL, cc BXL trong thut ton ny c t chc thnh mt vng logic v chng tin hnh truyn d liu theo vng trn ny. Thut ton MLFPT (Multiple Local Frequent Pattern Tree) [ZHL98] l thut ton da trn FP-growth. Thut ton ny gim c s ln duyt qua CSDL, khng cn to ra tp ng c vin v cn bng ti gia cc BXL trong h thng. Thut ton khai ph lut kt hp song song do [ZPO01] xut khc vi cc thut ton khc ch n lm vic trn h thng a x l i xng (SMP, cn c gi l shared-everything system) thay v trn h song song phn tn khng chia s ti nguyn (shared-nothing system).

4.2 Thut ton song song cho lut kt hp m


Cc thut ton song song c xut trc y thng phi ng b ha gia cc BXL bi chng hoc phi truyn thng tin v tp ng c vin (thut ton phn phi h tr, thut ton phn phi ng c vin) hoc phi truyn d liu cho nhau (thut ton phn phi d liu). Do phi truyn thng v ng b ha trong sut qu trnh tnh ton nn cc thut ton trn khng c xem l song song l tng. Vi cch tip cn lut kt hp m phn trn, ti xin xut mt thut ton song song gn l tng khai ph dng lut ny. Thut ton l tng ch cc BXL trong h thng gn nh khng phi truyn thng vi nhau trong sut qu trnh tnh ton, chng ch cn truyn thng vi nhau mt ln duy nht khi thut ton kt thc tp hp cc lut khai ph c t cc BXL trong h thng. 4.2.1 Hng tip cn Theo bi ton khai ph lut kt hp m tun t trong phn trn, mi thuc tnh iu trong I c gn vi mt tp cc tp m

Fiu nh sau:

- 48 -

Fiu = f iu1 , f iu2 ,..., f iuk


V d, vi CSDL trong bng 8, chng ta c:

Fi1 = FTui = {Tui_tr, Tui_trung_nin, Tui_gi} (vi k = 3) Fi2 = FCholesterol = {Cholesterol_thp, Cholesterol_cao} (vi k = 2)
Lut kt hp m c dng: X is A Y is B. Trong : X, Y I l cc tp thuc tnh. X = {x1, x2, , xp}, Y = {y1, y2, , yq}. xi xj (nu i j) v yi yj (nu i j). A = {fx1, fx2, , fxp}, B = {fy1, fy2, , fyq} l tp cc tp m tng ng vi cc thuc tnh trong X v Y. fxi Fxi v fyj Fyj. Mi thuc tnh m khng phi ch l tn thuc tnh m l mt cp bao gm [<tn thuc tnh> + <tn tp m tng ng>]. Vi I = {Tui, Cholesterol, ng_trong_mu, Bnh_tim} nh trong bng 8 th tp cc thuc tnh m s l: IF = {[Tui, Tui_tr] (1), [Tui, Tui_trung_nin] (2), [Tui, Tui_gi] (3),
[Cholesterol, Cholesterol_thp] (4), [Cholesterol, Cholesterol_cao] (5), [ng_trong_mu, ng_trong_mu_0] (6), [ng_trong_mu, ng_trong_mu_1] (7), [Bnh_tim, Bnh_tim_khng] (8), [Bnh_tim, Bnh_tim_c] (9)}
Bng 16 - Tp cc thuc tnh m sau khi m ha t CSDL bng 8

Nh vy, sau khi m ha IF s bao gm 9 thuc tnh m so vi 4 thuc tnh ban u. Sau khi m ha, gi tr cc bn ghi ti cc thuc tnh ca CSDL ban u cng c chuyn v khong [0, 1] nh cc hm thuc tng ng. Yu cu ca bi ton l tm tt c cc lut kt hp m trn tp thuc tnh IF v tp cc bn ghi T F. Nh chng ta bit, tp cc thuc tnh m (c v tri ln v phi) ca lut kt hp m khng cha bt k hai thuc tnh m no c cng thuc tnh ngun (thuc tnh khng m trong I) ban u. V d, nhng lut Tui_gi AND Cholesterol_cao AND Tui_tr => Bnh_tim_c hoc ng_trong_mu >

- 49 120 AND Bnh_tim_khng => Bnh_tim_c l khng hp l bi trong lut th nht Tui_gi v Tui_tr l hai thuc tnh m c cng mt ngun gc ban u l Tui, cn trong lut th hai, Bnh_tim_khng v Bnh_tim_c cng l hai thuc tnh m bt ngun t thuc tnh Bnh_tim ban u. C hai l do khng nh iu ny. Th nht, cc thuc tnh m c cng mt ngun gc thng c gi tr m loi tr ln nhau nn nu chng cng xut hin trong mt tp ph bin th h tr ca tp ph bin thng l nh v c th l rt nh trong trng hp chng loi tr nhau tht s. V d, gi tr hm thuc i vi tp m Tui_gi ca mt i tng no m cao th gi tr hm thuc i vi tp m Tui_tr l nh, v khng c ngi no li va gi va tr. L do th hai l nhng lut kt hp m nh th thng khng c t nhin v c t ngha. Nh vy, nhng lut kt hp lin quan n cc thuc tnh c chung ngun gc l hon ton c lp vi nhau, do chng ta c th tm kim chng bng mt thut ton song song gn l tng. Gi s trong h thng ca chng ta c 6 BXL, chng ta s tm cch chia IF thnh su phn cho 6 BXL ny nh sau: Vi BXL P1: IF1 = {[Tui, Tui_tr] (1),
[Cholesterol, Cholesterol_thp] (4), [ng_trong_mu, ng_trong_mu_0] (6), [ng_trong_mu, ng_trong_mu_1] (7), [Bnh_tim, Bnh_tim_khng] (8), [Bnh_tim, Bnh_tim_c] (9)}

= {1, 4, 6, 7, 8, 9} Vi BXL P2: IF2 = {1, 5, 6, 7, 8, 9} Vi BXL P3: IF3 = {2, 4, 6, 7, 8, 9} Vi BXL P4: IF4 = {2, 5, 6, 7, 8, 9} Vi BXL P5: IF5 = {3, 4, 6, 7, 8, 9} Vi BXL P6: IF6 = {3, 5, 6, 7, 8, 9} Nh vy, chng ta chia u c 9 thuc tnh m cho 6 BXL, mi BXL c 6 thuc tnh. Hai thuc tnh c a ra phn chia l Tui v Cholesterol. y l cch chia ti u bi tch gia s lng tp m gn vi thuc tnh Tui (l

- 50 3) v s lng tp m gn vi thuc tnh Cholesterol (l 2) va bng s lng BXL trong h thng (l 6). Trong trng hp chia ti u l chng ta chia u c tp cc thuc tnh m cho tt c cc BXL trong h thng, tuy nhin cng c trng hp chng ta s dng mt gii php chia chp nhn c c ngha l c mt vi BXL trong h thng c ngh ngi. Sau y ti xin xut mt thut ton chia tp thuc tnh m cho cc BXL, thut ton ny da trn chin lc quay lui (backtracking) v s dng ngay khi tm c nghim u tin. Trong trng hp khng tm c nghim ng, thut ton s tr v mt nghim chp nhn c. Cho CSDL D vi I = {i1, i2, , in} l tp n thuc tnh, T = {t1, t2, , tm} l tp m bn ghi. Sau khi gn cc tp m cho cc thuc tnh (cn gi l qu trnh m ha), ta c CSDL DF vi TF l tp cc bn ghi m cc gi tr ti cc trng thuc on [0, 1] (tnh ton thng qua hm thuc ca cc tp m) v tp cc thuc tnh m IF = {[i1, fi11], , [i1, fi1k1], [i2, fi21], , [i2, fi2k2], , [in, fin1], , [in, finkn]}. Trong , fiju l tp m th u c gn vi thuc tnh ij v kj l s lng tp m gn vi thuc tnh ij. V d, vi CSDL D bng 8, chng ta c I = {Tui, Cholesterol, ng_trong_mu, Bnh_tim} v sau khi m ha th DF c IF nh bng 16. Khi , k1 = 3, k2 = 2, k3 = 2, k4 = 2 tng ng l s lng tp m gn vi 4 thuc tnh trong I. t tp FN = {k1} {k2} {kn} = {s1, s2, , sv} (v n v c th tn ti nhng cp ki v kj ging nhau) v N l s lng BXL trong h thng, bi ton phn chia tp thuc tnh m cho cc BXL nh sau: tm mt tp con Fn (khc rng) ca FN sao cho tch cc phn t trong Fn bng s lng BXL (l N) trong h thng. Trong trng hp khng tm thy nghim ng th thut ton s tr v mt nghim chp nhn c tc l tch ca cc phn t trong Fn l xp x di ca N. Bi ton ny c th gii quyt bng chin lc quay lui. Vi v d trn, FN = {3} {2} {2} {2} = {3, 2}. Thut ton: BOOLEAN Subset(FN, N, Idx) 1 k = 1; 2 Idx[1] = 0; 3 S = 0; 4 while (k > 0) {

- 51 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Idx[k]++; if (Idx[k] <= sizeof(FN)) { if (S * FN[Idx[k]] <= N) { if (S * FN[Idx[k]] == N) return TRUE; else { S *= FN[Idx[k]]; Idx[k + 1] = Idx[k]; k++; } } } else { k--; S /= FN[Idx[k]]; } } return FALSE;

FindSubset(FN, N, Idx, Fn) 1 for (n = N; n > 0; n--) 2 If (Subset(FN, n, Idx)) { 3 Fn = {FN[i] | i Idx} 4 return; 5 }
Bng 17 - Thut ton h tr vic chia tp thuc tnh m cho cc BXL

Trong v d trn, sau khi tnh ton, Fn s bng {3, 2}. Thut ton trn cng m bo vic tm nghim chp nhn c trong trng hp khng tm c nghim ng do trong hm FindSubset chng ta gim dn n tm xp x di ca N. 4.2.2 Thut ton song song cho lut kt hp m u vo (inputs): CSDL D vi tp thuc tnh I v tp bn ghi T. S lng BXL N. h tr ti thiu minsup v tin cy ti thiu minconf. u ra (outputs): tp tt c cc lut kt hp m. Thut ton song song khai ph lut kt hp m bao gm cc bc sau: (1) M ha CSDL D chuyn v DF vi tp thuc tnh m IF v tp bn ghi TF.

- 52 (2) Dng thut ton trong bng 17 phn chia tp IF cho N BXL trong h thng. (3) Ty theo vic phn chia tp thuc tnh m bc (2) phn chia d liu cho cc BXL. Mi BXL Pi ch cn nhng trng lin quan n tp thuc tnh m c phn cho n. (4) Mi BXL Pi s dng thut ton tun t tm lut kt hp m trong bng 10 sinh lut. y l qu trnh cc BXL lm vic song song v c lp vi nhau. (5) Tp hp lut sinh c trn tt c cc BXL trong ton h thng chnh l u ra ca thut ton ny. Thut ton ny khng ch p dng c vi lut kt hp m m cn p dng c vi lut kt hp vi thuc tnh s v thuc tnh hng mc (quantitive & categorical association rules).

4.3 Th nghim v kt lun


Th nghim vi s thuc tnh tng dn v thi gian tm kim lut Th nghim vi kch thc d liu (s bn ghi tng dn) v thi gian tm kim lut Th nghim vi s BXL tng dn v thi gian tm kim lut

- 53 -

Chng V. Kt lun
Nhng vn c gii quyt trong lun vn ny
Vi cch tip cn da trn nhng xut c trong lnh vc nghin cu v KPDL, bn lun vn ny l mt s tng hp nhng nt chnh trong khai ph d liu ni chung v khai ph lut kt hp ni ring cng vi mt vi xut mi. Sau y l nhng im chnh m lun vn tp trung gii quyt. Trong chng mt, lun vn trnh by mt cch tng quan nht v KPDL c th l nh ngha v KPDL v nhng mc ch, ng c thc y cc nh tin hc ch trng vo lnh vc nghin cu ny. Phn ny cng trnh by s lc nhng k thut chnh, nhng hng tip cn c p dng gii quyt cc bi ton nh hn, c th hn nh bi ton phn lp, phn loi, .v.v. Ni tm li, chng ny cung cp cho ngi c mt ci nhn chung nht v lnh vc nghin cu ny. Chng hai pht biu li bi ton khai ph lut kt hp co R Agrawal xut nm 1993. Ngoi vic pht biu cc khi nim mt cch hnh thc, chng ny cn phc ha mt s nhnh nghin cu c th nh lut kt hp vi thuc tnh trng s, lut kt hp m, khai ph song song lut kt hp, .v.v. Mc tiu ca chng ny l trnh by tt c nhng khi nim c bn trong bi ton khai ph lut kt hp v nhng m rng ca bi ton ny. Da trn nhng xut ca [SA96] [MY98] [AG00] [KFW98], chng ba ca lun vn trnh by s lc v lut kt hp vi thuc tnh trng s cng vi nhng u, nhc im ca n. Tuy nhin, mc tiu chnh ca phn ny l trnh by v lut kt hp m, mt dng lut kt hp m rng mm do hn, gn gi hn ca dng lut kt hp c bn trong chng hai. Nhng ni dung trnh by trong [AG00] [KFW98] qu vn tt v cha ni ln ht c ngha ca lut kt hp m v c bit l mi quan h t nh gia lut kt hp m v php ko theo trong logic m. Lun vn l gii c ti sao li s dng hoc php ly min hoc php tch i s cho ton t T-norm (T-chun) trong cng thc (3.6). Phn ny cng nu li thut ton tm lut kt hp m trong [AG00] [KFW98] da trn thut ton Apriori cng vi mt vi sa i nh. Cui chng ny l mt xut v cch chuyn i t lut kt hp m sang lut kt hp vi thuc tnh trng s.

- 54 xut ny lm ni bt u im ca lut kt hp m l khi cn th n cng c th c chuyn v dng lut kt hp thng thng mt cch d dng. Chng bn ca lun vn xut mt thut ton song song mi p dng cho bi ton khai ph lut kt hp m. Vi thut ton ny, cc b x l trong h thng gim c ti a cng vic truyn thng v ng b ha trong sut qu trnh tnh ton. S d thut ton hot ng kh l tng nh vy l nh cch chia tp thuc tnh ng c vin mt cch va cng bng va khn kho. Cng bng ch tp ng c vin c chia u cho cc b x l, cn khn kho ch cc tp ng c vin sau khi chia cho tng b x l l hon ton c lp vi nhau. Nhc im ca thut ton ny l ch p dng cho lut kt hp vi thuc tnh s v lut kt hp m cng nh ch thc hin trn cc h thng song song khng chia s (sharednothing systems). Trong qu trnh thc hin lun vn cng nh trong thi gian trc , ti c gng tp trung nghin cu bi ton ny cng nh tham kho kh nhiu ti liu lin quan. Tuy nhin, do thi gian v trnh c hn nn khng trnh khi nhng hn ch v thiu st nht nh. Ti tht s mong mun nhn c nhng gp c v chuyn mn ln cch trnh by ca lun vn t bn c.

Cng vic nghin cu trong tng lai


Khai ph lut kt hp l bi ton c kh nhiu nh nghin cu quan tm bi n c ng dng rng ri trong cc lnh vc cng nh cha ng nhiu hng m rng khc nhau. Ngay trong lun vn ny, ti cng ch chn mt hng nh nghin cu. Trong thi gian ti, chng ti s m rng nghin cu ca mnh ra mt s hng sau: Khai ph lut kt hp m vi thuc tnh c nh trng s. Mc ch ca bi ton ny l tm cch gn trng s cho cc thuc tnh biu th mc quan trng ca chng i vi lut. V d, khi khai ph lut kt hp lin quan n bnh tim mch th nhng thng tin v huyt p, lng ng trong mu v cholesterol quan trng hn l thng tin v trng lng v tui tc, do chng c gn trng s ln hn. Bi ton ny thc ra khng mi m m c mt vi ngi xut, tuy nhin n cha c gii quyt thuu o.

- 55 Thut ton khai ph d liu song song trn ch p dng cho h thng song song khng chia s (shared-nothing systems). Trong thi gian ti, chng ti s nghin cu ci t n trn h thng song song chia s nh h a x l i xng chng hn. Mc d bi ton khai ph lut kt hp l c lp vi c s d liu m n thao tc, tuy nhin ti mong mun ng dng n vo mt c s d liu c th c th tinh chnh v a ra c thng s ti u.

- 56 -

Ti liu tham kho


Ti liu ting Vit: [1]. [2]. [PDD99] Phan nh Diu. L Gch trong Cc H Tri Thc. Khoa Cng ngh, i hc Quc gia H Ni. H Ni - 1999. [DMT03] inh Mnh Tng. Tr tu nhn to. Khoa Cng ngh, i hc Quc gia H Ni. H Ni 2003.

Ti liu ting Anh: [3]. [4]. [AR95] Alan Rea. Data Mining An Introduction. The Parallel Computer Centre, Nor of The Queens University of Belfast. [AG00] Attila Gyenesei. A Fuzzy Approach for Mining Quantitative Association Rules. Turku Centre for Computer Science, TUCS Technical Reports, No 336, March 2000. [AM95] Andreas Mueller. Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison. Department of Computer Science, University of Maryland-College Park, MD 20742. [LHM99] Bing Liu, Wynne Hsu, and Yiming Ma. Mining Association Rules with Multiple Minimum Supports. In ACM SIGKDD International Conference on KDD & Data Mining (KDD-99), August 15-18, 1999, San Diego, CA, USA. [KV01] Boris Kovalerchuk and Evgenii Vityaev. Data Mining in Finance Advances in Relational and Hybrid Methods. Kluwer Academic Publishers, Boston Dordrecht - London. 2001. [MHT02] Bui Quang Minh, Phan Xuan Hieu, Ha Quang Thuy. Some Parallel Computing Experiments with PC-Cluster. In Proc. of Conference on IT of Faculty of Technology, VNUH. Hanoi 2002. [KFW98] Chan Man Kuok, Ada Fu, and Man Hon Wong. Mining Fuzzy Association Rules in Databases. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong.

[5].

[6].

[7].

[8].

[9].

- 57 [10]. [THH02] Do Van Thanh, Pham Tho Hoan, and Phan Xuan Hieu. Mining Association Rules with different supports. In Proc. of the National Conference on Information Technology, Nhatrang, Vietnam, May 2002. [11]. [BCJ01] Doug Burdick, Manuel Calimlim, and Johannes Gehrke. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. Department of Computer Science, Cornell University. [12]. [HKK97] Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. Scalable Parallel Data Mining for Association Rules. Department of Computer Science, University of Minnesota, 4-192 EECS Building, 200 Union St. SE, Minneapolis, MN 55455, USA. [13]. [PHM01] Jian Pei, Jiawei Han, and Runying Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Burnaby, B.C., Canada. [14]. [HK02] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. University of Illinois, Morgan Kaufmann Publishers 2002. [15]. [HF95] Jiawei Han and Yongjian Fu. Discovery of Multiple-Level Association Rules from Large Databases. In Proc. of the 21st International Conference on Very Large Dadabases, Zurich, Switzerland, Sep 1995. [16]. [LZDRS99] Jinyan Li, Xiuzhen Zhang, Guozhu Dong, Kotagiri Ramamohanarao, and Qun Sun. Efficient Mining of High Confidence Association Rules without Support Thresholds. Department of CSSE, The University of Melbourne, Parkville, Vic, 3052, Australia. [17]. [HG00] Jochen Hipp, Ulrich Guntzer, and Gholamreza Nakhaeizadeh. Algorithms for Association Rule Mining A General Survey and Comparison. ACM SIGKDD, July 2000, Volume 2, Issue 1 page 58 64. [18]. [PCY95] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Efficient Parallel Data Mining for Association Rules. In Fourth International Conference on Information and Knowledge Management, Baltimore, Maryland, Nov 1995.

- 58 [19]. [PYC98] Jong Soon Park (Sungshin Womens Univ, Seoul, Korea), Philip S. Yu (IBM T.J. Watson Res. Ctr.), and Ming-Syan Chen (National Taiwan Univ., Taipei, Taiwan). Mining Association Rules with Adjustable Accuracy. [20]. [MTV94] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Efficient Algorithms for Discovering Association Rules. In KDD-1994: AAAI Workshop on Knowledge Discovery in Databases, pages 181-192, Seattle, Washington, July 1994. [21]. [LAZ65] L. A. Zadeh. Fuzzy sets. Informat. Control, 338-353, 1965. [22]. [KMRTV94] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proc. 3rd International Conference on Information and Knowledge Management, pages 401-408, Gaithersburg, Maryland, November 1994. [23]. [MM00] Manoel Mendonca. Mining Software Engineering Data: A Survey. University of Maryland, Department of Computer Science, A. V. Williams Building #3225 College Park, MD 20742. 2000. [24]. [ZH99] Mohammed J. Zaki and Ching-Jui Hsiao. CHARM: An Efficient Algorithm for Closed Association Rules Mining. RPI Technical Report 9910, 1999. [25]. [ZO98] Mohammed J. Zaki and Mitsunori Ogihara. Theoretical Foundations of Association Rules. In 3rd ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, June 1998. [26]. [ZPO01] Mohammed J. Zaki, Srinivasan Parthasarathy, and Mitsunori Ogihara. Parallel Data Mining for Association Rules on Shared-Memory Systems. In Knowledge and Information Systems, Vol. 3, Number 1, pages 1-29 February 2001. [27]. [MPIS95] MPI: A Message-Passing Interface Standard, Message Passing Interface Forum, June 12, 1995. [28]. [EMPI97] MPI-2: Extensions to the Message-Passing Interface, Message Passing Interface Forum, July 18, 1997

- 59 [29]. [JDMPI97] MPI-2 Journal of Development, Message Passing Interface Forum, July 18, 1997. [30]. [PBTL99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering Frequent Closed Itemsets for Association Rules. Laboratoire dInformatique, Universit Blaise Pascal Clermont-Ferrand II, Complexe Scientifique des Czeaux. [31]. [ZHL98] Osmar R. Zaiane, Mohammad El-Hajj, and Paul Lu. Fast Parallel Association Rule Mining Without Candidacy Generation. University of Alberta, Edmonton, Alberta, Canada. [32]. [DP01] Qin Ding and William Perrizo. Using Active Networks in Parallel Mining of Association Rules. Computer Science Department, North Dakota State University, Fargo ND 58105-5164. [33]. [AY98] R. Agrawal and P. Yu. Online Generation of Association Rules. In IEEE International Conference on Data Mining, February 1998. [34]. [AS96] Rakesh Agrawal and John Shafer. Parallel mining of association rules: Design, implementation and experience. Research Report RJ 10004, IBM Almaden Research Center, San Jose, California, February 1996. [35]. [AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th International Conference on Very Large Databases, Santiago, Chile, Sep 1994. [36]. [AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of theACM SIGMOD Conference on Management of Data, pages 207-216, Washington, D.C., May 1993. [37]. [SA95] Ramakrishnan Srikant and Rekesh Agrawal. Mining Generalized Association Rules. In Proc. of the 21st International Conference on Very Large Databases, Zurich, Switzerland. Sep 1995. [38]. [SA96] Ramakrishnan Srikant and Rakesh Agrawal. Mining Quantitative Association Rules in Large Relational Tables. IBM Almaden Research Center, San Jose, CA 95120.

- 60 [39]. [MY98] R. J. Miller and Y. Yang. Association Rules over Interval Data. Department of Computer & Information Science, Ohio State University, USA. [40]. [PMPIP] RS/6000 SP: Practical MPI Programming, Yukiya Aoyama & Jun Nakano, International Technical Support Organization,
www.redbooks.ibm.com

[41]. [MS00] T. Murai and Y. Sato. Association Rules from a Point of View of Modal Logic and Rough Sets. In proceeding of the forth Asian Fuzzy Symposium, May 31, June 3, 2000, Tsukuba, Japan, pp, 427-432. [42]. [HHMT02] Tran Vu Ha, Phan Xuan Hieu, Bui Quang Minh, and Ha Quang Thuy. A Model for Parallel Association Rules Mining from The Point of View of Rough Set. In Proc. of International Conference on East-Asian Language Processing and Internet Information Technology, Hanoi, Vietnam, January 2002. [43]. [FSSU96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press 1996. [44]. [WYY01] Wei Wang, Jiong Yang, and Philip S. Yu. Efficient Mining of Weighted Association Rules (WAR). IBM Watson Research Center. [45]. [WMPPP] Writing Message-Passing Parallel Programs with MPI, Neil MacDonald, Elspeth Minty, Tim Harding, Simon Brown, Edinburgh Parallel Computing Centre, The University of Edinburgh. [46]. [ZKM01] Zijian Zheng, Ron Kohavi, and Llew Mason. Real World Performance of Association Rule Algorithms. Blue Martini Software, 2600 Campus Drive, San Mateo, CA 94403, USA. [47]. [ZHJ91] Zimmermann H. J. Fuzzy Set Theory and Its Applications. Kluwer Academic Publishers, 1991.

Вам также может понравиться