Вы находитесь на странице: 1из 34

TRNG I HC HNG HI VIT NAM

KHOA CNG NGH THNG TIN

BI GING MN HC

KHAI PH D LIU
BI M U

TNG QUAN V KHAI PH D LIU


Ging vin: ThS. Nguyn Vng Thnh B mn: H thng thng tin

Hi Phng, 2011

Ti liu tham kho


1. Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques,

Elsevier Inc, 2006.


2. Robert Nisbet, John Elder, Gary Miner, Handbook of Statistical Analysis and Data Mining Applications, Elsevier Inc, 2009.

3. Elmasri, Navathe, Somayajulu, Gupta, Fundamentals of Database Systems


(the 4th Edition), Pearson Education Inc, 2004. 4. H Quang Thy, Phan Xun Hiu, on Sn, Nguyn Tr Thnh, Nguyn Thu Trang, Nguyn Cm T, Gio trnh khai ph d liu web, NXB Gio dc, 2009.

TNG QUAN V KHAI PH D LIU


0.1. NHU CU KHAI PH D LIU 0.2. KHAI PH D LIU L G? 0.3. KHI NIM V D LIU, MU V TRI THC

0.4. CC BI TON KHAI PH D LIU C BN


0.5. CC GIAI ON TRONG KHAI PH D LIU

0.6. KIN TRC IN HNH CA MT H THNG KPDL


0.7. CC NGUN D LIU PHC V CHO KHAI PH

0.8. NG DNG CA KHAI PH D LIU


3

0.1. NHU CU KHAI PH D LIU


S BNG N THNG TIN!
Nhiu d liu c sinh thm: Web, vn bn, nh Giao dch thng mi, cuc gi, ... DL khoa hc: thin vn, sinh hc Thm nhiu d liu c nm gi: Cng ngh lu gi nhanh hn v r hn.

H qun tr CSDL c th qun l cc c

s d liu vi kch thc ln hn.

Vn bng n d liu Cc tin ch thu thp d liu t ng

v cng ngh c s d liu ln mnh dn ti mt lng ln d liu c tch ly v/hoc cn c phn tch trong c s d liu, kho d liu v trong cc ngun cha d liu khc. Chng ta b ngp lt trong d liu m kht tri thc! Gii php: Kho d liu v Khai ph d liu (mining) To lp kho d liu v qu trnh phn tch d liu trc tuyn OLAP. Khai ph tri thc hp dn (lut, quy lut, mu, rng buc) t d liu trong CSDL ln. 6

0.2. KHAI PH D LIU L G?


Theo J.Han v M.Kamber (2006) [1]: Quan nim 1: Khai ph d liu (Data Mining) l qu trnh trch chn ra tri thc t trong mt tp hp rt ln d liu. Khai ph d liu = Pht hin tri thc t d liu (KDD: Knowledge Discovery From Data).

Quan nim 2: Khai ph d liu (Data Mining) ch l mt bc quan trng trong qu trnh pht hin tri thc t d liu (KDD).

p dng cc phng php thng minh trch chn ra cc mu d liu (data pattern).

Theo H Quang Thy v cc tc gi (2009) [4] (trang 11 v 16): Khi nim 1: Pht hin tri thc trong c s d liu (i khi cn c gi l khai ph d liu) l mt qu trnh khng tm thng nhm pht hin ra nhng mu c gi tr, mi, hu ch tim nng v c th th hiu c t d liu.

Khi nim 2: Khai ph d liu l mt bc trong qu trnh pht hin tri thc trong c s d liu, thi hnh mt thut ton khai ph d liu tm ra cc mu t d liu theo khun dng thch hp
9

0.3. KHI NIM V D LIU, MU V TRI THC


A. Khi nim v d liu v mu D liu (tp d liu) L mt tp F gm hu hn cc trng hp (s kin). Trong khai ph d liu, tp d liu F thng phi gm rt nhiu trng hp. Mu Trong qu trnh khai ph, ngi ta s dng ngn ng L biu din cc tp con cc s kin (d liu) thuc vo tp s kin F. Mi biu thc E trong ngn ng L biu din tp con FE tng ng cc s kin trong F. E c gi l mu nu n n gin hn so vi vic lit k cc s kin thuc FE.
V d: Mu Thu nhp < T
10

B. Tnh c gi tr ca mu Mu c pht hin phi c gi tr i vi cc d liu mi (xut hin trong tng lai) theo mt mc chn thc no y. Tnh "c gi tr": mt o tnh c gi tr (chn thc) l mt hm C nh x mt biu thc thuc ngn ng biu din mu L ti mt khng gian o c (b phn hoc ton b) MC. Mt biu thc E trong L biu din mt tp con FE F c th c gn mt o chn thc c = C(E,F).
Vi mu "THUNHP < $t: ng bin xc nh mu dch sang phi (bin THUNHP nhn gi tr ln hn) th chn thc gim xung do bao gi thm cc tnh hung vay tt li b a vo vng khng cho vay n. Vi mu a*THUNHP + b*N < 0: tnh trng ngi vay n ri vo tnh trng khng th chi tr tng ng vi na mt phng trn cho chn thc cao hn.
11

C. Tnh mi v hu dng tim nng Tnh mi: Mu phi l mi trong mt min xem xt no , t nht l h thng ang c xem xt. Tnh mi c th o c khi quan tm ti s thay i trong: D liu: so snh gi tr hin ti vi gi tr qu kh hoc gi tr k vng Tri thc: tri thc mi quan h nh th no vi cc tri thc c. Tng qut, iu ny c th c o bng mt hm N(E,F) hoc l o v tnh mi hoc l o k vng. Hu dng tim nng: Mu cn c kh nng ch dn ti cc tc ng hu dng v c o bi mt hm tin ch. Chng hn: Hm U nh x cc biu thc trong L ti mt khng gian o c th t (b phn hoc ton b) MU theo u = U (E,F).

12

D. Tnh hiu c, tnh hp dn v khi nim v tri thc Tnh hiu c: Mu phi hiu c Mc tiu ca khai ph d liu l to ra cc mu m con ngi hiu chng d dng hn cc d liu nn (d liu sn c trong h thng). C th hiu c" l tiu ch kh o c mt cch chnh xc a ra mt s o v s d hiu v cc o nh vy c sp xp t c php (tc l c ca mu theo bit) ti ng ngha (tc l d dng con ngi nhn thc c theo mt tc ng no ). Gi nh rng tnh hiu c l o c bng mt hm S nh x biu thc E trong L ti mt khng gian o c c th t (b phn /ton b) MS theo s = S(E,F). Tnh hp dn: hp dn (c coi l o tng th v mu) l s kt hp ca cc tiu ch gi tr, mi, hu ch v d hiu. Cc h thng KPDL thng: Hoc dng mt hm hp dn: i = I (E, F, C, N, U, S) nh x biu thc trong L vo mt khng gian o c Mi. Hoc xc nh hp dn trc tip thng qua th t ca cc mu c 13 pht hin.

Tri thc: Mt mu E L c gi l tri thc nu nh i vi mt lp ngi s dng no , ch ra c mt ngng i Mi m hp dn I(E,F,C,N,U,S) > i.

14

0.4. CC BI TON KHAI PH D LIU IN HNH


Mc tiu tng qut ca khai ph d liu l m t v d bo Bi ton m t: hng ti vic tm ra cc mu m t d liu. Bi ton d bo: s dng mt s bin (hoc trng) trong c s d liu d on v gi tr cha bit hoc gi tr s c trong tng lai ca cc bin. Th hin thng qua cc bi ton c th: 1. M t khi nim 2. Quan h kt hp 3. Phn cm 4. Phn lp 5. Hi quy 6. M hnh ph thuc 7. Pht hin thay i v lch
15

0.4.1. M t khi nim Nhm tm ra cc c trng v tnh cht ca khi nim. Cc bi ton in hnh bao gm: tng qut ha, tm tt, pht hin cc c trng d liu rng buc, Bi ton tm tt l mt trong nhng bi ton m t in hnh, p dng cc phng php tm ra mt m t c ng i vi mt tp con d liu. V d: xc nh k vng v lch chun ca mt dy cc gi tr. 0.4.2. Tm quan h kt hp Pht hin mi quan h kt hp trong tp d liu l bi ton quan trng trong khai ph d liu. Mt trong nhng mi quan h kt hp in hnh l quan h kt hp gia cc bin d liu trong bi ton khai ph lut kt hp l mt bi ton tiu biu. Bi ton khai ph lut kt hp thc hin vic pht hin ra mi quan h kt hp gia cc tp thuc tnh (cc tp bin) c dng XY, trong X v Y l hai tp thuc tnh. 16 S xut hin ca X ko theo s xut hin ca Y nh th no?

0.4.3. Phn lp Thc hin vic xy dng (m t) cc m hnh (hm) d bo nhm m t hoc pht hin cc lp hoc khi nim cho cc d bo tip theo. Mt s phng php in hnh l: cy quyt nh, lut phn lp, mng neuron, Ni dung ca phn lp chnh l mt hm nh x cc d liu vo trong mt s cc lp (nhm) bit. Phn lp cn c gi l hc my c gim st (supervised learning). 0.4.4. Phn cm Thc hin vic nhm d liu thnh cc cm (c th coi l mt lp mi) c th pht hin c cc mu phn b d liu trong min ng dng. Hng ti vic nhn bit mt tp hu hn cc cm hoc cc lp m t d liu. Mc tiu ca phn cm l cc i ha tnh tng ng gia cc phn t trong cng cm v cc tiu ha tnh tng ng gia cc phn t khc cm. Phn cm cn c gi l hc my khng c gim st (unsupervised 17 learning).

0.4.5. Hi quy L bi ton in hnh trong phn tch thng k v d bo. Tin hnh vic d on cc gi tr ca mt hoc mt s bin ph thuc vo gi tr ca mt tp hp cc bin c lp. C th quy v vic hc mt hm nh x d liu nhm xc nh gi tr thc ca mt bin theo mt s bin khc. 0.4.6. M hnh ph thuc Hng ti vic tm ra mt m hnh m t s ph thuc c ngha gia cc bin. Bao gm 2 mc: Mc cu trc ca m hnh: thng di dng th trong cc bin l ph thuc b phn vo cc bin khc. Mc nh lng ca m hnh: m t sc mnh ca tnh ph thuc khi s dng vic o tnh theo gi tr s. 0.4.7. Pht hin bin i v lch Tp trung pht hin hu ht s thay i c ngha di dng o bit trc hoc gi tr chun, cung cp nhng tri thc v s bin i v 18 lch cho ngi dng. Thng c ng dng trong bc tin x l.

{Milk, Coke} {Sweet} (sup=30%, conf=70%) {Beer} {Cigar, Coffee} (sup=35%, conf = 65%) {Coffee} {Tea, Biscuit} (sup=22%, conf = 75%) ...
Khai ph Lut kt hp

Phn lp d liu
19

Phn cm d liu

0.5. CC GIAI ON TRONG KHAI PH D LIU

20

1. Lm sch d liu (Data Cleaning): Loi b nhiu (noisy) v cc d liu khng nht qun. 2. Tch hp d liu (Data Integration): Kt hp d liu t cc ngun d liu khc nhau. 3. La chn d liu (Data Selection): D liu ph hp cho thao tc phn tch c ly v t c s d liu.

4. Chuyn dng d liu (Data Transformation): D liu c chuyn dng hoc hp nht thnh nhng dng ph hp cho qu trnh khai ph bng cch thc hin cc thao tc nh tm tt (summary) hoc gp nhm d liu (aggregation).
5. Trch chn mu (Data Patterns Extracting): p dng cc phng php thng minh trch chn ra cc mu thc s ng quan tm t d liu. i khi chnh bn thn bc ny cng c gi l khai ph d liu (Data Mining) (hiu theo ngha hp).
21

6. nh gi mu (Pattern Evaluation): Da trn cc o c trng, xc nh ra cc mu ng quan tm biu din tri thc. 7. Biu din tri thc (Knowledge Presentation): Cc k thut biu din tri thc v trc quan ha (visualization) c s dng biu din cc tri thc khai ph c n vi ngi dng.

Ch :
Cc giai on t 1. n 4. c gi l cc giai on tin x l d liu (data preprocessing) nhm chun b d liu cho qu trnh khai ph (trch chn mu).

22

0.6. KIN TRC IN HNH CA MT H THNG KHAI PH D LIU

23

1. C s d liu (Database), kho d liu (Data Warehouse), World Wide Web v cc ngun cha thng tin khc: y c th l mt hoc mt nhm cc c s d liu/kho d liu hoc cc ngun cha thng tin (information repositories). Cc k thut lm sch d liu v tch hp d liu c th c thc hin trn cc d liu ny. 2. My ch c s d liu hoc kho d liu (Database or Data Warehouse Server): Chu trch nhim ly v cc d liu ph hp da trn yu cu khai ph ca ngi dng. 3. C s tri thc (Knowledge Base): y l tri thc min (domain knowledge) c s dng dn hng qu trnh tm kim hoc nh gi hp dn ca cc mu tm thy. Tri thc nh vy c th bao gm c s phn cp khi nim (concept hierarchies) (c s dng t chc cc thuc tnh v gi tr thuc tnh thnh cc mc tru tng khc nhau). 24

4. Engine khai ph d liu (Data Mining Engine): y l thnh phn ch yu ca mt h thng KPDL. Bao gm cc module thc hin cc tc v nh phn tch c trng (characterization) v quan h kt hp (association/correlation analysis), phn lp (classification), d on (prediction), phn tch cm (cluster analysis), 5. Module nh gi mu (Pattern Evaluation Module): S dng cc o hp dn v c s tng tc vi engine khai ph d liu nhm tp trung vo vic tm ra cc mu ng quan tm. C th s dng ngng hp dn lc bt cc mu tm c. C th c tch hp vi module khai ph ty thuc vo phng php khai ph c s dng v cch thc ci t. Khuyn khch: Thao tc nh gi mu cn c tch hp cng cht ch cng tt vi tin trnh khai ph nhm nng cao hiu qu khai ph (gii hn vic tm kim ch vi cc mu ng quan tm).
25

4. Giao din ngi s dng (User Interface): Module ny lm nhim v giao tip gia ngi dng v h thng KPDL: Cho php ngi dng tng tc vi h thng bng cch ch ra truy vn hoc tc v khai ph mong mun. Cung cp thng tin gip cho thao tc tm kim c tp trung. Thc hin khai ph thm d (Exploratory Data Mining) da trn cc kt qu khai ph trung gian. Cho php ngi dng duyt c s d liu, lc kho d liu v cc cu trc d liu, nh gi cc mu c khai ph v biu din trc quan mu di cc dng thc khc nhau.

26

0.7. CC NGUN D LIU PHC V CHO KHAI PH


1. C S D LIU QUAN H (RELATIONAL DATABASE)

27

2. KHO D LIU (DATA WAREHOUSE) L ni tp trung d liu t nhiu ngun khc nhau (multiple sources) c lu tr di mt lc thng nht (unified shema) v c tp trung ti mt ni. c xy dng thng qua cc tin trnh lm sch d liu (data cleaning), tch hp d liu (data integration), chuyn dng d liu (data transformation), ti d liu (data loading) v lm ti d liu nh k (periodic data refreshing).

28

29

thun tin cho vic ra quyt nh, d liu trong kho d liu thng c t chc xoay quanh cc ch chnh ng quan tm nh khch hng (customer), hng ha (item), nh cung cp (supplier), D liu c lu tr nhm cung cp thng tin da trn mt ci nhn ton cnh v d liu tc nghip ca doanh nghip trong khong t 5 -10 nm v thng c tm tt (summarized) thun tin cho x l. Kho d liu thng c m hnh ha di dng mt cu trc c s d liu a chiu (multidimensional database structure), mi chiu tng ng vi mt thuc tnh hoc tp thuc tch ca lc v mi (cell) lu tr gi tr ca mt s i lng c gp nhm. Cu trc vt l thc s ca kho d liu c th l di dng mt c s d liu quan h hoc mt data cube a chiu. Mt data cube cung cp ci nhn a chiu v d liu v cho php thc hin cc thao tc tin tnh ton (precomputation) v truy cp nhanh ti d liu c tm tt.

30

3. C S D LIU GIAO DCH (TRANSACTION DATABASE) C s d liu giao dch l mt tp hp cc giao dch. Mi giao dch bao gm mt s hiu giao dch (trans_ID) v danh sch cc mc (item) cu thnh giao dch.
Trans_ID T1 T2
T3 T4 T5

Item List Milk, Bread, Coke Beer, Bread


Beer, Milk, Diaper, Coke Beer, Milk, Diaper, Bread Milk, Diaper, Coke

31

4. CC DNG D LIU NNG CAO D liu vn bn: bao gm cc dng c cu trc, bn cu trc hoc khng c cu trc. D liu Multimedia: hnh nh, m thanh, video, D liu World Wide Web: d liu ni dung web, d liu cu trc web, d liu s dng web.

32

0.6. NG DNG CA KHAI PH D LIU


Phn tch d liu v h tr quyt nh Phn tch v qun l th trng Tip th nh hng, qun l quan h khch hng (CRM), phn

tch thi quen mua hng, bn hng cho, phn on th trng.


Phn tch v qun l ri ro D bo, duy tr khch hng, ci thin bo lnh, kim sot cht

lng, phn tch cnh tranh.


Pht hin gian ln v pht hin mu bt thng (ngoi lai) ng dng khc Khai ph Text (nhm mi, email, ti liu) v khai ph Web. Khai ph d liu dng. Phn tch DNA v d liu sinh hc.
33

Q&A
34

Вам также может понравиться