Вы находитесь на странице: 1из 13

Nghin cu, xy dng phng php trch chn

thuc tnh nhm lm tng hiu qu phn lp


i vi d liu a chiu
ng Th Ngc Lan
Trng i hc Cng ngh
Lun vn Thc s ngnh: Cng ngh phn mm; M s: 60 48 10
Ngi hng dn: PGS.TS Nguyn H Nam
Nm bo v: 2011
Abstract: Tng quan v khai ph d liu v trch chn thuc tnh. Trnh by ni dung
chnh ca thut ton phn lp s dng trong lun vn l thut ton Random Forest v
gii thut di truyn. Trnh by phng php xut v hng gii quyt ca lun vn.
Trnh by qu trnh thc nghim v nh gi kt qu thc nghim.
Keywords: Cng ngh thng tin; Thut ton phn lp; C s d liu
Content
CHNG 1: TNG QUAN V KHAI PH D LIU V TRCH CHN THUC
TNH
1.1 Gii thiu khai ph d liu v trch chn thuc tnh
Khai ph d liu l mt khi nim ra i t nhng cui nhng nm 80 ca th k trc. N
bao hm mt lot cc k thut nhm pht hin cc thng tin c gi tr tim n trong tp cc d
liu ln.V bn cht, khai ph d liu lin quan n vic phn tch cc d liu v s dng cc
k thut tm ra cc mu hnh c tnh chnh quy trong tp d liu. Nm 1989, Fayyad,
Piatestsky-Shapiro v Smyth dng khi nim Pht hin tri thc trong c s d liu
(Kownledge Discovery in Database KDD) ch ton b qu trnh pht hin cc tri thc c
ch t cc tp d liu ln[14]. Trong , khai ph d liu l mt bc c bit trong ton b
qu trnh, s dng cc gii thut c bit chit xut ra cc mu hay cc m hnh t d liu.
Trong khai ph d liu th phng php trch chn thuc tnh ng mt vai tr quan
trng trong tin x l s liu. Lun vn ch yu tp trung vo tm hiu 3 nhim v chnh sau:
Gim chiu d liu:Gim chiu d liu l vic lm gim chiu ca khng gian tm kim
d liu, gim chi ph thu thp v lu tr d liu, nng cao hiu qu ca vic khai ph d liu
v lm n gin ha cc kt qu khai ph d liu. Trong nhim v lm gim chiu d liu
chng ta cn phn bit hai khi nhim sau:

Trch chn thuc tnh (Feature Extraction):

Chn la thuc tnh (Feature Selection):

Phn cm v phn lp:Phn lp v phn cm l hai nhim v c mi quan h tng


i gn nhau trong khai ph d liu. Mt lp l mt tp cc i tng c cng mt s c
im hoc mi quan h no , tt c cc i tng trong lp ny c phn vo trong cng
mt lp tn nhm mc ch l phn bit vi cc lp khc.Mt cm l mt tp cc i tng
tng t nhau v mt v tr.Cc cm thng c to ra nhm mc ch sau tin hnh
phn lp cc i tng.
Trch chn lut: Trch chn lut tm kim v a ra d liu bng cch tt c cc d
liu c a ra da trn cc suy din/cc quyt nh m cc suy din/quyt nh ny c
xy dng t cc tri thc thu thp c t d liu . i vi ngi s dng cc kt qu ca
khai ph d liu h ch mong mun c mt cch gii thch n gin l ti sao c cc kt qu
phn lp , thuc tnh no nh hng n kt qu khai ph d liuTuy nhin, bng cc
tham s phn lp rt kh c th din gii cc tri thc theo cch m ngi s dng c th
d dng hiu c. Do , trch chn ra cc lut IF-THEN a ra cc thng tin c gi tr l
mt cch din gii n gin v d hiu nht i vi ngi s dng.
1.2. La chn thuc tnh v bi ton phn lp
Nhim v c bn ca vic phn lp l phn chia mt tp cc i tng thnh n-hu hn
lp bit trc. Tp i tng cn phn lp c c trng bi mt tp cc thuc tnh cha
cc thng tin cn thit lin quan n cc lp, trong mi tp cc thuc tnh c i din
bi mt tp cc thuc tnh gi tr. Vi mt tp d liu bao gm mt tp cc i tng
c phn lp (thng gi l tp tp hun) nhim v t ra l t tp hun luyn cho trc xy
dng mt b phn lp cho cc d liu tng t. Vn t ra i vi bi ton phn lp l s
lng cc thuc tnh thng rt ln nhng cc thuc tnh khng lin quan hoc tha c th c
nhng nh hng tiu cc i vi cc gii thut phn lp. Cc thuc tnh/d liu tha hoc
khng lin quan c th l nguyn nhn dn n vic hc ca gii thut khng c chnh xc.
Thm vo , vi s c mt ca d liu tha hoc d liu khng lin quan c th lm cho b
phn lp tr ln phc tp hn. iu ny s gy ra nhng kh khn khng cn thit cho chng
ta trong vic din gii cc kt qu hc c t tp hun luyn. Do chng ta cn gii quyt
vn ny i vi cc bi ton phn lp.
1.3 Phng php la chn thuc tnh
C th nh ngha la chn thuc tnh l mt qu trnh tm ra mt tp con cc thuc tnh
t M tp thuc tnh ca tp d liu N ban u, nh vy phi xc nh tiu chun la chn
thuc tnh. Mt thut ton la chn gm 4 bc c bn: Sinh tp con, lng gi tp con, iu
kin dng v xc nhn kt qu.
La chn thuc tnh c th da vo cc m hnh, cc chin lc tm kim, thc o cht
lng thuc tnh v c lng.C ba loi m hnh nh Filter, Wrapper, Embedded.
Cc chin lc tm kim bao gm: forward, backward, floating, branch and bound,
randomized. c lng ca vic chn la thuc tnh bao gm hai nhim v: mt l so snh

hai giai on: trc v sau khi la chn thuc tnh. Hai l so snh hai thut ton la chn
thuc tnh [3].
Tm li la chn thuc tnh c xem nh l s tng hp ca ba thnh phn chnh: tm
kim, nh gi, chn la m hnh.
1.4 Mt s thut ton la chn thuc tnh
Cc thut ton la chn thuc tnh c xt di gc chin lc tm kim no c
s dng trong gii thut : Tm kim ton b, Tm kim theo kinh nghim v Tm kim xc
sut. Ngoi ra chng ta cng nghin cu mt vi phng php khc: phng php trng s
thuc tnh (feature weighting method), phng php lai (hybrid method) v phng php ln
dn (incremental method).
1.4.1 Tm kim ton b
a. Phng php Focus
b. Phng php AAB
1.4.2 Tm kim theo kinh nghim
1.4.3 Tm kim xc sut
(a). Phng php LVF
(b). Phng php LVF
1.4.4. Phng php trng s thuc tnh
1.4.5. Phng php lai
1.4.6. Phng php ln dn
Chng 2: THUT TON RANDOM FOREST V GII THUT DI TRUYN
2.1 Gii thiu thut ton Random Forest
Random Forest (rng ngu nhin) l phng phn lp thuc tnh c pht trin bi
Leo Breiman ti i hc California, Berkeley. Breiman cng ng thi l ng tc gi ca
phng php CART (Classification and Regression Trees) c nh gi l mt trong 10
phng php khai ph d liu kinh in. Random Forest c xy dng da trn 3 thnh
phn chnh l: (1) CART, (2) hc ton b, hi ng cc chuyn gia, kt hp cc m hnh, v
(3) tng hp bootstrap (bagging). Hnh 3.1 [30] di y th hin phng php phn lp
random forest.
2.2 Bootstrap v Bagging
2.2.1 Bootstrap
L mt phng php rt ni ting trong thng k c gii thiu bi Bradley Efron
vo nm 1979. Phng php ny ch yu dng c lng li chun (standard errors),
lch (bias) v tnh ton khong tin cy (confidence interval) cho cc tham s. Phng php
ny c thc hin nh sau: T mt qun th ban u ly ra mt mu L = (x1, x2,..xn) gm n

thnh phn, tnh ton cc tham s mong mun. Trong cc bc tip theo lp li b ln vic to
ra mu Lb cng gm n phn t t L bng cch ly li mu vi s thay th cc thnh phn
trong mu ban u sau tnh ton cc tham s mong mun.
2.2.2. Bagging
Phng php ny c xem nh l mt phng php tng hp kt qu c c t cc
bootstrap. T tng chnh ca phng php ny nh sau: Cho mt tp hun luyn D={(xi,
yi): i=1,2,,n} v gi s chng ta mun c mt mt d on no i vi bin x.
Mt mu gm B tp d liu, mi tp d liu gm n phn t c chn la ngu nhin t D
vi s thay th (ging nh bootstrap). Do B=(D1, D2, .,DB) trng ging nh l mt tp
cc tp hun luyn c nhn bn;
Tp hun mt my hoc mt m hnh i vi mi tp Db (b=1, 2, ,B) v ln lt thu thp
cc kt qu d bo c c trn mi tp Db;
Kt qu tng hp cui cng c tnh ton bng cch trung bnh ha (regression) hoc thng
qua s phiu bu nhiu nht (classification).
2.3 Random forest
Tm tt cu gii thut RF cho phn lp c din gii nh sau:
Ly ra K mu bootstrap t tp hun luyn.
i vi mi mu bootstrap xy dng mt cy phn lp khng c ta
(unpruned tree) theo hng dn sau: Ti mi nt thay v chn mt phn chia
tt nht trong tt c cc bin d on, ta chn ngu nhin mt mu m ca
cc bin d on sau chn mt phn chia tt nht trong cc bin ny.
a ra cc d on bng cch tng hp cc d on ca K cy.
Qu trnh hc ca Random Forest bao gm vic s dng ngu nhin gi tru vo,
hoc kt hp cc gi tr ti mi node trong qu trnh dng tng cyquyt nh. Kt qu ca
Random Forest, qua thc nghim cho thy, l tt hn khiso snh vi thut ton Adaboost.
Trong Random Forest c mt s thuc tnh mnh nh:
(1) chnh xc ca n tng t Adaboost, trong mt s trng hp cn tt hn.
(2) Thut ton gii quyt tt cc bi ton c nhiu d liu nhiu.
(3) Thut ton chy nhanh hn so vi bagging hoc boosting.
(4) C nhng s c lng ni ti nh chnh xc ca m hnh phngon hoc
mnh v lin quan gia cc thuc tnh.
(5) D dng thc hin song song.
(6) Tuy nhin t c cc tnh cht mnh trn, thi gian thc thi cathut ton kh
lu v phi s dng nhiu ti nguyn ca h thng.
Qua nhng tm hiu trn v gii thut RF ta c nhn xt rng RF l mt phng php phn
lp tt do: (1) Trong RF cc sai s (variance) c gim thiu do kt qu ca RF c tng
hp thng qua nhiu ngi hc (learner), (2) Vic chn ngu nhin ti mi bc trong RF s
lm gim mi tng quan (correlation) gia cc ngi hc trong vic tng hp cc kt qu.

Ngoi ra, chng ta cng thy rng li chung ca mt rng cc cy phn lp ph thuc vo li
ring ca tng cy trong rng cng nh mi tng quan gia cc cy.
2.4 Mt s c im ca RF
2.4.1 OOB
Khi tp mu c rt ra t mt tp hun luyn ca mt cy vi s thay th (bagging),
th theo c tnh c khong 1/3 cc phn t khng c nm trong mu ny [7]. iu ny c
ngha l ch c khong 2/3 cc phn t trong tp hun luyn tham gia vo trong cc tnh ton
ca chng ta, v 1/3 cc phn t ny c gi l d liu out-of-bag. D liu out-of-bag c
s dng c lng li to ra t vic kt hp cc kt qu t cc cy tng hp trong random
forest cng nh dng c tnh quan trng thuc tnh (variable important).
2.4.2 Thuc tnh quan trng
Vic thc hin cc tnh ton xc nh thuc tnh quan trng trong RF cng gn nh
tng t vic s dng OOB tnh ton li trong RF. Cch thc hin nh sau: Gi s chng
ta cn xc nh thuc tnh quan trng ca thuc tnh th th m. u tin tnh ROOB, sau
hon v ngu nhin cc gi tr ca thuc tnh m trong d liu OOB, ln lt gi cc gi tr
ny xung cy v m s cc d on ng ta gi vic tnh ton ny i vi thuc tnh l
Rperm.
quan trng thuc tnh c tnh nh sau:
Trong trng hp gi tr ca thuc tnh quan trng trn mi cy l c lp th chng ta
c th tnh c li chun (standard error) ca ROOB Rperm
2.5 Thut ton di truyn
Thut ton di truyn [2, 32] l thut ton ti u ngu nhin da trn c ch chn lc t
nhin v tin ha di truyn. Thut ton di truyn c ng dng u tin trong hai lnh vc
chnh: ti u ha v hc my. Trong lnh vc ti u ha thut ton di truyn c pht trin
nhanh chng v ng dng trong nhiu lnh vc khc nhau nh ti u hm, x l nh, bi ton
hnh trnh ngi bn hng, nhn dng h thng v iu khin. Thut ton di truyn cng nh
cc thut ton tin ha ni chung, hnh thnh da trn quan nim cho rng, qu trnh tin ha
t nhin l qu trnh hon ho nht, hp l nht v t n mang tnh ti u. Quan nim ny
c th xem nh mt tin ng, khng chng minh c, nhng ph hp vi thc t khch
quan. Qu trnh tin ha th hin tnh ti u ch, th h sau bao gi cng tt hn (pht trin
hn, hon thin hn) th h trc bi tnh k tha v u tranh sinh tn [2].
CHNG 3: PHNG PHP XUT
3.1 Gii thiu
Chng ny lun vn trnh by phng php hc my nhm tm ra b thuc tnh ti
u t tp cc thuc tnh ca b s liu cho trc, l s dng gii thut ta gii thut di
truyn (Genetic-Algorithm) kt hp vi thut ton rng ngu nhin (Random Forest).
Chng ny m t phng php xut nh l cch tip cn theo wrapper tm ra cc thuc
tnh ti u, loi b cc thuc tnh d tha.

3.2 C s l lun
Ta thy cc ton t trong gii thut di truyn u mang tnh ngu nhin, thut ton di
truyn cn xc nh c qun thv khi to qun th ban u mt cch ngu nhin, xc nh
xc sut lai ghp v xc sut t bin . Xc sut t bin cn l xc sut thp. khc phc
vic hn ch ca vn chn ngu nhin, lun vn xut phng n:
- To ra cc b thuc tnh con t tp thuc tnh ban u bng phng php kt hp vic
chn ngu nhin vi vic phn b u cc thuc tnh.
- Khng thc hin vic lai ghp, t bin to ra cc b thuc tnh mi m thc hin vic
nh gi cc b thuc tnh, da vo nh gi cc thuc tnh chn ra cc thuc tnh
c ph hp cao.
- Dng thut ton hc lm tiu ch nh gi thch hp ca cc b thuc tnh.
C th, phng php xut c trnh by chi tit phn tip theo.
3.3 Kin trc h thng

Hnh 4.3: Kin trc c bn ca h thng


4.4 Phng php xut.
Bc 1: To ra m b thuc tnh t tp n thuc tch ban u.
- Mi b cha 2*n/m thuc tnh. Gm:
o n/m thuc tnh u nhau
o n/m thuc tnh ngu nhin
Bc 2: Tnh thang im c lng cho tng b thuc tnh
- Dng RF tnh thang im c lng cho cc b thuc tnh.
=> c tp cc gi tr c lng f(i) (i=1,..,m)
Bc 3: Tnh ranking theo trng s ca tng thuc tnh.
- Trng s ca mi thuc tnh i c tnh theo cng thc:

kij = 0 nu thuc tnh th i khng c chn trong b thuc tnh th j


kij = 1 nu thuc tnh th i c chn trong b thuc tnh th j
Bc 4: Xy dng tp mi gmp% thuc tnh tt nht

Quay li B1. iu kin dng a) s thuc tnh < ngng cho php.
b) S vng lp xc nh
4.5 Hot ng ca h thng
Tp d liu ban u c phn chia ngu nhin thnh hai tp: Tp d liu hun luyn v Tp
d liu kim tra.
a. Tp d liu hun luyn cho qua Phn 1
D liu hun luyn l bng c kch c m x n, vi n l s thuc tnh ban u v m l s bn
ghi. Khi cho bng ny qua Phn 1:
Bc 1: phng php ngh s sinh ra cc bng con c kch c m x k i ; trong ki l s ct
(s thuc tnh) ca bng con th i (i=1,2) v ki < n. Mi bng l mt tp con cc thuc tnh
ca b d liu ban u.
Bc 2: nh gi thch nghi ca mi b thuc tnh mi bng vic p dng thut ton hc
my random forest .
Bc 3: Sau , vi mi thuc tnh ca b thuc tnh ban u ta tnh c ph hp (trng
s)w ca mi thuc tnh theo cng thc:

Vi j=1,..,n tng ng vi n thuc tnh u tin. Fitnessi l ph hp ca b thuc


tnh th i trong m b thuc tnh mi. k nhn gi tr 1 nu thuc tnh th j c chn v nhn
gi tr 0 nu thuc tnh j khng c chn trong b thuc tnh i.
Bc 4: Thc hin sp xp n thuc tnh theo th t gim dn ca trng s wj. Ly p% cc
thuc tnh c theo th t t trn xung di ta c mt b thuc tnh mi gm (n*p)/100
thuc tnh.
B thuc tnh ny li lp li cc bc trn cho n khi thu c mt b thuc tnh c
s thuc tnh t ngng no hoc s ln lp xc nh. Kt thc qu trnh hot ng Phn
1: thu c b thuc tnh c ph hp cao nht. Kt qu ny c s dng a vo Phn
2.
b. Hot ng phn 2
Ly tt c bn ghi ca b s liu ban u nhng ch vi cc thuc tnh va tm c
Phn 1, chia lm hai phn: hun luyn v kim tra. Tp d liu hun luyn mi c s dng
hun luyn cho RF.
Sau khi hun luyn xong, cho tp d liu kim tra vo RF ny nh gi cht lng
h thng. kim tra tnh n nh ca h thng, tin hnh kim th nhiu ln. Tc l mi ln
kim th l mt ln chia b s liu ngu nhin thnh cc tp hun luyn v kim th khc
nhau.

CHNG 4: THC NGHIM V NH GI


4.1 Mi trng thc nghim
Tt c cc thc nghim c thc hin trn my Laptop vi b x l Intel (R) Core
(TM) i7 -2620 M CPU @ 2.70 GHz 2.69 GHz, RAM 4GB.
Chng trnh thc nghim ca ti c vit bng ngn ng R. Gi random forest ly
t www.r-project.org, cc m un l hon ton t xy dng, khng s dng hay k tha li
ca bt c ngun no.
4.2 M t chng trnh
Lun vn s dng R xy dng danh sch cc hm nh sau:
Tn hm
Din gii
Innit()

RF_Run<function
(train,test,TreeN
um)

fitness<-function
(inpData,CrV,in
d)
proccess<function
(InpData,m,p)

RF_Proccess<function
(train,test,TreeN
um,RunNum)

Khi to cc tham s cho chng


trnh, c d liu t cc file .csv
vo bin. Chun ha li d liu
theo cu trc xc nh trc.
- Tnh thi gian chy RF bao
gm thi gian hun luyn v
thi gian kim th.
- Tnh % on nhn(phn lp)
ng ca RF vi tham s ty
chn TreeNum l s lng cy
ca RF.
Tnh ph hp ca b d liu
inpData vi h s kim chng
cho l CrV.
- Hm proccess c chc nng
la chn ra tp cc thuc tnh
ti u nht t tp d liu u
vo InpData.
- m l tham s xc nh s b
thuc tnh trong mi ln phn
chia.
- p xc nh s thuc tnh loi b
sau mi ln chn la. (y l
iu kin dng ca thut ton)
- Tnh % on nhn ng, thi
gian chy ca RunNum ln
chy RF vi s lng cy l
TreeNum trn b thuc tnh
ban u v b thuc tnh ti u
mi tm c.

- Xc nh gi tr trung bnh, gi
tr ln nht, nh nht, lch
chun, thi gian trung bnh ca
cc ln chy trn 2 b d liu
v th so snh.
Ghi kt qu vo file Output.
4.3 Kt qu thc nghim
4.3.1 B d liu ung th d dy(Stomach)
4.3.1.1 M t b d liu Stomach
B d liu Stomach Cancer gm 137 bn ghi, mi bn ghi c 119 thuc tnh. Cc bn
ghi trong b d liu c phn thnh hai lp k hiu l normal (bnh nhn bnh thng) v
cancer (bnh nhn b ung th).
4.3.1.2 Kt qu v phn tch thc nghim trn b d liu Stomach

Hnh 4.7 Biu so snh kt qu chy RF 20 ln trn b d liu mi v b d liu ban u


vi s cy bng 100,300,500,800,1000.

Hnh 4.8 Biu so snh thi gian chy trung bnh ca 20 ln chy RF trn b d liu mi
v b d liu ban u vi s cy bng 100,300,500,800,1000.

4.3.1.3 Nhn xt
Ta thy t ln on nhn ca RF vi b thuc tnh mi tng ln r rng, c tnh tng khong
5%, trung bnh thut ton RF cho kt qu on nhn l 78%, cn RF mi cho kt qu l 83%.
Ta cng thy thi gian hun luyn v thi gian kim tra u gim i ng k. T l on nhn
trn b thuc tnh mi tng ln cho thy b thuc tnh mi loi b c mt s thuc tnh
nhiu, thuc tnh d tha. Cn thi gian gim i l v s lng thuc tnh gim xung
tng i nhiu, c th t 119 thuc tnh ban u, sau khi la chn b thuc tnh mi cn l
36 thuc tnh, nh vy s thuc tnh gim khong 69% s thuc tnh ban u. iu
chng t phng php thc nghim m lun vn xut cho hiu qu tng i tt. Tuy
nhin, tm ra mt b thuc tnh mi chng ta tiu tn mt khong thi gian tng i
ln. Vi b d liu Stomach chng ta mt khong 20 pht tm ra c mt b thuc tnh
ti u hn, v vi cc b d liu ln hn th thi gian li tng ln, nhng chng ta ch mt
thi gian 1 ln tm b thuc tnh ti u. Sau , tt c cc bi ton s dng b d liu ny khi
thc thi trn b thuc tnh mi s gim thi gian tnh ton trn tt c cc ln chy. V t ,
th thi gian lm vic s gim i ng k.
4.3.2 B d liu ung th rut kt Colon Turmo
4.3.2.1 M t d liu
Colon Turmo[1] l b d liu gm 2000 genes c chn la t 6500 genes, thu thp
t 62 bnh nhn ung th (2000 x 62)
4.3.2.2 Kt qu thc nghim

Hnh 4.13 Kt qu chy RF 20 ln trn b thuc tnh Colon Tumor ban u v sau khi ti u
vi s cy ln lt l 100,300,500

Hnh 4.14 Biu so snh thi gian hun luyn trung bnh ca 20 ln chy RF trn b d
liu Colon Tumor mi v b d liu Colon Tumor ban u vi s cy bng 100,300,500.

10

Hnh 4.15 Biu so snh thi gian kim tra trung bnh ca 20 ln chy RF trn b d liu
Colon Tumor mi v b d liu Colon Tumor ban u vi s cy bng 100,300,500.
4.3.2.2 Nhn xt
Kt qu thc nghim 20 ln trn b d liu Colon Tumor vi s cy ln lt l
100,300,500 cng ging nh kt qu thc nghim vi b d liu Stomach. R rng so t l
on nhn ca RF vi b thuc tnh mi cao hn tng i so vi b thuc tnh c, c tnh
tng khong 10%, trung bnh thut ton RF cho kt qu on nhn l 78%, cn RF mi cho
kt qu l 89%. Cn thi gian gim i l v s lng thuc tnh gim xung tng i
nhiu, c th t 2000 thuc tnh ban u, sau khi la chn b thuc tnh mi cn l 600 thuc
tnh, nh vy s thuc tnh gim khong 70% s thuc tnh ban u. iu chng t
phng php thc nghim m lun vn xut cho hiu qu tng i tt.
KT LUN
Trong khun kh ca lun vn ti tm hiu c s l thuyt v mt s thut ton p
dng gii bi ton trch chn thuc tnh ph hp bng cch gim chiu d liu. Ti cng
tp trung nghin cu v thut ton Random Forest v phng php tin x l d liu. T
nhng tm hiu ny ti xut hng ci tin nhm tm ra b thuc tnh ti u nh nht
tng hiu qu ca thut ton phn lp.
T nhng kt qu thc nghim trn b d liu Colon Turmo, chng ta thy kt qu
tng i n nh v tt. Tuy nhin phng php ny c nhc im l thi gian chy
chng trnh hi lu. Nu mun kt qu d on chnh xc hn th vic thay i mt s tham
s cn tiu tn thi gian hn na.
gii quyt hn ch ca phng php hc my c xut trn trong thi gian
ti ti s ch trng tm hiu, ci tin nhm tng tc phn lp ca gii thut. ng thi, ti
cng tin hnh th nghim phng php trn nhiu b d liu khc nhau nhm nh gi
chnh xc v n nh ca phng php i vi tng loi d liu c th. Tm hiu mt s
phng php phn lp khc nh cy quyt nh hoc phng php h tr vc t (SVM),
thay th thut ton random forest khi nh gi kt qu d on. Ri tin hnh so snh gia cc
phng php ny vi nhau. Qua , c th ng gp thm mt chn la cho cc nh pht
trin ng dng khi pht trin cc ng dng lin quan n phn lp d liu.
References

11

Ti liu Ting Vit


[1] Nguyn H Nam (2009), "Ti u ha KPCA bng GA chn cc thuc tnh c trng
nhm tng hiu qu phn lp ca thut ton Random Forest", Tp ch Khoa hc HQGHN,
Khoa hc T nhin v Cng ngh, s 25, tr. 84-93.
[2] Nguyn nh Thc (2001), Lp trnh tin ha, Nh xut bn gio dc, H Ni.
[3] Hunh Phng Ton, Nguyn Hu Lm, Nguyn Minh Trung, Thanh Ngh (2012),
Rng ngu nhin ci tin cho phn loi d liu gien, Tp ch khoa hc i hc Cn Th
2012:22b 9-17, Cn Th.
[4] Nguyn Vn Tun (2007), Phn tch s liu v to biu bng R-Hng dn thc hnh,
NXB KHKT, H Ni.
Ti liu Ting Anh
[5] Blum, A. L. and Langley (1997), Selection of Relevant Features and Examples in
Machine Learning, Artificial Intelligence, pp. 245-271.
[6] L. Breiman (2002), Manual On Setting Up, Using, And Understanding Random Forests
V3.1, Available:
http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf
[7] L. Breiman (2001), "Random Forests", Machine Learning Journal Paper, vol. 45.
[8] A. C. Leo Breiman, Random Forests, Available:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
[9] R. O. Duda, P. E. Hart, D. G. Stork (2001), Pattern Classification (2nd Edition), John
Wiley & Sons Inc.
[10] E. F. Ian H.Witten (2005), Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition ed.: Morgan KauFmann Publishers.
[11 ] Isabelle Guyon (2006), Feature Selection, pp. 12-30.
[12] M. K. Jiawei Han (2006), Data Mining:Concepts and Techniques, Second Edition ed.
Diane Cerra.
[13] Jacek Jarmulak and Susan Craw (1999), Genetic Algorithms for Feature Selection and
Weighting, IJCAI 99 workshop.
[14] Krzysztof J.Cios, Witold Deddrycz, Roman W.Swiniarski, Lukasz A.Kurgan (2007),
Data Mining A Knowledge Discovery Approach, Springer.
[15] YongSeog Kim and Filipppo Meczenc(2005), Feature Selection in Data Mining.
[16] Ron Kohavi and George H. John (1996), Wrapper for Feature Subset Selection, AIJ
special issuse on relevance.
[17] Huan Liu and Hiroshi Motoda (2008), Computational Methods of Feature Selection,
Chapman & Hall/CRC.

12

[18] F. Livingston (2005), "Implementation of Breiman's Random Forest Machine Learning


Algorithm", Machine Learning Journal Paper.
[19] Luis Carlos Molina et at (2000), Feature Selection for Algorithms: A Survey and
Experimental Evaluation.
[20] Ha Nam Nguyen, Syng Yup Ohn (2005), A Learning Algorithm based for Searching
Optimal Combined Kernal Function in Support Vector Machine.
[21] Sancho Salcedo Sanz etc (2000), Feature Selection via Genetic Optimization.
[22] Padhraic Smyth (2007), Cross-Validation Methods, CS 175, Fall.
[23] P. Spector (2008), Data Manipulation with R, Springer.
[24] M. G. Dan Steinberg, N. Scott Cardell (2004), A Brief Overview to Random Forests,
Salford Systems.
[25] Taylor & Francis Group, Computational Methods of Feature Selection, LLC
CRC Press.
[26] L. Torgo (2003), Data Mining with R: learning by case studies, LIACC-FEP.
[27] X. F. Lipo Wang(2005), Data Mining with Computational Intelligence, Springer.
[28] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi
Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael
Steinbach, David J. Hand, Dan Steinberg (2009), The Top Ten Algorithms in Data Mining,
Chapman & Hall/CRC.
[29] X. Su, Bagging and Random Forests, Available:
http://pegasus.cc.ucf.edu/~xsu/CLASS/STA5703/notes11.pdf
[30] Jihoon Yang and Vasant Honavar, Feature Subset Selection Using a Genetic Algorithm,
Artifical Intelligence Research Group.
[31] Dataset Available (2003): http://www.nipsfsc.ecs.soton.ac.uk/datasets/
[32] Genetic Algorithm: http://www.cs.rutgers.edu/~mlittman/courses/ml04/

13

Вам также может понравиться