Академический Документы
Профессиональный Документы
Культура Документы
hai giai on: trc v sau khi la chn thuc tnh. Hai l so snh hai thut ton la chn
thuc tnh [3].
Tm li la chn thuc tnh c xem nh l s tng hp ca ba thnh phn chnh: tm
kim, nh gi, chn la m hnh.
1.4 Mt s thut ton la chn thuc tnh
Cc thut ton la chn thuc tnh c xt di gc chin lc tm kim no c
s dng trong gii thut : Tm kim ton b, Tm kim theo kinh nghim v Tm kim xc
sut. Ngoi ra chng ta cng nghin cu mt vi phng php khc: phng php trng s
thuc tnh (feature weighting method), phng php lai (hybrid method) v phng php ln
dn (incremental method).
1.4.1 Tm kim ton b
a. Phng php Focus
b. Phng php AAB
1.4.2 Tm kim theo kinh nghim
1.4.3 Tm kim xc sut
(a). Phng php LVF
(b). Phng php LVF
1.4.4. Phng php trng s thuc tnh
1.4.5. Phng php lai
1.4.6. Phng php ln dn
Chng 2: THUT TON RANDOM FOREST V GII THUT DI TRUYN
2.1 Gii thiu thut ton Random Forest
Random Forest (rng ngu nhin) l phng phn lp thuc tnh c pht trin bi
Leo Breiman ti i hc California, Berkeley. Breiman cng ng thi l ng tc gi ca
phng php CART (Classification and Regression Trees) c nh gi l mt trong 10
phng php khai ph d liu kinh in. Random Forest c xy dng da trn 3 thnh
phn chnh l: (1) CART, (2) hc ton b, hi ng cc chuyn gia, kt hp cc m hnh, v
(3) tng hp bootstrap (bagging). Hnh 3.1 [30] di y th hin phng php phn lp
random forest.
2.2 Bootstrap v Bagging
2.2.1 Bootstrap
L mt phng php rt ni ting trong thng k c gii thiu bi Bradley Efron
vo nm 1979. Phng php ny ch yu dng c lng li chun (standard errors),
lch (bias) v tnh ton khong tin cy (confidence interval) cho cc tham s. Phng php
ny c thc hin nh sau: T mt qun th ban u ly ra mt mu L = (x1, x2,..xn) gm n
thnh phn, tnh ton cc tham s mong mun. Trong cc bc tip theo lp li b ln vic to
ra mu Lb cng gm n phn t t L bng cch ly li mu vi s thay th cc thnh phn
trong mu ban u sau tnh ton cc tham s mong mun.
2.2.2. Bagging
Phng php ny c xem nh l mt phng php tng hp kt qu c c t cc
bootstrap. T tng chnh ca phng php ny nh sau: Cho mt tp hun luyn D={(xi,
yi): i=1,2,,n} v gi s chng ta mun c mt mt d on no i vi bin x.
Mt mu gm B tp d liu, mi tp d liu gm n phn t c chn la ngu nhin t D
vi s thay th (ging nh bootstrap). Do B=(D1, D2, .,DB) trng ging nh l mt tp
cc tp hun luyn c nhn bn;
Tp hun mt my hoc mt m hnh i vi mi tp Db (b=1, 2, ,B) v ln lt thu thp
cc kt qu d bo c c trn mi tp Db;
Kt qu tng hp cui cng c tnh ton bng cch trung bnh ha (regression) hoc thng
qua s phiu bu nhiu nht (classification).
2.3 Random forest
Tm tt cu gii thut RF cho phn lp c din gii nh sau:
Ly ra K mu bootstrap t tp hun luyn.
i vi mi mu bootstrap xy dng mt cy phn lp khng c ta
(unpruned tree) theo hng dn sau: Ti mi nt thay v chn mt phn chia
tt nht trong tt c cc bin d on, ta chn ngu nhin mt mu m ca
cc bin d on sau chn mt phn chia tt nht trong cc bin ny.
a ra cc d on bng cch tng hp cc d on ca K cy.
Qu trnh hc ca Random Forest bao gm vic s dng ngu nhin gi tru vo,
hoc kt hp cc gi tr ti mi node trong qu trnh dng tng cyquyt nh. Kt qu ca
Random Forest, qua thc nghim cho thy, l tt hn khiso snh vi thut ton Adaboost.
Trong Random Forest c mt s thuc tnh mnh nh:
(1) chnh xc ca n tng t Adaboost, trong mt s trng hp cn tt hn.
(2) Thut ton gii quyt tt cc bi ton c nhiu d liu nhiu.
(3) Thut ton chy nhanh hn so vi bagging hoc boosting.
(4) C nhng s c lng ni ti nh chnh xc ca m hnh phngon hoc
mnh v lin quan gia cc thuc tnh.
(5) D dng thc hin song song.
(6) Tuy nhin t c cc tnh cht mnh trn, thi gian thc thi cathut ton kh
lu v phi s dng nhiu ti nguyn ca h thng.
Qua nhng tm hiu trn v gii thut RF ta c nhn xt rng RF l mt phng php phn
lp tt do: (1) Trong RF cc sai s (variance) c gim thiu do kt qu ca RF c tng
hp thng qua nhiu ngi hc (learner), (2) Vic chn ngu nhin ti mi bc trong RF s
lm gim mi tng quan (correlation) gia cc ngi hc trong vic tng hp cc kt qu.
Ngoi ra, chng ta cng thy rng li chung ca mt rng cc cy phn lp ph thuc vo li
ring ca tng cy trong rng cng nh mi tng quan gia cc cy.
2.4 Mt s c im ca RF
2.4.1 OOB
Khi tp mu c rt ra t mt tp hun luyn ca mt cy vi s thay th (bagging),
th theo c tnh c khong 1/3 cc phn t khng c nm trong mu ny [7]. iu ny c
ngha l ch c khong 2/3 cc phn t trong tp hun luyn tham gia vo trong cc tnh ton
ca chng ta, v 1/3 cc phn t ny c gi l d liu out-of-bag. D liu out-of-bag c
s dng c lng li to ra t vic kt hp cc kt qu t cc cy tng hp trong random
forest cng nh dng c tnh quan trng thuc tnh (variable important).
2.4.2 Thuc tnh quan trng
Vic thc hin cc tnh ton xc nh thuc tnh quan trng trong RF cng gn nh
tng t vic s dng OOB tnh ton li trong RF. Cch thc hin nh sau: Gi s chng
ta cn xc nh thuc tnh quan trng ca thuc tnh th th m. u tin tnh ROOB, sau
hon v ngu nhin cc gi tr ca thuc tnh m trong d liu OOB, ln lt gi cc gi tr
ny xung cy v m s cc d on ng ta gi vic tnh ton ny i vi thuc tnh l
Rperm.
quan trng thuc tnh c tnh nh sau:
Trong trng hp gi tr ca thuc tnh quan trng trn mi cy l c lp th chng ta
c th tnh c li chun (standard error) ca ROOB Rperm
2.5 Thut ton di truyn
Thut ton di truyn [2, 32] l thut ton ti u ngu nhin da trn c ch chn lc t
nhin v tin ha di truyn. Thut ton di truyn c ng dng u tin trong hai lnh vc
chnh: ti u ha v hc my. Trong lnh vc ti u ha thut ton di truyn c pht trin
nhanh chng v ng dng trong nhiu lnh vc khc nhau nh ti u hm, x l nh, bi ton
hnh trnh ngi bn hng, nhn dng h thng v iu khin. Thut ton di truyn cng nh
cc thut ton tin ha ni chung, hnh thnh da trn quan nim cho rng, qu trnh tin ha
t nhin l qu trnh hon ho nht, hp l nht v t n mang tnh ti u. Quan nim ny
c th xem nh mt tin ng, khng chng minh c, nhng ph hp vi thc t khch
quan. Qu trnh tin ha th hin tnh ti u ch, th h sau bao gi cng tt hn (pht trin
hn, hon thin hn) th h trc bi tnh k tha v u tranh sinh tn [2].
CHNG 3: PHNG PHP XUT
3.1 Gii thiu
Chng ny lun vn trnh by phng php hc my nhm tm ra b thuc tnh ti
u t tp cc thuc tnh ca b s liu cho trc, l s dng gii thut ta gii thut di
truyn (Genetic-Algorithm) kt hp vi thut ton rng ngu nhin (Random Forest).
Chng ny m t phng php xut nh l cch tip cn theo wrapper tm ra cc thuc
tnh ti u, loi b cc thuc tnh d tha.
3.2 C s l lun
Ta thy cc ton t trong gii thut di truyn u mang tnh ngu nhin, thut ton di
truyn cn xc nh c qun thv khi to qun th ban u mt cch ngu nhin, xc nh
xc sut lai ghp v xc sut t bin . Xc sut t bin cn l xc sut thp. khc phc
vic hn ch ca vn chn ngu nhin, lun vn xut phng n:
- To ra cc b thuc tnh con t tp thuc tnh ban u bng phng php kt hp vic
chn ngu nhin vi vic phn b u cc thuc tnh.
- Khng thc hin vic lai ghp, t bin to ra cc b thuc tnh mi m thc hin vic
nh gi cc b thuc tnh, da vo nh gi cc thuc tnh chn ra cc thuc tnh
c ph hp cao.
- Dng thut ton hc lm tiu ch nh gi thch hp ca cc b thuc tnh.
C th, phng php xut c trnh by chi tit phn tip theo.
3.3 Kin trc h thng
Quay li B1. iu kin dng a) s thuc tnh < ngng cho php.
b) S vng lp xc nh
4.5 Hot ng ca h thng
Tp d liu ban u c phn chia ngu nhin thnh hai tp: Tp d liu hun luyn v Tp
d liu kim tra.
a. Tp d liu hun luyn cho qua Phn 1
D liu hun luyn l bng c kch c m x n, vi n l s thuc tnh ban u v m l s bn
ghi. Khi cho bng ny qua Phn 1:
Bc 1: phng php ngh s sinh ra cc bng con c kch c m x k i ; trong ki l s ct
(s thuc tnh) ca bng con th i (i=1,2) v ki < n. Mi bng l mt tp con cc thuc tnh
ca b d liu ban u.
Bc 2: nh gi thch nghi ca mi b thuc tnh mi bng vic p dng thut ton hc
my random forest .
Bc 3: Sau , vi mi thuc tnh ca b thuc tnh ban u ta tnh c ph hp (trng
s)w ca mi thuc tnh theo cng thc:
RF_Run<function
(train,test,TreeN
um)
fitness<-function
(inpData,CrV,in
d)
proccess<function
(InpData,m,p)
RF_Proccess<function
(train,test,TreeN
um,RunNum)
- Xc nh gi tr trung bnh, gi
tr ln nht, nh nht, lch
chun, thi gian trung bnh ca
cc ln chy trn 2 b d liu
v th so snh.
Ghi kt qu vo file Output.
4.3 Kt qu thc nghim
4.3.1 B d liu ung th d dy(Stomach)
4.3.1.1 M t b d liu Stomach
B d liu Stomach Cancer gm 137 bn ghi, mi bn ghi c 119 thuc tnh. Cc bn
ghi trong b d liu c phn thnh hai lp k hiu l normal (bnh nhn bnh thng) v
cancer (bnh nhn b ung th).
4.3.1.2 Kt qu v phn tch thc nghim trn b d liu Stomach
Hnh 4.8 Biu so snh thi gian chy trung bnh ca 20 ln chy RF trn b d liu mi
v b d liu ban u vi s cy bng 100,300,500,800,1000.
4.3.1.3 Nhn xt
Ta thy t ln on nhn ca RF vi b thuc tnh mi tng ln r rng, c tnh tng khong
5%, trung bnh thut ton RF cho kt qu on nhn l 78%, cn RF mi cho kt qu l 83%.
Ta cng thy thi gian hun luyn v thi gian kim tra u gim i ng k. T l on nhn
trn b thuc tnh mi tng ln cho thy b thuc tnh mi loi b c mt s thuc tnh
nhiu, thuc tnh d tha. Cn thi gian gim i l v s lng thuc tnh gim xung
tng i nhiu, c th t 119 thuc tnh ban u, sau khi la chn b thuc tnh mi cn l
36 thuc tnh, nh vy s thuc tnh gim khong 69% s thuc tnh ban u. iu
chng t phng php thc nghim m lun vn xut cho hiu qu tng i tt. Tuy
nhin, tm ra mt b thuc tnh mi chng ta tiu tn mt khong thi gian tng i
ln. Vi b d liu Stomach chng ta mt khong 20 pht tm ra c mt b thuc tnh
ti u hn, v vi cc b d liu ln hn th thi gian li tng ln, nhng chng ta ch mt
thi gian 1 ln tm b thuc tnh ti u. Sau , tt c cc bi ton s dng b d liu ny khi
thc thi trn b thuc tnh mi s gim thi gian tnh ton trn tt c cc ln chy. V t ,
th thi gian lm vic s gim i ng k.
4.3.2 B d liu ung th rut kt Colon Turmo
4.3.2.1 M t d liu
Colon Turmo[1] l b d liu gm 2000 genes c chn la t 6500 genes, thu thp
t 62 bnh nhn ung th (2000 x 62)
4.3.2.2 Kt qu thc nghim
Hnh 4.13 Kt qu chy RF 20 ln trn b thuc tnh Colon Tumor ban u v sau khi ti u
vi s cy ln lt l 100,300,500
Hnh 4.14 Biu so snh thi gian hun luyn trung bnh ca 20 ln chy RF trn b d
liu Colon Tumor mi v b d liu Colon Tumor ban u vi s cy bng 100,300,500.
10
Hnh 4.15 Biu so snh thi gian kim tra trung bnh ca 20 ln chy RF trn b d liu
Colon Tumor mi v b d liu Colon Tumor ban u vi s cy bng 100,300,500.
4.3.2.2 Nhn xt
Kt qu thc nghim 20 ln trn b d liu Colon Tumor vi s cy ln lt l
100,300,500 cng ging nh kt qu thc nghim vi b d liu Stomach. R rng so t l
on nhn ca RF vi b thuc tnh mi cao hn tng i so vi b thuc tnh c, c tnh
tng khong 10%, trung bnh thut ton RF cho kt qu on nhn l 78%, cn RF mi cho
kt qu l 89%. Cn thi gian gim i l v s lng thuc tnh gim xung tng i
nhiu, c th t 2000 thuc tnh ban u, sau khi la chn b thuc tnh mi cn l 600 thuc
tnh, nh vy s thuc tnh gim khong 70% s thuc tnh ban u. iu chng t
phng php thc nghim m lun vn xut cho hiu qu tng i tt.
KT LUN
Trong khun kh ca lun vn ti tm hiu c s l thuyt v mt s thut ton p
dng gii bi ton trch chn thuc tnh ph hp bng cch gim chiu d liu. Ti cng
tp trung nghin cu v thut ton Random Forest v phng php tin x l d liu. T
nhng tm hiu ny ti xut hng ci tin nhm tm ra b thuc tnh ti u nh nht
tng hiu qu ca thut ton phn lp.
T nhng kt qu thc nghim trn b d liu Colon Turmo, chng ta thy kt qu
tng i n nh v tt. Tuy nhin phng php ny c nhc im l thi gian chy
chng trnh hi lu. Nu mun kt qu d on chnh xc hn th vic thay i mt s tham
s cn tiu tn thi gian hn na.
gii quyt hn ch ca phng php hc my c xut trn trong thi gian
ti ti s ch trng tm hiu, ci tin nhm tng tc phn lp ca gii thut. ng thi, ti
cng tin hnh th nghim phng php trn nhiu b d liu khc nhau nhm nh gi
chnh xc v n nh ca phng php i vi tng loi d liu c th. Tm hiu mt s
phng php phn lp khc nh cy quyt nh hoc phng php h tr vc t (SVM),
thay th thut ton random forest khi nh gi kt qu d on. Ri tin hnh so snh gia cc
phng php ny vi nhau. Qua , c th ng gp thm mt chn la cho cc nh pht
trin ng dng khi pht trin cc ng dng lin quan n phn lp d liu.
References
11
12
13