Академический Документы
Профессиональный Документы
Культура Документы
Kt qu m hnh
Trn thc t, c nhiu hon cnh ta khng th s dng trc tip nhng nguyn liu sn c
m phi bin i, s ch chng
tr nn ph hp hn cho mc tiu m ta mun t ti
Nguyn liu
D liu gc
M hnh
Output Y ~ (A+B+C+D+E+F)
Hun luyn
m hnh
Kt qu
Mc tiu
K nng: thao tc c bn trn d liu, phn chia d liu, hun luyn m hnh, kho st m hnh, kim nh m hnh bng caret.
c lp vi quy trnh hun luyn : nu mc ch ca bn ch l chuyn i d liu cho cc phn tch thng k n gin (ANOVA, test t, tng
quan)
B. Kt hp ng thi vi quy trnh hun luyn v kim nh thut ton Machine learning
Trong mi cch, c nhiu hng tip cn:
1.
2.
3.
V mt phng php:
CARET h tr n 15 phng php chuyn i s liu, chia lm 5 nhm chnh:
1.
2.
3.
4.
5.
1. Nguyn tc chung
Nguyn tc chung cho quy trnh hon chuyn d liu s dng caret
Hm hon
chuyn d liu
d liu gc
1)
2)
3)
preProcess
Nu p dng ring bit cho tng bin s/nhm bin s, kt qu chuyn dng ring l
sau cn c gom li to 1 dataset mi
predict
u ra: d liu
c chuyn i
Quy trnh ny chy ngm, p dng cho ton b tp Train v cho c bc kim nh
(khi lm resample hay confusionMatrix). Dataset gc v tp Train, Test vn c bo
ton.
1. Nguyn tc chung
Hon chuyn thm d
b
Ton b dataset
a
Tp hc
c
M hnh
dataset
Kim nh
Hun luyn
Tp kim nh
Sau chuyn
dng
Hon chuyn d liu TRC khi tin hnh hun luyn, cho ton b dataset hay tng bin ring l
Hon chuyn d liu TRC khi tin hnh hun luyn, cho ring tp hc (train),
sau p dng hm hon chuyn cho tp test ri lm kim nh
Hon chuyn d liu NG THI khi hun luyn v kim nh: cho ton b tp hc (train)
Sau p dng cho tp Test trong khi kim nh
Gi : Mi cch lm c u khuyt im ring
(a) cho php thm d ring l cc phng php, cho tng bin; nhng khng cho thy h qu trn m
hnh,
(b) cho php lu d liu nhng mt thi gian v khng cho php hon chuyn ring bit tng bin s
(c) Nhanh chng v cho php kim tra hiu qu ca php hon chuyn ln phm cht ca m hnh,
nhng chy ngm v bt buc p dng cho ton b tp Train
2. Quy tc v th bc
Th bc p dng cc hm chuyn i d liu trong caret
1
Ghi ch:
Th t ny l bt buc v c nh cho quy trnh kt hp
Quy trnh PCA v ICA bt buc d liu phi c chun ha thang o v trung tm ha, ngay c khi bn khng khai bo method center v
scale, caret vn t thc hin chng.
5
3
Phng php hon chuyn d liu do 2 nh thng k Anh quc l Geogre EP Box (1919-2013) v David Cox (1924) lp ra nm
1964 (Journal of the Royal Statistical Society, Series B(1964) 26, 211-252)
ng dng ban u ca p.php BoxCox l chuyn dng bin s kt qu, nhng sau ny n c p dng cho c predictor.
Cng dng ca BoxCox transformation l ci thin bt thng v Skewness v a d liu v gn vi phn phi chun.
Bn cht ca hm BoxCox l mt hm s m, ty vo gi tr Lambda m n s tng ng vi cc hm chuyn dng thng
thng nh ly tha, cn, nghch o hay logarit ha.
iu kin
( 1)/
=0
log(y)
Nhc im duy nht ca hm BoxCox l n khng chp nhn gi tr y <0 (y phi l s dng hay =0).
Do khi d liu ca bn c gi tr m, phng php thch hp hn s l Yeo Johnson
In Kwon Yeo (Hn quc) v Richard A. Johnson (Hoa k) thit lp nm 2000 (Biometrika Vol. 87, No. 4 (Dec., 2000), pp. 954-959 )
Cng dng: lm gim lch (Skewness) v m phng phn phi chun
Ni dung:
Yeo-Johnson l s nng cp ca phng php Box-Cox:
iu kin
0, y 0
(( + 1) 1)/
= 0, y 0
log(y+1)
2, y < 0
(( + 1)2 1)/(2-)
= 2, y< 0
-log(-y+1)
knnImpute: B tc d liu bng cch nhn din lng ging gn nht (da vo khong cch Euclidian)
bagImpute: B tc d liu t kt qu m hnh cy vi bagging (d bo gi tr 1 bin s da vo nhng bin s khc)
Hai phng php ny chnh xc cao hn, nhng tiu tn thi gian v b nh my tnh
Th d minh ha
1. D liu
Th d minh ha
1. D liu
pima=read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", sep =
",",na.strings="0.0",strip.white=TRUE, fill = TRUE)
pima$V9=factor(pima$V9, levels = c(1,0), labels = c("Benhly", "Binhthuong"))
pima=subset(pima,V2>0 & V2>0 & V3>0 & V4>0 & V5 >0)
pima$V2=as.numeric(pima$V2)
pima$V3=as.numeric(pima$V3)
pima$V4=as.numeric(pima$V4)
pima$V5=as.numeric(pima$V5)
pima$V8=as.numeric(pima$V8)
names(pima)=c("pregnant", "plasmaglu", "DBP", "skinfold", "seruminsulin", "BMI", "pedigreefunction", "age", "chandoan")
data0=na.omit(pima)
Ghi ch: Mt s trng hp trong dataset ny cha gi tr 0 hoc 0,0 mt cch phi l (tri vi quy lut sinh hc) nn ta mc nhin xem chng nh missing value v loi b.
Cui cng data cn li 392 trng hp
head(data0)
pregnant plasmaglu DBP skinfold seruminsulin BMI pedigreefunction age
chandoan
4
1
89 66
23
94 28.1
0.167 21 Binhthuong
5
0
137 40
35
168 43.1
2.288 33
Benhly
7
3
78 50
32
88 31.0
0.248 26
Benhly
9
2
197 70
45
543 30.5
0.158 53
Benhly
14
1
189 60
23
846 30.1
0.398 59
Benhly
15
5
166 72
19
175 25.8
0.587 51
Benhly
Th d minh ha
2. Thm d trc quan
i tng c chn lm th
d: phn phi lch tri
library(caret)
featurePlot(x = data0[,1:8],
y = data0$chandoan,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1,
pch = "|",
layout = c(4,2),
auto.key = list(columns = 2))
Hm featurePlot kiu density ca caret cho php kho st trc quan c tnh phn phi ca ton b predictor theo tng phn nhm ca bin kt qu.
Hnh v cho php pht hin nhng bin s phn phi bt thng; trong th d ny l : pregnant, serum insulin, age v pedigree function. Ta chn bin
serum insulin lm th d minh ha
Th d minh ha
2. Thm d trc quan
featurePlot(x = data0[,1:8],
y = data0$chandoan,
plot = "box",
scales = list(y = list(relation="free"),
x = list()),
layout = c(4,2),
auto.key = list(columns = 2))
Trong khi , hm FeaturePlot kiu BoxPlot cho
php so snh trc quan gi tr predictor gia 2 phn
nhm ca bin kt qu. N cn cho php pht hin
nhng vn v outliers
i tng c chn lm th
d: nhiu outliers
3. Chuyn i c lp v ring l
insulin=as.data.frame(data0[,5])
names(insulin)="original"
cen=preProcess(insulin,method="center")
scal=preProcess(insulin,method="scale")
bct=preProcess(insulin,method="BoxCox")
YJ=preProcess(insulin,method="YeoJohnson")
expo=preProcess(insulin,method="expoTrans")
ran=preProcess(insulin,method="range")
yjsc=preProcess(insulin,method=c("YeoJohnson","scale"))
yjcensc=preProcess(insulin,method=c("YeoJohnson","center","scale"))
bccent=preProcess(insulin,method=c("BoxCox","center"))
C php:
i tng hm chuyn dng <- preProcess (tn subset cn chuyn dng, method
= c("A","B", )
Vi A,B,C l tn cc phng php chuyn dng, gm:
"BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute",
"bagImpute", "medianImpute", "pca", "ica", "spatialSign", "zv", "nzv", v
"conditionalX"
yjcensc
Created from 392 samples and 1 variables
Pre-processing:
- centered (1)
- ignored (0)
- scaled (1)
- Yeo-Johnson transformation (1)
Lambda estimates for Yeo-Johnson transformation:
0.03
3. Chuyn i c lp v ring l
Ghi ch:
insulin2=insulin
insulin2$Boxcox=boxcox$original
featurePlot(x = insulin2[,-12],
y = insulin2$class,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1,
pch = "|",
layout = c(3,4),
auto.key = list(columns = 2))
Nhn xt:
Trong th d ny:
Lambda ca Boxcox =0 do n chnh l chuyn dng
logarit
Insulin ch c gi tr > 0, nn BoxCox, YeoJohnson v
logarit cho ra phn phi hon ton ging nhau; v chng
ci thin rt tt skewness ca insulin, vi phn phi
gn nh bnh thng (YJ v BoxCox khc thang o)
Vic p dng thm scale, center, range: khng lm thay
i bn cht v phn phi, chng ch thay i thang o.
featurePlot(x = insulin2[,1:11],
y = insulin2$class,
plot = "box",
scales = list(y = list(relation="free"),
x = list()),
layout = c(4,3),
auto.key = list(columns = 2))
Nhn xt:
7 nhn t chnh
library(fastICA)
ICAT=preProcess(data0[,-9],method=c("center", "scale", "ica"), n.comp=5)
ICAdata=predict(ICAT,newdata=data0[,-9])
ICAdata$chandoan=data0$chandoan
PCAT=preProcess(data0[,-9],method="pca")
PCAdata=predict(PCAT,newdata=data0[,-9])
PCAdata$chandoan=data0$chandoan
PCAT
Created from 392 samples and 8 variables
Pre-processing:
- centered (8)
- ignored (0)
- principal component signal extraction (8)
- scaled (8)
PCA needed 7 components to capture 95 percent of the
variance
ICAT
Created from 392 samples and 8 variables
Pre-processing:
- centered (8)
- independent component signal extraction (8)
- ignored (0)
- scaled (8)
ICA used 5 components
5 nhn t c lp
glm
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.4613382
ROC
0.8613
Accuracy
0.7946667
Kappa
0.5246114
Sensitivity
0.64
Specificity
0.872
Pos_Pred_Value
0.7147577
Neg_Pred_Value
0.830181
Detection_Rate
0.2133333
Balanced_Accuracy
0.756
glmbct
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
Pre-processing: Box-Cox transformation (7)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.4348474
ROC
0.8717
Accuracy
0.804
Kappa
0.5480141
Sensitivity
0.658
Specificity
0.877
Pos_Pred_Value
0.7312402
Neg_Pred_Value
0.8375498
Detection_Rate
0.2193333
Balanced_Accuracy
0.7675
glmPCA
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
Pre-processing: principal component signal extraction (8), centered (8), scaled (8)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.443819
ROC
0.8712
Accuracy
0.804
Kappa
0.5482512
Sensitivity
0.656
Specificity
0.878
Pos_Pred_Value
0.7348954
Neg_Pred_Value
0.836638
Detection_Rate
0.2186667
Balanced_Accuracy
0.767
glmICA
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
Pre-processing: independent component signal extraction (8), centered (8), scaled (8)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.4612039
ROC
0.86175
Accuracy
0.794
Kappa
0.5191909
Sensitivity
0.622
Specificity
0.88
Pos_Pred_Value
0.7247836
Neg_Pred_Value
0.8243255
Detection_Rate
0.2073333
Balanced_Accuracy
0.751
list=resamples(list(GLM=glm,GLMBCT=glmbct,PCA=glmPCA,ICA=glmICA))
diff1=diff(list,models=c("GLM","GLMBCT","PCA","ICA"),metric=c("logLoss","Kappa","ROC","Accuracy","Sensitivity","Specificity"))
summary(diff1)
Call:
summary.diff.resamples(object = diff1)
p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0
logLoss
GLM
GLMBCT
PCA
ICA
GLM
0.0264908 0.0175193 0.0001343
GLMBCT 6.847e-06
-0.0089716 -0.0263565
PCA
0.011356 0.583963
-0.0173850
ICA
1.000000 0.007899
0.138403
Kappa
GLM
GLMBCT
GLM
-0.0234027
GLMBCT 0.46692
PCA
0.07057 1.00000
ICA
1.00000 1.00000
PCA
ICA
-0.0236398 0.0054205
-0.0002371 0.0288232
0.0290603
0.97180
ROC
GLM
GLMBCT
PCA
ICA
GLM
-0.01040 -0.00990 -0.00045
GLMBCT 0.0030376
0.00050
0.00995
PCA
0.0009256 1.0000000
0.00945
ICA
1.0000000 0.4779607 0.3285702
Accuracy
GLM
GLM
GLMBCT 0.5381
PCA
0.1185
ICA
1.0000
GLMBCT
PCA
ICA
-9.333e-03 -9.333e-03 6.667e-04
-8.882e-18 1.000e-02
1.0000
1.000e-02
1.0000
1.0000
Sensitivity
GLM
GLM
GLMBCT 0.8530
PCA
0.4364
ICA
1.0000
GLMBCT PCA
ICA
-0.018 -0.016 0.018
0.002 0.036
1.0000
0.034
0.7320 0.6067
Specificity
GLM GLMBCT
GLM
-0.005
GLMBCT 1
PCA
1
1
ICA
1
1
PCA
ICA
-0.006 -0.008
-0.001 -0.003
-0.002
1
Nhn xt: Vic hon chuyn d liu c th gip ci thin kh nng d bo ca thut ton phn loi
bwplot(list,models=c("GLM","GLMBCT","PCA","ICA"),metric=c("logLoss","Kappa","ROC","Accuracy","Sensitivity","Specificity"))
testing$pred=predict(glm,newdata=testing)
confusionMatrix(testing$pred,testing$chandoan)
GLM
testing$pred=predict(glmbct,newdata=testing)
confusionMatrix(testing$pred,testing$chandoan)
GLM + BoxCox
testing$pred=predict(glmPCA,newdata=testing)
confusionMatrix(testing$pred,testing$chandoan)
GLM + PCA
Reference
Prediction
Benhly Binhthuong
Benhly
10
7
Binhthuong
20
55
Reference
Prediction
Benhly Binhthuong
Benhly
11
7
Binhthuong
19
55
Reference
Prediction
Benhly Binhthuong
Benhly
12
7
Binhthuong
18
55
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.7065
(0.6024, 0.7969)
0.6739
0.29231
Kappa : 0.2482
Mcnemar's Test P-Value : 0.02092
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
:
:
:
:
:
:
:
:
0.3333
0.8871
0.5882
0.7333
0.3261
0.1087
0.1848
0.6102
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.7174
(0.6139, 0.8064)
0.6739
0.21975
Kappa : 0.283
Mcnemar's Test P-Value : 0.03098
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
:
:
:
:
:
:
:
:
0.3667
0.8871
0.6111
0.7432
0.3261
0.1196
0.1957
0.6269
Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]
:
:
:
:
0.7283
(0.6255, 0.8158)
0.6739
0.1584
Kappa : 0.3171
Mcnemar's Test P-Value : 0.0455
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy
:
:
:
:
:
:
:
:
0.4000
0.8871
0.6316
0.7534
0.3261
0.1304
0.2065
0.6435
Thng ip cui bi
1.
2.
Sau khi chuyn i d liu bn cn kho st li kt qu ca vic chuyn i ny, biu trc quan l cch lm n gin nhng hiu
qu
3.
Khng c quy lut no hon ton chnh xc cho php d bo v hiu qu tt/xu ca mi phng php chuyn i d liu i vi
phm cht ca m hnh. Bn cn phn tch, th nghim nhiu gii php v rt ra kt lun cho chnh mnh
4.
Khng nn lm dng vic hon chuyn d liu, nht l cho m hnh c mc ch din dch v vi nhng phng php lm thay i
hon cu trc d liu hoc gi tr bn trong, nh YeoJohnson hay PCA. Nguy c l bn c th mt kh nng din gii kt qu.
Hn gp li bn
vo bi tip theo