Вы находитесь на странице: 1из 30

Hon chuyn d liu s dng CARET

Bs. L Ngc Kh Nhi

Ti sao phi chuyn i d liu ?


M hnh
Maki sushi ~ (rice + avocado + cucumber + radish + nori sheet)
Dng m hnh

Kt qu m hnh

Nguyn liu sau khi S ch

Trn thc t, c nhiu hon cnh ta khng th s dng trc tip nhng nguyn liu sn c
m phi bin i, s ch chng
tr nn ph hp hn cho mc tiu m ta mun t ti
Nguyn liu

Ti sao phi chuyn i d liu ?


Mc tiu ca chuyn i d liu c th l

D liu gc

Khc phc, ci thin vn v phn phi (skewness, outliers)


Tha mn gi nh phn phi chun ca mt s m hnh
Chun ha thang o v n v
Tng s tng phn gia cc phn nhm
Loi b nhng bin s v dng v nguy him cho m hnh
Ci thin kh nng d bo ca m hnh

Sau khi c chuyn i

M hnh
Output Y ~ (A+B+C+D+E+F)

Hun luyn
m hnh

Kt qu

Mc tiu

Mc tiu ca bi ny: s gip cc bn :


1. Nm c nguyn tc chung ca chuyn i s liu bng gi CARET
2. Bit c kh nng ca caret : nhng phng php hon chuyn d liu m n c th thc hin, bn cht v cng dng ca
tng phng php
3. Thc hin c bin i s liu cho bin s ring l, cho tp bin, v kt hp ng thi vi hun luyn m hnh
4. nh gi c hiu qu ca s bin i ny trn c tnh phn phi ca d liu sau chuyn dng, v trn phm cht ca m
hnh

Yu cu v kin thc-k nng c bn trc khi thc hnh:


Kin thc: c tnh phn phi (mean, skewness, density plot), cc hm m v logarit, m hnh logistic, PCA v ICA

K nng: thao tc c bn trn d liu, phn chia d liu, hun luyn m hnh, kho st m hnh, kim nh m hnh bng caret.

Gii thiu: caret lm c nhng g ?


Caret cho php bn hon chuyn d liu theo nhiu cch:
A.

c lp vi quy trnh hun luyn : nu mc ch ca bn ch l chuyn i d liu cho cc phn tch thng k n gin (ANOVA, test t, tng
quan)

B. Kt hp ng thi vi quy trnh hun luyn v kim nh thut ton Machine learning
Trong mi cch, c nhiu hng tip cn:
1.
2.
3.

p dng ring l cho tng bin s


p dng ng lot cho mt subset nhiu bin s
p dng ng lot cho ton b dataset : Trc hoc TRONG qu trnh hun luyn

V mt phng php:
CARET h tr n 15 phng php chuyn i s liu, chia lm 5 nhm chnh:
1.
2.
3.
4.
5.

Nhm dch chuyn, chun ha thang o v n v: centering, scaling, ranging


Nhm lc b bin s: zv, nzv, conditionalX
Nhm thay i Skewness bng hm m: BoxCox, YeoJohnson, ExpoTrans
Nhm b tc missing value (Imputation) : knn, bagging tree, median based
Nhm phn tch nhn t :PCA, ICA

Cc phng php ny c th c p dng ring l hoc kt hp vi nhau (d nhin theo 1 th bc ph hp).


Chng ta bt u hnh trnh nh ?

1. Nguyn tc chung

Nguyn tc chung cho quy trnh hon chuyn d liu s dng caret

A) Khng km hun luyn:

Hm hon
chuyn d liu

d liu gc

1)

Chun b mt tp d liu (subset) cha 1 hay nhiu bin s cn c chuyn i

2)

p dng hm preProcess trn subset ny, to ra 1 i tng = m hnh hon


chuyn T. Hm ny c th cu to t 1 hay nhiu hm hon chuyn

3)

p dng m hnh T cho subset cn chuyn dng (dng hm predict). Kt qu


xut ra s l 1 tp d liu mi c chuyn dng.

preProcess

Ghi ch: M hnh T c th p dng cho 1 subset mi c lp vi subset ban u (th


d Testing subset)
u vo:
d liu bt k

Nu p dng ring bit cho tng bin s/nhm bin s, kt qu chuyn dng ring l
sau cn c gom li to 1 dataset mi

predict

B) ng thi vi hun luyn


C th thc hin hon chuyn d liu ngay trong quy trnh Hun luyn, bng ty
chnh preProcess trong hm Traincontrol v train.

u ra: d liu
c chuyn i

Quy trnh ny chy ngm, p dng cho ton b tp Train v cho c bc kim nh
(khi lm resample hay confusionMatrix). Dataset gc v tp Train, Test vn c bo
ton.

Hon chuyn Trc


cho subset Train

1. Nguyn tc chung
Hon chuyn thm d
b

Hon chuyn ng thi

Ton b dataset
a

Quy trnh chung

Tp hc
c

M hnh

dataset

Sau chuyn dng

Kim nh
Hun luyn

Tp kim nh

Quy trnh chuyn bit


Subset hay bin
s ring l

Sau chuyn
dng

Hon chuyn d liu TRC khi tin hnh hun luyn, cho ton b dataset hay tng bin ring l

Hon chuyn d liu TRC khi tin hnh hun luyn, cho ring tp hc (train),
sau p dng hm hon chuyn cho tp test ri lm kim nh

Hon chuyn d liu NG THI khi hun luyn v kim nh: cho ton b tp hc (train)
Sau p dng cho tp Test trong khi kim nh
Gi : Mi cch lm c u khuyt im ring
(a) cho php thm d ring l cc phng php, cho tng bin; nhng khng cho thy h qu trn m
hnh,
(b) cho php lu d liu nhng mt thi gian v khng cho php hon chuyn ring bit tng bin s
(c) Nhanh chng v cho php kim tra hiu qu ca php hon chuyn ln phm cht ca m hnh,
nhng chy ngm v bt buc p dng cho ton b tp Train

2. Quy tc v th bc
Th bc p dng cc hm chuyn i d liu trong caret
1

Lc b tt c bin s c variance = 0 (method zv , nzv ) v cc bin khng thch hp cho m hnh


phn loi (conditionalX)

Cc hm ly tha (BoxCox, YeoJohnson, Expo)

Cc hm chuyn v v chun ha thang o (centering, scaling)

Quy ng thang o (range)

Khc phc thiu st d liu bng Imputation ("knnImpute", "bagImpute", "medianImpute )

Phn tch nhn t chnh hoc c lp (PCA, ICA)

Ghi ch:
Th t ny l bt buc v c nh cho quy trnh kt hp

Quy trnh PCA v ICA bt buc d liu phi c chun ha thang o v trung tm ha, ngay c khi bn khng khai bo method center v
scale, caret vn t thc hin chng.

Ni dung cc quy trnh chuyn i


1. Loi b bin s v ci thin skewness
1

Nhng quy trnh pht hin v loi b


bin s khng ph hp
zv: t ng pht hin v loi b tt c bin s c variance = 0 (ch cha 1 gi tr duy nht trong ton b d liu)
nzv: t ng pht hin v loi b tt c bin s c variance qu thp (gn 0) (cc gi tr qu gn nhau, rt t s bin thin)
conditionalX: t ng pht hin v loi b nhng predictor ch cha1 gi tr duy nht khi t hp vi bt k phn nhm no ca bin
s kt qu nh tnh.
Mc tiu ca 3 quy trnh ny l lm sch d liu trc khi dng m hnh d bo, loi b nhng predictor v dng, c nguy c lm
tn hi cho m hnh.

Nhng phng php chuyn dng thuc nhm hm M


Bao gm 3 hm chuyn dng: BoxCox, YeoJohnson v ExpoTrans
Cng dng ca cc hm ny l ci thin Skewness, outlier v a d liu v phn phi chun
Bn ch c th chn 1 trong 3 phng php

5
3

Ni dung cc quy trnh chuyn i


2. Tc ng ln thang o v n v
Nhng phng php ch lm thay i thang o

center: Ly mi gi tr ca X tr cho Trung bnh ca X


Mc ch ca quy trnh centering (trung tm ha) l dch chuyn trung bnh ca phn phi v v tr Zero. N khng lm thay i hnh
dng phn phi, nhng cho php din gii h s hi quy d dng hn.

scale: chia mi gi tr ca bin s X cho lch chun (SD)


Cng dng ca scaling l chun ha thang o
range: Chuyn thang o v khong [0:1]
Cng dng ca quy trnh ny l Quy ng thang o cho nhiu bin s trong d liu. Vic ny c th hu ch cho mt s m hnh phn
loi khi cc predictor c chung 1 thang o.
Ghi ch:
Khi kt hp nhiu quy trnh, cc quy trnh scale, center v range s c thc hin sau hm m (BoxCox, YeoJohnson v ExpoTrans)
Center v scale l bt buc trc khi p dng PCA hay ICA.

Ni dung cc quy trnh chuyn i


3. Box-Cox

Phng php hon chuyn d liu do 2 nh thng k Anh quc l Geogre EP Box (1919-2013) v David Cox (1924) lp ra nm
1964 (Journal of the Royal Statistical Society, Series B(1964) 26, 211-252)
ng dng ban u ca p.php BoxCox l chuyn dng bin s kt qu, nhng sau ny n c p dng cho c predictor.
Cng dng ca BoxCox transformation l ci thin bt thng v Skewness v a d liu v gn vi phn phi chun.
Bn cht ca hm BoxCox l mt hm s m, ty vo gi tr Lambda m n s tng ng vi cc hm chuyn dng thng
thng nh ly tha, cn, nghch o hay logarit ha.

iu kin

Cng thc chuyn i

( 1)/

=0

log(y)

Nhc im duy nht ca hm BoxCox l n khng chp nhn gi tr y <0 (y phi l s dng hay =0).
Do khi d liu ca bn c gi tr m, phng php thch hp hn s l Yeo Johnson

Ni dung cc quy trnh chuyn i


4. Yeo-Johnson

In Kwon Yeo (Hn quc) v Richard A. Johnson (Hoa k) thit lp nm 2000 (Biometrika Vol. 87, No. 4 (Dec., 2000), pp. 954-959 )
Cng dng: lm gim lch (Skewness) v m phng phn phi chun
Ni dung:
Yeo-Johnson l s nng cp ca phng php Box-Cox:
iu kin

Cng thc chuyn i

0, y 0

(( + 1) 1)/

= 0, y 0

log(y+1)

2, y < 0

(( + 1)2 1)/(2-)

= 2, y< 0

-log(-y+1)

Nh vy: khi y >0, p.php YJ tng ng vi BoxCox p dng cho (y+1)


Khi y<0, p.php YJ tng ng vi BoxCox p dng cho (-y+1) nhng vi h s m = 2-lambda
Do YeoJohnson tin li hn BoxCox v p dng c cho gi tr m ca Y. Mt khc, nu d liu ch cha Y0, BoxCox v YeoJohnson l
tng ng v cho ra 2 phn phi ng dng

Ni dung cc quy trnh chuyn i


5. B tc d liu thiu st v phn tch nhn t
4

Nhng phng php b tc d liu thiu st


(Imputation)
Caret cho php b tc t ng nhng d liu thiu st (NA) bng 3 phng php:

knnImpute: B tc d liu bng cch nhn din lng ging gn nht (da vo khong cch Euclidian)
bagImpute: B tc d liu t kt qu m hnh cy vi bagging (d bo gi tr 1 bin s da vo nhng bin s khc)
Hai phng php ny chnh xc cao hn, nhng tiu tn thi gian v b nh my tnh

medianImpute: B tc d liu da vo Trung v ca bin s


Phng php ny n gin v nhanh nhng km chnh xc.

Nhng quy trnh phn tch nhn t


pca = principal components.
ica: independent components (nhn t c lp), s lng do ngi dung quy nh

Th d minh ha
1. D liu

Trong th d ny chng ta s s dng b s liu Pima Indian Diabetes ni ting.


Dataset ny c thu thp trong 1 nghin cu dch t hc v bnh tiu ng trong qun th chng tc Pima Indian ti Phoenix, Arizona vo nm 1987.
Dataset gm 4 predictors kiu s lin tc:
V1=pregnant : s ln mang thai
V2=plasma glucose: kt qu test thanh thi glucose
V3 = DBP: huyt p tm trng
V4 = skinfold: dy np gp da
V5= serum insulin: nng insulin trong mu
V6= BMI: ch s khi c th
V7= pedigree function
V8 = Tui
v 1 bin kt qu V9 = Chn on bnh tiu ng (c/khng)
Dataset ny c th dng thc hnh dng nhng m hnh phn loi, th d chn on bnh tiu ng da vo 8 predictors k trn.

Th d minh ha
1. D liu
pima=read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", sep =
",",na.strings="0.0",strip.white=TRUE, fill = TRUE)
pima$V9=factor(pima$V9, levels = c(1,0), labels = c("Benhly", "Binhthuong"))
pima=subset(pima,V2>0 & V2>0 & V3>0 & V4>0 & V5 >0)

pima$V2=as.numeric(pima$V2)
pima$V3=as.numeric(pima$V3)
pima$V4=as.numeric(pima$V4)
pima$V5=as.numeric(pima$V5)
pima$V8=as.numeric(pima$V8)
names(pima)=c("pregnant", "plasmaglu", "DBP", "skinfold", "seruminsulin", "BMI", "pedigreefunction", "age", "chandoan")

data0=na.omit(pima)
Ghi ch: Mt s trng hp trong dataset ny cha gi tr 0 hoc 0,0 mt cch phi l (tri vi quy lut sinh hc) nn ta mc nhin xem chng nh missing value v loi b.
Cui cng data cn li 392 trng hp
head(data0)
pregnant plasmaglu DBP skinfold seruminsulin BMI pedigreefunction age
chandoan
4
1
89 66
23
94 28.1
0.167 21 Binhthuong
5
0
137 40
35
168 43.1
2.288 33
Benhly
7
3
78 50
32
88 31.0
0.248 26
Benhly
9
2
197 70
45
543 30.5
0.158 53
Benhly
14
1
189 60
23
846 30.1
0.398 59
Benhly
15
5
166 72
19
175 25.8
0.587 51
Benhly

Th d minh ha
2. Thm d trc quan

i tng c chn lm th
d: phn phi lch tri

library(caret)
featurePlot(x = data0[,1:8],
y = data0$chandoan,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1,
pch = "|",
layout = c(4,2),
auto.key = list(columns = 2))

Hm featurePlot kiu density ca caret cho php kho st trc quan c tnh phn phi ca ton b predictor theo tng phn nhm ca bin kt qu.

Hnh v cho php pht hin nhng bin s phn phi bt thng; trong th d ny l : pregnant, serum insulin, age v pedigree function. Ta chn bin
serum insulin lm th d minh ha

Th d minh ha
2. Thm d trc quan

featurePlot(x = data0[,1:8],
y = data0$chandoan,
plot = "box",
scales = list(y = list(relation="free"),
x = list()),
layout = c(4,2),
auto.key = list(columns = 2))
Trong khi , hm FeaturePlot kiu BoxPlot cho
php so snh trc quan gi tr predictor gia 2 phn
nhm ca bin kt qu. N cn cho php pht hin
nhng vn v outliers

i tng c chn lm th
d: nhiu outliers

3. Chuyn i c lp v ring l

# Tch ring bin insulin trong 1 subset va i tn bin thnh original

Quy trnh hon chuyn d liu ring l:

insulin=as.data.frame(data0[,5])
names(insulin)="original"

Bc 1: to subset cha 1 hay nhiu bin cn chuyn dng

# p dng ln lt cc hm chuyn i n l hoc kt hp ln subset insulin

Bc 2: p dng hm preProcess v thit lp phng php chuyn dng

cen=preProcess(insulin,method="center")
scal=preProcess(insulin,method="scale")
bct=preProcess(insulin,method="BoxCox")
YJ=preProcess(insulin,method="YeoJohnson")
expo=preProcess(insulin,method="expoTrans")
ran=preProcess(insulin,method="range")
yjsc=preProcess(insulin,method=c("YeoJohnson","scale"))
yjcensc=preProcess(insulin,method=c("YeoJohnson","center","scale"))
bccent=preProcess(insulin,method=c("BoxCox","center"))

C php:

# Xem ni dung hm chuyn dng, th d yjcensc

i tng hm chuyn dng <- preProcess (tn subset cn chuyn dng, method
= c("A","B", )
Vi A,B,C l tn cc phng php chuyn dng, gm:
"BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute",
"bagImpute", "medianImpute", "pca", "ica", "spatialSign", "zv", "nzv", v
"conditionalX"

yjcensc
Created from 392 samples and 1 variables
Pre-processing:
- centered (1)
- ignored (0)
- scaled (1)
- Yeo-Johnson transformation (1)
Lambda estimates for Yeo-Johnson transformation:
0.03

i tng ny c th c kho st bng cch gi trc tip

3. Chuyn i c lp v ring l

# To kt qu chuyn dng bng hm predict


centering=predict(cen,insulin)
boxcox=predict(bct,insulin)
scaling=predict(scal,insulin)
YeoJohnson=predict(YJ,insulin)
expotrans=predict(expo,insulin)
ranging=predict(ran,insulin)
YJscale=predict(yjsc,insulin)
BCcenter=predict(bccent,insulin)
YJcenscal=predict(yjcensc,insulin)

Quy trnh hon chuyn d liu ring l:


Bc 3: Xut kt qu chuyn i bng hm predict
C php:
Kt qu chuyn i = predict (tn hm chuyn i, tn subset cn chuyn
i)

# To subset mi = insulin2 so snh trc quan kt qu chuyn dng

Ghi ch:

insulin2=insulin
insulin2$Boxcox=boxcox$original

subset cn chuyn i c th l subset ban u, hoc 1 subset hon ton mi


(th d tp kim nh).

# Chuyn dng logarit, so snh vi Boxcox


insulin2$Logarit=log(insulin$original)
insulin2$Centering=centering$original
insulin2$scaling=scaling$original
insulin2$YeoJohnson=YeoJohnson$original
insulin2$Expotrans=expotrans$original
insulin2$Ranging=ranging$original
insulin2$YJ_scale=YJscale$original
insulin2$BC_center=BCcenter$original
insulin2$YeoJohnson_cent_scale=YJcenscal$original
insulin2$class=data0$chandoan

4. Kim tra kt qu sau chuyn i

featurePlot(x = insulin2[,-12],
y = insulin2$class,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1,
pch = "|",
layout = c(3,4),
auto.key = list(columns = 2))

Nhn xt:
Trong th d ny:
Lambda ca Boxcox =0 do n chnh l chuyn dng
logarit
Insulin ch c gi tr > 0, nn BoxCox, YeoJohnson v
logarit cho ra phn phi hon ton ging nhau; v chng
ci thin rt tt skewness ca insulin, vi phn phi
gn nh bnh thng (YJ v BoxCox khc thang o)
Vic p dng thm scale, center, range: khng lm thay
i bn cht v phn phi, chng ch thay i thang o.

4. Kim tra kt qu sau chuyn i

featurePlot(x = insulin2[,1:11],
y = insulin2$class,
plot = "box",
scales = list(y = list(relation="free"),
x = list()),
layout = c(4,3),
auto.key = list(columns = 2))

Nhn xt:

Ranging, scaling v centering ch lm thay i n v v


thang o ch khng nh hng n phn phi hoc s
tng phn gia 2 phn nhm kt qu
BoxCox, YeoJohnson v Logarit lm tng s tng phn
gia 2 phn nhm bnh l / bnh thng

1. Chuyn i ng lot nhiu bin s

# p dng hon chuyn YeoJohnson cho ton b 8 predictors trong dataset


YJT=preProcess(data0[,-9],method="YeoJohnson")
YJTdata=predict(YJT,newdata=data0[,-9])
YJTdata$chandoan=data0$chandoan
YJT
Created from 392 samples and 8 variables
Pre-processing:
- ignored (0)
- Yeo-Johnson transformation (8)
Lambda estimates for Yeo-Johnson transformation:
-0.02, -0.07, 1.18, 0.67, 0.03, 0.12, -1.76, -1.76
Ghi ch: caret t ng xc nh lambda ti u cho tng bin s.
featurePlot(x = YJTdata[,-9],
y = YJTdata$chandoan,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1,
pch = "|",
layout = c(4,2),
auto.key = list(columns = 2))
Nhng bin s m YeoJohnson thc s c hiu ng thay i c tnh
phn phi

2. Phn tch nhn t

# p dng phn tch ICA v PCA cho ton b dataset

7 nhn t chnh

library(fastICA)
ICAT=preProcess(data0[,-9],method=c("center", "scale", "ica"), n.comp=5)
ICAdata=predict(ICAT,newdata=data0[,-9])
ICAdata$chandoan=data0$chandoan
PCAT=preProcess(data0[,-9],method="pca")
PCAdata=predict(PCAT,newdata=data0[,-9])
PCAdata$chandoan=data0$chandoan
PCAT
Created from 392 samples and 8 variables
Pre-processing:
- centered (8)
- ignored (0)
- principal component signal extraction (8)
- scaled (8)
PCA needed 7 components to capture 95 percent of the
variance

ICAT
Created from 392 samples and 8 variables
Pre-processing:
- centered (8)
- independent component signal extraction (8)
- ignored (0)
- scaled (8)
ICA used 5 components

5 nhn t c lp

1. Kt hp chuyn i d liu v hun luyn m hnh

# Chia d liu gc thnh 2 subset Train v Test vi t l 76,3%


set.seed(123)
idTrain=createDataPartition(y=data0$chandoan, p=0.763,list=FALSE)
training=data0[idTrain,]
testing=data0[-idTrain,]
# Hun luyn 4 m hnh logistic : (1) s dng subset Train gc, (2) km hon chuyn BoxCox, (3) km phn tch cm PCA, (4) Km phn tch cm c lp ICA vi s cm=4
set.seed(1234)
Control=trainControl(method= "repeatedcv",number=5,repeats=5,classProbs = TRUE,summaryFunction = multiClassSummary)
glm=train(chandoan~.,data=training,method = "glmStepAIC",trControl = Control,tuneLength=5)
set.seed(1234)
Control=trainControl(method= "repeatedcv",number=5,repeats=5,classProbs=TRUE,summaryFunction = multiClassSummary)
glmbct=train(chandoan~.,data=training,method = "glmStepAIC",preProcess="BoxCox",trControl = Control,tuneLength=5)
set.seed(1234)
Control=trainControl(method= "repeatedcv",number=5,repeats=5,classProbs=TRUE,summaryFunction = multiClassSummary)
glmPCA=train(chandoan~.,data=training,method = "glmStepAIC",preProcess="pca",trControl = Control,tuneLength=5)
set.seed(1234)
Control=trainControl(method= "repeatedcv",number=5,repeats=5,classProbs=TRUE,summaryFunction = multiClassSummary,preProcOptions = list(ICAcomp=4))
glmICA=train(chandoan~.,data=training,method = "glmStepAIC",preProcess="ica",trControl = Control,tuneLength=5)

1. Kt hp chuyn i d liu v hun luyn m hnh

glm
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.4613382

ROC
0.8613

Accuracy
0.7946667

Kappa
0.5246114

Sensitivity
0.64

Specificity
0.872

Pos_Pred_Value
0.7147577

Neg_Pred_Value
0.830181

Detection_Rate
0.2133333

Balanced_Accuracy
0.756

glmbct
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
Pre-processing: Box-Cox transformation (7)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.4348474

ROC
0.8717

Accuracy
0.804

Kappa
0.5480141

Sensitivity
0.658

Specificity
0.877

Pos_Pred_Value
0.7312402

Neg_Pred_Value
0.8375498

Detection_Rate
0.2193333

Balanced_Accuracy
0.7675

1. Kt hp chuyn i d liu v hun luyn m hnh

glmPCA
Generalized Linear Model with Stepwise Feature Selection

300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
Pre-processing: principal component signal extraction (8), centered (8), scaled (8)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:
logLoss
0.443819

ROC
0.8712

Accuracy
0.804

Kappa
0.5482512

Sensitivity
0.656

Specificity
0.878

Pos_Pred_Value
0.7348954

Neg_Pred_Value
0.836638

Detection_Rate
0.2186667

Balanced_Accuracy
0.767

glmICA
Generalized Linear Model with Stepwise Feature Selection
300 samples
8 predictor
2 classes: 'Benhly', 'Binhthuong'
Pre-processing: independent component signal extraction (8), centered (8), scaled (8)
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 240, 240, 240, 240, 240, 240, ...
Resampling results:

logLoss
0.4612039

ROC
0.86175

Accuracy
0.794

Kappa
0.5191909

Sensitivity
0.622

Specificity
0.88

Pos_Pred_Value
0.7247836

Neg_Pred_Value
0.8243255

Detection_Rate
0.2073333

Balanced_Accuracy
0.751

2. So snh phm cht m hnh c v khng c chuyn i d liu

list=resamples(list(GLM=glm,GLMBCT=glmbct,PCA=glmPCA,ICA=glmICA))
diff1=diff(list,models=c("GLM","GLMBCT","PCA","ICA"),metric=c("logLoss","Kappa","ROC","Accuracy","Sensitivity","Specificity"))
summary(diff1)
Call:
summary.diff.resamples(object = diff1)
p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0
logLoss
GLM
GLMBCT
PCA
ICA
GLM
0.0264908 0.0175193 0.0001343
GLMBCT 6.847e-06
-0.0089716 -0.0263565
PCA
0.011356 0.583963
-0.0173850
ICA
1.000000 0.007899
0.138403
Kappa
GLM
GLMBCT
GLM
-0.0234027
GLMBCT 0.46692
PCA
0.07057 1.00000
ICA
1.00000 1.00000

PCA
ICA
-0.0236398 0.0054205
-0.0002371 0.0288232
0.0290603
0.97180

ROC
GLM
GLMBCT
PCA
ICA
GLM
-0.01040 -0.00990 -0.00045
GLMBCT 0.0030376
0.00050
0.00995
PCA
0.0009256 1.0000000
0.00945
ICA
1.0000000 0.4779607 0.3285702

Accuracy
GLM
GLM
GLMBCT 0.5381
PCA
0.1185
ICA
1.0000

GLMBCT
PCA
ICA
-9.333e-03 -9.333e-03 6.667e-04
-8.882e-18 1.000e-02
1.0000
1.000e-02
1.0000
1.0000

Sensitivity
GLM
GLM
GLMBCT 0.8530
PCA
0.4364
ICA
1.0000

GLMBCT PCA
ICA
-0.018 -0.016 0.018
0.002 0.036
1.0000
0.034
0.7320 0.6067

Specificity
GLM GLMBCT
GLM
-0.005
GLMBCT 1
PCA
1
1
ICA
1
1

PCA
ICA
-0.006 -0.008
-0.001 -0.003
-0.002
1

Nhn xt: Vic hon chuyn d liu c th gip ci thin kh nng d bo ca thut ton phn loi

2. So snh phm cht m hnh c v khng c chuyn i d liu

bwplot(list,models=c("GLM","GLMBCT","PCA","ICA"),metric=c("logLoss","Kappa","ROC","Accuracy","Sensitivity","Specificity"))

2. So snh phm cht m hnh c v khng c chuyn i d liu

testing$pred=predict(glm,newdata=testing)
confusionMatrix(testing$pred,testing$chandoan)

GLM

testing$pred=predict(glmbct,newdata=testing)
confusionMatrix(testing$pred,testing$chandoan)

GLM + BoxCox

testing$pred=predict(glmPCA,newdata=testing)
confusionMatrix(testing$pred,testing$chandoan)

GLM + PCA

Confusion Matrix and Statistics

Confusion Matrix and Statistics

Confusion Matrix and Statistics

Reference
Prediction
Benhly Binhthuong
Benhly
10
7
Binhthuong
20
55

Reference
Prediction
Benhly Binhthuong
Benhly
11
7
Binhthuong
19
55

Reference
Prediction
Benhly Binhthuong
Benhly
12
7
Binhthuong
18
55

Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]

:
:
:
:

0.7065
(0.6024, 0.7969)
0.6739
0.29231

Kappa : 0.2482
Mcnemar's Test P-Value : 0.02092
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy

:
:
:
:
:
:
:
:

0.3333
0.8871
0.5882
0.7333
0.3261
0.1087
0.1848
0.6102

'Positive' Class : Benhly

Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]

:
:
:
:

0.7174
(0.6139, 0.8064)
0.6739
0.21975

Kappa : 0.283
Mcnemar's Test P-Value : 0.03098
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy

:
:
:
:
:
:
:
:

0.3667
0.8871
0.6111
0.7432
0.3261
0.1196
0.1957
0.6269

'Positive' Class : Benhly

Accuracy
95% CI
No Information Rate
P-Value [Acc > NIR]

:
:
:
:

0.7283
(0.6255, 0.8158)
0.6739
0.1584

Kappa : 0.3171
Mcnemar's Test P-Value : 0.0455
Sensitivity
Specificity
Pos Pred Value
Neg Pred Value
Prevalence
Detection Rate
Detection Prevalence
Balanced Accuracy

:
:
:
:
:
:
:
:

0.4000
0.8871
0.6316
0.7534
0.3261
0.1304
0.2065
0.6435

'Positive' Class : Benhly

Thng ip cui bi
1.

Chuyn i d liu rt d b b qua, nhng y c th li l bc cn thit ci thin mt s vn vi d liu gc, v a d liu


gc v dng thch hp hn cho mc tiu bn mun i ti. Bn cn ch ng ngh n vic ny ngay t bc thm d d liu.

2.

Sau khi chuyn i d liu bn cn kho st li kt qu ca vic chuyn i ny, biu trc quan l cch lm n gin nhng hiu
qu

3.

Khng c quy lut no hon ton chnh xc cho php d bo v hiu qu tt/xu ca mi phng php chuyn i d liu i vi
phm cht ca m hnh. Bn cn phn tch, th nghim nhiu gii php v rt ra kt lun cho chnh mnh

4.

Khng nn lm dng vic hon chuyn d liu, nht l cho m hnh c mc ch din dch v vi nhng phng php lm thay i
hon cu trc d liu hoc gi tr bn trong, nh YeoJohnson hay PCA. Nguy c l bn c th mt kh nng din gii kt qu.

Hn gp li bn
vo bi tip theo

Вам также может понравиться