Вы находитесь на странице: 1из 8

Gn nhn t loi ting Vit da trn cc phng php hc my thng k

Phan Xun Hiu1, L Minh Hong2, Nguyn Cm T3 (1) Trng Khoa hc thng tin, i hc Tohoku, Nht Bn (2) i hc S Phm H Ni (3) i hc Cng ngh, i hc Quc gia H Ni Tm tt Trong nhng nm gn y, do nhu cu ln v tm kim, khai ph v x l thng tin ting Vit, cc vn x l ting Vit ngy cng nhn c nhiu quan tm t cng ng nghin cu trong v ngoi nc [Socbay, Bamboo, Xalo, VLSP, Biocaster, ]. Gn nhn t loi l mt trong nhng bc quan trng trong x l v khai ph d liu ting Vit. Bo co ny tng kt mt s kt qu nghin cu v gn nhn ting Vit trong nhng nm gn y. Bn cnh , bo co cn a ra nhng so snh, nh gi cht lng gn nhn vi hai phng php hc my thng k l phng php cc i ha entropy (MaxEnt) v Conditional Random Fields. Nhng kt qu ny s gp phn nh hng cho vic xy dng mt h gn nhn t loi hiu qu cho cng ng khai ph thng tin ting Vit ni chung v x l ting Vit ni ring. T kha: Gn nhn t loi, ting Vit, hc my, Maximum Entropy, Conditional Random Fields, POS Tagging

1) Gii thiu
Gn nhn t loi l vic xc nh cc chc nng ng php ca t trong cu. y l bc c bn trc khi phn tch su vn phm hay cc vn x l ngn ng phc tp khc. Thng thng, mt t c th c nhiu chc nng ng php, v d: trong cu con nga con nga , cng mt t nhng t th nht v th ba gi chc nng ng php l danh t, nhng t th hai li l ng t trong cu. Mt s hng tip cn chnh trong gn nhn t loi ting Anh [inh in] bao gm: gn nhn da trn m hnh Markov n (HMM); cc m hnh da trn b nh (Daelemans, 1996) ; m hnh da trn lut (Transformation Based Learning, Brill, 1995); Maximum Entropy; cy quyt nh (Schmid, 1994a); mng n-ron(Schmid, 1994b), v.v. Trong cc hng tip cn , phng php da trn hc my c nh gi rt tt. Vn gn nhn t loi ting Vit c nhiu kh khn [Nguyn Huyn, V Lng]. Ngoi kh khn v c trng ring v ngn ng, gn nhn t loi ting Vit hin cn rt thiu cc kho d liu chun nh Brown hay Penn Treebank trong ting Anh cho qu trnh so snh nh gi. Nghin cu ny ca nhm chng ti hng ti mt s mc ch chnh bao gm: (1) kho st cc cng trnh gn nhn t loi ting Vit lien quan; (2) nh gi kh nng p dng hng tip cn gn nhn t loi ting Vit da trn 2 phng php hc my thng k (Maximum Entropy v CRFs) - hng tip cn c nh gi rt tt trong ting Anh; v (3) nh gi mc nh hng ca phn phi cc nhn trong kho d liu n cht lng gn nhn.

Phn cn li ca bi bo c t chc nh sau: phn 2 tng hp mt s cng trnh lien quan n gn nhn t loi ting Vit; phn 3 trnh by nhng t tng chnh ca cc phng php Maximum Entropy v CRFs; phn 4 l mt s th nghim v phn tch kt qu th nghim; mt s kt lun c rt ra trong phn 5 cng l phn cui ca bi bo.

2) Gn nhn t loi ting Vit: cc cng trnh lin quan


Trong nghin cu ny, chng ti tp trung kho st hai cng trnh tch t tiu biu: mt ca nhm inh in v cng s; v hai l nhm Nguyn Huyn, V Lng v cng s. Nhm th nht [inh in] xy dng h thng gn nhn t loi cho ting Vit da trn vic chuyn i v nh x t thng tin t loi t ting Anh. C s ca hng tip cn ny nm hai : (1) gn nhn t loi trong ting Anh t chnh xc cao (trn 97% cho chnh xc mc t) v (2) nhng thnh cng gn y ca cc phng php ging hng t (word alignment methods) gia cc cp ngn ng. C th, nhm ny xy dng mt tp ng liu song ng Anh Vit ln n 5 triu t (c Anh ln Vit). Sau thc hin gn nhn t loi cho bn ting Anh (da trn Transformation-based Learning TBL [Brill 1995]) v thc hin ging hng gia hai ngn ng ( chnh xc khong 87%) chuyn chuyn thng tin v nhn t loi t ting Anh sang ting Vit. Cui cng, d liu ting Vit vi thng tin t loi mi thu c s uc hiu chnh bng tay lm d liu hun luyn cho b gn nhn t loi ting Vit. u im ca phng php ny l trnh c vic gn nhn t loi bng tay nh tn dng thng tin t loi mt ngn ng khc. Tuy vy mc thnh cng ca phng php ny cn cn phi xem xt k cng hn. y, chng ti nu ra vi nhn nh ch quan v nhng kh khn m phng php ny gp phi. 1) S khc bit v tnh cht ngn ng gia ting Anh v ting Vit rt ng k: s khc bit v cu to t, trt t v chc nng ng php ca t trong cu lm cho vic ging hng tr nn kh khn. 2) Li tch ly qua hai giai on: (a) gn nhn t loi cho ting Anh v (b) ging hng gia hai ngn ng: li tch ly c hai giai on ny s nh hng ng k ti chnh xc cui cng. 3) Tp nhn c chuyn i trc tip t ting Anh sang ting Vit thiu linh ng v kh c th l mt tp nhn in hnh cho t loi ting Vit: do tnh cht ngn ng khc nhau, vic chuyn i nhn t loi ca ting Anh sang ting Vit c phn p t v s khng nht qun hon ton vi tp nhn c xy dng da trn tnh cht ngn ng ca ting Vit. Do tc gi ch cng b kt qu di dng n phm khoa hc v khng chia s d liu c th nn chng ti khng th tm hiu k hn phn ni dung thc hin v kt qu t c. y cng l mt kh khn trong vic hc tp, tha k ln nhau, v i n thng nht mt chun chung, to tin cho x l ting Vit sau ny. Nhm th hai [Nguyen Huyen, Vu Luong] tip cn vn ny da trn nn tng v tnh cht ngn ng ca ting Vit. Nhm ny xut xy dng tp t loi (tagset) cho ting Vit da trn chun m t kh tng qut ca cc ngn ng Ty u, MULTEXT, nhm

m un ha tp nhn hai mc: (1) mc c bn/ct li (kernel layer) v (2) mc tnh cht ring (private layer). Mc c bn nhm c t chung nht cho cc ngn ng trong khi mc th hai m rng v chi tit ha cho mt ngn ng c th da trn tnh cht ca ngn ng . C th, mc c bn ca t loi do nhm ny xut bao gm: danh t (noun N), ng t (verb V), tnh t (adjective A), i t (pronoun P), mo t (determine D), trng t (adverb R), tin-hu gii t (adposition S), lin t (conjunction C), s t (numeral M), tnh thi t (interjection I), v t ngoi Vit (residual X, nh foreign words, ...). Mc th hai c trin khai ty theo cc dng t loi trn nh danh t m c/khng m c i vi danh t, ging c/ci i vi i t, .v.v. Vi cch phn loi ny, chng ta c th co gin h phn loi t mc chung (c bn) hoc c th (chi tit ha) tng i d dng. Tuy vy, tp nhn m nhm tc gi th hai a ra vn cha thc s ti u cho ngn ng ting Vit. Hin nay, hai tc gi chnh ca nhm ang l thnh vin chnh trong vic xy dng VietTreeBank trong khun kh d n VLSP. Qua trao i vi nhm xy dng Viet Treebank, chng ti c bit cc thnh vin ca nhm ny tip tc trao i a ra mt thit k tt hn, c h thng hn vi s tham gia ca nhiu nhm lin quan. Nhng kt qu thng nht v b th v d liu kt hp vi nhng nghin cu v phng php v ngn ng s l nn tng cho x l v khai ph d liu trn ting Vit.

3) Phng php Cc i ha Entropy (Maxent) v Conditional Random Fields (CRFs)


a) Phng php Maximum Entropy
T tng chnh ca Maximum Entropy l ngoi vic tha mn mt s rang buc no th m hnh cng ng u cng tt. r hn v vn ny, ta hy cng xem xt bi ton phn lp gm c 4 lp. Rng buc duy nht m chng ta ch bit l trung bnh 40% cc ti liu cha t professor th nm trong lp faculty. Trc quan cho thy nu c mt ti liu cha t professor chng ta c th ni c 40% kh nng ti liu ny thuc lp faculty, v 20% kh nng cho cc kh nng cn li (thuc mt trong 3 lp cn li). Mc d maximum entropy c th c dng ng lng bt k mt phn phi xc sut no, chng ta xem xt kh nng maximum entropy cho vic gn nhn d liu chui. Ni cch khc, ta tp trung vo vic hc ra phn phi iu kin ca chui nhn tng ng vi chui (xu) u vo cho trc.
Cc Rng buc v c trng

Trong maximum entropy, ngi ta dng d liu hun luyn xc nh cc rng buc trn phn phi iu kin. Mi rng buc th hin mt c trng no ca d liu hun luyn. Mi hm thc trn chui u vo v chui nhn c th c xem nh l c trng f i (o, s ) . Maximum Entropy cho php chng ta gii hn cc phn phi m hnh l thuyt gn ging nht cc gi tr k vng cho cc c trng ny trong d liu hun luyn D . V th ngi ta m hnh ha xc sut P(o | s ) nh sau ( y, o l chui u vo v s l chui nhn u ra) 1 (2.1) exp i f i (o, s ) P (o | s ) = Z (o ) i

y f i (o, s ) l mt c trng, i l mt tham s cn phi c lng v Z (o ) l tha s chun ha n gin nhm m bo tnh ng n ca nh ngha xc sut (tng xc sut trn ton b khng gian bng 1) Z (o ) = exp i f i (o, s )
c c

Mt s phng php hun luyn m hnh t d liu hc bao gm: IIS (improved iterative scaling), GIS, L-BFGS, and so forth.

b) Phng php Conditional Random Fields


CRFs l m hnh trng thi tuyn tnh v hng (my trng thi hu hn c hun luyn c iu kin) v tun theo tnh cht Markov th nht. CRFs c chng minh rt thnh cng cho cc bi ton gn nhn cho chui nh tch t, gn nhn cm t, xc nh thc th, gn nhn cm danh t, etc. Gi o = (o1, o2, , oT) l mt chui d liu quan st cn c gn nhn. Gi S l tp trng thi, mi trng thi lin kt vi mt nhn l L . t s = (s1, s2,, sT) l mt chui trng thi no , CRFs xc nh xc sut iu kin ca mt chui trng thi khi bit chui quan st nh sau:
p (s | o) = 1 T exp k f k ( st 1 , s t , o, t ) . Z (o) t =1 k
(1)

T Gi Z (o) = s ' exp k f k ( s 't 1 , s 't , o, t ) l tha s chun ha trn ton b cc t =1 k chui nhn c th. fk xc nh mt hm c trng v k l trng s lin kt vi mi c trng fk. Mc ch ca vic hc my vi CRFs l c lng cc trng s ny. y, ta c hai loi c trng fk : c trng trng thi (per-state) v c trng chuyn (transition).

fk

( per state )

( st , o, t ) = ( st , l ) x k (o, t ) .

(2)

fk

( transition )

( st 1 , st , t ) = ( st 1 , l ) ( st , l ) .

(3)

y l Kronecker- . Mi c trng trng thi (2) kt hp nhn l ca trng thi hin ti st v mt v t ng cnh - mt hm nh phn xk(o,t) xc nh cc ng cnh quan trng ca quan st o ti v tr t. Mt c trng chuyn (3) biu din s ph thuc chui bng cch kt hp nhn l ca trng thi trc st-1 v nhn l ca trng thi hin ti st. Ngi ta thng hun luyn CRFs bng cch lm cc i ha hm likelihood theo d liu hun luyn s dng cc k thut ti u nh L-BFGS. Vic lp lun (da trn m hnh hc) l tm ra chui nhn tng ng ca mt chui quan st u vo. i vi CRFs, ngi ta thng s dng thut ton qui hoch ng in hnh l Viterbi thc hin lp lun vi d liu mi.

4) Th nghim
a) D liu th nghim
xy dng cc h th nghim prototype, chng ti s dng cng mt tp d liu c s dng trong [Nguyen Huyen, Vu Luong]. Tp d liu ny gm khong 6400 cu v c gn nhn hai mc: mc 1 gm 11 nhn c bn v mc 2 gm tp nhn c chi tit ha. T tp nhn chi tit mc 2 c th thu gn v tp nhn c bn mc 1 d dng. Cc nhn c bn bao gm: N danh t; A tnh t; V ng t; P i t; Cc lin t; Cm gii t; J ph t (adverb); E cm t; I tnh thi t; Nn s t; X khng c phn loi. Ngoi ra cn 11 nhn cho cc du cu, k t c bit, cc du m ng ngoc c gn nhn chnh l k t . Tp nhn mc c th (mc 2) gm 49 nhn v 11 nhn cho cc du cu, k t c bit nh trn. th nghim v nh gi, chng ti chia tp d liu ra thnh 4 phn bng nhau (4 folds) v thc hin hun luyn ln lt trn 3 phn v kim th chnh xc trn phn cn li (thut ng gi l 4-fold cross validation test).

b) La chn c trng
hun luyn cho cc h thng phn loi, chng ti trch chn cc c trng t d liu nh sau. phn lp t loi cho mi t trong cu, chng ti s dng mt ca s trt (sliding window) tri rng t 2 t i pha trc n 2 t i pha sau ca t hin ti. V trong ca s , cc c trng sau c la chn: 1. Cc t trong ca s t v tr -2, -1, 0 (v tr hin ti), +1, +2 2. Kt hp ca hai t pha trc t hin ti: -2-1 3. Kt hp ca hai t pha sau t hin ti: +1+2 4. Kt hp t pha trc v t hin ti: -10 5. Kt hp ca t hin ti v t pha sau: 0+1 6. T hin ti c gm ton ch s hay khng? 7. T hin ti c cha ch s hay khng? 8. T hin ti c cha k t - hay khng? 9. T hin ti c c vit hoa ton b hay khng? 10. T hin ti c c vit hoa k t u tin hay khng? 11. T hin ti c phi l mt trong cc du cu hay k t c bit hay khng? (ngha l cc k t .,!,?,;,/,...) Tp c trng trn y cn mc rt n gin do chng ti mi bt u qu trnh th nghim. c bit l chng ti hon ton cha s dng n thng tin tra cu v nhn t loi t t in. Trong thi gian ti chng ti s th nghim nhiu hn nhm tm ra c nhng tp c trng kh d nht.

c) Cc thit lp th nghim
Nhm th nghim gn nhn t loi s dng hai cng c FlexCRF v Jmaxent. Vi mi phng php (Maxent hay CRFs), chng ti tin hnh 2 mc th nghim: (1) gn nhn mc 1 vi 9 nhn t vng tng qut (N, V, J, ...) v 10 nhn cho cc loi k hiu; (2) gn nhn mc 2 vi 48 nhn t vng chi tit (Nt, Vtn, ...) v 10 nhn cho cc loi k hiu.

Cc thit lp tham s i vi FlexCRF v Jmaxent c cho nh trong bng sau: FlexCRF order = 1 Th nghim trn CRF bc 1 f_rare_threshold=1 B cc c trng vi tn xut xut hin nh hn 1 Cp_rare_threshold=1 B cc ng cnh vi tn xut nh hn 1 init_lamda_val=0.5 Khi to cc tham s m hnh bng 0.5 Jmaxent cpRareThreshold=3 B cc ng cnh vi tn xut xut hin nh hn 2 fRareThreshold=2 B cc c trng vi tn xut nh hn 3

d) Kt qu v nh gi
Tng hp kt qu thc nghim gn nhn t vng vi Maxent v CRF
Table 4.1. Kt qu gn nhn t vng mc tng qut (11 nhn t vng v 11 du cu) v mc c th (48 nhn t vng v 11 du cu)

Fold 1 Fold 2 Fold 3 Fold 4 Trung bnh

F1-measure (tng qut) Maxent CRFs 91.33 91.55 91.18 91.56 90.22 91.98 91.00 91.59 90.93 91.67 Thi gian trung bnh (s) (trn mt vng lp) Mc tng qut Mc c th ~3 ~8 ~48 ~353

F1-measure (c th) Maxent CRFs 83.82 84.21 83.82 84.12 82.04 84.01 83.70 83.84 83.35 84.05

Table 4.2. So snh v thi gian gia Maximum Entropy v Conditional Random Fields

Maxent CRFs

Ti u vng lp th (trung bnh) Mc tng qut Mc c th ~35 ~40 ~36 ~40

Table 4.3. So snh v cht lng gn nhn vi cc nhn t loi khc nhau trong trng hp tng qut (th nghim vi fold3, mc tng qut v CRFs) Nhn Nn N P V Cc Cm A J E I X chnh xc 98.41 93.09 96.48 89.13 93.59 87.97 81.09 92.44 30.77 67.07 81 hi tng 97.01 94 95.48 88.74 93.2 90.01 78.15 90.22 70.59 67.07 66.94 F1-measure 97.7 93.54 95.98 88.94 93.4 88.98 79.59 91.32 42.98 67.07 73.3

Precision 120 100 80 60 40 20 0 Nn N P V Cc

Recall

F1-measure

Cm

Hnh 1. So snh v cht lng gn nhn vi cc nhn t loi khc nhau trong trng hp tng qut (th nghim vi fold3, mc tng qut v CRFs)
F1-measure 120 100 80 60 40 20 0
Nn p Np l Jd a Vt d Vt m Aa Jt Ng Cm Nm Nc Nx Pi to tc Vi Vi Vt s Vi Vl X tf

F1-measure

Hnh 2. So snh cht lng gn nhn vi cc nhn t loi trong trng hp c th (th nghim vi fold 1, mc c th vi CRFs)

e) Nhn xt
Thc nghim cho thy tnh kh quan ca cc hng tip cn da trn CRFs v Maxent i vi bi ton gn nhn t vng trong ting Vit. D CRFs mt nhiu thi gian hn cho vic hun luyn v gn nhn nhng n em li ci thin ng k cht lng gn nhn (trung bnh tt hn Maxent 0.7%). u im ca c 2 phng php trn l ta c th tch hp rt nhiu cc c trng phong ph, hu ch t d liu. D ch vi mt s c trng n gin (cha tch hp t in t vng, cha dng n cc biu thc chnh qui, ...), kt qu t c vn rt ng ch (tt nht t 91.98% vi mc tng qut v CRFs). Thc nghim cng khng nh nhng nhn xt trong [Nguyen Huyen, Vu Luong], l vic gn nhn mc c th thng khng tt bng gn nhn mc tng qut. Hnh 1, v 2 so snh cht lng gn nhn i vi cc nhn trong hai mc tng qut v c th. Hnh 1 cho thy vic gn vi cc nhn t vng quan trng nh N, V, P, A t c kt qu rt tt so vi cc nhn t ph bin hn nh E v I. Chng ti tin rng vi vic xy dng mt kho d

liu c ph ln v cn bng gia cc nhn th s khc bit ny c th c ci thin ng k.

5) Kt lun
Tuy cha th ti u tp c trng cho vic gn nhn t vng ting Vit da trn hc my. Chng ti thc s hi vng nhng nghin cu ny s em li li ch cho cng ng x l ngn ng ting Vit. Nhng ng gp ca chng ti gm 3 im chnh: (1) tng hp li mt s cng trnh in hnh v gn nhn t loi ting Vit; (2) khng nh phng php CRFs em li cht lng gn nhn tt hn so vi Maxent; v (3) cc nhn c cht lng gn nhn thp thng l cc nhn t ph bin trong tp d liu, t rt ra c tm quan trng ca vic xy dng mt kho d liu c ph tt v c phn phi khng qu lch trn tt c cc nhn t vng.

Li cm n
Nghin ny l mt phn ca d n Xy dng cc sn phm tiu biu v thit yu v x l ting ni v vn bn ting Vit mt ti nghin cu khoa hc v pht trin cng ngh c u t bi B Khoa hc & Cng ngh, Vit Nam. Chng ti xin gi li cm n ti ch nhim d n, cc bn lin quan, v cc cp qun l h tr v to iu kin cho chng ti thc hin nghin cu ny.

Ti liu tham kho


Dien Dinh and Kiem Hoang, POS-tagger for English-Vietnamese bilingual corpus. HLTNAACL Workshop on Building and using parallel texts: data driven machine translation and beyond, 2003. Thi Minh Huyen Nguyen, Laurent Romary, and Xuan Luong Vu, A Case Study in POS Tagging of Vietnamese Texts. The 10th annual conference TALN 2003. Thi Minh Huyen Nguyen, Laurent Romary, Mathias Rossignol, and Xuan Luong Vu, A lexicon for Vietnamese language processing. Language Resources and Evaluation, 2007. Nguyn Th Minh Huyn, V Xun Lng, L Hng Phng, S dng b gn nhn t loi xc sut QTAG cho vn bn ting Vit, ICT 2003 Nguyn Quang Chu, Phan Th Ti, Cao Hong Tr, Gn nhn T loi cho ting Vit da trn vn phong v tnh ton xc sut, Tp ch pht trin KH&CN, Tp 9, s 2 nm 2006 Phan, X.H, JTextPro: A Java-based Text Processing Toolkit, http://jtextpro.sourceforge.net/ Xuan-Hieu Phan, Le-Minh Nguyen, and Cam-Tu Nguyen, "FlexCRFs: Flexible Conditional Random Field Toolkit", http://flexcrfs.sourceforge.net, 2005.

Вам также может понравиться