Вы находитесь на странице: 1из 187

TRNG I HC KHOA HC T NHIN KHOA CNG NGH THNG TIN B MN H THNG THNG TIN

TSN QU HNG 0112385 V H BO KHANH 0112387

XY DNG B NG LIU NH GI BNG TING VIT V CHNG TRNH TR GIP NH GI CC H TM KIM THNG TIN

KHA LUN C NHN TIN HC GIO VIN HNG DN


T.S H BO QUC

NIN KHA 2001 - 2005

Lun vn : nh gi cc h thng tm kim thng tin

KIN CA GIO VIN PHN BIN


. Xc nhn ca GVPB

Trang 2

Lun vn : nh gi cc h thng tm kim thng tin

CNG CHI TIT


Thng tin chung v ti: Tn ti: Xy dng b ng liu nh gi (test collection) bng ting Vit v chng trnh tr gip nh gi cc h tm kim thng tin GVHD: Tin s H Bo Quc Sinh vin thc hin: 1. MSSV: 0112385 H v tn: Tsn Qu Hng 2. MSSV: 0112387 H v tn: V H Bo Khanh Tm tt ni dung lun vn: ti gm 2 phn : 1. Xy dng b ng liu nh gi cc h thng tm kim thng tin ting Vit. Vic xy dng b ng liu gm ba phn : _ Xy dng ng liu mu ting Vit _ Xy dng tp cu truy vn mu ting Vit _ Xy dng mt bng nh gi bng th cng 2.Xy dng mt h thng chng trnh tr gip vic nh gi cc h thng tm kim thng tin vi thnh phn u vo : ng liu mu, cu truy vn mu, h thng tm kim thng tin ; cc thnh phn u ra : kt qu truy vn, kt qu nh gi, n i dung tp ti liu, cu truy vn

Mt s t kha chnh lin quan n n i dung ti: nh gi cc h thng tm kim thng tin (information retrieval systems evaluation) Lnh vc p dng: nh gi cc h thng tm kim thng tin ting Vit.

Cc thut ton, phng php, quy trnh chnh c nghin cu, ng dng trong ti _ Tm hiu v tm kim thng tin (information retrieval), nh gi cc h thng tm kim thng tin (information retrieval systems evaluation) _ Tm hiu cu trc ca b ng liu, phng php xy dng b ng liu ca TREC (Text REtrieval Conference) _ Tm hiu v s dng cc h thng tm kim : SMART, IOTA ,Lucene,Terrier _ Xy dng b ng liu kim tra bng ting Vit

Trang 3

Lun vn : nh gi cc h thng tm kim thng tin _ Xy dng mt h chng trnh phc v vic kim tra v nh gi cc h thng tm kim thng tin. Chng trnh phi chy c trn hai h iu hnh : Windows v Linux, chng trnh vit bng ngn ng Java Cc cng c, cng ngh chnh c nghin cu, ng dng trong ti Borland Jbuider X Visual Studio . NET Microsoft Visio 2003 Rational Rose Microsoft Word, Power Point

Xc nhn ca GVHD

Trang 4

Lun vn : nh gi cc h thng tm kim thng tin

Li cm n
Chng em xin chn thnh cm n cc Thy C Khoa Cng ngh Thng tin hng dn v ging dy rt nhit tnh cho chng em trong sut bn nm hc Trng i hc Khoa hc T nhin. Nhng kin thc m chng em hc c trn ging ng s l hnh trang qu bu trn bc ng i ca chng em. Chng em xin cm n Thy H Bo Quc to c hi cho chng em c nghin cu hc hi v lnh vc tm kim thng tin bng Ting Vit, mt lnh vc tng i mi v hp dn Vit Nam . Mt ln na chng em xin cm n Thy v Thy tn tnh hng dn chng em ti lun vn Xy dng b ng liu dng nh gi bng ting Vit v chng trnh tr gip nh gi cc h thng tm kim thng tin. Chng em xin cm n gia nh, cc anh ch, bn b ng vin, gip chng em hon thnh tt ti lun vn ny. Nhm sinh vin thc hin Tsn Qu Hng V H Bo Khanh

Trang 5

Lun vn : nh gi cc h thng tm kim thng tin

MC LC
M U ............................................................................................................ 10 Chng 1 : TNG QUAN ................................................................................. 13
1.1. Tng quan v tm kim thng tin v h thng tm kim thng tin ........................13 1.2. Tng quan v nh gi cc h thng tm kim thng tin ......................................14 1.2.1. L do tin hnh nh gi cc h thng tm kim thng tin........................14 1.2.2. Cc tiu chun c dng nh gi .........................................................15 1.2.3. Cc m hnh nh gi...................................................................................15 1.2.4. Cc o dng nh gi .........................................................................18 1.2.5. Cc phng php xy dng b ng liu dng nh gi ............................18 1.2.6. Phng php xy dng b ng liu c chn.............................................20 1.2.7. Phng php nh gi tm quan trng ca kt qu tr v .............................21

Chng 2 : C S L THUYT....................................................................... 22
2.1. Tm kim thng tin v cc h thng tm kim thng tin.......................................22 2.1.1. Lch s tm kim thng tin v h thng tm kim thng tin...........................22 2.1.2. H thng tm kim thng tin.........................................................................25 2.1.2.1. Khi nim v h thng tm kim thng tin .............................................25 2.1.2.2. Cch thc hot ng ca h thng tm ki m thng tin............................25 2.1.2.3. Cc phng tin tm kim thng tin (Search Engines) ...........................27 2.1.3. So snh tm kim thng tin c in v tm kim thng tin trn Web .............29 2.1.4. So snh tm kim thng tin vi tm kim d liu ..........................................30 2.1.5. Cng thc tru tng trong tm kim thng tin ............................................31 2.1.6. Cc m hnh tm kim thng tin c in sp th t lin quan ...................32 2.1.6.1. M hnh i s Bool .............................................................................32 2.1.6.2. M hnh khng gian vec-t ....................................................................33 2.2. nh gi cc h thng tm kim thng tin ...........................................................36 2.2.1. Nn tng nh gi cc h thng tm kim thng tin ......................................36 2.2.2. M hnh nh gi hng h thng ................................................................37 2.2.2.1. T Cranfield n TREC ........................................................................37 2.2.2.2. Th tc nh gi....................................................................................39 2.2.2.3. nh gi s lin quan............................................................................40 2.2.3. Thc hin o kh nng tm kim ..................................................................41 2.2.3.1. Cc khi nim v o v lin quan .....................................................41 2.2.3.2. Cch tnh bao ph (R) v chnh xc (P)........................................42 2.2.3.3. Phng php tnh chnh xc da trn 11 im chun ca bao ph 44 2.2.3.3.1. th biu din hiu sut thc thi h thng tm kim .....................44 2.2.3.3.2. ng cong bao ph v chnh xc RP...................................45 2.2.3.3.3. ng cong RP cho tp truy vn ....................................................47 2.2.3.3.4. nh gi h thng tm kim thng tin da vo th ......................48 2.2.3.4. S lin quan gia cu hi v ti liu ......................................................49 2.2.3.4.1. Cc lin quan .............................................................................49 2.2.3.4.2. Cc vn v lin quan .............................................................49 2.2.3.4.3. nh gi vi lin quan nhiu cp ..........................................51 2.2.3.4.4. Phng php o bao ph (R), chnh xc (P) da trn lin quan nhiu cp ..........................................................................................53

Trang 6

Lun vn : nh gi cc h thng tm kim thng tin 2.2.4. TREC v nh gi theo chun TREC ...........................................................54 2.2.4.1. TREC l g? ..........................................................................................54 2.2.4.2. Cch xy dng ng liu ca TREC........................................................56 2.2.4.2.1. Xy dng tp hp cc ti liu..........................................................57 2.2.4.2.2. Xy dng cc ch .......................................................................57 2.2.4.2.3. Xy dng bng nh gi lin quan chun........................................58 2.3. Ng liu ting Vit .............................................................................................59 2.3.1. T ................................................................................................................60 2.3.1.1. Quan nim v t ....................................................................................60 2.3.1.2. Quan nim v hnh v ............................................................................61 2.3.1.3. Khi nim v cu to t .........................................................................61 2.3.2. Ranh gii t .................................................................................................62

Chng 3 : THIT K V CI T ................................................................. 63


3.1. Xy dng b ng liu dng nh gi ..............................................................63 3.1.1. Xy dng kho ng liu bng ting Vit ........................................................63 3.1.1.1. Chun ha ng liu ...............................................................................63 3.1.1.1.1. Chun ha dng ng liu ................................................................63 3.1.1.1.2. nh dng ng liu .........................................................................64 3.1.2. Xy dng tp cu hi bng ting Vit...........................................................64 3.1.3. Tch t ting Vit ........................................................................................65 3.1.4. Xy dng bng nh gi...............................................................................65 3.1.4.1. H thng SMART .................................................................................66 3.1.4.1.1. Gii thiu h thng SMART...........................................................66 3.1.4.1.2. Qu trnh tm kim thng tin ca SMART ......................................66 3.1.4.1.3. M hnh vec-t ca h thng SMART ............................................67 3.1.4.1.4. S dng m hnh vec-t ..................................................................69 3.1.4.2. H thng Search4Vn .............................................................................73 3.1.4.3. H thng TERRIER ..............................................................................73 3.1.4.4. H thng X-IOTA .................................................................................74 3.1.4.5. H thng LUCENE ...............................................................................74 3.2. Phn tch h thng nh gi cc h thng tm kim thng tin ..............................74 3.2.1. M t h thng tr gip nh gi..................................................................74 3.2.1.1. Pht biu bi ton..................................................................................74 3.2.1.2. Mc tiu................................................................................................75 3.2.1.3. Phm vi .................................................................................................75 3.2.1.4. Chc nng .............................................................................................75 3.2.1.5. Tnh kh dng .......................................................................................76 3.2.1.6. Hiu sut ...............................................................................................76 3.2.1.7. Tnh bo mt .........................................................................................76 3.2.2. Phn tch h thng nh gi..........................................................................76 3.2.2.1. Chc nng ca h thng ........................................................................76 3.2.2.2. Chc nng yu cu ................................................................................77 3.2.2.2.1. Chc nng nh gi mt h thng IR ..............................................77 3.2.2.2.2. Chc nng so snh nhiu h thng IR .............................................77 3.2.2.2.3. S use case ................................................................................77 3.2.2.2.4. S tun t hot ng usecase .....................................................79

Trang 7

Lun vn : nh gi cc h thng tm kim thng tin 3.3. Thit k h thng nh gi ..................................................................................86 3.3.1. Cc chc nng ca chng trnh...................................................................86 3.3.1.1. Chc nng nh dng c s d liu ti li u ........................................86 3.3.1.2. Chc nng nh dng kt qu tr v...................................................86 3.3.1.3. Chc nng nh dng file index.........................................................87 3.3.1.4. Chc nng Thc thi h thng IR ........................................................87 3.3.1.5. Chc nng X l kt qu tr v ..........................................................87 3.3.1.6. Chc nng nh gi mt h thng IR.................................................87 3.3.1.7. Chc nng nh gi nhiu h thng IR..............................................87 3.3.2. Thit k h thng .........................................................................................88 3.3.2.1. S kin trc tng th .........................................................................88 3.3.2.1.1. Danh sch cc lp i tng...........................................................88 3.3.2.1.2. Lp i tng th hin....................................................................88 3.3.2.1.3. Lp i tng x l........................................................................91 3.3.2.1.4. Lp i tng lu tr .....................................................................99 3.3.2.2. S kin trc tng qut cho tng chc nng ca chng trnh ............99 3.3.2.2.1. Chc nng nh dng ti liu ......................................................99 3.3.2.2.2. Chc nng nh dng cu h i.................................................... 100 3.3.2.2.3. Chc nng Thc thi h thng .................................................... 101 3.3.2.2.4. Chc nng nh dng kt qu.................................................... 102 3.3.2.2.5. Chc nng nh dng file index ................................................ 103 3.3.2.2.6. Chc nng nh gi v hin thi kt qu nh gi ...................... 103 3.3.2.2.7. Chc nng So snh cc h thng IR c thc thi ................ 104 3.3.2.3. Thit k d liu t chc lu tr......................................................... 105 3.3.2.3.1. M hnh d liu ............................................................................ 105 3.3.2.3.2. S logic d liu........................................................................ 107 3.3.2.4. T chc lu tr d liu........................................................................ 110 3.3.2.4.1. System.......................................................................................... 110 3.3.2.4.2. Topic ............................................................................................ 112 3.3.2.4.3. Index_topic................................................................................... 113 3.3.2.4.4. Document ..................................................................................... 114 3.3.2.4.5. Index_Doc.................................................................................... 115 3.3.2.4.6. relevant_TT.................................................................................. 115 3.3.2.4.7. relevant_LT .................................................................................. 116 3.3.2.4.8. evaluation..................................................................................... 117 3.3.2.5. Thit k giao din................................................................................ 119 3.3.2.5.1. S lin h gia cc mn hnh ................................................... 119 3.3.2.6. Thit k mn hnh................................................................................ 122 3.3.2.6.1. Mn hnh chnh (TH_Main) .......................................................... 122 3.3.2.6.2. Mn hnh nh dng ti liu (TH_DDTaiLieu) .............................. 122 3.3.2.6.3. Mn hnh to thuc tnh cho ti liu (TH_TTTaiLieu) .................. 124 3.3.2.6.4. Mn hnh nh dng cu h i (TH_DDCauHoi).............................. 125 3.3.2.6.5. Mn hnh to thuc tnh cho cu hi (TH_TTCauHoi) .................. 127 3.3.2.6.6. Mn hnh x l iu kin thc thi h thng IR.......................... 128 3.3.2.6.7. Mn hnh thc thi h thng (TH_ThucThiHT) .............................. 129 3.3.2.6.8. Mn hnh nh dng kt qu (TH_DDKetQua).............................. 130

Trang 8

Lun vn : nh gi cc h thng tm kim thng tin 3.3.2.6.9. Mn hnh nh dng thng tin index (TH_DDIndex)..................... 131 3.3.2.6.10. Mn hnh nh gi h thng (TH_KqDanhGia)........................... 133 3.3.2.6.11. Mn hnh xem th ca h thng .............................................. 136 3.3.2.6.12. Mn hnh xem chi tit (TH_XemChiTiet) ................................... 136 3.3.2.6.13. Mn hnh so snh h thng (TH_SoSanhHT) .............................. 138 3.3.2.7. Thit k h thng lp i tng........................................................... 139 3.3.2.7.1. Cc lp i tng x l ................................................................ 139 3.3.2.7.2. Cc lp i tng lu tr.............................................................. 169

Chng 4 : KT QU NH GI .................................................................. 171


4.1. Ngng nh gi .............................................................................................. 171 4.2. nh gi h thng tm kim thng tin search4VN ............................................. 171 4.3. So snh h thng tm kim search4VN v h thng Lucene............................... 177 4.4. Nhn xt chng trnh h tr nh gi h thng tm kim thng tin .................. 179 4.4.1. u im..................................................................................................... 179 4.4.2. Khuyt im .............................................................................................. 179

Chng 5 : KT LUN .................................................................................... 181 Chng 6 : HNG PHT TRIN.................................................................. 182 PH LC ......................................................................................................... 183 Ti liu tham kho .......................................................................................... 186

Trang 9

Lun vn : nh gi cc h thng tm kim thng tin

M U
Tm kim thng tin l nhu cu thit thc ca tt c mi ngi. c bit trong bi cnh bng n thng tin nh hin nay, gm c s ra i ca internet v sng kin v th vin in t, nhu cu tm kim thng tin li cng pht trin. Nhng nh c s tr gip ca cng ngh thng tin con ngi c th tha mn nhu cu ny mt cch d dng. Tht vy, c rt nhiu h thng tm ki m thng tin

(Information Retrieval system hay IR system) trn my tnh ang tn ti tr


gip con ngi. Tuy nhin, kh nng tm kim thng tin ca cc h thng ny chc chn khc nhau. Do , vic nh gi cc h thng tm kim thng tin

(Evaluation of Information Retrieval systems) l mt nhu cu khng th thiu


nhm xc nh cc h thng tm kim thng tin hiu qu. Vic nh gi ny c ngha rt ln i vi s tn ti v pht trin ca cc h thng tm kim thng tin. N gip xc nh kh nng tm kim ca cc h thng tm kim thng tin t m cc t chc, cng ty, trng hc to ra h thng ny c th pht trin, thay i h thng a ra kh nng tm kim thng tin tt nht. Ngoi ra, vic xc nh cc h thng tm kim thng tin hiu qu rt hu ch i vi ngi dng, h s cm thy tin tng vo kt qu tm kim m h thng tm c. Xa hn na, vic nh gi s to ra mt cuc cch mng trong lnh vc tm kim thng tin; gip a tm kim thng tin vo trong th gii thc ca i sng. Chng hn, khi cc h thng tm kim thng tin tin b chuyn t nghin cu sang th gii thc ca cnh tranh thng mi th nhng nh thit k, nh pht trin, ngi bn hng, v nhng i din bn hng ca cc sn phm thng tin mi nh sch in t, v cc phng tin tm kim (Search engines) mun bit sn phm ca h c cung cp cho nhng ngi s dng v ngi mua hng tim nng cc li th cnh tranh hay khng, s c tha mn nhu cu thng tin ny mt cch d dng, chnh xc. Kh nng tm kim ca h thng tm kim thng tin chng ti va cp c nghin cu nhiu cp : th nht l v kh nng x l tc thi gian tm kim v khng gian lu tr hay cn gi l hiu nng; th hai l v kh nng tm

Trang 10

Lun vn : nh gi cc h thng tm kim thng tin

kim hay hiu qu ca kt qu tr v; th ba l kh nng v h thng tc h thng c tha mn nhu cu thng tin ca ngi dng hay khng. Hin nay, trn th gii c rt nhiu h thng nh gi cc h thng tm kim thng tin nhng ch yu l nh gi cc h thng tm kim thng tin ting Anh, ting Php. i vi ting Vit, theo chng ti c bit, cha c mt h thng no c dng nh gi cc h thng tm kim thng tin ting Vit. Nhng theo xu hng pht trin ca t nc v nhu cu tm kim thng tin th cc h thng tm kim thng tin ting Vit bt buc phi tn ti v pht trin. V vy, Vit Nam chng ta rt cn cc h thng c dng nh gi hiu nng, hiu qu ca cc h thng tm kim thng tin ting Vit. Do ngha to ln ca lnh vc nghin cu nh gi ny, chng ti quyt nh chn ti nh gi cc h thng tm kim thng tin. Chng ti ngh rng h thng nh gi ca chng ti s l c s nh gi tt c cc h thng tm kim thng tin, nht l h thng tm kim thng tin ting Vit. Chng ti cng hy vng h thng ca chng ti s gp phn vo s pht trin ca cc h thng tm ki m thng tin, ca tm kim thng tin v ca cng ngh thng tin nc ta. Thc hin nh gi kh nng tm kim, chng ti tp trung vo nh gi hiu qu ca kt qu tm kim c tr v (cp th hai trong kh nng tm kim ca h thng thng tin trn). Hiu qu ca kt qu tr v c nh ngha l kh nng h thng tm kim thng tin tm c cc ti liu lin quan (Relevant Documents) v loi b i nhng ti liu khng lin quan (Irrelevant Documents). y l m hnh hng h thng trong nghin cu tm kim thng tin. M hnh ny m hnh nh gi c s dng nhiu nht v hiu qu nht trn th gii. V xy dng h thng nh gi cc h thng tm kim thng tin ting Vit theo m hnh hng h thng, trc ht, chng ti cn phi xy dng b ng liu dng nh gi bng ting Vit (a Vietnamese Test collection). B ng liu dng nh gi gm c kho ng liu mu bng ting Vit (a Vietnamese

Trang 11

Lun vn : nh gi cc h thng tm kim thng tin

Corpus hay a set of Vietnamese documents), tp cu truy vn mu bng ting Vit (a set of Vietnamese queries), v bng nh gi lin quan chun

(Relevance Judgment). Chng ti tm hiu v thc hin xy dng b ng liu


dng nh gi theo tiu chun ca Hi ngh v Tm kim thng tin Vn bn

(Text REtrieval Conference hay TREC) ca Hoa K, mt trong nhng Hi ngh


hng u trn th gii v Tm kim Thng tin. Tip theo, chng ti xy dng chng trnh tr gip nh gi cc h thng tm kim thng tin, cho php ngi dng thao tc, thc hin nh gi cc h thng mt cch d dng. Kt qu tr v ca chng trnh nh gi c c da vo b ng liu mu c dng nh gi. Kt qu tr v ny gm c kt qu truy vn ca h thng tm kim thng tin v kt qu nh gi. Kt qu nh gi c tnh da trn s kt hp ca hai o: bao ph (Recall) v chnh xc (Precision). T kt qu tr v, chng ta c th bit c kh nng tm kim ca ring tng h thng tm kim thng tin v so snh kh nng ca cc h thng tm kim vi nhau.

Trang 12

Lun vn : nh gi cc h thng tm kim thng tin

Chng 1 : TNG QUAN


1.1. Tng quan v tm kim thng tin v h thng tm kim thng tin Tm kim thng tin lin quan n vic biu din, lu tr, t chc v tip cn cc yu t thng tin (mt ti liu c th c mt hoc nhiu yu t thng tin) [1 ]. Theo l thuyt, khng c gii hn v cc loi yu t thng tin trong tm ki m thng tin. Trn thc t, cc loi yu t thng tin ngy cng tr nn a dng cng vi s pht trin ca x hi. Ngoi ra, mt tp hp cc yu t thng tin c gi l hu dng khi v ch khi n y v lun c cp nht. y y c ngha l tp hp ny phi cha mt t l ln cc yu t thng tin c xem l c kh nng lin quan n cc lnh vc xc nh. Hn na, vic biu din v t chc cc yu t thng tin nn cung cp cho ngi dng cch truy cp d dng nht n thng tin m ngi quan tm. Nhng khng may l tnh cht ca nhu cu thng tin ngi dng khng phi n gin. Chng ta xem xt mt v d v mt nhu cu thng tin hin nhin ca ngi s dng trong ng cnh tm kim World Wide Web hay ch l Web: Tm tt c cc trang hay ti liu cha thng tin v bnh ung th phi v nguyn nhn dn n ung th phi, cc ti liu c xem l lin quan phi va ni n cc triu chng ung th phi, va ni n nguyn nhn dn n cn bnh ny gm c tc hi ca vic ht thuc v nhim mi trng. T v d trn, chng ta thy r rng l s m t y nhu cu thng tin ngi dng khng th c s dng trc tip tm kim trn bnh din ca cc phng tin tm kim Web (Web Search Engine) hay h thng tm kim thng tin (IR system) hin nay. Thay vo , ngi s dng phi dch nhu cu thng tin ca mnh sang mt cu truy vn c th c x l bng phng tin tm kim hay h thng tm kim thng tin. iu ny to ra mt tp cc t kha tm tt m t nhu cu thng tin ngi dng hay cn gi l cu truy vn. Da trn cu truy vn ca ngi s dng, mc ch chnh ca h thng tm kim thng tin l tm kim cc thng tin hu ch hay lin quan cho ngi s dng.

Trang 13

Lun vn : nh gi cc h thng tm kim thng tin

Vy c th ni mt cch tng qut, h thng tm kim thng tin l mt h thng cho php ngi s dng tm kim ti liu tha mn nhu cu thng tin t mt kho ng liu ln. tm kim thng tin, h thng tm kim phi thc hin cc cng vic sau. Trc ht, h thng tm kim x l ti liu th thnh nhng ti liu c tch t, phn on (tokenized documents) v sau lp ch mc (index) da trn v tr ca t. Khi ngi dng a vo cu truy vn, h thng tm kim thng tin cng s x l cc cu truy vn thnh ngn ng ch mc m t cc yu t thng tin cn tm kim v thc hin i chiu vi ch mc ti liu tm ra cc ti liu lin quan. Cui cng, cc ti liu lin quan s c tr v cho ngi dng theo mt danh sch c sp xp theo u tin chnh xc gim dn (ranked list). 1.2. Tng quan v nh gi cc h thng tm kim thng tin 1.2.1. L do tin hnh nh gi cc h thng tm kim thng tin Khi nhu cu tm kim thng tin pht trin, c rt nhiu m hnh, thut ton, h thng tm kim thng tin ra i. Do , vic nh gi cc m hnh, thut ton, h thng tm kim thng tin l iu bt buc phi lm. Chng ta so snh mt h thng (c th l mt h thng mi) vi cc h thng khc tn ti v phng din: tnh hiu qu, chi ph, thi gian , tc x l H thng tm kim thng tin thng thc hin hai qu trnh: qu trnh lp ch mc v qu trnh tm kim. Mi mt qu trnh s c nhiu phng php thc hin, nh gi h thng cng c th dng xc nh tnh ti u ca cc phng php trn. L do khc tin hnh nh gi l so snh cc thnh phn ca h thng. Do h thng gm nhiu thnh phn, nh gi h thng xc nh cch mi thnh phn ca h thng thc thi khi c s thay i mt thnh phn bi mt thnh phn khc th s thay i nh hng n h thng nh th no, t ta c th quyt nh c nn thay i thnh phn khng.

Trang 14

Lun vn : nh gi cc h thng tm kim thng tin

nh gi tm kim thnh phn no l tt nht cho hm xp th t (dotproduct, cosine); thnh phn no l tt nht cho la chn thut ng (loi b stopword, phng php ly gc t stemming ); thnh phn no l tt nht trong la chn phng php nh gi thut ng (term weighting) nh TF, IDF (cc thnh phn ny s c ni r hn trong chng sau). So snh bit ngi s dng cn danh sch cc ti liu tr v (ranked list) di c bao nhiu h c th nhn d dng nht. nh gi bit h thng no tht s tt, ngi dng c th tin tng kt qu tr v c. 1.2.2. Cc tiu chun c dng nh gi Hin nay, trn th gii c ba tiu chun c dng nh gi h thng tm kim thng tin. Th nht l tiu chun v tnh hiu qu tc s chnh xc, tnh y ca kt qu tr v so vi mc ch tm kim ca ngi s dng, v gi tr vn c th on c trong cc tnh hung khc c ngha l khi a vo cc cu truy vn khc, tp ti liu khc th h thng vn c th tm ra kt qu chnh xc. Th hai l tiu chun v hiu nng, gm c tc tm kim ca thut ton, kh nng lu tr, thi gian tr v cho ngi s dng, thi gian lp ch mc, kch thc ch mc Th ba l tiu chun v kh nng s dng h thng tc l c th nghin cu, hc hi trn h thng tm kim, ngi khng bit tin hc hay cc chuyn gia tin hc i c th s dng h thng. 1.2.3. Cc m hnh nh gi Theo chng ti c bit, trn th gii c tt c bn m hnh nh gi cc h thng tm kim thng tin. Chng bao gm : nh gi hp knh, nh gi hp en, nh gi hng h thng, nh gi hng ngi dng hay cn gi l nh gi nghin cu ngi dng [ 2]. nh gi hp knh (glass box evaluation) : nh gi h thng da trn vic nh gi tt c mi thnh phn ca h thng. C ngha l khi bit r cc thnh phn ca h thng, chng ta tin hnh nh gi cc thnh phn .

Trang 15

Lun vn : nh gi cc h thng tm kim thng tin

nh gi hp en (black box evaluation) : nh gi h thng bng cch xem h thng nh l mt thc th hp nht, khng nh gi chnh xc cc thnh phn bn trong h thng. nh gi hng h thng (system-oriented evaluation) l xu hng nh gi chnh t khi cc h thng tm kim v lp ch mc t ng c pht trin vo nhng nm 1960. Mt trong nhng mc ch chnh ca hng nh gi ny l kim tra cc h thng t ng cng nh cc th tc th cng thc thi nh th no. Ngoi ra, m hnh ny cn nh gi so snh cc cch thc hin lin quan n cc ngn ng ch mc, x l tm kim ca h thng ca cc h thng khc nhau hay nh gi so snh cc lc ch mc t ng khc nhau. nh gi hng h thng c mt im li l iu kin mi trng kim tra c qun l cht ch, s dng phng php nh gi theo l hay cn gi l nh gi da trn tp cu truy vn; c ngha l h thng tm kim thng tin ln lt thc hin cc cu truy vn, tm kim trn tp d liu c xy dng v ghi li kt qu nhng ti liu no lin quan n cu truy vn no ri em so snh vi Bng nh gi lin quan chu n

(Relevance judgment) c xy dng. Vi mi cu truy vn tnh ton


chnh xc v bao ph da trn kt qu tr v v bng nh gi lin quan chun nhn xt hiu qu tm kim ca h thng tm kim thng tin. Hng nh gi ny c thc hin rt ph bin cc d n, hi ngh v nghin cu h thng tm kim thng tin nh: Cranfield , MEDLARS,
SMART, STAIRS v TREC.

nh gi hng ngi dng (user studies evaluation): Hng nghin cu ngi dng ra i vo nhng nm 1970 khi m nhiu h thng tm kim thng tin thng mi ra i. Mc ch chnh ca hng nghin cu ny l nhm xc nh cch thc tm kim ca ngi s dng [ 3]. Hng nh gi ny cn cho php xem xt h thng kha cnh ngi dng; tc l nh gi v mt tng tc vi ngi s dng nh giao din ca h thng tm kim thng tin, thi gian h thng tm kim i vi mt cu truy vn,
Trang 16

Lun vn : nh gi cc h thng tm kim thng tin

mc hi lng ca ngi s dng Hng nghin cu ny cho rng nhu cu ca ngi dng c tho mn tng ng vi hiu qu ca h thng. Ch khi nhu cu thng tin ngi dng c tha mn, khi y tm kim thng tin mi c gi l c ch. Hi ngh quc t v Tm kim Thng tin trong Ng cnh (Information Seeking in Context) c t chc nh l mt din n cho cc nh nghin cu lnh vc ny khm ph cc phng php v cc kt qu nghin cu. Mt hi ngh khc mi c thnh lp tn l Nhm Quan tm c bit (Special Interest Group - SIG) n tm kim, nhu cu v s dng thng tin ca X hi Hoa K v Khoa hc Thng tin (American Society of Information Science). Nhng hi ngh ny cng tng t nh TREC trong vic c gng khuyn khch nghin cu hng ngi dng, pht trin mi lin h gia cc nh nghin cu trong k thut, gio dc v chnh ph, v xc nh, ci tin cc k thut tm kim thch hp. Nhng cc hi ngh ny khc nhau ch cc hi ngh mi cha c phng php lun nh gi chun no c xc tin. nh gi hng ngi dng c ng gp rt ln n lnh vc tm kim thng tin. ng gp ny gm c vic xc nh cch thc tm kim thng tin ca con ngi, ni lin khong cch gia nhu cu thng tin gia cc c nhn v cc h thng tm kim thng tin, dn n mt th h mi ca cc h thng tm kim thng tin bao gm cc giao din ho my tnh-ngi s dng. Hin nay, trong s bn m hnh trn th hai m hnh nh gi hng h thng v hng ngi dng ang c s dng chnh v rng ri nht. Trong phm vi ti ca chng ti, chng ti ch s dng m hnh nh gi hng h thng v m hnh nh gi hng ngi dng cn c s hp tc ca rt nhiu ngi dng ly thng tin phn hi sau khi s dng h thng tm kim thng tin hoc cn phi tham gia trao i v hiu nng tm kim ti cc hi ngh. Nhng cc hi ngh dnh cho m hnh nh gi hng ngi dng a s cha c mt phng php lun c th no dng nh gi. Ngoi ra, vi m hnh hng h thng, chng

Trang 17

Lun vn : nh gi cc h thng tm kim thng tin

ti c th xy dng ng dng nh gi nhiu h thng tm kim thng tin mt cch t ng. 1.2.4. Cc o dng nh gi bao ph (Recall) v chnh xc (Precision) l 2 n v o c bn nht nh gi cht lng mt h thng tm kim thng tin [4 ]. bao ph l t l gia cc ti liu lin quan c tr v trn tng s cc ti liu lin quan tht s. Trong khi , chnh xc l t l gia cc ti liu lin quan c tr v trn tng s ti liu c tr v. C nhiu phng php s dng mt hoc cc o ny tnh ton nh gi, chng hn phng php chnh xc trung bnh (Mean Average Precision MAP) ch s dng chnh xc, khng quan tm n bao ph. Phng php o da trn gi tr n Swets E-Measure hoc chiu di tm kim trung bnh th cng ch s dng mt gi tr tnh ton. Phng php tnh chnh xc da trn 11 im chun ca bao ph s dng c hai o bao ph v chnh xc. Chng ti thc hin nh gi theo phng php tnh chnh xc da trn 11 im chun ca bao ph bi v phng php ny kh n gin, d thc hin tnh ton, o v nh gi. Ngoi ra, phng php ny trc quan vi cch biu din th ca cc im bao ph, chnh xc t d dng thy hiu qu tm kim ca ring tng h thng v so snh cc h thng nh gi vi nhau. 1.2.5. Cc phng php xy dng b ng liu dng nh gi Theo m hnh hng h thng, trc ht phi xy dng b ng liu dng nh gi (test collection). B ng liu dng nh gi gm c tp cc ti liu mu, tp cu truy vn mu, v bng nh gi lin quan chun. Tp ti liu dng nh gi c thu thp t cc ngun khc nhau, gm nhiu ch khc nhau. Tp ti liu ny phi l nhng ti liu mu bao qut cng nhiu lnh vc cng tt, phn nh c cc vn a dng khc nhau, cc phong cch vn chng khc nhau iu ny c ngha l tp

Trang 18

Lun vn : nh gi cc h thng tm kim thng tin

ti liu mu ny phi c kch thc ln, v vy tp ti liu ny cn c gi l kho ng liu mu. Tp cu truy vn mu l nhng cu hi c to ra ph hp vi tp ti liu mu. Tp cu truy vn ny sau s c s dng tm kim. Bng nh gi lin quan chun l bng cha thng tin v s th t cu hi v cc ti liu lin quan tht s ca cu hi . Bng nh gi lin quan chun c dng nh l bng i chiu tnh bao ph v chnh xc. C nhiu cch khc nhau to bng nh gi lin quan chun hay bng Relevance judgment. Cc phng php ny gm c: Phng php nh gi ton b, phng php ny thng khng kh thi v t l tp cu hi*tp ti liu l qu ln. Phng php ny rt tn chi ph. Phng php Pooling hay cn gi l phng php ly mt s ti liu lin quan nht lm bng nh gi lin quan chun. Phng php ny s dng tt cho nh gi nhiu h thng tm kim thng tin. Phng php ny i hi phi c mt s a dng cc h thng tm kim thng tin. Bc u tin ca phng php ny l tm thy cc ti liu lin quan cho mi h thng. Cc h thng khc nhau tm thy cc ti liu lin quan khc nhau. Bc tip theo l tng hp cc kt qu ca tt c cc h thng li v ly phn giao ca cc bng nh gi lin quan ca cc h thng. Nhng phn giao ny c th ch l mt s lng nht nh cc ti liu gn nh chnh xc nht. Vic nh gi da trn phng php ny tht s khch quan khi nh gi cc h thng khng c chn giao ly bng nh gi lin quan chun. Phng php nh gi hng dn ch tm kim thnh thong cho kt qu tt. Phng php ny cho php tng tc gia nghin cu truy vn, tm kim, nh gi. Tng cng thm bng cch xem li, iu chnh, nh gi li. Ni chung, khi s dng phng php ny, ngi nh gi

Trang 19

Lun vn : nh gi cc h thng tm kim thng tin

phi thao tc bng tay rt nhiu, xem cc ti liu tr v c tht s l lin quan hay cha a vo bng nh gi lin quan chun. Cc nh gi da trn nhng thnh phn bit, phng php ny tn t chi ph nht. Phng php ny cho php thay i cu hi tm ra mt ti liu bit. 1.2.6. Phng php xy dng b ng liu c chn K t nm 1992, khi Hi ngh v Tm kim thng tin Vn bn (Text REtrieval Conference hay TREC) ca Hoa K ra i, m hnh hng h thng mi tht s pht trin. Bi v hng nm, TREC t chc hi ngh ku gi tham gia nh gi cc h thng tm kim thng tin, c bit ku gi nh gi theo m hnh hng h thng. Nh m mi nm khi lng, kch thc b ng liu dng nh gi tng ln rt ng k cng vi s pht trin v s lng cc t chc, trng i hc tham gia TREC. TREC c xem l Hi ngh ln nht th gii v nh gi cc h thng tm kim thng tin v l mt trong nhng Hi ngh c uy tn trong lnh vc tm kim thng tin. TREC xy dng bng nh gi lin quan chun theo phng php Pooling. TREC cn a ra cc tiu chun, nh dng cho ng liu rt r rng, v d tun theo. V vy, chng ti quyt nh chn phng php xy dng ng liu theo tiu chun v cch lm ca TREC. Chng ti nh dng cu hi v ti liu theo tiu chun nh dng m TREC a ra, ng thi lm bng nh gi lin quan chun theo phng php Pooling hay phng php ly mt s ti liu lin quan nht lm bng nh gi lin quan ging TREC v cch to bng nh gi khch quan m n mang li v khng phi tn nhiu thi gian, chi ph. Tuy nhin, i vi ting Vit, vic xy dng b ng liu nh gi phc tp hn l xy dng b ng liu ting Anh, ting Php trong trng hp dng cc h thng tm kim ph bin, ni ting sn c cho ting Anh, Php tm kim thng tin ting Vit. Bi v c th loi hnh ngn ng khc nhau gia ting Anh, Php v ting Vit. Chng hn, trong ting Anh, Php mi t l mt t n, cch nhau bi

Trang 20

Lun vn : nh gi cc h thng tm kim thng tin

mt khong trng nhng ting Vit th hon ton khc, mt t c th gm t mt t n tr ln. Do , nhng h thng tm kim thng tin, ng liu phi c chun ha v ging vi tiu chun ng liu dng tm kim ca h thng . Nhng cng chnh iu ny lm cho ng liu ca chng ti c th c s dng linh hot nh gi nhiu h thng tm kim thng tin cho nhiu th ting khc nhau. iu ny cn c ngha rt ln trong tm kim thng tin v chng ta c th s dng h thng tm kim hiu qu ca nc ngoi tm kim thng tin ting Vit. 1.2.7. Phng php nh gi tm quan trng ca kt qu tr v Cc o thc hin ton b h thng tm kim ch yu c ly trung bnh trn tp cu hi. V tnh cht bin i ca cc cu hi l rt ln, v s thay i ca cc o tnh ton l rt cao, nn i hi mt phng php phn tch thng k thch hp nh gi xem s khc bit c o gia cc h thng c phi l c ngha thng k n mt tin cy nht nh khng. V vy, phng php nh gi tm quan trng ca kt qu tr v c s dng l phng php thng k.

Trang 21

Lun vn : nh gi cc h thng tm kim thng tin

Chng 2 : C S L THUYT
2.1. Tm kim thng tin v cc h thng tm kim thng tin 2.1.1. Lch s tm kim thng tin v h thng tm kim thng tin Tm kim thng tin c mt lch s lu i gn lin vi cc th vin v trung tm tm kim thng tin. Trc y, khi m my tnh v internet cha ra i, nhng ngi c nhu cu thng tin ngoi vic nh s tr gip thng tin t bn b, ngi thn cn c th tm n th vin hoc cc trung tm thng tin tm kim thng tin cn thit. Cch biu din, lu tr, t chc v ph bin thng tin ca th vin c xem l cch lm truyn thng ca mt h thng tm kim thng tin. Th vin, khi tip nhn cc yu t thng tin hay ti liu mi, trc ht l phn tch yu t thng tin . Sau , nhng m t thch hp s c chn ra m t, phn nh ni dung ca yu t thng tin . Da trn nhng m t ny, mi yu t thng tin s c phn loi theo nhng th tc c thit lp ri sp nhp vo tp hp cc yu t thng tin tn ti. Cc th tc ny c to ra h thng ha cc yu cu (cc yu cu c thit k thay th cho mt nhu cu thng tin ) v so snh nhng yu cu, truy vn vi m t ca cc yu t thng tin lu tr. Vic so snh ny chnh l c s quyt nh cc yu t thng tin thch hp vi cu truy vn tng ng. Cui cng, mt c ch tm kim v ph bin thng tin s c dng tr cc yu t thng tin cn thit n ngi s dng h thng. Tuy nhin, chng ta phi xem xt vn ny sinh v v tr tht s ca mt yu t thng tin mi c thm vo trong tp hp ti liu. C nhiu c ch tip cn khc nhau gii quyt vn ny nhng chng u lin quan n cch t chc vt l hoc lun l cc yu t thng tin. Trong th vin, cch t chc vt l chnh l vic lp ch mc cho ti liu, tc l s sp xp cc con s ca cc quyn sch, cch nh s thng c quy nh bi cc th vin ln. Nhng quyn sch s c t vo nhng v tr xc nh da vo nhng con s ny. Ngoi ra, cch t chc lun l d liu phi c thm vo vi cch t chc vt l gip ngi s

Trang 22

Lun vn : nh gi cc h thng tm kim thng tin

dng tm kim thng tin d dng hn. Chng hn, nhng quyn sch n bn v tm kim thng tin c th c xc nh bng cch nhn vo danh mc cc ch ca th vin vi thut ng cn tm l tm kim thng tin. Mt khi ta tm thy thut ng thch hp, cc th s k tip nhau s xc nh nhng quyn sch lin quan n ch ang tm kim. Nhng quyn sch ny ph thuc vo cc con s v chng s c tm thy ti nhng v tr xc nh. Bn cnh , mi khi mun thay i thut ng ch ca sch, chng ta khng cn thay i v tr ca sch trn k sch; tc l, cc yu t thng tin c th c t chc lun l li bng cch thay i danh mc th vin m khng cn thay i sp xp vt l. X hi ngy cng pht trin do thng tin rt a dng phong ph, bi ton t ra l chng ta phi lm sao qun l c s lng thng tin khng l mt cch c hiu qu. T dn n nhu cu lm gim mt lng cc yu t thng tin n mt kch thc c th qun l, cc yu t thng tin cn li c xem l c lin quan nhiu nht n lnh vc tm kim. Mt khc, chng ta rt kh d on mu, trng thi pht trin tng lai ca thng tin, hoc nu c th d on th t l ri ro rt cao. Kh khn tip theo trong vic t chc thng tin hiu qu l c mun gi nhng yu t lin quan gn nhau. V d, nhng ch lin quan n nhiu lnh vc nh phn tch h thng (n lin quan n khoa hc my tnh, vn tr hc, k thut hc, khoa hc qun l, gio dc v cc h thng thng tin) khng th gn nhau c m phi ring ra theo tng lnh vc : y l mt kh khn. Cn rt nhiu kh khn na, chng hn cc kh khn trong phn loi, so snh ti liu, yu t thng tin; lp ch mc, nh s cho ti liu. V nhng kh khn ny s khng c gii quyt nu khng c s ra i ca my tnh. Qu tht, nh c my tnh m vic lu tr, tm kim thng tin tr nn d dng hn. My tnh c th thao tc trn tt c cc loi thng tin v c th lu tr mt cch nhanh chng mt s lng thng tin khng l. Ngoi ra, c ch tm kim thng tin trn my tnh c th rt nhanh chng v hiu qu ty thuc m hnh ci t, thut ton ca c ch . C ch tm kim ny cng kh ging vi c ch tm kim thng tin ca th vin. Trc ht, da trn ngn ng ch mc v cc yu t thng tin i din cho ni
Trang 23

Lun vn : nh gi cc h thng tm kim thng tin

dung ca ti liu, tp ti liu s c biu din di dng tp hp cc ch mc i din cho tp ti liu . Trong khi , nhu cu tm kim thng tin c biu din di dng cu truy vn c cu trc hoc khng cu trc m my c th hiu c. Sau , my s so snh hai dng biu din trn, biu din ti liu v biu din cu truy vn, bit c ti liu no ph hp vi truy vn no. Sau khi so snh, my s nh v c v tr vt l ca yu t thng tin cn tm kim v ph bin n n ngi s dng. y l c ch tm kim chung cho mi h thng tm kim thng tin. Tuy nhin, cch y khng qu 20 nm, sau khi my tnh ra i, cc h thng tm kim thng tin ch yu c s dng trong phng th nghim tm kim mt kho ng liu sch v ti liu. Mc d chng khng bao hm cc phng php ton phc tp, nhng khi Internet pht trin th k thut tm kim ch yu trn World Wide Web chnh l cc k thut tm kim thng tin. Qu tht, cc h thng tm kim thng tin ngy cng pht trin v thut ton, k thut tm kim thng tin nh c s ra i ca Internet. V nhu cu tm kim thng tin ca con ngi trn Internet l mt nhu cu ph bin, thit thc, khng th thiu nn cc nh pht trin h thng tm kim thng tin cng phi n lc mang li hiu nng, hiu qu cho ngi s dng. Chng ta thy r rng l nghin cu tm kim thng tin c truyn thng tp trung vo tm ki m thng tin dng vn bn (Text Retrieval) hay ti liu vn bn (Document Retrieval). Trong mt thi gian di, tm kim thng tin gn nh ng ngha vi tm kim ti liu hay tm kim vn bn. Trong thi gian gn y, cc vin cnh ng dng mi nh ng dng tr li cu hi (question answering), ng dng nhn dng ch (topic detection), hay ng dng lu vt (tracking) tr thnh cc lnh vc hot ng mnh m trong nghin cu tm kim thng tin. Cng ngy ranh gii gia cng ng tm kim thng tin hay cng ng tm ki m thng tin v cc cng ng nghin cu x l ngn ng t nhin, cng ng nghin cu c s d liu tr nn m nht khi cc cng ng ny cng nhau pht trin cc

Trang 24

Lun vn : nh gi cc h thng tm kim thng tin

lnh vc quan tm chung; v d nh tr li cu hi, tm tt v tm kim thng tin t cc ti liu c cu trc. Mt lnh vc pht trin khc m cc k thut tm kim thng tin ang k tc v pht huy, l tm kim thng tin khng vn bn hay cn gi l tm kim thng tin a phng tin. Loi hnh tm kim ny s da trn rt trch t ng cc phn vn bn hay li ni ca cc ti liu a phng tin, sau c x l bi cc k thut tm kim thng tin da vn bn (text-based IR techniques). Tuy nhin, ngi ta ngy cng quan tm n s pht trin cc k thut phi by c th thng tin phng tin truyn thng ri tch hp chng vi cc phng php tm kim c thit lp tt hn l cch rt trch chng ti trnh by. Trong phm vi ti, chng ti ch gii hn tm kim thng tin trn vn bn. 2.1.2. H thng tm kim thng tin 2.1.2.1. Khi nim v h thng tm kim thng tin Theo l thuyt, h thng tm kim thng tin l mt h thng thng tin. N c s dng lu tr, x l, tra cu, tm kim, v ph bin cc yu t thng tin n ngi s dng. H thng tm kim thng tin thng thao tc vi cc d liu dng vn bn v khng c s gii hn v cc yu t thng tin trong vn bn. H thng thng tin bao gm mt tp hp cc yu t thng tin, mt tp cc yu cu, v mt vi c ch tm kim quyt nh yu t thng tin no lin quan n cc yu cu. Theo nguyn tc, mi quan h gia cc cu truy vn v ti liu c c t s so snh trc tip. Nhng trn thc t, s lin quan gia cc cu truy vn v ti liu xc nh khng phi c quyt nh trc tip; m gin tip bng cch : cc ti liu, yu t thng tin phi chuyn sang ngn ng ch mc trc khi xc nh mc lin quan. 2.1.2.2. Cch thc hot ng ca h thng tm kim thng tin Hnh 1 minh ha cu trc, cch hot ng c bn ca mt h thng tm ki m thng tin c in.
Trang 25

Lun vn : nh gi cc h thng tm kim thng tin

Hnh 1. giai on u tin, giai on tin x l, ti liu th ca ng liu c x l thnh cc ti liu c tch t, phn on (tokenized documents) v sau c lp ch mc thnh mt danh sch cc v tr ca t (postings per terms). giai on th hai, ngi s dng a ra mt cu truy vn (phi cu trc bng ngn ng t nhin) m t nhu cu thng tin ca h. H thng tm kim thng tin s biu din cu truy vn ny thnh nhng cu truy vn c hoc khng c cu trc m my c th hiu c. H thng tm kim thng tin bt u thc hin cht vn, i chiu tm ra ti liu, cc yu t thng tin c th tr li v lin quan n cu truy vn. Cc th tc c dng quyt nh cc yu t thng tin c lin quan n cu truy vn u da trn biu din ca cc cu truy vn v cc yu t thng tin c cha cc thnh phn ngn ng ch mc. Cui cng, cc ti liu, yu t thng tin c tm thy c hin th thnh mt danh sch ti liu v c sp xp theo th t lin quan (ranked retrieved documents). Thng thng, nhng ti liu, yu t thng tin c

Trang 26

Lun vn : nh gi cc h thng tm kim thng tin

lin quan nhiu nht c xp trn nhng ti liu t lin quan hn. Ty vo cc h thng tm kim thng tin khc nhau m chng hin th thng tin lin quan theo nhng cch khc nhau. Chng hn, c h thng ch hin th tn tiu v ng dn n ti liu , hoc c h thng va hin th tn, ng dn, va hin th mt t ni dung lin quan n cu truy vn, hoc c nhng h thng phc v tm kim thng tin trn mng th thm vo cc lin kt n cc trang web khc nhau. Nhiu h thng thng tin cn c c c ch cho php ngi s dng cung cp phn hi n cht lng ca kt qu tr v. S dng phn hi, h thng c gng thch ng v n lc tm ra nhng kt qu tt nht cho cu truy vn. Vic lp ch mc trong giai on tin x l chng ti va cp v nguyn tc th ging nhau i vi tng h thng nhng v thut ton, cch thc th khc nhau. Nguyn tc lp ch mc: Ti liu hay yu t thng tin phi cu trc khi thm mi s c h thng tm kim thng tin chuyn sang mt th c bit, l ngn ng ch mc. Vic chuyn i thnh phn thng tin thnh ngn ng ch mc c thc hin th cng, hay t ng hoc c hai v n c gi l tin trnh lp ch mc. Tin trnh lp ch mc ny c thc hin da trn cc yu t thng tin i din cho ni dung ca ti liu. Do , kt qu ca tin trnh ny l mt tp ch mc i din cho ti liu . 2.1.2.3. Cc phng tin tm kim thng tin (Search Engines) Hnh 2 minh ha cu trc c bn ca cc phng tin tm kim. Mt phng tin tm kim l mt h thng tm kim thng tin, tuy nhin, n khng ging hon ton vi h thng tm kim thng tin c in m t trn. S khc bit gia cc h thng tm kim thng tin c in v cc phng tin tm kim bt ngun t s khc bit ngun gc d liu, c ngha l mt kho lu tr khp kn c nh ngha tt tri ngc vi World Wide Web. V khng c cch tip cn trc tip n cc ti liu trn Web (nh l c trong kho ng liu th vin), phng tin tm ki m phi cn n thnh phn crawler ( tm gi l ng chy ca xch). Thnh phn

Trang 27

Lun vn : nh gi cc h thng tm kim thng tin

phn mm ny chu trch nhim ly cc trang web v v lu tr chng trong mt kho ni b. C ch crawling (ng chy ca xch) a ra cc thch thc cng ngh lin quan n hiu nng ca qu trnh v n s lin quan ca ti liu v cc trang web l ng, nn crawler phi gi cho kho ni b lun c cp nht hng ngy. Vic crawling cc ti liu ngoi Web th khng bi v d liu web gm c nhiu thng tin d tha. Phn tch ton cc c trch nhim loi b d liu khng quan trng nh cc trang Web ging nhau v cc trang bao gm sch bo khng lnh mnh. Ngoi ra, phn tch ton cc cng chu trch nhim tnh ton ton cc c dng trong cc h thng tm kim thng tin nh sp xp th t trang (th t trang hu ht c xc nh bi nhng trang c lin kt vi n v nhng trang n lin kt ti).

Hnh 2.

Trang 28

Lun vn : nh gi cc h thng tm kim thng tin

2.1.3. So snh tm kim thng tin c in v tm kim thng tin trn Web Bng di y biu din s khc bit gia cc h thng tm kim thng tin c in (IR c in) v cc h thng tm kim thng tin Web (Web IR). IR c in Kch thc Cht lng d liu T l thay i d liu liu a dng nh dng Ti liu # lin quan K thut IR ng nht, cng ngun Rt a dng gc V n b n Nh Da ni dung HTML Ln Da lin kt Ln Sch, khng trng lp Him Web IR Khng l Ln xn, trng lp Lin tc Truy cp mt phn

Kh nng truy cp d C th

Khi lng d liu trong mt h thng IR c in kh ln, trong khi khi lng d liu ny trong h thng Web IR l khng l. Khc bit ln nht trong khi lng d liu, chnh l cc th t ca lng, nh hng n phn cng c i hi (mt my tnh th khng bao gi , b nh khng th cha ton b d liu) v cc thut ton (cc nh ngha hiu nng ca thi gian v khng gian b thay i). Mt khc bit na l khc bit ca d liu. Trong h thng IR c in d liu c lm sch, trong khi d liu trn Web IR th phc tp, c hai u do s trng lp v v do cc spam c dng tng th hng ca trang hoc ch to s ln xn . Nh cp trn, s thay i d liu trong IR c in l khng thng xuyn ,do n thng c lp ch mc 1 ln. Ngc li, d liu trn Web th

Trang 29

Lun vn : nh gi cc h thng tm kim thng tin

thay i thng xuyn nn ch mc cng cn c cp nht. Hn na, tnh kh truy cp ca d liu l khng quan trng trong Web IR. Ti liu trong IR c in thng ng nht v nh dng cn ti liu trong Web IR gm nhiu loi khc nhau: bt c ai cng c th to mt trang web trong bt k nh dng no v bt k ngn ng no. Mt im khc bit quan trng na l ti liu web khng thng xuyn c vit dng vn bn th nh trong ti liu IR c in. Trang Web thng c vit bng HTML (Hypertext Markup Language) , va c nhng li ch v bt li i vi h thng tm kim thng tin : mt mt, n bao gm d liu c cu trc gip vic phn tch d dng hn ; mt khc, n thng khng cha nhiu vn bn (h thng IR da trn th ny), do kh phn loi hn. Kt qu tr v trong Web IR cng nhiu hn so vi IR c in, do kh sp th t danh sch kt qu hn. V cui cng, IR c in s dng k thut sp th t ch da trn ni dung (content-based). Tuy nhin, k thut ny khng th p dng vi Web IR. N tng l mt k thut thng dng cho n khi Google gii thiu k thut sp th t mi da trn lin kt (link-based) . K thut sp th t da trn lin kt s dng siu lin kt (hyperlink) gia cc ti liu web sp th t cc trang web mt cch hiu qu v chc chn hn. 2.1.4. So snh tm kim thng tin vi tm kim d liu Mt h thng tm kim thng tin khng phi l mt h thng tm kim d liu. Bng di y trnh by mt s thuc tnh khc nhau gia h thng tm kim thng tin v h thng tm kim d liu. Tm kim thng tin D liu Truy vn Kt qu Tm kim d liu Vn bn t do, khng cu Cc bng d liu, c cu trc trc T kha, ngn ng t nhin SQL, i s quan h Lin quan tng i, xp x Lin quan chnh xc

Trang 30

Lun vn : nh gi cc h thng tm kim thng tin

Kt qu Truy cp

Sp xp theo mc lin Khng sp xp quan Nhng ngi khng phi Ngi s dng c kin thc chuyn gia hoc cc tin trnh t ng

H thng tm kim thng tin thu thp ti liu da trn yu cu thng tin ca ngi dng. Cu truy vn trn d liu khng c cu trc (thng l dng vn bn t do), s dng t kha hoc ngn ng t nhin v do vy c th c vit bi ngi dng khng thng tho. V c php ca cu truy vn khng c nh ngha chnh xc nn kt qu c th bao gm cc kt hp khng chnh xc v th t lin quan hay tng quan (relevance) ca chng ch l gn ng. H thng tm kim d liu thu thp mt tp hp cc ti liu ph hp v mt c php vi cu truy vn ca ngi s dng. Cu truy vn trn d liu c cu trc ( hng l bng trong c s d liu) v thng s dng mt ngn ng truy vn c nh ngha hon chnh nh l SQL hay i s quan h. Ngi s dng phi quen thuc vi c php v hiu c ng ngha ca ngn ng truy vn. V vy, cu truy vn thng c vit bi ngi am hiu hoc mt qu trnh t ng. Kt qu tr v bao gm tt c cc ti liu chnh xc ph hp vi ng ngha ca cu truy vn, th t bt k. 2.1.5. Cng thc tru tng trong tm kim thng tin Gi D l tp hp cc ti liu v Q l tp hp cc cu truy vn. Hm f: D Q R l hm tnh tng quan ca mt cp (ti liu, cu truy vn) bi mc tng quan ca ti liu i vi cu truy vn. i vi mi cu truy vn q trong Q , f ch ra mt th t ( ring phn) q trn D. Hot ng ca mt h thng tm kim thng tin bao gm 2 pha chnh. Trong sut pha u tin, D c tin x l v ch mc I c to ra tng ng. Trong pha th 2 , cho trc mt cu truy vn trong Q, I c dng xut ra mt hon v trn D.
Trang 31

Lun vn : nh gi cc h thng tm kim thng tin

Mc tiu chnh ca mt h thng tm kim thng tin l xut ra mt hon v gn vi q bng cch s dng mt ch mc c v phn hi trong mt thi gian ngn. Chng hn, chng ta khng mun t chnh xc cao bng cch s dng ch mc ln trong bao gm mt hon v trn D cho mi cu truy vn c th c hoc bng cch duyt ton b ch mc cho mi cu truy vn. Chng ta s dng khi nim tokens biu din ti liu. t T l khng gian tokens. Khng gian tokens c th bao gm , v d nh l :ton b t trong ting Anh, mt tp hp cc cm t hoc mt tp hp cc URLs. Chng ta nh ngha mt ti liu l mt vec-t thc d trong R k (k l s tokens trong khng gian tokens). Go d l trng lng ca t trong d. C rt nhiu cch tnh d ,cch d nht l
i i i

tnh s ln xut hin ca t trong d.


i

2.1.6. Cc m hnh tm kim thng tin c in sp th t lin quan Nghin cu tm kim thng tin da trn rt nhiu m hnh khc nhau. y, chng ti xin nu ra hai m hnh c s dng nhiu nht. 2.1.6.1. M hnh i s Bool M hnh lin quan (relevance) c bn nht trong h thng tm kim thng tin c in l m hnh i s Bool hay Boolean. Mt ti liu c nh ngha l mt vec-t boolean d trong (trng lng boolean) trong d =1 khi d c mt
i i

trong d. Mt cu truy vn c nh ngha l mt cng thc boolean q trn cc tokens :q: {0,1}k {0,1} .Ngha l, q l mt hm m khi cho trc mt vec-t trong {0,1}k biu din mt ti liu th s tr v mt gi tr boolean ph thuc vo lin quan gia ti liu v cu truy vn. Hm tnh lin quan c nh ngha n gin bng cch p dng hm ny trn mt ti liu, f(d,q) = q(d). V d nh ,mt cu truy vn trong m hnh boolean c th l Micheal Jordan AND (Not basketball). Li ch chnh ca m hnh boolean l tnh n gin cho ngi s
Trang 32

Lun vn : nh gi cc h thng tm kim thng tin

dng, tuy nhin hm tnh lin quan ca n qu ti khi n ch tr v mt gi tr boolean. 2.1.6.2. M hnh khng gian vec-t M hnh thng dng trong h thng tm kim thng tin c in dng sp th t lin quan l m hnh khng gian vect hay vec-t (VSM). Mt ti liu l mt vec-t thc d trong R k (trng lng thc), d c xc nh da trn mt hm tnh
i

ton, thng l im TF-IDF (s c cp sau trong phn ny). Tng t nh mt ti liu, mt cu truy vn cng l mt vec-t thc trong R k trong q l
i

trng lng ca t trong q. Hm tnh lin quan l f(d,q) = sim (d,q) trong
i

sim(d,q) l mc ging nhau gia d v q. Tip theo chng ta s phng thc o s ging nhau ca mt vec-t ti liu v mt vec-t truy vn, sau trnh by im TF-IDF dng tnh trng lng ca tokens trong ti liu. Trc gic c th dn ta n cch nh ngha s ging nhau gia mt vec-t ti liu v mt vec-t truy vn bng vec-t khc bit ca chng (hnh bn di)

Phng thc ny s gn mt trng lng ng k cho cc tokens xut hin trong ti liu nhng khng xut hin trong cu truy vn. Vec-t truy vn thng tha tht hn nhiu so vi vec-t ti liu, v vy mt phng thc tt hn nn loi b hiu ng ca cc tokens khng xut hin trong cu truy vn.

(a) Vec-t khc nhau

(b) Cosin

Hnh biu din tng t ca vec-t ti liu d v vec-t truy vn q

Trang 33

Lun vn : nh gi cc h thng tm kim thng tin

Phng thc o s ging nhau cosine (hnh 5b) da trn quan st trn, l phng thc thng dng o s ging nhau gia mt vec-t ti liu v mt vect truy vn.

Ch l nu gc gia 2 vec-t nh th cosine gn ti 1, l gi tr ln nht ca s ging nhau. Nu 2 vec-t gn nh vung gc th cosine gn ti khng,ngha l s ging nhau nh nht. TF-IDF l phng php thng dng cn nng (nh gi) cc thut ng (term) trong mt ti liu. tng c bn ca phng php ny l xem xt tnh ph bin ca mt thut ng trong mt ti liu khi so snh vi tnh ph bin ca thut ng trong cc ti liu khc. V d nh , xem xt mt ti liu d1 c 100 thut ng, 10 trong s l java v mt ti liu d 2 c 100000 thut ng, 10 trong s l java. V tn s xut hin (tnh ph bin) ca thut ng java trong d1 cao hn ng k so vi trong d 2 nn trng lng ca thut ng java trong d1 phi cao hn trong d 2 .By gi xem xt mt s tht l thut ng the ,cng xut hin 10 ln trong d1 . V n l mt thut ng thng dng trong cc ti liu nn n khng nn c trng lng bng thut ng java mc d mc ph bin ca chng l nh nhau. nh ngha chnh qui ca im TF-IDF cho mt ti liu c nh ngha nh sau. Gi n(d, t ) l s ln xut hin ca t trong d v N = i n(d , t i ) l tng s
i i

tokens trong d. D ch s ti liu cha t v D l tng s ti liu c trong tp hp.


i i

Tn s thut ng ( term frequency) TF (d, t ) ,l tn s xut hin ca t trong d.


i i

C mt vi cch tnh tn s thut ng. 2 cch thng dng nht l chia s ln xut hin ca token trong ti liu cho hoc l tng s token c trong ti liu hoc l s ln xut hin ca token xut hin nhiu nht trong ti liu:

Trang 34

Lun vn : nh gi cc h thng tm kim thng tin

Trong bt k trng hp no, thut ng xut hin nhiu hn s c im TF cao (cao nht l 1) v thut ng t xut hin s c im TF gn bng 0. Ngc li, IDF( t ) (Inverse Document Frequency) l tn s nghch o ca t
i i

trong tt c cc ti liu c trong tp hp. N thng c o bng loga ca t s gia tng s ti liu c trong tp hp v s ti liu trong tp hp c cha t .
i

IDF ( t ) = log ( D / D )
i i

Ch l hm loga c p dng ch v nhng l do s hc. Thut ng thng xuyn xut hin trong ti liu nh l the v vy s c im IDF gn bng 0 v thut ng him gp s c IDF gn bng 1. im TF-IDF c tnh bng cch nhn im TF v im IDF: TF-IDF (d, t ) = TF (d, t ) IDF ( t )
i i i

Qua cng thc trn, ta c th thy rng TF-IDF s cho im mt thut ng cao hn nu n xut hin thng xuyn trong mt ti liu v khng xut hin thng xuyn trong cc ti liu khc. M hnh khng gian vec-t , thng xuyn s dng TF-IDF nh gi cc thut ng v hm cosine l hm o mc ging nhau, th hin l mt phng thc tnh lin quan gia mt ti liu v mt cu truy vn tin cy hn m hnh boolean trn. Bn cnh , VSM c nhng hin thc hiu qu v th hin hiu sut tt trong thc t. Nhc im chnh ca phng php ny l n gi nh cc thut ng c lp nhau. Trong thc t, cc thut ng thng c lin quan vi nhau v hiu c iu ny c th dn n vic tnh lin quan tt hn.

Trang 35

Lun vn : nh gi cc h thng tm kim thng tin

2.2. nh gi cc h thng tm kim thng tin 2.2.1. Nn tng nh gi cc h thng tm kim thng tin Mt trong nhng gii thiu tt nht v nh gi cc h thng tm kim thng tin c trong chng 7 ca [ 5 ] . y, chng ti tp trung ch yu vo nhng g c th c nh gi trong tm kim thng tin. Mc bao ph ca b ng liu: m rng n mc h thng bao gm cc thnh phn lin quan. V vy, mc bao ph ca ti liu phi x l vi vn cht lng ca b ng liu. iu ny quan trng trong tm kim thng tin dng Web v mi phng tin tm kim (Search engine) c bit l c th bao ph 16% khng gian Web. Hiu nng: Khong thi gian trung bnh gia thi gian mt yu cu c a ra v cu tr li c tr v. Hiu nng c xem nh thi gian thc hin tm kim, s dng b nh, v.v .. Biu din ca kt qu tr v. Kt qu lin quan n ngi dng trong vic ly cu tr li cho mt yu cu. bao ph ca h thng: t l cc ti liu lin quan c tr v. chnh xc ca h thng :t l cc ti liu tr v tht s lin quan. C bao ph v chnh xc u lin quan n hiu qu tm kim. Trong ti lun vn ca chng ti, chng ti tp trung vo hai kha cnh cui cng ( bao ph, chnh xc ca h thng) v chng chim u th nhiu nht trong nh gi cc h thng tm kim thng tin. Hai kha cnh ny l mt phn ca m hnh nh gi hng h thng m chng ti cp trong phn tng quan. Chng ti xin c ni r hn v m hnh ny.

Trang 36

Lun vn : nh gi cc h thng tm kim thng tin

2.2.2. M hnh nh gi hng h thng K thut ca hng nh gi tm kim hng h thng v cc o kh nng thc thi kt hp ca n c pht trin trong mt s lng cc d n nghin cu thi gian di : Cranfield , MEDLARS, SMART, STAIRS v TREC. tng chnh l o kh nng thc hin ca h thng tm kim thng tin bng cch chy mt tp cc cu hi trong b ng liu dng nh gi, c ch mc bi h thng v lu li kt qu. i vi mi cu truy vn, c th tnh c chnh xc v bao ph ca tp kt qu c lu li. Nh c nh ngha chng 1, chnh xc (precision) l t l ca tp ti liu lin quan c tr v so vi tp ti liu kt qu, bao ph (recall) l t l ca tp ti liu lin quan c tr v trn tng s ti liu lin quan. Nhng nh ngha chnh xc hn ca phn ny v cc o lin quan s c trnh by mc 2.2.3. 2.2.2.1. T Cranfield n TREC D n Cranfield do Cleverdon thc hin thng c nh gi nh l m hnh ch lc ca TREC. Cleverdon to ra cc cuc kim th Cranfield, Cranfield ch yu c dng trong th nghim vi mc ch chnh l xem xt, nh gi cc hm, chc nng lp ch mc khc nhau thc hin khc nhau nh th no. Mc ch chnh l xc nh thut ton no l ti u nht, ph hp nht vi cc tiu chun o lng v o. T truyn thng nghin cu th nghim hng h thng ra i. Salton Hoa K l ngi u tin m rng phng php th nghim cho nh gi cc thut ton tm kim thng tin theo m hnh Khng gian Vec-t [6] . ng bt u nghin cu tm kim thng tin ti i hc Harvard nm 1961. ng mun pht trin mt khung lm vic (framework) cho so snh kh nng lp ch mc v cc k thut tm kim thng tin ca h thng. Khung lm vic c thc hin bi mt lot cc thut ton v c bit n nh l h thng SMART. D n SMART c l l nghin cu tm kim thng tin ko di nht cho n ngy hm nay, trong khong thi gian t 1961 n khi Salton mt nm 1996, nhm SMART th nghim nhiu kha cnh ca tm kim thng tin: nh gi thut ng (term

Trang 37

Lun vn : nh gi cc h thng tm kim thng tin

weighting), m rng cu h (query expansion), phn hi tng quan (relevance feedback), phn lp (clustering) v.v. Tt c cc th nghim u da trn h thng tm kim thng tin SMART, chi tit ca h thng ny s c trnh by mc 3.1.4.1. D n SMART t c kt qu tt nht trong m hnh Khng gian vec-t trc quan v hiu qu. Chng trnh TREC ang tin hnh hin nay c thi thc bi cc nghin cu Cranfield v SMART. TREC bt u nm 1992 vi hai nhim v chnh: nghin cu ad-hoc v nghin cu routing. K t , nhiu nhim v mi c kim tra trong nhiu track khc nhau. u im chnh ca TREC l kch thc ca ng liu dng nh gi thc t hn nhng d n khc v vic nh gi th m cho bt k nhm nghin cu no. Nhng ngi tham gia vo TREC qua cc nm tng ln mt cch nhanh chng. S lng c ngha cc nhm tham gia mi nm, bo m s n nh v c th so snh qua cc nm. TREC s dng y ban nh gi t Vin Quc gia v Tiu chun v Cng ngh Hoa K (National Institute of Standard and Technology - NIST) thc hin nh gi. Nghin cu STAIR l mt trong nhng nghin cu u tin pht trin th tc mi o bao ph, bi v kch thc ln ca ng liu lm cho vic to bng nh gi lin quan chun qu tn nhiu chi ph. TREC cng b tr cc o bao ph ca n trn vic xem xt mt tp nh cc ti liu (pool - gi l h hay tp hp xc nh) nhng s dng cc phng php khc nhau to n. Tp hp xc nh ny c to t mt mu cc vic chy h thng tm kim khc nhau (cng khc nhau cng tt). i vi mi cu truy vn, danh sch cc ti liu tr v c kt hp vi nhau bng cch trn vo nhau v loi b cc ti liu lp li. Kt qu l mt danh sch cc ti liu thng nht. Cui cng, ngi nh gi xem li nh gi l cc ti liu trong danh sch ny (c mt danh sch cc cu hi) c tht s lin quan n cu hi tng ng khng. nh hng ca TREC trong tm kim thng tin l rt ln v cht lng ca b ng liu kim tra l rt tt v c nhiu h thng tham gia ng gp vo tp hp xc nh cc ti liu v do tnh cht tip tc ca chng trnh TREC. TREC to

Trang 38

Lun vn : nh gi cc h thng tm kim thng tin

ra mt ti sn ln b ng liu dng nh gi m c th c dng trong s lng ln cc th nghim c kim sot. u im ln ca cc th nghim c kim sot l n c th c lp li. Trc khi TREC ra i, c nhiu b ng liu kim tra nh, rt kh so snh cc phng php gia cc nhm khc nhau. Tnh trng ny ngn cn s pht trin ca nh gi cc h thng tm kim thng tin. TREC c mc tiu l xy dng mt s lng cc b ng liu nh gi ln cho tm kim thng tin, ch yu l thc hin nh gi di cc iu kin c kim sot v cho php thc hin li vic nh gi. Nhn vo nhng kt qu m cc nhm tham gia TREC t khi TREC bt u, chng ta c th thy mt s tin b ng k. Chi tit v TREC s c gii thiu trong mc 2.2.4 . 2.2.2.2. Th tc nh gi Phng php th nghim hng h thng c thc hin qua cc bc phn bit sau: Trc ht, xy dng mt b ng liu dng nh gi. B ng liu dng nh gi gm c tp cc ti liu mu, tp cu truy vn mu, v bng nh gi lin quan chun. Theo l thuyt, mi kt hp cu truy vn-ti liu u c kim tra lin quan. Nhng trn thc t, ch mt phn tp ti liu c xem xt cho mi cu truy vn. Cc h thng tm kim thc hin tm kim trn b ng dng nh gi: lp ch mc tp ti liu, to cc cu hi t ch (topic), to bng lin quan c th t cc ti liu cho mi cu hi. nh gi cc o thc hin: Cc o c in l bao ph v chnh xc, nhng c mt s lng ln cc o khc. l o trung bnh nghim ngt (Mean average precision). nh gi tm quan trng ca kt qu tr v bng phng php thng k. Cc o thc hin ton b h thng tm kim ch yu c ly trung bnh

Trang 39

Lun vn : nh gi cc h thng tm kim thng tin

trn tp cu hi. V tnh cht bin i ca cc cu hi l rt ln, v s thay i ca cc o tnh ton l rt cao, nn i hi mt phng php phn tch thng k thch hp nh gi xem s khc bit c o gia cc h thng c phi l c ngha thng k n mt tin cy nht nh khng. 2.2.2.3. nh gi s lin quan Trong nh gi cc h thng tm kim thng tin theo kiu ca TREC, c hai gi nh quan trng, m khng c trong cc thit lp th gii thc : S lin quan hon ton theo khi nim: mt ti liu ch c th l lin quan hoc khng lin quan. S lin quan ca mt ti liu hon ton c lp vi cc ti liu khc. Cc gi nh ny lm n gin vic o cc h thng tm kim. Nhiu nh nghin cu th nghim vi nhiu t l khc nhau ca s lin quan. Cc t l ny c trnh by r hn trong mc 2.2.3.4. Gi nh v s lin quan ca mt ti liu hon ton c lp vi cc ti liu khc khng thc t trong hu ht cc trng hp. Trong hu ht cc trng hp tm kim thng tin c bn, chng hn tm kim thng tin trn web, nhng ngi tm kim mun tm mt cu tr li cho mt cu hi xc nh hay mt vi tham kho. Gi s rng ngi s dng s bt u duyt qua cc ti liu c tm thy bt u t nhng ti liu lin quan nht, cc ti liu t lin quan hn th ph thuc vo ti liu lin quan c. Xc sut c ti liu mi gim dn theo danh sch ti liu. S ph thuc ny thng b b qua bi cc nh nghin cu tm kim thng tin. C nhiu mi quan tm v tnh ch quan ca th tc nh gi. Con ngi thng c nhng kin khc nhau v s lin quan. iu ny c nh hng xu n s pht trin ca cc nh gi ca TREC. Tuy nhin, c nhiu nghin cu gii quyt vn ny v thy rng nh hng trn tp cc h thng c kt qu sp th t lin quan th c th b qua. Mt nghin cu gn y lin quan n b ng liu nh gi ca TREC kim th nhiu vn khc nhau: nh gi bi tc gi khc vi khng phi tc gi

Trang 40

Lun vn : nh gi cc h thng tm kim thng tin

nh gi mt bng khc vi nh gi nhm bng nh gi nh gi trong cng mt mi trng khc vi nh gi nhiu mi trng Nhng yu t ny nh hng n gi tr tuyt i ca cc o kh nng thc hin, nhng th t lin quan ca cc h thng vn n nh. 2.2.3. Thc hin o kh nng tm kim Cc o c in cho kh nng tm kim ca cc th nghim h thng thng tin l bao ph v chnh xc. Trong nhng phn tip theo, chng ti m t cc th tc o chnh xc v bao ph v tnh o cho cc h thng tm kim c kt qu c sp th t trong tnh hung khng th nh gi tt c cc ti liu trong b ng liu kim tra. T bao ph v chnh xc, chng ti dng phng php tnh chnh xc da trn 11 im ca bao ph tnh chnh xc. Kt qu tr v t phng php ny l mt bng bin thin ca chnh xc v bao ph hay l mt hm s ca chnh xc da trn bao ph. C th biu din th lin h gia chnh xc v bao ph t hm s ny cho bit hiu qu tr v ca h thng mt cch trc quan, v c th so snh nhiu h thng vi nhau da trn th. 2.2.3.1. Cc khi nim v o v lin quan Tnh lin quan ca ti liu (relevant ): Mt ti liu c gi l c lin quan khi ni dung ca ti liu c cp n vn m cu truy vn ca ngi dng quan tm. bao ph (Recall - R): Cho bit kh nng ca h thng tm kim c nhng ti liu c lin quan. chnh xc (Precision - P): Cho bit kh nng ca h thng tm c nhng ti liu chnh xc Kh nng loi b: (Fall out - F): Cho bit kh nng ca h thng loi b nhng ti liu khng lin quan

Trang 41

Lun vn : nh gi cc h thng tm kim thng tin

2.2.3.2. Cch tnh bao ph (R) v chnh xc (P)


Tp d liu v ti liu

A B A
Tp ti liu c lin quan

Tp ti liu tr v

Tp ti liu tr v c lin quan

C lin quan (Relevant)

Khng lin quan (non- relevant) Tm thy (retrieved) Khng tm thy (not retrieved)

A A B A
bao ph (R):
R=

A B B

(1)

chnh xc (P):
P=

A B A

(2)

Kh nng loi b: (Fall out - F):


F=

(3)

Trang 42

Lun vn : nh gi cc h thng tm kim thng tin

Mi lin h gia R, P, F:
R*G F= R * G + F* (1-G) (4)

G : l nhn t tng qut o dy c ca ti liu lin quan trong tp d liu G cho bit lin quan ca ti liu so vi cu truy vn l cao hay thp

A
G=

Vi S l tp ti liu

(5)

S
Vn o bao ph: Tnh bao ph l mt vn kh khn trong vic nh gi h thng tm kim thng tin bi v n lin quan n vic nh gi th cng tng s ti liu lin quan trong tp ti liu i vi mi cu truy vn (vn to bng lin quan l thuyt) , vic nh gi nh vy rt tn km nu tp d liu ln. gii quyt vn ny ngi ta a ra phng php pooling. tng ca phng php pooling l trong danh sch ti liu tr v ch ly n ti liu u, n c gi l chiu di ca pool. Vic to bng lin quan l thuyt p dng phng php pooling c tin hnh nh sau: tin hnh tm kim trn nhiu h thng p dng phng php pooling, c th ti liu lin quan c tr v ca mt h thng l cao, ta tin hnh giao cc tp ti liu lin quan tr v ca cc h thng v ch ly n ti liu u. Bi v tp kt qu tr v c sp xp theo th t nn chnh xc v bao ph c th tnh c ti cc ngng v tr th t th i ti liu. Vn bng lin quan thc t i vi cch tnh trn ta phi quan nim v lin quan ca ti liu trn 2 mc : hoc l ti liu c lin quan hoc l ti liu khng lin quan. Cch

Trang 43

Lun vn : nh gi cc h thng tm kim thng tin

quy c nh vy nhm lm n gin ho cch nh gi. Trn thc t , lin quan ca ti liu khng ch l 2 mc m c th c nhiu mc . 2.2.3.3. Phng php tnh chnh xc da trn 11 im chun ca bao ph 2.2.3.3.1. th biu din hiu sut thc thi h thng tm ki m ng vi 1 cu truy vn c thc hin bi h thng s c 1 bao ph (Ri) , chnh xc (Pi) c th . Vi 1 cp (Ri,Pi) biu din trn h trc to ROP tng ng vi 1 im. Biu din kt qu ca tp cu truy vn trn ROP ta s c 2 ng cong m t hiu sut thc thi ca h thng. ng cong c dng:

T th ta c th rt ra kt lun: bao ph v chnh xc c mi quan h gn nh t l nghch, khi R tng th P c th s gim v ngc li. Khi ta c gng lm tng R bng cch tng s ti liu tr v (N), theo cng thc (1) : N tng nn c may s ti liu c lin quan s tng trn tng s ti liu c lin quan so vi cu truy vn trong bng lin quan chun l khng i
Trang 44

Lun vn : nh gi cc h thng tm kim thng tin

R s c th tng Mt khc theo cng thc (2) do N tng c ngha l s ti liu tr v tng mc d s ti liu c lin quan tng nhng khng ng k so vi s ti liu tr v (lc ny cng tng) nn P s gim. Ni cch khc, khi cho h thng thc thi 1 cu truy vn m ta tng s ti liu tr v th kt qu s c c nhiu ti liu c ch nhiu hn nhng s ti liu khng lin quan (ti liu rc) cng s tng. 2.2.3.3.2. ng cong bao ph v chnh xc RP C s tnh bng gi tr cho ng cong RP da vo bng lin quan l thuyt v danh sch ti liu lin quan c sp th t do h thng tm ki m thng tin tr v(cn gi l bng lin quan thc t). Xt v d sau: Thc hin kim tra h thng tm kim thng tin vi tp cu hi. Xt cu hi th k, cch tnh nh sau: Ti liu lin quan c tr v l phn giao ca danh sch ti liu lin quan theo l thuyt v theo thc t => Tng s ti liu lin quan c tr v : 5 Bng gi tr R,P tnh vi n ti liu c tr v n Doc ID Lin quan S ti liu S theo thuyt ? l liu quan tr v 1 2 3 4 5 6 588 589 576 590 986 592 true true false true false true 1 2 2 3 3 4 1 2 3 4 5 6 1/5=0.2 2/5=0.4 2/5=0.4 3/5=0.6 3/5=0.6 4/5=0.8 1/1=1.00 2/2=1.00 2/3=0.67 3/4=0.75 3/5=0.60 4/6=0.67 lin liu c v ti bao chnh tr ph (R)

xc (P)

Trang 45

Lun vn : nh gi cc h thng tm kim thng tin

7 8 9

984 988 578

false false false false false false true false

4 4 4 4 4 4 5 5

7 8 9 10 11 12 13 14

4/5=0.8 4/5=0.8 4/5=0.8 4/5=0.8 4/5=0.8 4/5=0.8 5/5=1.0 5/5=1.0

4/7=0.57 4/8=0.50 4/9=0.44 4/10=0.40 4/11=0.36 4/12=0.33 5/13=0.38 5/14=0.36

10 985 11 103 12 591 13 772 14 990

Nhn bng gi tr trn, ta thy ti gi tr R=0.6 c 2 gi tr P (P=0.75 v P=0.6) v ngc li ti gi tr P=1.0 c 2 gi tr R (R=0.2, R=0.4) xy dng ng cong cho mt cu truy vn ta dng phng php tnh ni suy chnh xc da trn 11 im chun ca bao ph: Xt cc gi tr R ti cc im chun 0.0, 0.1, 0.2, 0.3, 0.4,0.5,0.6,0.7,0.8,0.9, 1.0, ti cc v tr tnh gi tr P theo cng thc sau:

P R( i= ) max PR( j ) vi j i
Ta c bng ni suy cc gi tr P cho cu hi th k nh sau: N Doc ID bao chnh xc (P) bao chnh ph (R) ph chun xc ni ho 1 2 3 4 5 6 7 588 589 576 590 986 592 984 1/5=0.2 2/5=0.4 2/5=0.4 3/5=0.6 3/5=0.6 4/5=0.8 4/5=0.8 1/1=1.00 2/2=1.00 2/3=0.67 3/4=0.75 3/5=0.60 4/6=0.67 4/7=0.57 0.0 0.1 0.2 0.3 0.4 0.5 0.6 suy 1.00 1.00 1.00 1.00 1.00 0.75 0.75

Trang 46

Lun vn : nh gi cc h thng tm kim thng tin

8 9

988 578

4/5=0.8 4/5=0.8 4/5=0.8 4/5=0.8 4/5=0.8 5/5=1.0 5/5=1.0

4/8=0.50 4/9=0.44 4/10=0.40 4/11=0.36 4/12=0.33 5/13=0.38 5/14=0.36

0.7 0.8 0.9 1.0

0.67 0.67 0.38 0.38

10 985 11 103 12 591 13 772 14 990

th RP cho cu hi th k:

2.2.3.3.3. ng cong RP cho tp truy vn Xt tp cu truy vn gm N cu truy vn Ln lt tnh bng gi tr RP ni suy nh trn (tnh P da trn 11 im Tnh gi tr trung bnh P ti cc im chun ca R nh sau: chun ca R) -

0.2

0.4

Precision 0.6 0.8

1.0

Trang 47

Lun vn : nh gi cc h thng tm kim thng tin

Nhn xt: Phng php nh gi h thng da vo bng gi tr RP ni suy khng nh gi mt cch chnh xc hiu sut tm kim thng tin ca h thng tm kim thng tin bi v cc gi tr ca R,P l cc gi tr ni suy. 2.2.3.3.4. nh gi h thng tm kim thng tin da vo th Ta tin hnh kim tra 2 h thng vi cng 1 tp ti liu mu v tp cu truy vn mu. Gi s th din ca 2 h thng nh sau:

Nhn trn th : ng cong A biu din hiu sut thc thi ca h thng A ng cong B biu din hiu sut thc thi ca h thng B Do ng A nm trn ng B nn hiu sut ca h thng A ln hn h thng B Mt cch tng qut : ng cong no cng gn v pha gc trn bn phi ca h trc to (c ngha l chnh xc v bao ph l ln nht) th chnh l ng cong biu din hiu sut thc thi tt nht. Vi cch biu din trn th nh vy ta c th nh gi nhiu h thng hoc nh gi 1 h thng trong nhng iu kin thc thi khc nhau.

Trang 48

Lun vn : nh gi cc h thng tm kim thng tin

2.2.3.4. S lin quan gia cu hi v ti liu 2.2.3.4.1. Cc lin quan Cc lin quan c ni n rt r trong [ 7 ]. lin quan nh phn (binary relevance): l lin quan ch c 2 gi tr : hoc l c lin quan (relevant _ 1), hoc khng lin quan (not relevant _ 0). lin quan nhiu mc ( lin quan a cp ) : (multiple degree relevance, multiple level relevance): lin quan c xt nhiu mc ,c nhiu gi tr . V d lin quan 3 mc : Mc c lin quan (relevant): 2 Mc lin quan b phn (partically relevant): 1 Khng lin quan (not relevant) : 0

2.2.3.4.2. Cc vn v lin quan C s nh gi h thng tm kim thng tin: mt tp ti liu (document) i din mt tp ch (topic) i din mt vi cu truy vn cho mi ch bng nh gi lin quan ca mi ti liu vi mi ch Do vn c bn ca vic nh gi l phi thng nht quan im v mc lin quan. lin quan l mt khi nim a kha cnh (multifaceted), a chiu (multidimensional). Khi nim v lin quan n nay vn l mt vn kh khn trong lnh vc khoa hc thng tin.Nhng cuc nghin cu gn y tp trung vo nhn t nh hng ln vic nh gi lin quan v chiu (hoc tiu chun) ca lin quan. C nhiu loi lin quan: lin quan thut ton, lin quan ch , lin quan nhn thc, lin quan tnh hung, lin quan ng c.

Trang 49

Lun vn : nh gi cc h thng tm kim thng tin

lin quan vn mang tnh ch quan, nh gi lin quan thng khng thng nht do tnh c nhn v nhn t thi gian : Mt ti liu c nh gi l c lin quan vi t l no nhng i vi ngi khc t l ny s khc => lin quan ph thuc tnh c nhn Mt ti liu c nh gi l c lin quan vi t l no ti thi im t , nhng ti thi im t t l s thay i => lin quan ph thuc nhn t thi gian . Tuy nhin s thay i ny c th chp nhn c do n tng i thp Trong hu ht cc th nghim nh gi h thng tm kim thng tin (bao gm c nhng th nghim ca TREC) ngi ta thng quan tm lin quan nh phn (c ngha l ti liu hoc l c nh gi l c lin quan (1) hoc khng c lin quan (0)). u im ca d lin quan nh phn l vic tnh ton R, P n gin ; khuyt im l khng th phn nh c kh nng lin quan ca ti liu nhiu mc ng vi thc t. Trong cch nh gi tm kim thng tin ca TREC, khi nim lin quan l mt khi nim tuyt i: mt ti liu hoc l lin quan hoc l khng lin quan. iu gi s ny nhm lm n gin ha vic tnh ton cc o. Nhiu cuc kim tra khc tin hnh nh gi vi t l lin quan nhiu mc . lin quan 3 cp c thc hin Hi ngh NTCIR 1999 (NIINACSIS Test Collection for IR systems), WEB track ca TREC-9. lin quan 4 cp c dng trong NTCIR 2000. T l lin quan ca mt ti liu ti v tr th N s c tr hao, iu ny phn nh mt tnh trng l ti liu tr v cng pha di danh sch cng c t gi tr hn i vi ngi s dng : mc d do mc tng quan khng gim nhng s trng lp thng tin vi nhng ti liu pha trn cng lm cho ti liu pha di km phn gi tr hn. Gi s rng s lin quan ca mt ti liu l c lp vi cc ti liu khc l khng thc t trong hu ht cc trng hp. Trong hu ht cc nhim v tm kim
Trang 50

Lun vn : nh gi cc h thng tm kim thng tin

thng tin c bn ging nh tm kim trn mng, tm kim cu tr li cho mt cu hi c bit no hoc cho mt vi s tham kho no , gi s rng mt ngi dng c lt qua cc ti liu c tr v s bt u vi ti liu d thy nht ,ni bt nht ( pha trn danh sch) do lin quan ca ti liu pha di danh sch s ph thuc vo nhng ti liu c c. Kh nng mt ti liu cha nhng thng tin mi s gim xung n cui danh sch ti liu. S ph thuc ny thng c b qua trong nhng ln nghin cu tm kim thng tin. Ngoi ra vic nh gi lin quan ny mang tnh ch quan. Chng ta thng c nhiu kin khc nhau v mc lin quan. Do mc lin quan ca ti liu c phn bit: bng lin quan c nh gi do tc gi ca ti liu hay khng phi tc gi bng lin quan c nh gi bi mt nhm nh gi bng lin quan c nh gi trong cng iu kin hay c nh gi trong cc iu kin khc nhau 2.2.3.4.3. nh gi vi lin quan nhiu cp (Multiple degree relevance or non-binary relevance) Trong mt vi th nghim v nh gi lin quan nhiu cp ch c mt vi th nghim thc s cho thy li ch ca vic nh gi lin quan nhiu cp khc nhau. bao ph (R) , chnh xc (P) l phng php c in nh gi kh nng thc thi ca IR v thng c tnh da trn vic nh gi lin quan nh phn. Do vic nh gi lin quan nhiu cp ch c tin hnh bc u , sau nhng gi tr mc s c qui v 2 gi tr 0,1 nh gi V d : nh gi lin quan c tin hnh 3 mc :
o o o

c lin quan (relevant) => k hiu A lin quan mt phn (partically relevant ) => k hiu B khng lin quan (not relevant) => k hiu C

Trang 51

Lun vn : nh gi cc h thng tm kim thng tin

Mc lin quan s c qui v 2 gi tr tnh R , P. C 2 cch tnh: A, B mang gi tr 1 (c lin quan) C mang gi tr 0 (khng lin quan) hoc A mang gi gi tr 1 (c lin quan) B,C mang gi tr 0 (khng lin quan) Vi cch tin hnh nh vy duy tr mc lin quan ca ti liu, nh dng mt tp tin nh gi lin quan (relevant judgement) nh sau: topic-ID dumy doc-ID relevant assessment Trong : topic-ID : ch s ca ch (topic) dumy : l trng cho bit ti liu c mc lin quan l bao nhiu (A,hoc B ,hoc C) doc-ID : ch s ti liu relevant assessment: mang gi tr 0 hoc 1 , gi tr nh gi lin quan sau khi c qui v lin quan nh phn. Mt v d khc v o lin quan ca ti liu 4 mc :
o o o o

lin quan cao (highly relevant) lin quan va (fairly relevant) lin quan trung bnh (marginally relevant) khng lin quan (irrelevant)

Tuy nhin trong cc Hi ngh v nh gi cc h thng thng tin gn y, lin quan nh phn vn cn c xem l mt cch nh gi chun, thm ch nhiu trng hp nh gi lin quan nhiu cp nhng cng c qui v nh gi nh phn tnh bao ph v chnh xc. Cch tin hnh ny c khuyt im l n khng kim tra c tng mc c th ca lin quan. Mt s ngi c quan im l cch o R v P da vo vic nh gi nh phn l nn trnh v cch tnh nh vy khng quan tm n s thay i v

Trang 52

Lun vn : nh gi cc h thng tm kim thng tin

phc tp ca mc lin quan, lm sai lch tnh t nhin v thc t ca lin quan. Mt gii php gii quyt vn ny l tng qut ho R v P. Da vo l thuyt, thc nghim, nghin cu, mc lin quan ca ti liu thay i mt cch r rng, mt vi ti liu th lin quan nhiu hn, mt s khc th t hn. Tht l kh xc nh mc lin quan khi tin hnh nh gi. iu ny cn tu thuc vo tnh hung nh gi h thng ca chng ta. 2.2.3.4.4. Phng php o bao ph (R), chnh xc (P) da trn lin quan nhiu cp Phng php o da vo bao ph (R ) v chnh xc (P) l mt phng php truyn thng nhng o R,P ch c tnh da vo lin quan nh phn i vi trng hp lin quan nhiu cp ta c 2 cch gii quyt sau: qui tt c mc lin quan v 2 gi tr 0,1 (ging nh a v d lin quan nh phn ) => cch ny theo Schamber l nn trnh tng qut ho R v P bao ph tng qut v chnh xc tng qut: (generalized , non-binary recall and precision) Gi R l tp n ti liu c phc hi t c s d liu ti liu D= { d1, d2, , dN } , R D Gi ti liu di trong c s d li ti liu c t l lin quan l r(di) bao ph tng qut gR v chnh xc tng qut gP c tnh theo cng thc nh sau: vi mt cu truy vn thuc v mt ch no

r(d)
gP =
dR

gR=

r(d)
dR r(d) dD

Trang 53

Lun vn : nh gi cc h thng tm kim thng tin

Cch tnh ny cng tng t tnh R, P nh phn truyn thng , n cng cho php tnh R trung bnh v P trung bnh ca tp cu truy vn, tnh P da trn R, hoc tnh da trn ngng gii hn s ti liu tr v v cng cho php biu din ng cong PR Ghi ch : r(d) l mt con s thc c gi tr trong khong (0.0, 1.0) V d vi mc lin quan l 4. Tnh r(d) o Mc lin quan cao : 3 => r(d)=3/4 o Mc lin quan va : 2 => r(d)=2/4 o Mc lin quan trung bnh : 1 => r(d)=1/4 Khng lin quan :0 => r(d)=0 2.2.4. TREC v nh gi theo chun TREC 2.2.4.1. TREC l g? TREC l vit tt ca Text REtrieval Conference, c ngha l Hi ngh v Tm kim Thng tin Vn bn, c t chc hng nm ti Vin Quc gia v Tiu chun v Cng ngh Hoa K (NIST _ National Institute Standard and Technology) [ 8]. TREC l mt lot Hi ngh chuyn cung cp c s h tng cho vic kim tra, nh gi quy m ln v cng ngh tm kim (ch yu l tm kim vn bn). Hi ngh TREC c to ra thc y nghin cu v cc cng ngh tm kim thng tin. Cc mc tiu chnh ca TREC l : Khuyn khch cc nghin cu trong tm kim thng tin da trn ng liu nh gi qui m ln. Pht trin giao tip, lin lc gia cc ngnh cng nghip, gio dc v chnh ph bng cch cung cp mt din n m trao i cc kin nghin cu. H tr trao i cng ngh t nhng phng th nghim nghin cu thnh nhng sn phm thng mi. Ci thin vt bt cc phng php lun tm kim trn cc vn th gii thc v cc o cho tm kim thng tin.

Trang 54

Lun vn : nh gi cc h thng tm kim thng tin

To ra mt lot ng liu nh gi lin quan cc kha cnh khc nhau ca tm kim thng tin. Pht trin cc cng ngh nh gi thch hp sn c m c s dng bi ngnh cng nghp v gio dc, bao gm c vic pht trin cc cng ngh nh gi mi thch hp hn vi cc h thng hin ti. Chu trnh Hi ngh hng nm ca TREC :

Tin hnh cng b

Ku g i tham gia

Xc nh cng vic

Hi ngh TREC

Kim c ti liu

Phn tch kt qu

Pht trin ch

nh gi kt qu nh gi lin quan

Th nghim tm kim thng tin

TREC gm cc lnh vc tp trung khc nhau gi l TRACK. Nhim v ca cc TRACK ch yu l tp trung vo vn con ca tm kim thng tin vn bn. Chnh nhng TRACK ny tip thm sinh lc, v lm cho TREC tip tc pht trin v nhng TRACK ny thc hin : + Ng liu chuyn mn ha h tr nghin cu trong cc lnh vc mi. + Nhng th nghim qui m ln g nhng li m cng vic gp phi. + Cung cp nhng du hiu v s pht trin ca cng ngh nh gi.
Trang 55

Lun vn : nh gi cc h thng tm kim thng tin

Tuy nhin, s pht trin tp hp TRACK trong mt TREC c th ph thuc vo : + S hng ng ca nhng ngi tham gia. + Cc cng vic m TREC a ra c thch hp hay khng + Nhu cu v ti tr + S rng buc v ngun ng liu nh gi theo tiu chun ca TREC chnh l xy dng b ng lu dng nh gi theo chun ca TREC v phng php nh gi theo lin quan theo chun TREC, thm vo l phng php nh gi kt qu theo chun 11 im ca bao ph. V phng php nh gi theo lin quan, phng php nh gi kt qu theo chun 11 im ca bao ph c trnh by n trong cc phn trn nn chng ti ch yu s trnh by trong phn tip theo v cch xy dng ng liu ca TREC. 2.2.4.2. Cch xy dng ng liu ca TREC Nh chng ti cp, TREC nh gi cc h thng tm kim thng tin theo m hnh hng h thng. Theo m hnh ny, bt buc h thng nh gi phi thc hin cc cng vic cp trong phn 2.2.2. Trong , phn xy dng b ng liu nh gi l phn quan trng nht v TREC lm rt tt cng vic ny do kch thc ca b ng liu rt ln v thc t. Ngoi ra, vic nh gi th m cho rt nhiu nhm nghin cu, s ngi tham gia vo TREC qua cc nm tng ln mt cch nhanh chng. S lng c ngha cc nhm tham gia mi nm, bo m s n nh v c th so snh qua cc nm. TREC cng xy dng b ng liu dng nh gi gm ba phn : tp ti liu hay kho ng mu, tp cu truy vn, v bng nh gi lin quan chun. i vi tng phn, TREC u a ra cc chun xy dng v nh dng kh tt. V vy, chng ti cng thc hin nh dng b ng liu theo chun TREC. Sau y l cch xy dng ng liu ca TREC [ 9].

Trang 56

Lun vn : nh gi cc h thng tm kim thng tin

2.2.4.2.1. Xy dng tp hp cc ti liu Ty thuc vo mc ch, nhu cu ca ngi thc hin nh gi, h s chn tp cc ti liu xc nh xy dng. Tp ti liu ny phi l mu ca cc loi vn bn m h chn. Tuy nhin, tp cc ti liu ny phi c xem xt cc iu kin v th loi, s lng, l vn bn y hoc l bn tm tt. Ngoi ra, vic la chn tp ti liu m phn nh c tnh a dng ca vn , ca vic la chn t ng, vn phong, hnh thc cng rt quan trng. Tp hp ti liu thng phi rt ln. Ng liu chnh ca TREC cha 3 gigabytes vn bn (trn 1,000,000 ti liu). Cc ti liu c s dng cc TRACK khc nhau l nh hoc ln ph thuc vo nhu cu ca TRACK v d liu c sn. Cc tp ti liu chnh ca TREC ch yu bao gm nhng bi bo giy v nhng bi bo in t, ngoi ra cn c mt s ti liu khc nhng s lng ti liu ny rt t. Nhng cu trc cp cao trong mi ti liu c gn nhn bng SGML, v mi ti liu c gn bi mt th xc nh duy nht c gi l DOCNO (s th t ca ti liu). gi tnh tht ca ti liu, vn bn s c gi gn ging vi vn bn gc c th. Tuy nhin, ng liu ca TREC khng c xc nh, sa cc li chnh t, tch cu, tch nhng bng cu trc l v nhng li tng t nh vy. 2.2.4.2.2. Xy dng cc ch TREC phn bit mt li yu cu thng tin (topic - ch ) vi cu trc d liu m c tht s a vo trong mt h thng tm kim thng tin (query cu truy vn). B ng liu ca TREC cung cp cc ch cho php mt phm vi ln cc phng php to cu truy vn c nh gi v cng bao gm mt s trnh by r rng rng tiu chun g lm cho ti liu lin quan n ch . Mi cch trnh by ch thng gm bn phn : mt th nh danh (number), mt tiu (title), mt on m t (description) v mt o n tng thut (narrative).

Trang 57

Lun vn : nh gi cc h thng tm kim thng tin

Trong th nh danh c dng phn bit cc ch vi nhau. Cn on m t c dng m t r rng hn ni dung tiu ca ch . on tng thut c to ra lm chun cho bit ti liu no tht s lin quan n ch ang cp. Ngoi ra, nhng phn khc nhau ca cc ch ca TREC cho php ngi nghin cu kim tra hiu qu chiu di nhng cu truy vn khc nhau vi biu din tm kim. T mt ch c th to ra nhiu cu truy vn khc nhau. Khi tm kim, cc cu truy vn s c s dng tm kim. Thng thng, mi nm TREC li to 50 ch mi. Ch c to ra bi nhng ngi nh gi. H to ra cc ch ng vin v gi n NIST. Cc ch ng vin ny lin quan n nhng vn bt k m h quan tm. Sau , NIST s tm kim thng tin vi cc ch ng vin ny bng h thng tm kim TREC PRISE. Cui cng, NIST s chn ra nhng ch no c s lng kt qu tr v gn vi ngng cc ti liu lin quan c t ra v s lng cc ch c chn phi chia u cho nhng ngi nh gi 2.2.4.2.3. Xy dng bng nh gi lin quan chun Bng nh gi lin quan chun l bng cha cc ch v cc ti liu lin quan tht s ca cc ch . Da trn bng nh gi lin quan chun ny, ngi nh gi xc nh cc ti liu no l tht s lin quan n ch no sau khi chy cc h thng tm kim. TREC hu nh s dng bng nh gi lin quan theo lin quan nh phn (hoc ti liu lin quan n ch hoc khng). bit l ti liu c lin quan tht s hay khng th ngi nh gi s xem trong phn tng thut ca ch . Phn ny ghi rt chi tit v nhng ti liu nh th no l lin quan. Sau , ngi nh gi s nh du ti liu no l lin quan v ti liu no khng lin quan.

Trang 58

Lun vn : nh gi cc h thng tm kim thng tin

V d : Ch v kinh t tri thc, n c tng thut l : Cc ti liu c gi l lin quan l nhng ti liu ni v nn kinh t tri thc, th no l nn kinh t tri thc, nh hng ca nn kinh t tri thc ca cc nc trn th gii. Nhng cch nh gi bng th cng ca ngi khng th thc hin c vi s lng ti liu qu ln, c bit l ti liu ca TREC. V vy, TREC p dng phng php Pooling lm bng nh gi lin quan chun. Phng php ny chng ti trnh by trn. Ti NIST, khi nhng ngi tham gia ng k h thng tm kim ca h ti NIST, h phi thc hin tm kim vi cc ch ca NIST bng h thng tm kim ca h. Sau , kt qu c sp xp theo th t cc ti liu trong b ng liu kim tra i vi tng ch . NIST s chn mt s cc kt qu v trn li vi nhau, nu chn cng nhiu h thng th bng nh gi lin quan chun cng chnh xc. Sau mi ln chy, X ti liu ng nht (thng X =100) c thm vo trong Pool hay cn gi l danh sch cc ti liu chnh xc ca tng ch . Nhiu ti lu c tm thy trong trong X c trong nhiu hn mt ln chy tm kim, v vy Pool thng nh hn con s l thuyt X*s_lng _cc_bng_kt_qu_c_chn. B ng liu dng nh gi theo phng php Pooling mc d khng cng bng lm trong trng hp nh gi cc h thng cha tham gia vo tm kim ly bng nh gi lin quan chun nhng phng php ny cho ra kt qu c th tin cy c. V s khc bit khi tnh bao ph, chnh xc khng chnh lch nhiu so vi nh gi bng th cng. 2.3. Ng liu ting Vit Chng ti xy dng b ng liu dng nh gi bng ting Vit v chng trnh nh gi cc h thng tm km thng tin, c bit l cc h thng tm kim thng tin ting Vit nn vic ni n ng liu ting Vit l phn khng th thiu.

Trang 59

Lun vn : nh gi cc h thng tm kim thng tin

Khc vi ting Anh, Php (thng c dng trong cc h thng tm kim thng tin ph bin), ting Vit c nhng c th ring, c bit trong vic xc nh t ting Vit. 2.3.1. T 2.3.1.1. Quan nim v t Theo [10 ], th tng hp t cc sch ngn ng hc i cng, sch ng php v sch v t vng hc, chng ti xin trnh by li mt s nh ngha in hnh v t nh sau: T l mt hnh thi t do nh nht. T l n v ngn ng c tnh hai mt : m v ngha. T c kh nng c lp v c php khi s dng trong li. T l n v nh nht c ngha ca ngn ng, c vn dng c lp, ti hin t do trong li ni xy dng nn cu. y cng chnh l nh ngha m trong ngn ng hc i cng hay s dng. T cc nh ngha trn, ta rt ra nhng nt c trng chnh ca t nh sau: V hnh thc : t phi l mt khi v cu to (mt chnh t, ng m) V ni dung : t phi c ngha hon chnh. V kh nng : t c kh nng hot ng t do v c lp v c php. Ngoi ra, ta cn gp mt s thut ng khc trong ngn ng hc i cng m S.E.Jakhontov [ 11] a ra nhn din t, nh: 1. T ng m: l nhng n v c thng nht vi hin tng ng m no . i vi Vit ng, chnh l nhng m tit, hay cn gi l ting, ting mt 2. T chnh t: l nhng khong cch gia 2 ch trn vn t; tc l nhng n v c vit lin thnh khi, i vi ting Vit, chnh l ch 3. T hon chnh: l nhng cu trc n nh, khng th tch ri hay hon v cc thnh t ca chng.

Trang 60

Lun vn : nh gi cc h thng tm kim thng tin

4. T t in hc: l n v m cn c vo c im ngha ca n phi xp ring trong t in. 5. T bin t: l nhng n v lun lun gm 2 phn: gc t (biu th ngha i tng) v ph t (biu th mi lin h vi cc t khc trong cu). y cn gi l t ng php. V phng din x l t ng bng my tnh, th t chnh t v t t in l hai loi c nhn din d nht v c s dng nhiu nht trong ti liu ny. 2.3.1.2. Quan nim v hnh v Trong ng php truyn thng th hnh v c xem l thnh t trc tip to nn t. Do , hnh v c xem nh l n v t bo gc, n v t bo ca ng php, v cn c gi l t t. Chnh v vy, m vic nhn din hnh v phi l bc i u tin trong vic nhn din t. nhn din hnh v, Jakhontov a ra cch phn xut cu n mc ti gin, gi l t cu, hay hnh v; ngoi ra, gii ngn ng hc hay s dng phng php lp hnh vung Greenberg i snh. V d: i snh c l v c , ta tch c thnh 3 hnh v: c, l v . Trc ht ta hy xem li quan nim v hnh v (morpheme) trong ngn ng hc i cng: theo Baudouin de Courtenay th hnh v l b phn nh nht c ngha ca t, cn theo Bloomfield th hnh v l n v ngn ng nh nht c ngha. Nhng quan nim thng thy trong ngn ng hc i cng l: hnh v l n v ngn ng nh nht c ngha v/hoc c gi tr (chc nng) v mt ng php. T c cu to bng mt hnh v hay nhiu hnh v kt hp vi nhau theo nhng nguyn tc nht nh. V d: anti-virus (chng vi rt). Hnh v bao gm hai loi: hnh v t do (nh: work, home,) v hnh v hn ch (nh: -ed, -less,). Trong hnh v hn ch gm hnh v bin t (nh: work-ed) v hnh v phi sinh (nh: home-less). 2.3.1.3. Khi nim v cu to t T c cu to nh cc hnh v (morpheme).

Trang 61

Lun vn : nh gi cc h thng tm kim thng tin

V d: anti + poison = antipoison. Hnh v l n v ngn ng nh nht c ngha v/hoc c gi tr (chc nng) v mt ng php. Hnh v gm cc loi : - Hnh v t do : t n xut hin vi t cch l mt t c lp, v d : house, man, black, nh, ngi, en... - Hnh v hn ch : xut hin trong t th i km, ph thuc vo hnh v khc, n bao gm cc hnh v bin t v hnh v pht sinh. V d : -ing, -ed, -s, -ness, ... cu to t, ngi ta dng cc phng thc : - Dng mt hnh v. - T hp 2 hay nhiu hnh v. - Thm ph t (tin, trung, hu t) vo. - Ly. 2.3.2. Ranh gii t Nhn din ranh gii t (word boundary identification) hay cn gi l phn on t (word segmentation) l mt cng on tin quyt i vi hu ht cc h x l ngn ng t nhin. i vi cc ngn ng bin hnh (ting Anh, ting Nga,...) th ranh gii t c xc nh ch yu bng khong trng hay du cu, cn i vi cc ngn ng n lp (trong c ting Vit) th khong trng khng th l tiu ch nhn din t. Mun xc nh c ranh gii t trong cc ngn ng ny, chng ta phi da vo cc thng tin mc cao hn, nh hnh thi, t php, c php, hoc ng ngha v thm ch c ng dng.

Trang 62

Lun vn : nh gi cc h thng tm kim thng tin

Chng 3 : THIT K V CI T
3.1. Xy dng b ng liu dng nh gi Nh chng ti trnh by trong cc phn trn, chng ti xy dng b ng liu dng nh gi theo tiu chun ca TREC. Vic xy dng gm 3 phn sau : 3.1.1. Xy dng kho ng liu bng ting Vit Chng ti xy dng kho ng liu dng nh gi bng cch thu thp ti liu t cc bo in t, chng hn : www.tuoitre.com.vn, www.thanhnien.com.vn, www.vnexpress.net . Cc ti liu ny bao gm rt nhiu lnh vc khc nhau gm c khoa hc k thut, kinh t, gio dc, vn ha, thi s Kho ng liu ca chng ti cho n nay c gn 15.000 ti liu, vi kch thc lu tr l 34 MB. Tuy nhin, cc ti liu ny di dng th, cha c chun ha nn bc quan trng nht l chun ha ng liu. 3.1.1.1. Chun ha ng liu Chun ha ng liu l chun ho thnh mt dng, mt tiu chun duy nht. Vic chun ho ng liu gm cc nhim v sau: 3.1.1.1.1. Chun ha dng ng liu Chun ho dng k t: a v ng dng in t, nh dng tp tin (t cc nh dng tp tin khc nhau s chuyn v txt, loi b nhng th khng phi l vn bn), ng m k t (chuyn v m Unicode). Chun ho tp tin: mi tp tin ng liu s gm mt s cu (khong 2000 t), mi cu c th nm trn mt dng, ht cu, ngt xung dng cng sau du chm cu. Mi u cu, s c mt m s nh danh cho bit thng tin v vn bn, nh: ngn ng (Anh, Vit, Php, Hoa, ), lnh vc ca tp tin (vn hc, tin hc, kinh t, th thao,), tiu loi (nh: trong vn hc c truyn ngn, tiu thuyt,

Trang 63

Lun vn : nh gi cc h thng tm kim thng tin

th, k, ) v thng tin v s hiu cu (cu th my trong tp tin), s hiu vn bn (vn bn th my trong tiu loi/lnh vc ang xt). Chun ho chnh t: xem xt bin th hnh thi chnh t, nh: quy tc b du thanh (vi ting Vit: b du trn nguyn m chnh theo nguyn tc thm m hay b trn nguyn m chnh theo nguyn tc ng m hc), cc bin th ch vit, nh: cch vit i/y trong ting Vit ho l v ha l. 3.1.1.1.2. nh dng ng liu Sau khi thu thp ng liu v chun ha ng liu v dng vn bn, chng ti chuyn tt c ng liu sang nh dng XML vi mt th nh danh nh DOCNO ca TREC. Th nh danh ca chng ti l DOCID. S d chng ti chuyn sang nh dng XML l v nh th chng ti c th d dng chuyn i nh dng tm kim cho tt c cc h thng tm kim khc nhau. nh dng ng liu ca chng ti c trong phn ph lc. Chng ti c chng trnh cho php chuyn i nh dng t text sang XML ging ti liu ca chng ti. Chng trnh ny rt hu ch trong vic to thm ti liu cho kho ng liu ca chng ti. 3.1.2. Xy dng tp cu hi bng ting Vit Tp cu hi c xy dng bng cch c lt qua mt s ti liu v to mt tp X cu hi. Sau , chng ti to nh dng cho cu hi theo tiu chun ca TREC, tc cu hi phi c phn nh danh, tiu , m t v tng thut. Cu hi cng c nh dng XML. Tip theo, chng ti chy tp X cc cu hi cho cc h thng tm kim. Ri xem li kt qu tm kim ca cc h thng la chn nhng cu hi no l ti u nht to tp Y cu hi chnh thc. V vy Y lun nh hn hay bng X. C cu hi v ng liu ting Vit u phi c tch t trong trng hp nh gi cc h thng tm kim ting Anh cho ting Vit v h thng ting

Trang 64

Lun vn : nh gi cc h thng tm kim thng tin

Anh khng th no c dng nh gi h thng ting Vit. Do , chng ti cng xy dng mt chng trnh tch t cho ng liu ting Vit. 3.1.3. Tch t ting Vit nhn din ranh gii t, chng ti s dng mt s hnh nh: MM Maximum Matching: forward / backward ; LRMM: Left Right, RLMM: Right Left v phng php MMSEG : Maximum Matching Segmentation. Theo phng php LRMM phn on t ting Vit trong mt ng/cu, ta i t tri sang phi v chn t c nhiu m tit nht m c mt trong t in, ri c tip tc cho t k tip cho n ht cu. Vi cch ny, ta d dng tch c chnh xc cc ng/cu nh: hp tc x | mua bn; thnh lp | nc | Vit Nam | dn ch | cng ho,. Phng php RLMM th ngc li, trong mt cu/ng, ta i t phi sang tri v chn t c nhiu m tit nht m c mt trong t in, ri c tip tc cho t k tip cho n ht cu. Phng php MMSEG l s kt hp ca c hai phng php LRMM v RLMM, do MMSEG cho kt qu tt hn hai phng php trn. 3.1.4. Xy dng bng nh gi Chng ti xy dng bng nh gi theo phng php Pooling. Chng ti chy cc h thng khc nhau nh gi. Chng ti tm hiu cc h thng sau: H thng SMART [12 ] ca i hc Cornell pht trin, mt h thng kinh in v m hnh vec-t . H thng XIOTA [ 13], h thng cho php nh dng ng liu theo XML, c pht trin ti Php. H thng Terrier [14 ] ca i hc Glasgow, Scot-len. H thng ny c dng chy cc track Tetra, Robust ca TREC. H thng Lucene [ 15 ], do nhm Jakarta Apache pht trin, y l mt search engine c dng ph bin.

Trang 65

Lun vn : nh gi cc h thng tm kim thng tin

H thng Tm kim Ting Vit hay Search4Vn ca mt nhm lm lun vn kha 2001 pht trin tm kim thng tin ting Vit. Tuy nhin, a s cc h thng lm cho ting Anh nn m ha ca cc h thng ny khng h tr ting Vit (mc d ting Vit c m ha bng m Unicode), v vy mun thc thi cc h thng tm kim cho ting Vit bt buc chng ti phi thc hin chuyn m cho chng trnh tm kim. Cc h thng ny c vit trn rt nhiu ngn ng khc nhau, gm c ngn ng C trn Linux, ngn ng BASH Shell, ngn ng Java, JSP, ngn ng DOT NET; v ti liu cho cc thnh phn m ngun khng y nn chng ti kh c th chnh sa ht tt c cc h thng. Chng ti lm ht kh nng c th c m ngun v tm cch sa m cho h thng tm kim nhng chng ti ch c th chy c vi h thng. Sau khi chy cc h thng, chng ti giao cc bng lin quan li thnh bng lin quan chun. Sau , chng ti c li v to bng nh gi lin quan chun han chnh. 3.1.4.1. H thng SMART 3.1.4.1.1. Gii thiu h thng SMART SMART l mt h thng tm kim thng tin da trn m hnh vector c xut bi Salton vo cui nhng nm 60. Mc ch chnh ca SMART l cung cp mt nn tng cho vic xy dng tm kim thng tin, lp ch mc, nh gi tm kim thng tin. Mc ch th hai l cung cp cho ngi dng thng tin cui cng va nh thch hp vi ngi dng. SMART c nhng u im v khuyt im ca n.SMART c thit k rt linh hot, n cho php thm chnh sa cc on m v c th chy trn bt k h thng UNIX vi yu cu v kch thc b nh nh. 3.1.4.1.2. Qu trnh tm kim thng tin ca SMART SMART thc hin qua 4 th tc:

Trang 66

Lun vn : nh gi cc h thng tm kim thng tin

1. lp ch mc t ng : trch dn v xc nh cc yu t thng tin l t hay cm t (term) ca ti liu v cu truy vn. 2. Phn loi ti liu: tp hp cc ti liu c lin quan vi nhau to thnh nhng lp ti liu c cng ch , lm nh vy h thng c th tm c nhiu ti liu c ch tng t ng thi cng lm tng tc x l ca h thng (xin xem gii thch phn di). 3. Xc nh ti liu c tr v bng cch tnh tng t (similarity) gia cc yu t thng tin c lu tr v cc yu t thng tin va c phn tch t cu truy vn mi nhp vo, sp xp kt qu tr v theo th t gim dn tng t. Qu trnh ny , h thng SMART dng m hnh vec-t.
4. Ci tin cu lnh tm kim (cu truy vn) nhm xy dng li cu truy vn

da vo nhng thng tin c ly t kt qu ca qu trnh phc hi truy vn trc. 3.1.4.1.3. M hnh vec-t ca h thng SMART Trong m hnh ny, mi ti liu c c trng bi 1 vector ca tp cc t ng (term). Tp cc t ng ny c xc nh bi qu trnh lp ch mc ca h thng. C ngha l vi mi ti liu c th DOCi s c xc nh bi tp t ng TERM1 ,TERM2,..,TERMt (T ng y c th gi 1 cch rng hn l yu t thng tin v n c th l mt t, ng c trch dn t cc ti liu hay mt t , cm t ly t cc t in thut ng ng ngha). Mt tp cc ti liu DOC1, DOC2, ., DOCn c th c biu din thnh 1 ma trn trong mi dng ca ma trn l 1 ti liu, mi ct biu din 1 yu t thng tin ca cc at liu. TERM1 DOC1 DOC2 . DOC3 TERM31 TERM32 TERM3t TERM11 TERM21 TERM2 TERM12 TERM22 TERMt TERM1t TERM2t

Trang 67

Lun vn : nh gi cc h thng tm kim thng tin

TERMij gi l trng s thng tin ca yu t thng tin TEMj trong ti liu DOCi , n chnh l tn s xut hin ca TERMj trong ti liu DOCi. TERMij=0 c ngha l trong ti liu DOCi khng tn ti yu t thng tin TERMj => Tng t nh vy cc cu truy vn khi c a vo h thng cng s c biu din thnh vector c t thnh phn TERM c sn ca ti liu. Nhng gi tr ca cc TERMij khng phi l trng s m c gi tr tr nh phn. - Khi TERMij=0 : t (ng) ca cu truy vn khng c trong tp yu t thng tin ti liu - TERMij=1 : t (ng) ca cu truy vn c trong tp yu t thng tin ti liu Biu din hnh hc ca tp vec-t ti liu: Tp ti liu gm n DOC : DOC1, DOC2,,DOCn v t yu t thng tin TERM1, TERM2, , TERMt. Theo m hnh vector : mt ti liu biu din thnh 1 vector trn khng gian t chiu . Vy ta c n vector ti liu DOC1 (TERM11,TERM12,..,TERM1t) DOC2 (TERM21,TERM22,..,TERM2t) . DOCn (TERMn1,TERMn2,..,TERMnt) Ln lt tnh cos ca gc to bi 2 vector ti liu DOCi,DOCj theo cng thc sau:
COS (DOCi, DOCj)=
k=1

(TERMik * TERMjk) (TERMik)^2

k=1

(TERMjk)^2

k=1

Ta thy gc to bi 2 vector DOCi , DOCj cng nh th vector DOCi v DOCj cng gn nhau hay trng s ca cc yu t thng tin so vi ti liu DOCi, DOCj gn bng nhau ti liu DOCi v DOCj c cng ch th hin ngha trn ta c khi nim tng t

Trang 68

Lun vn : nh gi cc h thng tm kim thng tin

tng t ca cc ti liu chnh l cos ca gc to bi 2 vector DOCi, DOCj y chnh l cch xc nh phn loi ti liu ca h thng Mt cch tng t ta c th nh ngha tng t ca ti liu v cu truy vn: Xt 1 cu truy vn Qj c th , Qj c th c biu din di dng vector nh sau: Qj (QTERMj1, QTERMj2,,QTERMjt) Vector Qj cng c biu din trong khng gian t chiu nh tp ti liu tng t ca cu truy vn so vi ti liu DOCi chnh l cos ca gc to bi 2 vector Qj v DOCi .

COS (DOCi, Qj) =

k=1 (TERMik * QTERMjk)

(TERMik)^2 (QTERMjk)^2 k=1 k=1

Do gi tr ca cc vector Qj v DOCi lun l 1 con s ln hn bng 0, nn cos >=0 => l gc trong khong [0, ] . Do hm s cos trong khong [ 0, ] l hm s nghch bin nn cos cng ln th cng nh c ngha l nu 2 vector cng gn nhau th tng t cng ln hay ni dung ca ti liu DOCi lin quan nhiu n yu cu ca cu truy vn Qj. 3.1.4.1.4. S dng m hnh vec-t Phn loi ti liu: Da vo vic tnh tng t gia cc vector ti liu ta c th phn loi ti liu, nhng ti liu c tng t gn nhau s xp vo 1 lp

Mc ch ca vic phn ti liu: L to ra 1 tp tin cluster document. Di y l v d ca tp tin

cluster:

Trang 69

Lun vn : nh gi cc h thng tm kim thng tin

x x o x x x x o x o x x o x

x o x x

x x

Mi 1 im x k hiu cho 1 vector ti liu, khong cch gia 2 im x t l nghch vi tng t (khong cch gia 2 im x cng ln c ngha l tng t gia 2 ti liu cng nh v ngc li). Mi ng trn i din cho lp ti liu. c trng cho lp ngi ta nh ngha thm 1 vector c bit gi l vector centroid , n cng ging nh trng tm ca tp cc im x , c th hin trn hnh v l o.

Cch tnh vector centroid Gi s c m ti liu thuc lp p, vector centroid ca lp p c biu din

nh sau: CENTROIDp = CTERMp1 , CTERMp2,.,CTERMpt Trong :


1 m

CTERMpk =

m i=1

TERMik

vi TERMik l trng s ca term k ca ti liu i trong lp p

Mc ch ca vic a ra vector centroid: u tin ,mi cu truy vn s c so snh vi cc vector centroid tnh

tng t gia vector truy vn v vector centroid thay v phi tnh vi tt c cc vector ti liu . Nu tng t ln (c ngha l lp ti liu thch

Trang 70

Lun vn : nh gi cc h thng tm kim thng tin

hp) th ta tip tc so snh vector truy vn vi cc vector ti liu trong lp m vector centroid i din . Ti liu no c tng t ln s c phc hi Gi s c n ti liu trong tp c s d liu ti liu c phn chia thnh x lp (nn c x vector centroid) , mi lp c kh nng cha n/x ti liu S ln so snh gia cu truy vn v vector centroid l x ln. Sau khi so snh x ln vi cc vector centroid ta chn c 1 vector centroid c tng t ln nht v tin hnh so snh vi n/x ti liu trong lp c vector centroid i din Tng s ln so snh l x + n/x (*) Nu khng thit lp tp tin cluster (tc l khng nh ngha vector centroid ) tng s ln so snh ca 1 cu truy vn vi n tp ti liu l n ln p dng bt ng thc Cauchy cho biu thc (*): x+ n
x

2 n
x n

Du = xy ra khi x = n

x=n

Vy s ln so snh t nht s l 2 n vi s cluster trong tp ti liu l x = i vi tp ti liu ln vi cc ti liu c nhiu ch khc nhau khng ng nht th s lng cluster (lp) s ln, lc s ln so snh gia vector truy vn v cc vector centroid s ln . gii quyt trng hp ny , mt ln na ta li p dng phng php tnh tng t gia cc vector centroid nh cch tnh tng t gia cc ti liu phn lp cho tp vector centroid. Ni tm li vic tnh tng t gia cc ti liu phn lp ti liu to ra 1 vector i din cho lp gi l vector centroid , tng t vic phn lp vector centroid s to ra 1 vector i din cho lp vector centroid gi l vector supercentroid v lp c gi l superclass Do vic tm kim ti liu s c thc hin qua 3 bc: i. u tin so snh vector truy vn vi cc vector supercentroid thuc cc superclass
Trang 71

Lun vn : nh gi cc h thng tm kim thng tin

ii.

Sau so snh cu truy vn vi cc vector centroid ca cc superclass tho bc 1 Cui cng so snh cu truy vn vi cc vector ti liu ca cc lp m vector centroid tho bc 2

iii.

Mt v d v cu trc ca tp tin cluster: Vic t chc cc file cluster phi thch hp vi s pht trin ca tp c s

d liu ti liu, bi v 1 ti liu mi c thm vo c s d liu cng s c thc hin so snh tng t nh i vi cu truy vn. Cc item ca ti liu s c so snh vi cc supercentroid v cc centroid c , kt qu l ti liu s c thm vo nhng cluster thch hp m tng t gia cc cluster ln. Sau h thng phi tin hnh tnh li cc vector supercentroid v centroid ca nhng cluster va mi c thm ti liu mi
SUPERCENTROIDk

SCTERMk1 , SCTERMk2
CENTROIDPOINTER i

, ,

SCTERMkt
.

CENTROIDPOINTER j

CENTROIDi ..

CTERMi1 , CTERMi2
DOCPOINTER i1

, ,

CTERMit

DOCPOINTER i2

CENTROIDj

CTERMj1 , CTERMj2
DOCPOINTER j1 .

, ,

CTERMjt
DOCPOINTER j2

DOCi1 DOCi2

TERMi11 TERMi21 TERMj11 TERMj21

, ,

TERMi12 TERMi22 TERMj12 TERMj22

, , , ,

TERMi1t TERMi22 TERMj1t TERMj22

DOCj1 DOCj2
, , , , , ,

Trang 72

Lun vn : nh gi cc h thng tm kim thng tin

Xc nh ti liu thch hp tr v Ci tin cu truy vn: Th tc ci tin cu truy vn ca h thng SMART cn c gi l qu trnh gi thng tin phn hi v tnh lin quan ca ti liu (Relevance feedback) vic nh gi lin quan m ngi s dng cung cp cho h thng da trn nhng ti liu c phc hi ca cu truy vn trc , xy dng li m hnh vector truy vn mi. Mc ch ca qu trnh ny l xy dng cu truy vn mi c hiu qu thc thi tt hn Th tc ci tin cu truy vn c thc thi nh sau: i. Cc t ng xut hin trong ti liu c ngi dng xc nh l c lin quan s c thm vo m hnh vector truy vn ban u hoc trng s ca nhng t ng ny (term) s c tng ln. ii. Cc t ng xut hin trong ti liu c ngi dng xc nh l khng c lin quan s c xo ra khi cu truy vn ban u hoc trng s ca n s c gim ln.Th tc ci tin cu truy vn s c thc thi mt cch t ng da vo thng tin phn hi t pha ngi dng. Qu trnh ny c th c tin hnh nhiu ln tm ra cu truy vn ti u nht 3.1.4.2. H thng Search4Vn H thng ny cng s dng m hnh khng gian vect tm kim. Mc ch h thng ny l tm kim thng tin ting Vit. M hnh tch t c s dng l Longest Matching. H thng c vit bng ngn ng C#. 3.1.4.3. H thng TERRIER H thng ny cng s dng m hnh khng gian vect tm kim. Kt qu tm kim c TF v IDF cc chuyn gia c th bit h thng tm theo m hnh tt hay khng. Tuy nhin, h thng ny cha h tr Unicode,v cc lp c vit cho Unicode li thuc mt th vin chun(th vin antlr ca Java) nn vic chuyn m rt kh khn. Mun chuyn m bt buc phi thay i cch vit

Trang 73

Lun vn : nh gi cc h thng tm kim thng tin

chng trnh ca h thng. Cho n nay, chng ti cha th chuyn m tm kim ting Vit. H thng ny c vit bng Java v JSP nn c lp mi trng. 3.1.4.4. H thng X-IOTA H thng XIOTA hay h thng IOTA l mt khung lm vic XML m cho th nghim tm kim thng tin. V s dng XML nn XIOTA rt linh hot trong x l ng liu, h tr ci t nhanh nhiu thnh phn th nghim khc nhau m dng cc x l ngn ng t nhin t ng. H thng XIOTA cng tm kim theo m hnh vec-t. H thng X-IOTA c vit bng C++ chy trn mi trng Linux. Tuy nhin, h thng ny ang trong giai on th nghim nn thnh phn m ngun cha n nh tm kim thng tin. 3.1.4.5. H thng LUCENE H thng ny cng s dng m hnh khng gian vec-t tm kim. H thng c vit bng ngn ng Java. Lucene cng l mt h thng m ngun m, l mt cng c tm kim m ngi s dng c th pht trin giao din tm kim theo ring ca mnh. Chng ti cng thm giao din v sa m ca Lucene c th tm ki m c ting Vit. 3.2. Phn tch h thng nh gi cc h thng tm kim thng tin 3.2.1. M t h thng tr gip nh gi 3.2.1.1. Pht biu bi ton Nh chng ti cp, chng ti thc hin nh gi da trn m hnh hng h thng nh gi kt qu tr v ca cc h thng tm kim thng tin (chng ti gi tt l h thng IR). Nhng vic nh gi c thc hin mt cch r rng, trc quan v c bit l t ng ha th nht thit phi cn n h
Trang 74

Lun vn : nh gi cc h thng tm kim thng tin

thng tr gip nh gi cc h tm kim thng tin. H thng tr gip nh gi gm c chng trnh h tr nh gi cc h thng tm kim t ng v b ng liu dng nh gi. 3.2.1.2. Mc tiu Chng trnh h tr nh gi cho php thc thi v xem cch thc hot ng ca cc h thng tm kim thng tin bt k. thc thi tm kim trn mt h thng tm kim thng tin bt k, chng trnh phi cho php nh dng b ng liu dng nh gi ca chng trnh thnh b ng liu m h thng tm kim c th hiu v tm kim c. Cn cch thc hot ng ca h thng tm kim ch yu l vic lp ch mc cu hi, ti liu cho ngi nh gi thy mt cch trc quan cch lp ch mc ca h thng tm kim v so snh cch lp ch mc ca cc h thng vi nhau. Nhng phn quan trng nht m chng trnh phi h tr l tnh hiu sut thc thi ca cc h thng tm kim thng tin bit c h thng tm kim c tt hay khng. Hiu sut thc thi c tnh da trn o bao ph v chnh xc ca kt qu m h thng tm kim tr v. Hiu sut thc thi ca tng h thng v so snh hiu sut ca cc h thng c biu din bng th trc quan cho php ngi nh gi c th d dng xc nh kh nng tm kim ca mt h thng v so snh nhiu h thng tm kim vi nhau. 3.2.1.3. Phm vi Phm vi ca h thng nh gi l ch tr gip cc h thng IR c: cc file kt qu v file ch mc l dng XML tp d liu kim th (tp ti liu v tp cu hi) c th dng XML hay dng text file 3.2.1.4. Chc nng nh dng tp ti liu v tp cu hi ca chng trnh ph hp vi cu trc h thng IR
Trang 75

Lun vn : nh gi cc h thng tm kim thng tin

cho php thc thi mt h thng IR (vi iu kin h thng IR phi c file thc thi) Cho php xem cch thc hot ng ca h thng IR (v d nh cch lp ch mc b ng liu dng nh gi). Hiu thng tin kt qu tr v ca h thng IR v tnh ton ,nh gi h thng IR cho php xem kt qu nh gi ca mt h thng c th so snh cc h thng IR

cho php xem th biu din ng cong RP chun ha 3.2.1.5. Tnh kh dng - Tng thch, chy c trn hai mi trng Windows v Linux - Giao din ngi dng d s dng, cho php nh dng, xem h thng tm kim thc thi v nh gi trc quan bng th. 3.2.1.6. Hiu sut - c kh nng nh dng b ng liu dng nh gi kch thc ln mt cch nhanh chng. 3.2.1.7. Tnh bo mt (khng c) 3.2.2. Phn tch h thng nh gi 3.2.2.1. Chc nng ca h thng H thng tr gip nh gi c cc chc nng chnh sau: nh gi kt qu truy vn ca mt h thng IR So snh hiu sut thc thi ca nhiu h thng IR

Trang 76

Lun vn : nh gi cc h thng tm kim thng tin

3.2.2.2. Chc nng yu cu 3.2.2.2.1. Chc nng nh gi mt h thng IR

3.2.2.2.2. Chc nng so snh nhiu h thng IR

3.2.2.2.3. S use case

Trang 77

Lun vn : nh gi cc h thng tm kim thng tin

So sanh nhieu he thong IR


(from Use Cases)

Thuc thi he thong IR


(from Use Cases)

Dinh dang tai lieu


(from Use Cases)

He thong IR NguoiSuDung
(from Actors)

Tap du lieu kiem tra


(from Actors)

Dinh dang index file


(from Use Cases)

(from Actors)

Dinh dang cau hoi


(from Use Cases)

Dinh dang ket qua


(from Use Cases)

Xem ket qua danh gia


(from Use Cases)

M t usecase: Dinh dang tai lieu Usecase ny cho php ngi s dng chuyn i cu trc tp ti liu ca chng trnh thnh cu trc ti liu ca h thng IR Dinh dang cau hoi: Usecase ny cho php ngi s dng chuyn i cu trc tp cu hi ca chng trnh thnh cu trc cu hi ca h thng IR Thuc thi he thong IR: Usecase ny cho php thc thi mt h thng IR bn ngoi Dinh dang ket qua: Usecase ny cho php ngi s dng chuyn i cu trc tp tin kt qu ca h thng IR thnh cu trc tp tin kt qu do chng trnh nh ngha v x l cc thng tin kt qu ny nh gi h thng IR Dinh dang index file: Usecase ny cho php ngi s dng chuyn i cu trc tp tin index ca h thng IR thnh cu trc tp tin index do chng trnh nh ngha Xem ket qua danh gia

Trang 78

Lun vn : nh gi cc h thng tm kim thng tin

Usecase ny cho php ngi s dng xem kt qu nh gi h thng IR So sanh nhieu he thong IR Usecase ny cho php so snh nhiu h thng IR vi nhau 3.2.2.2.4. S tun t hot ng usecase Dinh dang tai lieu:

: NguoiSuDung

TH_DDTaiLieu

XL_Doc

XL_XML

XL_Text

LT_XML

LT_Text

Mo man hinh Nhap thong tin dinh dang Yeu cau chuyen sang XML Dinh dang tai lieu Chuyen doi XML

Ghi file XML

Yeu cau chuyen sang text Chuyen doi text Yeu cau dinh dang text Ghi file text

Trang 79

Lun vn : nh gi cc h thng tm kim thng tin

Dinh dang cau hoi:

: NguoiSuDung

TH_DDCauHoi

XL_Topic

XL_XML

XL_Text

LT_XML

LT_Text

Mo man hinh Nhap thong tin dinh dang Yeu cau chuyen sang XML Dinh dang cau hoi Chuyen doi XML

Ghi file XML

Yeu cau chuyen doi sang text Dinh dang cau hoi text

Chuyen doi sang Text

Ghi file text

Trang 80

Lun vn : nh gi cc h thng tm kim thng tin

Thuc thi he thong IR:

: NguoiSuDung

TH_ThucThiHT

TH_DKThucThi

XL_HeThongIR

: He thong IR

Mo man hinh

Yeu cau thuc thi he thong IR

Xet tap du lieu kiem tra da san sang ?

Chua san sang Yeu cau nhap vi tri luu tru tap du lieu
Nhap vi tri luu tap du lieu Sao chep tap du lieu den vi tri yeu cau
Thuc thi he thong
hien thi he thong IR

Da san sang

Thuc thi he thong


Hien thi he thong IR

Trang 81

Lun vn : nh gi cc h thng tm kim thng tin

Dinh dang ket qua

: NguoiSuDung

TH_DDKetQua

XL_KetQua

XL_XML

LT_XML

Mo man hinh

Nhap thong tin dinh dang

Yeu cau dinh dangDinh dang ket qua IR

Lay du lieu file kq


doc file ket qua

Thong tin ve ket qua


Tao file ket qua co cau truc cua chuong trinh

Ghi file XML

Trang 82

Lun vn : nh gi cc h thng tm kim thng tin

Dinh dang index file:

: NguoiSuDung

TH_DDIndex XL_Index

XL_XML

LT_XML

Mo man hinh
Nhap thong tin dinh dang

Yeu cau chuyen doi


Chuyen doi file index tai lieu
Lay thong tin file index tai lieu
Thong tin file index tai lieu
Tao file index tai lieu theo cau truc cua chuong trinh

Ghi file XML

Chuyen doi file index cau hoi


Lay thong tin file index cau hoi
Thong tin file index cau hoi
Ghi file XML

Xem ket qua danh gia

Trang 83

Lun vn : nh gi cc h thng tm kim thng tin

TH_Kq_DanhGia TH_XemChiTiet TH_DoThi_HeThong XL_Topic XL_KetQua XL_Doc : NguoiSuDung

XL_Index XL_HeThongIR XL_XML

XL_DoThi

LT_XML

Mo nam hinh Yeu cau thong tin ve cac he thong da danh gia Lay noi dung cua cac the theo cau cua file xml Doc fileyeu he thong

danh sach he thong

Hien thi danh sach he thong Chon he thong can xem Lay danh sach cac cau hoi duoc kiem tra Lay noi dung cua cac the theo yeu cau cua file xml Doc file danh gia Danh sach cau hoi

xem thong tin lien quan cua mot cau hoi Lay cac tai lieu lien quan den cau hoi do va ket qua danh gia Lay noi dung cua cac the theo yeu cau cua file xml Doc file danh gia Cac tai lieu lien quan va thong tin danh gia Hien thi ket qua danh gia

Yeu cau xem thong tin chi tiet Mo man hinh xem chi tiet Lay noi dung cua cau hoi Lay noi dung cua cac the theo yeu cau Doc file cau hoi tuong ung Noi dung theo yeu cau Noi dung cau hoi

Lay noi dung tai lieu lien quan va do tuong quan Lay noi dung cua cac the theo yeu cau Doc file tai lieu tuong ung Noi dung theo yeu cau Noi dung tai lieu lien quan va do tuong quan Lay thong tin chi m uc cua tai lieu va cau hoi Lay noi dung cua cac the theo yeu cau Doc file index cua tai lieu va cau hoi Noi dung theo yeu cau Noi dung index hien thi thong tin lien quan cua cau voi voi 1 tai lieu

Yeu cau xem do thi he thong Lay thong tin ve do chinh xac o 11 diem chuan cua do bao phu Lay noi dung cua the theo yeu cau Doc file he thong

Ve do thi he thong Hien thi duong cong RP len do thi

Trang 84

Lun vn : nh gi cc h thng tm kim thng tin

So sanh nhieu he thong IR

: NguoiSuDung

TH_SoSanhHT

XL_HeThongIR

XL_DoThi

XL_XML

LT_XML

Mo man hinh

Lay danh sach cac he thong Lay thong tin cac the theo yeu cau

Doc file he thong

Thong tin theo yeu cau


Danh sach he thong IR

Hien thi cac he thong IR

Chon cac he thong IR can so s anh

Lay gia tri R,P chuan cua cac he thong Lay thong tin cac the theo yeu cau

Doc file he thong

thong tin theo yeu cau

Thong tin R,P o 11 diem chuan


Yeu cau ve do thi

Hien thi cac duong cong RP len do thi

Trang 85

Lun vn : nh gi cc h thng tm kim thng tin

3.3. Thit k h thng nh gi 3.3.1. Cc chc nng ca chng trnh

nh gi nhiu h thng IR

nh gi mt h thng IR

Thc thi h thng IR

X l kt qu tr v

nh dng c s d liu kim tra

nh dng tp kt qu

nh dng index file

3.3.1.1. Chc nng nh dng c s d liu ti liu Chng trnh phi xy dng tp c s d liu dng cho vic kim tra cc h thng IR . Tp c s d liu bao gm : cc ti liu , tp cu truy vn . (Ngoi ra cn c bng kt qu nh gi chun so snh h thng IR vi cc h thng IR chun). Vi chc nng ny, chng trnh s cho php ngi dng khai bo nh dng d liu (bao gm ti liu v cu truy vn) m h thng IR ca h cn. Da vo nh dng ny , chng trnh s to tp d liu c ni dung l tp d liu ca mnh nhng c cu trc ca h thng IR 3.3.1.2. Chc nng nh dng kt qu tr v H thng IR sau khi thc hin tt c cc cu truy vn trong tp c s d liu ti liu , s gi kt qu thc thi v cho chng trnh. Mi h thng IR s nh dng kt qu tr v khc nhau, v kt qu tr v m chng trnh quan tm l tp tin kt qu ghi nhn s lin quan ca mi cu hi vi tp ti liu

Trang 86

Lun vn : nh gi cc h thng tm kim thng tin

Chc nng ny ghi nhn cc thng tin nh dng kt qu to file kt qu c cu trc ca chng trnh v ghi nhn thng tin kt qu ca h thng IR to bng lin quan thc t (do chng trnh IR cn nh gi cung cp) 3.3.1.3. Chc nng nh dng file index H thng IR c lu tr thng tin lp ch mc cho ti liu v cho cu hi h tr ngi dng nh gi chc nng lp ch mc ca h thng IR, chng trnh hin th thng tin ch mc ca h thng IR. Do chc nng ny cho php ngi dng khai bo cu trc file index chng trinh ly thng tin 3.3.1.4. Chc nng Thc thi h thng IR Gi thc thi h thng IR 3.3.1.5. Chc nng X l kt qu tr v Da vo bng nh gi chun (bng lin quan theo l thuyt) v bng lin quan tnh bao ph, chnh xc v cc gi tr chnh xc ti 11 im chun ca bao ph, tnh gi tr bao ph trung bnh, chnh xc trung bnh 3.3.1.6. Chc nng nh gi mt h thng IR Da vo tp kt qu tr v c cu hnh theo nh dng ca chng trnh, chung trnh s tnh hiu sut thc thi ca h thng da vo bao ph v chnh xc 3.3.1.7. Chc nng nh gi nhiu h thng IR Da vo tp tin lu tr kt qu tr v ca tng h thng m n nh gi, chng trnh s so snh cc h thng

Trang 87

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2. Thit k h thng 3.3.2.1. S kin trc tng th


TH_Main TH_DDTaiLieu TH_TTTaiLieu TH_DDCauHoi TH_TTCauHoi TH_DKThucThi TH_DDKetQua TH_ThucThiHT TH_DDIndex Tng th hin

TH_KqDanhGia TH_DoThi_HeThong TH_XemChiTiet

TH_SoSanhHT

CFormat CDocument XL_Doc

CHeThongIR CTopic

CKetQua CRelevant

CIndex Tng x l

XL_Topic XL_HeThongIR XL_XML XL_Text

XL_KetQua

XL_Index

XL_DoThi Tng lu tr

LT_XML

LT_Text

3.3.2.1.1. Danh sch cc lp i tng 3.3.2.1.2. Lp i tng th hin


STT Tn

ngha ca chng trnh

Ghi ch c th thc thi tt c cc chc nng ca chng trnh

TH_Main

Mn hnh chnh T mn hnh chnh, chng ta

TH_DDTaiLieu

nh dng ti liu Tt c ti liu ca chng trnh c lu bng file

Trang 88

Lun vn : nh gi cc h thng tm kim thng tin

XML, chc nng ny ca chng trnh cho php chuyn i cc ti liu XML thnh cc ti liu XML khc (c nh dng khc) hoc ti liu dng text (ph hp vi nh dng ca file ti liu ca h thng IR bn ngoi) 3 TH_TTTaiLieu nh dng thuc Khi chuyn cc file XML ti tnh cho cc th liu ca chng trnh (F1) ti liu nu c sang file XML khc (F2) : cc th ca F1 tng ng vi cc th ca F2 m cc th ca F2 c thuc tnh l cc th ca F1 hoc th mi th chng trnh s hin th mn hnh TH_TTTaiLieu cc thuc tnh Ch : vi chc nng chuyn t file XML (file ti liu ca chng trnh) sang file text (file ti liu ph hp vi h thng IR) s khng cn mn hnh ny 4 TH_DDCauHoi nh dng cu (tng t TH_DDTaiLieu) hi 5 TH_TTCauHoi nh dng thuc (tng t TH_TTTaiLieu ) cho php ngi dng nh ngha

Trang 89

Lun vn : nh gi cc h thng tm kim thng tin

tnh cho cc th cu hi nu c 6 TH_DKThucThi Mn hnh nhp Mn hnh ny ch dng khi iu thit kin trc cn ngi dng khng nh dng khi ti liu v cu hi v mun thc thi h thng IR. Nhim v ca mn hnh ny l yu cu ngi dng cung cp ni lu tr tp ti liu v tp cu hi ca h thng IR 7 TH_ThucThiHT Thc thi h thng Ti mn hnh ny s gi cc IR bn ngoi h thng IR cn thc thi sau khi chuyn i tp d liu kim tra (gm tp ti liu v tp cu hi)ph hp vi h thng IR . 8 TH_DDKetQua nh qu dng kt Sau khi thc thi h thng IR xong, chng ta cn ly thng tin v nh dng cc file kt qu ca h thng IR tr v chng trnh c th nh gi da trn cc file kt qu ny 9 TH_DDIndex Mn hnh nh Chc nng nh dng tp tin dng cc tp tin ch mc khng bt buc ch mc ca tp ti liu v tp cu hi 10 TH_Kq_DanhGia Hin th kt qu Mn hnh ny ch th hin

thc thi h thng

Trang 90

Lun vn : nh gi cc h thng tm kim thng tin

nh gi ca mt nh gi di gc h h thng IR 11 TH_XemChiTiet Mn hnh thng cho Ti mn hnh xem kt qu

php xem thng nh gi (TH_Kq_DanhGia), tin c th ca mt khi mun xem chi tit s lin ti liu lin quan quan ca mt ti liu c th n 1 cu truy vi mt cu truy vn no vn TH_DoThi_HeThon Mn hnh biu g din ng cong RP ca mt cu hi 13 TH_SoSanhHT So snh cc h thng IR chng trnh s gi mn hnh ny 12

3.3.2.1.3. Lp i tng x l
STT Tn

ngha

Ghi ch

CFormat

Cho php khai Cu trc nh dng nh sau: bo cu trc ca - oldId: th t ca th trong ti cc dng nh liu c dng ca ti - newId: th t ca th trong ti liu mi - oldTag: tn th (section) c - newTag: tn th (section) mi - haveAttr : nh du nu th c thuc tnh - attrArray: mng lu tr gi tr liu v cu hi

Trang 91

Lun vn : nh gi cc h thng tm kim thng tin

thuc tnh (c kiu CFormat) (Ghi ch: cc trng haveAttr,attrArray ch dng i vi file XML khng dng cho file text) Thng tin nh dng ca cc th , thuc tnh ca th(nu l file XML) v thng tin nh dng ca cc section (nu l file text) s c ghi nhn thng qua lp Cformat. Thng tin ny s c chuyn xung cho lp XL_Doc thc hin vic chuyn i 2 CDocument Cho php khai Cu trc ca mt ti liu nh sau: bo cu trc ca - DocID: phn bit cc ti liu ti liu lu - Title: ch ca cu hi tr cc gi tr - Author: tc gi ca ti liu ni dung ca - Date: ngy to ti liu cc th (hoc - News: ti liu c ly t u section) ca ti - Content: ni dung ti liu liu Ni dung ca cc th hay cc section ca ti liu s c tr v cho chng trnh thng qua lp CDocument khi cn 3 XL_Doc X l cc hat Chng trnh thng qua lp ng , cc chc XL_Doc thc hin cc chc nng liu ca ti nng nh: - chuyn ti liu theo nh dng

Trang 92

Lun vn : nh gi cc h thng tm kim thng tin

ca h thng IR bt k - ly cc danh sch cc ti liu - ly thng tin mt ti liu bt k da vo ch s ca n (DocID), (ch : thng tin h thng s c tr v di cu trc ca lp CDocument) 4 CTopic Cho php khai Cu trc ca mt cu hi (topic) bo cu trc ca gm: cu hi lu - TopID : phn bit cc cu hi tr cc gi tr - Title : ni dung ca cu hi ni dung ca - Description: gii thch cu hi cc th (hoc - Narrative: yu cu mc tng section) ca cu quan ca mt ti liu c cho l hi c lin quan n cu hi Ni dung ca cc th hay cc section ca cu hi s c tr v cho chng trnh thng qua lp CTopic khi cn 5 XL_Topic X l cc hat Chng trnh thng qua lp ng , cc chc XL_Topic thc hin cc chc nng ca cu nng nh: hi - chuyn file cu hi theo nh dng ca h thng IR bt k - ly cc danh sch cc cu hi - ly thng tin mt cu hi bt k da vo ch s ca n (TopID), (ch : thng tin h thng s c

Trang 93

Lun vn : nh gi cc h thng tm kim thng tin

tr v di cu trc ca lp CTopic) 6 CHeThongIR Cho php khai Cu trc mt h thng IR nh sau: bo cu trc ca - TenHT: tn h thng h thng IR - ID: ch s ca h thng phn lu tr cc gi bit cc h thng tr ni dung ca - Ngy kim tra: ngy gi tin cc th (hoc hnh kim tra h thng IR section) ca h - R trung bnh: bao ph trung thng IR bnh ca h thng IR (tnh trn tng s cu hi thc thi) - P trung bnh: chnh xc trung bnh ca h thng IR (tnh trn tng s cu hi thc thi) 7 XL_HeThongIR X l cc hat Cc chc nng lin quan n h ng chc nng thng IR : lin quan n - Thc thi h thng IR h thng IR - Ly danh sch cc h thng IR c nh gi - Ly thng tin mt h thng IR bt k da vo ch s ca n (ID) (ch : thng tin h thng s c tr v di cu trc ca lp CHeThongIR) 8 CKetQua Cho php khai File kt qu l mt file lu tr mi bo c u trc lin h gia cc cu hi vi cc ti nh dng cc liu v mc lin quan ca n

Trang 94

Lun vn : nh gi cc h thng tm kim thng tin

file kt qu tr ( tng quan) . Cu trc ca kt v ca h thng qu CKetQua: - TopicID: tn th tng ng vi TopicID trong file kt qu tr v ca h thng IR - DocID: tn th tng ng vi DocID trong file kt qu - ThuocTinh_TopicID: tn thuc tnh tng ng vi th TopicID ( nu th tng ng vi TopicID c thuc tnh ) ThuocTinh_DocID:th thuc tnh tng ng vi th DocID ( nu th tng ng vi DocID c thuc tnh ) - TagSim: th tng ng vi th nh ngha tng quan - ThuocTinh_Sim:th thuc tnh tng ng vi th TagSim (nu th TagSim c thuc tnh) Chng trnh dng lp CketQua nh ngha nh dng ca file kt qu tr v ca mt h thng IR bt k 9 CRelevant Cho php khai Cu trc m t s lin quan nh bo cu trc s sau: lin quan ca - DocID: ti liu ti v tr th t n ti liu ti v tr ca danh sch cc ti liu tr v

Trang 95

Lun vn : nh gi cc h thng tm kim thng tin

th t n ca sp xp. danh sch cc - Relevant : ti liu tr v c ti liu tr v lin quan hay khng ca mt cu hi - Precision: chnh xc ca ti liu v tr th t n - Recall: bao ph ca ti liu v tr th n 10 XL_KetQua X l cc chc nng lin quan n kt qu tr Chc nng x l kt qu tr v ca h thng IR: - Chuyn i nh dng ca file

v ca h thng kt qu t h thng IR bn ngoi IR v x l sang nh dng m chng trnh chc nng nh nh ngha chng trnh hiu gi h thng IR v s dng Chc nng x l nh gi h thng IR: - Tnh bao ph, chnh xc cho tng cu hi v chnh xc ti 11 im chun ca bao ph - Ghi nhn kt qu nh gi - Pht sinh t ng ch s Id cho cc h thng thc hin kim tra - Tnh bao ph trung bnh v bao ph chnh xc cho h thng IR - Tnh chnh xc da trn 10 im chun ca bao ph

Trang 96

Lun vn : nh gi cc h thng tm kim thng tin

- Ly thng tin nh gi tr v cho chng trnh . 11 CIndex Cho php khai Chng trnh s h tr hin th bo c u trc thng s lp ch mc ca h thng nh dng cc IR, cho php ngi dng c th file index tr v da vo kt qu nh gi ca h thng IR phng php lp ch mc ca h thng IR . Cu trc ca CIndex: - ID : th trong file ch mc m t ch s ca cu truy vn hay ti liu c lp ch mc - ThuocTinh_ID: thuc tnh ca th ID nu ID c thuc tnh (tc l nu trong th ID khng cha ni dung ca ch s ti liu hay cu hi m ni dung c lu trong thuc tnh ca th ID) - Term: th m t t c ngha trong ti liu hay cu hi c lp ch mc - ThuocTinh_Term: thuc tnh ca th Term (nu ni dung ca t lu trong thuc tnh ca th Term) - Weigh: th m t trng s ca t (term) trn - ThuocTinh_Weigh: thuc tnh ca th Weigh nu ni dung trng

Trang 97

Lun vn : nh gi cc h thng tm kim thng tin

s lu trong thuc tnh ca th Weigh Ngi dng cung cp cc thng s v tn th trong tp tin ch mc tng ng cu trc cc th trong CIndex ca chng trnh c nh ngha nh trn chng trnh thc hin vic nh x ly thng tin lp ch mc ca h thng IR 12 XL_Index X l cc chc Cc chc nng ca lp nng lin quan XL_Index : n file index - chuyn i nh dng ca file tr v ca h Index ca h thng IR thng IR - ly thng tin v file index sau khi nh dng tr v cho chng trnh ng vi mt cu hi (hay ti liu )bt k 13 XL_DoThi X l vic lin quan n v th 14 XL_XML Thc hin cc chc nng to file XML , chuyn i mt file XML bt k sang mt file XML c nh

Trang 98

Lun vn : nh gi cc h thng tm kim thng tin

dng khc 15 XL_Text Thc hin cc chc nng to file text , chuyn i mt file XML bt k sang mt file text c nh dng khc

3.3.2.1.4. Lp i tng lu tr STT Tn 1 LT_XML ngha Thc hin vic oc ghi cc file XML, thao tc trc tip vi file XML 2 LT_Text Thc hin vic oc ghi cc file text, thao tc trc tip vi file text Ghi ch

3.3.2.2. S kin trc tng qut cho tng chc nng ca chng trnh 3.3.2.2.1. Chc nng nh dng ti liu Qui trnh thc hin nh dng ti liu, chng ta cn thc hin cc bc sau: Thng qua TH_DDTaiLieu: Chn loi file m ti liu cn kt xut (l file xml hay text) Nu l file text : chng trnh yu cu ngi dng nh x cc th ca ti liu vi cc section tng ng do h thng IR quy nh, trong trng hp khng tn ti section tng ng vi th ti liu

Trang 99

Lun vn : nh gi cc h thng tm kim thng tin

ca chng trnh , chng trnh s cho php ngi dng t nh ngha thm section mi Nu l file xml: chng trnh yu cu ngi dng nh x cc th ca ti liu vi cc th tng ng do h thng IR quy nh. Tng t nh trn ngi dng cng c php nh ngha th mi Thng qua TH_TTTaiLieu: (ch dng trong trng hp file kt xut l file xml)nu cc th tng ng ca h thng IR c th c thuc tnh ng vi th ti liu ca chng trnh hoc l th mi cng c thc hin S :
TH_DDTaiLieu TH_TTTaiLieu

CFormat XL_Doc XL_XML XL_Text

CDocument

LT_XML

LT_Text

3.3.2.2.2. Chc nng nh dng cu hi Qui trnh thc hin: nh dng cu hi, chng ta cn thc hin cc bc sau: Thng qua TH_DDCauHoi: Chn loi file m cu hi cn kt xut (l file xml hay text) Nu l file text : chng trnh yu cu ngi dng nh x cc th ca cu hi vi cc section tng ng do h thng IR quy nh, trong trng hp khng tn ti section tng ng vi th cu hi

Trang 100

Lun vn : nh gi cc h thng tm kim thng tin

ca chng trnh , chng trnh s cho php ngi dng t nh ngha thm section mi Nu l file xml: chng trnh yu cu ngi dng nh x cc th ca cu hi vi cc th tng ng do h thng IR quy nh. Tng t nh trn ngi dng cng c php nh ngha th mi Thng qua TH_TTCauHoi: (ch dng trong trng hp file kt xut l file xml)nu cc th tng ng ca h thng IR c th c thuc tnh ng vi th cu hi ca chng trnh hoc l th mi cng c thc hin S :
TH_DDCauHoi TH_TTCauHoi

CFormat XL_Topic XL_XML XL_Text

CTopic

LT_XML

LT_Text

(ch : cc section hoc th mi c th c th t khc vi th t ca cc th ti liu hay th cu hi) 3.3.2.2.3. Chc nng Thc thi h thng Qui trnh thc hin: Chng trnh gi h thng IR bn ngoi thc hin chc nng tm kim trn kho d liu c nh dng

Trang 101

Lun vn : nh gi cc h thng tm kim thng tin

S
TH_ThucThiHT

XL_HeThongIR

3.3.2.2.4. Chc nng nh dng kt qu Qui trnh thc hin: Sau khi h thng IR thc hin tim kim xong trn kho d liu do chng trnh cung cp, chng trnh yu cu h thng IR cung cp thng tin nh dng ca file kt qu. nh dng file kt qu ta tin hnh cc bc sau: Chng trnh ch chp nhn file kt qu tr v ca h thng IR l dng file xml Ngi dng thc hin nh x cc th tng ng ca file kt qu ca h thng IR vi cc th theo quy c ca chng trinh. Nu ni dung kt qu c nh ngha trong thuc tnh ca th th ngi dng phi khai bo th v thuc tnh ca th Chng trnh tin hnh c cc gi tr kt qu da vo thng tin nh dng m ngi dng cung cp, ghi li cc gi tr vo file theo cu trc ca chng trnh nh ngha S :
TH_DDKetQua

CKetQua

XL_KetQua XL_XML

LT_XML

Trang 102

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.2.5. Chc nng nh dng file index Qui trnh thc hin: Ngi s dng c th khai bo nh ngha cc th tng ng ca file lp ch mc (chng trnh khng bt buc, nu khng c khai bo chng trnh khng th hin thng tin lp ch mc cho ngi dng xem). nh dng file index ta tin hnh cc bc sau: Chng trnh ch chp nhn file index ca h thng IR l dng file xml Ngi dng thc hin nh x cc th tng ng ca file index ca h thng IR vi cc th theo quy c ca chng trinh. Nu ni dung index c nh ngha trong thuc tnh ca th th ngi dng phi khai bo th v thuc tnh ca th Chng trnh tin hnh c cc gi tr index da vo thng tin nh dng m ngi dng cung cp, ghi li cc gi tr vo file theo cu trc ca chng trnh nh ngha S :
TH_DDIndex

CIndex

XL_Index XL_XML

LT_XML

3.3.2.2.6. Chc nng nh gi v hin thi kt qu nh gi Qui trnh thc hin: Chn h thng cn xem kt qu nh gi Chng trnh hin thi kt qu nh gi ca h thng

Trang 103

Lun vn : nh gi cc h thng tm kim thng tin

Chng trnh cho php ngi dng xem chi tit thng tin v tng cu hi , ti liu lin quan (ni dung, thng tin lp ch mc)

S :
TH_ChiTietCauHoi TH_KqDanhGia TH_ChiTietTaiLieu

XL_Topic

XL_HeThongIR

XL_KetQua

XL_Topic

XL_XML

LT_XML

3.3.2.2.7. Chc nng So snh cc h thng IR c thc thi Qui trnh thc hin:: Cc h thng IR c thc thi v nh gi s c ghi nhn li, chng trnh cho php ngi dng chn cc h thng trong s h thng trn thc hin so snh. Chn cc h thng cn so snh Thc hin so snh

S :
TH_KqDanhGia

XL_HeThongIR

XL_KetQua

XL_XML

LT_XML

Trang 104

Lun vn : nh gi cc h thng tm kim thng tin

Qui trnh tng qut nh gi mt h thng IR: - nh dng tp d liu kim tra bao gm tp ti liu v tp cu hi - Thc thi h thng IR trn tp d liu nh dng: h thng IR s chy cc cu h trong tp cu hi , tm kim trn tp ti liu , sau s phi tr v cho chng trnh file kt qu m t s lin quan ca cc cu hi vi cc ti liu v tng quan ca chng - nh dng file kt qu tr v ca h thng IR chng trnh c th hiu c cc thng tin kt qu. Thng tin kt qu bao gm: ch s ca cc cu hi, ch s ca cc ti liu lin quan n mi cu hi , v tng quan ca chng . - nh dng tp tin ch mc nu thng tin v ch mc ca tp ti liu v cu hi c ghi nhn di file xml. - Xem thng tin nh gi v h thng IR trn. Thng tin nh gi gm: cu hi c thc thi, tp ti lin c h thng IR tr v, tp ti liu c lin quan theo l thuyt (da vo bng lin quan ca chng trnh xy dng sn trc), chnh xc , bao ph ca h thng khi thc thi tm ki m cu hi , v thng tin v chnh xc tnh trn 11 im chun ca bao ph, thng tin v ng cong RP Sau khi nh gi c nhiu h thng, chng trnh cn cho php ngi dng so snh da cc h thng da trn ng cong chun RP 3.3.2.3. Thit k d liu t chc lu tr 3.3.2.3.1. M hnh d liu a) M hnh ER:

Trang 105

Lun vn : nh gi cc h thng tm kim thng tin

b) Gii thch : Chng trnh lu gi cc gi tr nh gi ca mt h thng IR c tin hnh kim tra.H thng IR l mt thc th (System) Mi mt h thng sau khi thc hin vic tm kim da trn tp d liu (g m tp ti liu v tp cu hi) c chng trnh cung cp, s thng bo cho chng trnh bit thng tin v s lin quan ca cc cu hi vi tp ti liu v mc lin quan ca n. Chng trnh s ghi nhn li thng tin lin quan gi l bng lin quan thc t (relevant_TT). thc hin vic kim tra chng trnh c sn bng lin quan ca tp d liu, gi l bng lin quan theo l thuyt (relevant_LT) Mi mt h thng sau khi thc thi tp cu hi trn tp ti liu kim tra s c mt bng nh gi theo tng cu hi (evaluation). Bng nh gi gm cc thng tin sau:cc cu hi c thc thi, v s ti liu lin quan thc s tr v(l phn giao ca cc ti liu trong bng lin quan l thuyt v bng lin quan thc t ca h thng IR), s ti liu lin quan theo l thuyt , s ti liu tr v (do h thng IR), bao ph , chnh xc ca h thng khi thc hin cu hi v chnh xc ti 11 im chun ca bao ph.Cc thng tin ny c c do chng trnh tnh ton da vo bng lin quan thc t ca mt h thng IR xc nh v bng lin quan l thuyt.

Trang 106

Lun vn : nh gi cc h thng tm kim thng tin

Mi mt cu hi hay mt ti liu u c lp ch mc theo mt phng php no ca mt h thng IR c th (Index_Topic) (Index_Doc). Thng tin v index ny s c thng bo cho chng trnh bit c h tr hin th cho ngi dng c th nh gi phng php index c tht s tt hay khng 3.3.2.3.2. S logic d liu S logic :

Trang 107

Lun vn : nh gi cc h thng tm kim thng tin

Gii thch: System: sysID,Name,Date,AvgRecall, AvgPrecision,

R00,R01,R02,R03,R04,R05,R06,R07,R08,R09,R10 Mi h thng c phn bit vi cc h thng khc da vo sysID ca n. Thng tin nh gi mc h thng bao gm : (tn, ngy thc hin kim tra, bao ph trung bnh, chnh xc trung bnh , chnh xc trung bnh c tnh 11 im chun ca bao ph : R00 R10) s c lu li Terms: termID, Term Danh sch cc t c ngha. Topic: TopID , Title, Des, Narr - Tp cu hi (topic) s c lu v c phn bit bi TopID.Thng tin m chng trnh quan tm n cu hi l ni dung ca n (title), cc thng tin ph nh ch gii (description) , rng buc lin quan (narrative) ch c ngha i vi ngi dng nh gi khi no mt ti liu c gi l c lin quan n cu hi ny, phn thng tin ny khng c ngha i vi chng trnh Index_topic : topID, sysID,size Mi cu hi s c lp 1 ch mc theo phng php lp ch mc ca h thng IR c th (c sysID), thng tin ch mc ca tng cu hi ch thuc v 1 cu hi ,thng tin ch mc bao gm : kch thc ca ch mc (tng s t trong cu hi c lp ch mc), cc t c lp ch mc v trng s ca mi t Topic_term : termID,topID,sysID,weigh Mi index_topic s c lp ch mc da trn bng t c ngha (term) v nh trng s (weigh) cho mi t. Document: DocID, Title, Author, Date,News, Content

Trang 108

Lun vn : nh gi cc h thng tm kim thng tin

- Tp ti liu (document) s c lu v c phn bit bi DocID.Thng tin m chng trnh quan tm n ti liu l ni dung ca n (content), ch (title), tc gi ca ti liu (author), ngy to (date), ngun gc ca ti liu (news) Index_Doc : docID,sysID, size Mi ti liu s c lp 1 ch mc theo phng php lp ch mc ca h thng IR c th (c sysID), thng tin ch mc ca tng ti liu ch thuc v 1 ti liu ,thng tin ch mc bao gm : kch thc ca ch mc (tng s t trong ti liu c lp ch mc), cc t lp ch mc v trng s ca mi t Doc_term : termID,docID,sysID,weigh Mi index_doc s c lp ch mc da trn bng t c ngha (term) v nh trng s (weigh) cho mi t. relevant_TT: topID,DocID,sysID, similarity - Mi lin h gia cu hi v ti liu c h thng IR c th (c mt sysID c th)bn ngoi tr v, chng trnh s ghi nhn nh gi relevant_LT : topID,DocID - S lin quan n ti liu ca mt cu hi vi ti liu theo l thuyt.S lin quan theo l thut ny c to ra t bn ngoi qua vic kim tra nhiu h thng evaluation: sysID,TopID, Ret_Rel, Ret,Rel,R,P,

R00,R01,R02,R03,R04,R05,R06,R07,R08,R09,R10 - Mi mt topic c thc thi trn mt h thng (c sysId c th) s c chng trnh nh gi v lu tr cc thng tin nh: s ti liu c lin quan c tr v (RET_REL), s ti liu lin quan theo l thyt (REL), s ti liu tr v thc s(RET,tnh bao ph , chnh xc v cc chnh xc ti 11 im chun ca bao ph (R00.. R10).

Trang 109

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.4. T chc lu tr d liu Tt c d liu c lu tr di file XML 3.3.2.4.1. System - Cc h thng IR sau khi c tin hnh kim tra s c lu tr trong file system.xml. - Ngoi ra a ra bng gi tr thc hin vic v ng cong RP , chng trnh s tnh gi tr chnh xc c tnh cc im chun ca bao ph .Tt c cc gi tr ny s c lu tr . Cu trc DTD ca file system.xml nh sau: <!ELEMENT COMPARE (SYSTEM*)> <!ELEMENT SYSTEM (NAME,DATE, AVGRECALL,

AVGPRECISION, R00, R01, R02, R03, R04, R05, R06, R07, R08, R09, R10)> <!ATTLIST SYSTEM SYSID CDATA #REQUIRE> <!ELEMENT NAME (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT AVGRECALL (#PCDATA)> <!ELEMENT AVGPRECISION (#PCDATA)> <!ELEMENT R00 (#PCDATA)> <!ELEMENT R02(#PCDATA)> <!ELEMENT R03 (#PCDATA)> <!ELEMENT R04 (#PCDATA)> <!ELEMENT R05 (#PCDATA)> <!ELEMENT R06 (#PCDATA)> <!ELEMENT R07 (#PCDATA)> <!ELEMENT R08 (#PCDATA)> <!ELEMENT R09 (#PCDATA)> <!ELEMENT R10 (#PCDATA)>

Trang 110

Lun vn : nh gi cc h thng tm kim thng tin

<COMPARE> <SYSTEM SYSID=""> <NAME > <DATE > <AVGRECALL > </NAME > </DATE > </AVGRECALL >

<AVGPRECISION> </AVGPRECISION> <R00> <R01> <R02> <R03> <R04> <R05> <R06> <R07> <R08> <R09> <R10> </SYSTEM> </COMPARE> Din gii: <SYSTEM SYSID=""> : mi h thng s c cp mt ch s duy nht sysID <NAME> : tn h thng <DATE> : ngy gi tin hnh kim tra h thng <AVGRECALL> : bao ph trung bnh <AVGPRECISION> : chnh xc trung bnh <Ri> : gi tr chnh xc ti cc im chun i ca R </R00> </R01> </R02> </R03> </R04> </R05> </R06> </R07> </R08> </R09> </R10>

Trang 111

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.4.2. Topic Cc cu hi dng kim tra h thng IR c lu tr thnh cc file c cu trc nh sau: Cu trc DTD: <!ELEMENT TOPIC (TOP*)> <!ELEMENT TOP(TOPID,TITLE,DES,NARR)> <!ELEMENT TOPID(#PCDATA)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT DES (#PCDATA)> <!ELEMENT NARR (#PCDATA)>

<TOPIC> <TOP> <TOPID> <TITLE> <DES> <NARR> </TOP> </TOPIC> Din gii: <TOPID> : ch s ca cu hi <TITLE> : ni dung ca cu hi <DES> : ch thch cho cu hi <NARR>: yu cu v s lin quan ca cu hi i vi ti liu (Thng tin DES, NARR ch c ngha cho vic nghin cu s lin quan ca cu hi vi ti liu) </TOPID> </TITLE> </DES> </NARR>

Trang 112

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.4.3. Index_topic Mi mt cu hi c lp ch mc theo phng php lp ch mc ca h thng IR bn ngoi. Chng trnh t chc vic lu cc file ch mc ca cc topic thuc v mt h thng l mt file xml, c ngha l h thng IR c sysID1 c tin hnh kim tra c thng tin v ch mc ca tp cu hi ca chng trnh , cc thng tin s c lu thanh mt file xml, thng tin lp ch mc ca h thng khc sysID2 s c lu thnh file xml khc. Cch t tn ca file ch mc s c t theo nguyn tc sau: ( c th d tin hnh vic c file) Tn file ch mc cu hi ca h thng IR c sysID1 = idx_topic_+ sysID1+ .xml Cu trc DTD ca file nh sau: <!ELEMENT MATRIX (INDEX*)> <!ATTLIST MATRIX SIZE CDATA #REQUIRE> <!ELEMENT INDEX (TERM)> <!ATTLIST INDEX ID CDATA #REQUIRE SIZE CDATA #REQUIRE> <!ELEMENT TERM(#PCDATA)> <!ATTLIST TERM WORD CDATA #REQUIRE WEIGH CDATA #REQUIRE> Cu trc ca file nh sau: <MATRIX SIZE = > <INDEX ID= SIZE = > <TERM WORD= WEIGH=> </INDEX> </MATRIX> Din gii: <MATRIX SIZE =>: size l tng s cu hi c lp ch mc

Trang 113

Lun vn : nh gi cc h thng tm kim thng tin

<INDEX ID = SIZE = > :ID l topicID ca mt cu hi;size l tng s t trong mt cu hi c lp ch mc <TERM WORD= WEIGH=> : word: l t trong cu hi c topicID c lp ch mc, weigh l trng s ca n 3.3.2.4.4. Document

Cc ti liu dng kim tra h thng IR c lu tr thnh cc file c cu trc nh sau:

Cu trc DTD :
<!ELEMENT DOCUMENT(DOC*)> <!ELEMENT DOC(DOCID,TITLE,AUTHOR,DATE,NEWS,CONTENT)> <!ELEMENT DOCID(#PCDATA)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT AUTHOR (#PCDATA)> <!ELEMENT DATE (#PCDATA)> <!ELEMENT NEWS (#PCDATA)> <!ELEMENT CONTENT (#PCDATA)>

<DOCUMENT> <DOC> <DOCID> <TITLE> </DOCID> </TITLE>

<AUTHOR> </AUTHOR> <DATE> <NEWS> </DATE> </NEWS> </CONTENT>

<CONTENT> </DOC> </DOCUMENT>

Trang 114

Lun vn : nh gi cc h thng tm kim thng tin

Din gii: <DOCID>: ch s ca ti liu <TITLE>: ch ca ti liu <AUTHOR>: tc gii ti liu <DATE> ngy to ti liu <NEWS>: ngun gc ti tiu <CONTENT> ni dung ca ti liu 3.3.2.4.5. Index_Doc Tng t cch lu tr ca Index_Topic, cc file lp ch mc ti liu ca h thng sysID c th s c lu thnh mt file xml ring bit.Cch t chc cu trc ngha ca file xml cng tng t nh file lu tr index ca topic, nhng cch t tn ca file ch mc ti liu ca mt h thng c th nh sau: Tn file = idx_doc_+ sysID + .xml 3.3.2.4.6. relevant_TT Mi h thng sau khi tm kim trn kho d liu ca chng trnh s tr v cho chng trnh file biu din mi lin quan ca cc cu hi vi cc ti liu.Chng trnh ghi nhn li cc thng tin s lin quan ca mt h thng IR c th (c sysID) bng mt file xml, c ngha l s lin quan thc t ca cc cu hi vi ti liu do mt h thng IR c th sysID1 s c lu thnh mt file, con s lin h topic-doc ca h thng khc sysID2 s c lu thnh file khc. Do chng trnh quy nh cch t tn cho file nh sau: Tn file = rel_+sysID + .xml Cu trc ca file c t chc nh sau: <!ELEMENT RELEVANT(REL*)> <ELEMENT REL(DOCID)>

Trang 115

Lun vn : nh gi cc h thng tm kim thng tin

<!ATTLIST REL TOPID CDATA #REQUIRE> <!ELEMENT DOCID(#PCDATA)> <!ATTLIST DOCID SIMILARITY CDATA >

<RELEVANT> <REL TOPID=> <DOCID SIMILARITY= > </REL> </RELEVANT> Din gii: <TOPID>: ch s ca topic <DOC ID>: ch s ca ti liu c lin quan vi cu hi c ch s l TOPID <SIMILARITY>: tng quan ca ti liu DOCID vi cu hi TOPID 3.3.2.4.7. relevant_LT Chng trnh to sn bng lin quan gia cu hi vi ti liu (bng lin quan theo l thuyt). File lu tr thng tin lin quan c cu trc sau: <!ELEMENT RELEVANT(REL*)> <ELEMENT REL(DOCID)> <!ATTLIST REL TOPID CDATA #REQUIRE> <!ELEMENT DOCID(#PCDATA)>

<RELEVANT> <REL TOPID=> <DOCID> </RELEVANT> Din gii: (tng t nh trn) </DOCID>

Trang 116

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.4.8. evaluation Vi 1 cu hi c thc thi bi 1 h thng IR bt k, chng trnh s tnh tan cc thng tin v : s ti liu lin quan, s ti liu tr v, s ti liu lin quan c tr v, bao ph, chnh xc. Cc thng tin ny s c lu tr vo file evaluation.xml. Cu trc file ny nh sau: DTD : <!ELEMENT EVALUATION(SYSTEM*)> <!ELEMENT SYSTEM(EVAL*)> <!ATTLIST SYSTEM SYSID CDATA #REQUIRE> <!ELEMENT EVAL(RETREL,RET,REL, RECALL, PRECISION, R00, R01, R02, R03, R04, R05, R06, R07, R08, R09, R10)> <!ATTLIST EVAL TOPID CDATA #REQUIRE> <!ELEMENT RETREL(#PCDATA)> <!ELEMENT RET(#PCDATA)> <!ELEMENT REL(#PCDATA)> <!ELEMENT RECALL (#PCDATA)> <!ELEMENT PRECISION (#PCDATA)> <!ELEMENT R00 (#PCDATA)> <!ELEMENT R02 (#PCDATA)> <!ELEMENT R03 (#PCDATA)> <!ELEMENT R04 (#PCDATA)> <!ELEMENT R05 (#PCDATA)> <!ELEMENT R06 (#PCDATA)> <!ELEMENT R07 (#PCDATA)> <!ELEMENT R08 (#PCDATA)> <!ELEMENT R09 (#PCDATA)> <!ELEMENT R10 (#PCDATA)>

Trang 117

Lun vn : nh gi cc h thng tm kim thng tin

<EVALUATION> <SYSTEM SYSID=> <EVAL TOPID=> <RETREL> <RET> <REL> <RECALL> <PRECISION> <R00> <R01> <R02> <R03> <R04> <R05> <R06> <R07> <R08> <R09> <R10> </EVAL> </SYSTEM > </EVALUATION> Din gii: <SYSTEM SYSID=> : ch s ca h thng IR <EVAL TOPID=> :ch s cu hi c thc thi bi h thng IR <RETREL>: s ti liu c lin quan c tr v (retrieval relevant) <RET>: s ti liu tr v (retrieval) <REL>: s ti liu lin quan (relevant) <RECALL> : bao ph ca cu hi TopID
Trang 118

</RETREL> </RET> </REL> </RECALL> </PRECISION> </R00> </R01> </R02> </R03> </R04> </R05> </R06> </R07> </R08> </R09> </R10>

Lun vn : nh gi cc h thng tm kim thng tin

<PRECISION>: chnh xc ca cu hi TopID <Ri>: chnh xc ca cu hi TopID c tnh ti 11 im chun ca bao ph 3.3.2.5. Thit k giao din 3.3.2.5.1. S lin h gia cc mn hnh Qui trnh 1: T mn hnh chnh chn: Bc 1: nh dng ti liu Bc 2: nh dng kt qu Bc 3: Thc thi h thng IR Bc 4: nh dng kt qu tr v ca h thng Bc 5: nh dng file index Bc 6: Xem kt qu nh dng Bc 7: xem th h thng Bc 8: xem chi tit Nu kho d liu ca chng trnh c cu trc ging vi cu trc nh dng ca h thng IR th bc nh dng d liu (bc 1 v 2) c th b qua. C th h thng IR c thc hin bn ngoi, bc 3 c b qua, ngi dng ch thng bo cho chng trnh cc file kt qu v file index thc hin vic nh gi.
fraTTTaiLieu fraDDTaiLieu fraDDCauHoi fraTTCauHoi

fraThucThiHT fraDDKetQua fraKqDanhGia fraXemChiTiet fraDDIndex fraSoSanhHT

fraDoThi_HeThong

Trang 119

Lun vn : nh gi cc h thng tm kim thng tin

Qui trnh 2: T mn hnh chnh chn: - Thc thi h thng IR: Gi mn hnh thng bo cho chng trnh bit v tr cc kho d liu cn =>fraDKThucThi Gi mn hnh thc thi h thng IR =>fraThucThiHT X l kt qu tr v nh dng tp tin ch mc (c th khng c) Xem thng tin kt qua nh gi Xem chi tit Xem th h thng
fraDKThucThi

fraThucThiHT

fraDDKetQua

fraDDIndex

fraKqDanhGia

fraXemChiTiet

fraDoThi_HeThong

Qui trnh 3: T mn hnh chnh chn: - X l kt qu tr v: Gi mn hnh x l kt qu tr v, yu cu nhp thng tin v tn h thng nh dng tp tin ch mc (c th khng c) Xem thng tin kt qu nh gi Xem chi tit
Trang 120

Lun vn : nh gi cc h thng tm kim thng tin

Xem th h thng
Nhp tn h thng fraDDKetQua fraDDIndex

fraKqDanhGia

fraXemChiTiet

fraDoThi_HeThong

Qui trnh 4: T mn hnh chnh chn: - Xem thng tin kt qu nh gi: Gi mn hnh xem thng tin kt qu nh gi, yu cu chn h thng cn xem Xem chi tit Xem th h thng
Chn h thng cn xem

fraKqDanhGia

fraXemChiTiet

fraDoThi_HeThong

Qui trnh 5: T mn hnh chnh chn: - so snh nhiu h thng => gi mn hnh fraSoSanhHT

Trang 121

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.6. Thit k mn hnh 3.3.2.6.1. Mn hnh chnh (TH_Main) K hiu: fraMain Cho php lin kt cc chc nng ca chng trnh.Chng trnh gm cc chc nng sau: Chuyn i nh dng ca kho d liu ( ph hp vi nh dng d liu ca h thng IR cn kim tra).Kho d liu bao gm: tp ti liu v tp cu hi Thc thi h thng IR bn ngoi Chuyn i nh dng cc file kt qu v file index ca h thng IR to file kt qu v index cho chng trnh nh gi h thng IR So snh cc h thng IR c nh gi 3.3.2.6.2. Mn hnh nh dng ti liu (TH_DDTaiLieu) K hiu: fraDDTaiLieu

1 2

6 7 Trang 122 8

Lun vn : nh gi cc h thng tm kim thng tin

Din gii: STT Tn 1 txtViTri Loi kiu Textbox ngha Cho php ngi dng nhp v tr cn lu ca tp ti liu mi sau khi c nh dng 2 bntChonViTri button h tr ngi dng tm v tr thch hp lu tp ti liu mi nh dng xong 3 cboLoaiFile comboBox chn loi file cn kt xut : l file xml hay file text 4 tblTaiLieu table Cho php ngi dng nh x cc th ti liu ca chng trnh vi cc th ti liu ca h thng IR, ng thi cho php nh ngha mi cc th nu th khng tng ng vi bt k th no trong th ti liu ca chng trnh 5 6 7 btnThem btnXoa btnTiepTuc Button Button Button Thm mt dng ca bng tblTaiLieu Xo mt dng ca bng tblTaiLieu Nu l file text khng c khi nim thuc tnh nn chng trnh s thc hin vic to cc file ti liu ph hp vi nh dng m ngi dng cung cp v tip tc cho php ng dng tp cu hi Nu l file xml khng c thuc tnh , tng t nh trn Nu l file xml c thuc tnh, tip tc cho php ngi dng nh ngha cc

Trang 123

Lun vn : nh gi cc h thng tm kim thng tin

th tng ng vi thuc tnh ca th ti liu trong tp ti liu ca h thng IR 8 btnHuyBo Button hu b thao tc nh dng cc ti liu

3.3.2.6.3. Mn hnh to thuc tnh cho ti liu (TH_TTTaiLieu) K hiu: fraTTTaiLieu

To thuc tnh cho th ti liu


nh ngha thuc tnh ca th ti liu
Th h thng IR Tn thuc tnh Th tng ng DOCMENT

Thm

Xo

3
Tip tc ng mn hnh

Tr li

Din gii: STT Tn 1 tblThuocTinh Loi kiu Table ngha Cho php ngi dng nh ngha thuc tnh ca cc th mi c nh ngha mn hnh fraDDTaiLieu, thuc tnh ca th mi c th tng ng vi th ti liu ca chng trnh hoc l th mi 2 btnThem Button Thm mt dng ca bng

Trang 124

Lun vn : nh gi cc h thng tm kim thng tin

tblThuocTinh 3 btnXoa Button Xo mt dng ca bng

tblThuocTinh 4 5 btnTroLai btnTiepTuc Button Button tr li mn hnh fraDDTaiLieu thc hin vic to cc file ti liu ph hp vi nh dng m ngi dng cung cp v tip tc cho ngi dng nh dng tp cu hi 6 btnDong Button ng mn hnh

3.3.2.6.4. Mn hnh nh dng cu hi (TH_DDCauHoi) K hiu: fraDDCauHoi

1 2

5 6 7 8

Din gii: STT Tn 1 txtViTri Loi kiu Textbox ngha Cho php ngi dng nhp v tr cn

Trang 125

Lun vn : nh gi cc h thng tm kim thng tin

lu ca tp cu hi sau khi c nh dng 2 bntChonViTri button h tr ngi dng tm v tr thch hp lu tp cu hi mi nh dng xong 3 cboLoaiFile comboBox chn loi file cn kt xut : l file xml hay file text 4 tblCauHoi table Cho php ngi dng nh x cc th cu hi ca chng trnh vi cc th cu hi ca h thng IR, ng thi cho php nh ngha mi cc th nu th khng tng ng vi bt k th no trong th cu hi ca chng trnh 5 6 7 btnThem btnXoa btnTiepTuc Button Button Button Thm mt dng ca bng tblCauHoi Xo mt dng ca bng tblCauHoi Nu l file text khng c khi ni m thuc tnh nn chng trnh s thc hin vic to cc file cu hi ph hp vi nh dng m ngi dng cung cp v tip tc cho php thc thi h thng IR cn thc hin Nu l file xml khng c thuc tnh , tng t nh trn Nu l file xml c thuc tnh, tip tc cho php ngi dng nh ngha cc th tng ng vi thuc tnh ca th cu hi trong tp cu hi ca h thng IR 8 btnHuyBo Button hu b thao tc nh dng cc cu hi

Trang 126

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.6.5. Mn hnh to thuc tnh cho cu hi (TH_TTCauHoi) K hiu: fraTTCauHoi

To thuc tnh cho th cu hi


nh ngha thuc tnh ca th cu hi
Th h thng IR Tn thuc tnh Th tng ng TOPID

Thm

Xo

Tr lai

Tip tc

Hu b

Din gii: STT Tn 1 tblThuocTinh Loi kiu ngha Table Cho php ngi dng nh ngha thuc tnh ca cc th mi c nh ngha mn hnh fraDDCauHoi, thuc tnh ca th mi c th tng ng vi th cu hi ca chng trnh hoc l th mi 2 3 4 5 btnThem btnXoa btnTroLai btnTiepTuc Button Button Button Button Thm mt dng ca bng tblThuocTinh Xo mt dng ca bng tblThuocTinh tr li mn hnh fraDDCauHoi thc hin vic to cc file cu hi ph hp vi nh dng m ngi dng cung cp v tip tc cho php gi thc thi h thng IR

Trang 127

Lun vn : nh gi cc h thng tm kim thng tin

btnDong

Button

ng mn hnh

3.3.2.6.6. Mn hnh x l iu kin thc thi h thng IR K hiu:fraDKThucThi Ch : H thng IR bt k no c tin hnh u phi thc thi trn kho d liu kim tra ca chng trnh. Nu khi thc hin h thng ny , ngi dng b qua bc nh dng kho d liu (do kho d liu ca chng trnh c cu trc ging kho d liu ca h thng IR) th phi cho chng trnh bit l kho d liu ca chng trnh nn v tr no h thng IR c th thc thi c.

1 2

5 6

4 3

Din gii: STT 1 Tn txtTaiLieu Loi kiu ngha Textbox Cho php ngi dng nhp v tr ca cc file ti liu 2 btnVTTaiLieu Button h tr ngi dng m th mc lu cc file ti liu 3 txtCauHoi Textbox Cho php ngi dng nhp v tr ca cc file cu hi. 4 btnVTCauHoi button h tr ngi dng m th mc lu cc file cu hi 5 btnGhiNhan Button Ghi nhn vi tr lu ti liu v cu hi,

Trang 128

Lun vn : nh gi cc h thng tm kim thng tin

sao chp tt c ti liu v cu hi ca chng trnh vo v tr mi , v gi mn hnh thc hin ca h thng IR 6 btnHuyBo button hu b thao tc thc thi h thng IR

3.3.2.6.7. Mn hnh thc thi h thng (TH_ThucThiHT) K hiu: fraThucThiHT

3 4 5

Din gii: STT 1 Tn txtDuongDan Loi kiu ngha Textbox Cho php ngi dng nhp v tr file thc thi h thng IR bn ngoi 2 3 btnMoFile txtTenHT Button Textbox h tr ngi dng m file thc thi Cho php nhp tn h thng. Mc nh chng trnh s xem tn file thc thi h thng IR l tn h thng 4 5 btnThucHien btnXLyKQ button Button thc hin gi h thng IR thc thi Gi mn hnh thc hin vic nh dng v x l kt qu tr v ca h thng IR 6 btnHuyBo button hu b thao tc thc thi h thng IR

Trang 129

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.6.8. Mn hnh nh dng kt qu (TH_DDKetQua) Ky hiu: fraDDKetQua

1 3

6 5

Din gii: STT Tn 1 txtTenHT Loi kiu Textbox ngha Cho php nhp tn h thng, nu cha nhp tn h thng t mn hnh thc thi h thng (nu ngi dng thc thi h thng IR bn ngoi), nu thc thi h thng IR ri th txtTenHT s hin th tn h thng (khng cn nhp ln na) 2 txtViTri Textbox Cho php ngi dng nhp v tr file kt qu ca h thng IR bn ngoi 3 4 bntChonViTri tblKetQua Button Table H tr ngi dng chn file kt qu Ghi nhn gi tr th tng ng. V d : nu TopicID l mt th trong file kt

Trang 130

Lun vn : nh gi cc h thng tm kim thng tin

qu th khng cn nh ngha thuc tnh ch cn nh ngha th tng ng vi TopicID , nu TopicID l mt thuc tnh th cn nh ngha th cha thuc tnh v nh ngha th tng ng vi thuc tnh 5 btnThucHien Button thc hin ghi file kt qu cho chng trnh 6 btnHuyBo Button hu b thao tc nh dng file kt qu

3.3.2.6.9. Mn hnh nh dng thng tin index (TH_DDIndex) K hiu: fraDDIndex nh dng file index ca cu hi:

2 3

Trang 131

Lun vn : nh gi cc h thng tm kim thng tin

nh dng file index ca ti liu:


7

8 9

10

Din gii: STT Tn 1 tabCauHoi Loi kiu Tab control 2 txtViTri_idxTopic Textbox ngha nh ngha file index ca cu hi Cho php ngi dng nhp v tr file index ca cu hi ca h thng IR bn ngoi 3 bntChonViTri_idxTopic Button h tr ngi dng chn file index ca cu hi 4 tblIdx_Topic Table Ghi nhn gi tr th tng ng. V d : nu TopicID l mt th trong file kt qu th khng cn nh ngha thuc tnh ch cn nh ngha th tng ng vi TopicID , nu TopicID l

Trang 132

Lun vn : nh gi cc h thng tm kim thng tin

mt thuc tnh th cn nh ngha th cha thuc tnh v nh ngha th tng ng vi thuc tnh 5 btnThucHien Button thc hin ghi file index ca cu hi v ca ti liu cho chng trnh 6 btnHuyBo Button hu b thao tc nh dng file index 7 tabTaiLieu Tab control 8 txtViTri_idxDoc Textbox nh ngha file index ca ti liu Cho php ngi dng nhp v tr file index ca ti liu ca h thng IR bn ngoi 9 bntChonViTri_idxDoc Button h tr ngi dng chn file index ca ti liu 10 tblIdx_Doc Table Ghi nhn gi tr th tng ng.

3.3.2.6.10. Mn hnh nh gi h thng (TH_KqDanhGia) K hiu: fraKq_DanhGia

Trang 133

Lun vn : nh gi cc h thng tm kim thng tin

2 5

3 6

9 7

11 10

12 15 13 16 17

14

Din gii: STT Tn 1 cboHeThong (txtHeThong) Loi kiu ngha kt qu nh gi nu ngi s dng gi mn hnh ny t mn hnh chnh, nu mn hnh ny c gi t mn hnh x l kt qu ,txtxHeThong s hin th ng tn ca h thng ang comboBox Cho php chn h thng cn xem (textbox)

Trang 134

Lun vn : nh gi cc h thng tm kim thng tin

c thc thi 2 txtsysID Textbox Sau khi chn h thng cn xem chng trnh t ng hin thi system ID ca h thng 3 txtNgay Textbox Chng trnh t ng hin th thng tin v ngy gi thc hin kim tra h thng IR 4 txtRtb Textbox th hin thng tin v bao ph trung bnh ca h thng khi thc thi tp cu hi 5 txtPtb Textbox th hin thng tin v chnh xc trung bnh ca h thng IR 6 lblTongSoCauHoi Lable th hin thng tin v tng s cu hi c h thng IR thc hin kim tra 7 lstCauHoi Listbox th hin tp cu hi c thc hn bi h thng IR 8 txtNoiDungCH Textbox Ni dung ca mt cu hi mi khi ngi dng chn cu hi cn xem lstCauHoi 9 10 lblSoTLTraVe tblCauHoi Label Table S ti liu tr v ng vi 1 cu hi Thng tin cu hi gm c: s th t n, ti liu c tr v,c lin quan theo l thuyt hay khng, bao ph v chnh xc ti v tr th n 11 tblRPChuanHoa Table Bng RP chun ha (tnh chnh xc ti 11 im chun ca bao ph ng vi 1 cu hi)

Trang 135

Lun vn : nh gi cc h thng tm kim thng tin

12 13 14

txtR txtP

Textbox Textbox

bao ph ca 1 cu hi chnh xc ca 1 cu hi S ti liu lin quan ng vi 1 cu hi S ti liu lin quan c tr v ng vi 1 cu hi V th ca h thng khi thc thi tp cu hi ng mn hnh

lblSoTLLienQuan Label

15

lblSoTLRel_Ret

Label

16

btnVeDoThi

Button

17

btnDong

Button

3.3.2.6.11. Mn hnh xem th ca h thng K hiu: fraDoThi_HeThong Mn hnh ny s v th ca h thng c nh gi trn tp cu hi kim tra 3.3.2.6.12. Mn hnh xem chi tit (TH_XemChiTiet) K hiu: fraXemChiTiet Mn hnh ny c gi khi ngi dng mun xem thng tin chi tit v s lin quan ca mt cu hi vi 1 ti liu c th. ng ti mn hnh xem kt qu nh gi (fraKq_DanhGia) , nhn vo DocID ca ti liu mun xem (bi v trn mn hnh fraKq_DanhGia, tblTaiLieuPhucHoi chi th hin cc DocID ca ti liu c tr v), chng trnh s hin th mn hnh xem chi tit ny.Thng tin chi tit gm c : ni dung ca cu hi,ni dung ca ti liu lin quan n cu hi , tng quan ca cu hi vi ti liu v thng tin ti liu v cu hi c lp ch mc

Trang 136

Lun vn : nh gi cc h thng tm kim thng tin


Chi tit ti liu

Thng tin chi tit


3
T Trng s

tng quan :... Cu hi: .

Ti liu:

DocID= ...

5
T Trng s

ng mn hnh

Din gii: STT Tn 1 lblSim Loi kiu Lable ngha tng quan gia cu hi v ti liu 2 3 4 lblTopID txtCauHoi Lable textbox Ch s (topicID) ca ti liu th hin ni dung ca cu hi Th hin ni dung ch mc ca cu hi 5 6 7 8 lblDocID txtTaiLieu Lable textbox Ch s ID ca ti liu Ni dung ti liu Ni dung ch mc ca ti liu ng mn hnh

tblIndex_CauHoi table

tblIndex_TaiLieu table btnDong Button

Trang 137

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.6.13. Mn hnh so snh h thng (TH_SoSanhHT) K hiu: fraSoSanhHT

2 1 4

Din gii: STT Tn 1 2 lstHT btnChon Loi kiu ListBox Button ngha Danh sch h thng IR chn mt h thng bn danh sch h thng 3 btnBo Button b mt h thng bn danh sch h thng chn 4 5 6 lstHTChon th btnDong Button ListBox Danh sch h thng c chn hin th th ca cc h thng chn ng mn hnh

Trang 138

Lun vn : nh gi cc h thng tm kim thng tin

3.3.2.7. Thit k h thng lp i tng 3.3.2.7.1. Cc lp i tng x l a) CFormat: ngha : nh ngha cu trc cc thng s cn thit nh dng ti liu v cu hi M t: Cu trc thng s nh dng: STT Tn 1 2 3 4 5 6 7 oldTag newTag Content oldId newId haveAttr Loi kiu Kiu chui Kiu chui Kiu chui Kiu chui Kiu chui Kiu logic ngha Th c cn chuyn i Th mi (t th c i sang th mi) Ni dung ca th c Th t ca th c trong file xml c Th t ca th mi trong file xml mi Th mi c thuc tnh hay khng? Nu th mi c thuc tnh th attrArray s lu danh sch cc thuc tnh (Thuc tnh c cu trc dng th: tn thuc tnh (tng ng vi tn th mi- newTag)), th tng ng vi thuc tnh trn (oldTag)

attrArray Mng CFormat

b) CDocument: ngha nh ngha cu trc ca ti liu lu tr cc gi tr ni dung ca cc th (hoc section) ca ti liu M t:

Trang 139

Lun vn : nh gi cc h thng tm kim thng tin

STT Tn 1 2 3 4 5 6 DocID Title Content Date Author News

Loi kiu Kiu chui Kiu chui Kiu chui Kiu chui Kiu chui Kiu logic

ngha docID ca ti liu tiu ca ti liu Ni dung ca ti liu Ngy to ti liu Tc gi ca ti liu Ngun gc ca ti liu

c) CTopic: ngha: nh ngha cu trc ca ti liu lu tr cc gi tr ni dung ca cc th (hoc section) ca cu hi M t: STT Tn 1 2 3 4 TopID Title Loi kiu Kiu chui Kiu chui Kiu chui ngha docID ca cu hi Ni dung ca cu hi Ch thch ca cu hi Ni dung yu cu lin quan ca cu hi

Description Kiu chui Narrative

d) CHeThongIR: ngha: nh ngha cu trc h thng IR M t: STT Tn 1 2 3 4 strTenHT strID Loi kiu Kiu chui Kiu chui Kiu chui ngha Tn h thng IR ID ca h thng IR Ngy tin hnh kim tra h thng IR bao ph trung bnh

NgayKiemTra Kiu chui Rtrungbinh

Trang 140

Lun vn : nh gi cc h thng tm kim thng tin

5 6

Ptrungbinh fP

Kiu chui Mng thc

chnh xc trung bnh chun ca bao ph

s chnh xc trung bnh ti 11 im

e) CKetQua: ngha: nh ngha cu trc nh dng ca file kt qu tr v ca h thng IR (file kt qu tar3 v ca h thng IR chnh l bng lin quan thc t) M t: STT Tn 1 TopicID Loi kiu Kiu chui 2 DocID Kiu chui 3 ThuocTinh_TopicID Kiu chui Kiu chui Kiu chui 6 ThuocTinh_Sim Kiu chui Tn thuc tnh tng ng vi TopicID nu th tng ng vi TopicID c thuc tnh 4 ThuocTinh_DocID Tn thuc tnh tng ng vi DocID nu th tng ng vi DocID c thuc tnh 5 TagSim Th tng ng vi similarity (th Similarity) Tn thuc tnh tng ng vi Similarity nu th tng ng vi Similarity c thuc tnh Th tng ng vi DocID ngha Th tng ng vi TopicID

Trang 141

Lun vn : nh gi cc h thng tm kim thng tin

f) CRelevant: ngha nh ngha cu trc s lin quan ca ti liu ti v tr th t n ca danh sch cc ti liu tr v ca mt cu hi Thng tin v chnh xc v bao ph ti v tr th n s c lu trong mng cu trc CRelevant thc hin ni suy tnh chnh xc ti 11 im chun ca bao ph (tnh P (r) , vi r c gi tr l 0.0, 0.1 ,0.2 ,0.3 ,0.4, 0.5 , 0.6 , 0.7, 0.8, 0.9, 1.0) M t: STT Tn 1 2 DocID bRelevant Loi kiu Kiu chui Kiu logic Kiu thc 4 fRecall Kiu thc s bao ph ca ti liu ti v tr n ngha docID ca cu hi ti v tr th t n Ti liu ti v tr n c lin quan hay khng? 3 fPrecision s chnh xc ca ti liu ti v tr n

g) CIndex: ngha: nh ngha cu trc nh dng cc file index tr v ca h thng IR M t: STT Tn 1 ID Loi kiu Kiu chui Kiu chui Kiu ngha ID ca cu hi hoc ti liu c lp ch mc 2 word T c ngha xut hin trong cu hi hoc ti liu 3 weigh s Trng s ca t

Trang 142

Lun vn : nh gi cc h thng tm kim thng tin

nguyn

h)

XL_XML :

S lp:
XL_XML

TranslateXML

M t: Phng thc: TranslateXML o ngha: chuyn i cu trc ca mt file xml sang cu trc file xml khc o Tham s u vo: STT 1 Tn Format Loi kiu ngha mi m file xml mi cn chuyn i sang cu trc 2 3 4 newFile Kiu chui Kiu chui file xml mi bao gm c ng dn ng dn file xml c tn file xml c oldPathFile Kiu chui oldFile Kiu mng mng Cformat lu tr cu trc nh dng CFormat

Ghi ch: Tham s format c kiu mng CFormat vi: newTag : l tn th trong file xml mi tng ng vi oldTag oldTag l cc th trong file xml c newId: l th t ca th mi trong file xml mi oldTag: l th t ca th c haveAttr: th mi cn to c thuc tinh hay khng?, nu c lu thuc tnh vo mng attrArray
Trang 143

Lun vn : nh gi cc h thng tm kim thng tin

V d : c file xml nh sau: <DOCUMENT> <DOC> nt gc (c th t =0) nt k gc (c th tc bng 1)

<DOCID> 1 <DOCID> <TITLE> Thanh nin Vit Nam <TITLE> </DOC> </DOCUMENT> i sang file xml c cu trc nh sau: <DOCUMENT> <DOC DOCID=1> <TITLE> Thanh nin Vit Nam <TITLE> </DOC> </DOCUMENT> nh ngha mng format nh sau: CFormat[] f=new CFormat[3]; f[0]=new CFormat(); f[0].oldId=f[0].newId=0; f[0].newTag="TAILIEU"; f[0].oldTag="DOCUMENT"; f[0].haveAttr=false;

f[1]=new CFormat(); f[1].newId=f[1].oldId=1; f[1].newTag="TL"; f[1].oldTag="DOC"; f[1].haveAttr=true; f[1].attrArray=new CFormat[1]; f[1].attrArray[0]=new CFormat(); f[1].attrArray[0].oldTag="DOCID";
Trang 144

Lun vn : nh gi cc h thng tm kim thng tin

f[1].attrArray[0].newTag="ID";

f[2]=new CFormat(); f[2].newId=f[2].oldId=2; f[2].newTag="CHUDE"; f[2].oldTag="TITLE"; o Kt qu tr v: Loi kiu Kiu nguyn M t thut ton: ngha s Kt qu tr v xc nhn chuyn i thnh cng hay tht bi

STT Tn 1 kq

- Tm newTag ca nt gc trong mng format (nt gc l nt c th t newId=0) , da vo khai bo ca mng format (newTag) to nt gc cho file xml mi - c file xml c da vo oldPathFile v oldFile: gi phng thc ReadXML ca lp LT_XML - Tm v tr ca nt k gc trong mng format(nt k gc l nt c oldId=1) =>idxroot - Ly danh sch cc nt k gc ( th oldTag) =>Nodes - Gi nodelen l chiu di ca danh sch Nodes - Lp i=0 -> i=nodelen - Bt u lp i Kim tra xem format[idxroot] c thuc tnh hay khng:

(format[idxroot].haveAttr ?) Nu c thuc tnh (duyt tt c cc thuc tnh ca nt k gc gn ni dung cho n) - Vi mi thuc tnh:

Trang 145

Lun vn : nh gi cc h thng tm kim thng tin

- xem thuc tnh ca nt k gc c oldTag l g, nu c oldTag = c ngha l to mi thuc tnh, nu oldTag != : c ni dung ca oldTag gn vo format[idxroot].Content Ngc li khng c thuc tnh To ni dung cho cc phn t format[k] (k!=idxroot) kim tra xem tng phn t trong mng format khc vi format[idxroot] c thuc tnh hay khng Nu phn t format[k] c thuc tnh : c ni dung th tng ng vi thuc tnh (oldTag) v ghi nhn ni dung vo format[k].attrArray[j].Content Sau khi gn ni dung cho cc th mi trong mng format : - Cui lp i Sp xp th t mng format theo th t tng dn ca newId To node cho file xml mi nh x xong cc node ca file xml c sang file xml mi k) XL_Text: S lp:
XL_XML

(ch : nt k gc format[idxroot] khng cha ni dung) -

TranslateText

M t: Phng thc : TranslateText o ngha: i cu trc ca mt file xml sang cu trc file text khc vi cc th ca file xml tng ng vi cc section ca file text o Tham s u vo: Loi kiu ngha

STT Tn

Trang 146

Lun vn : nh gi cc h thng tm kim thng tin

format

Kiu mng mng Cformat lu tr cu trc nh dng CFormat Kiu chui Kiu chui mi m file xml mi cn chuyn i sang cu trc file xml mi bao gm c ng dn ng dn file xml c tn file xml c

2 3 4

newFile

oldPathFile Kiu chui oldFile

o STT 1

Kt qu tr v: Tn kq Loi kiu Kiu nguyn Thut ton: ngha s Kt qu tr v xc nhn chuyn i thnh cng hay tht bi

- Tm newTag ca nt gc trong mng format (nt gc l nt c th t newId=0) , da vo khai bo ca mng format (newTag) to nt gc cho file xml mi - c file xml c da vo oldPathFile v oldFile: gi phng thc ReadXML ca lp LT_XML - Do file text mi khng c khi nim th gc ch cn nhng th cha d liu - Ly danh sch cc nt c th l format[0].oldTag =>Nodes (format[0] l nt k gc trong file xml c (khng k nt gc)) - Gi strNoiDung lu gi ni dung ca file text - Gi nodelen l chiu di ca danh sch Nodes - Lp k=0 ->k=nodelen Bt u lp i Vi mi phn t format[i]: kim tra xem format[i].oldTag = hay khng ? o Nu bng rng: to section mi cho file text m n dung ca section l rng :
Trang 147

Lun vn : nh gi cc h thng tm kim thng tin

strNoiDung = strNoiDung + format[i].newTag + k t xung dng o Nu khc rng : ly ni dung ca th c tn l format[i].oldTag gn vo ni dung ca section mi : format[i].Content strNoiDung = strNoiDung + format[i].newTag + k t xung dng + format[i].Content + k t xung dng Cui lp i - Ghi ln file text : gi phng thc WriteText(strNoiDung,newFile) ca lp LT_Text - Gi kt qu thc hin thnh cng hay tht bi cho chng trnh l) XL_Doc S lp:
XL_Doc Static String Doc_Path="coll"; Static int SO_DOC=50; DinhDangTaiLieu DinhDangTaiLieuText LayDSTaiLieu LayTaiLieu

M t: Thuc tnh: STT Tn 1 Loi kiu ngha ng dn ca tp ti liu ca chng trnh , mc nh l tt c cc tp d liu s c lu ti th mc coll ca chng trnh 2 SO_DOC Kiu nguyn s S ti liu trong mt file ti liu , mc nh l c 50 ti liu trong mt file ti liu Doc_Path Kiu chui

Trang 148

Lun vn : nh gi cc h thng tm kim thng tin

Phng thc :DinhDangTaiLieu o ngha:chuyn i nh dng ca tp ti liu , chuyn sang file xml o Tham s u vo: Loi kiu ngha mi m file xml mi cn chuyn i sang cu trc 2 Path Kiu chui ng dn n th mc cha tp ti liu chuyn i o Kt qu tr v:(khng c) o M t thut ton: T bin Doc_Path : duyt tt c cc file xml nm trong th mc Nu Path= : c ngha l ngi s dng khng nhp v tr file ti liu mi c chuyn i => lu tt c cc file chuyn i mi vo th mc mc nh ca chng trnh Nu Path != : lu cc file chuyn i vo th mc Path Vi mi file ti liu xml ca chng trnh : gi phng thc TranslateXML ca lp XL_XML thc hin chuyn i v to file xml mi cho tng file xml ti liu c Phng thc: DinhDangTaiLieuText o ngha:chuyn i nh dng ca tp ti liu , chuyn sang file text o Tham s u vo: STT Tn 1 Loi kiu ngha m file xml mi cn chuyn i sang cu trc format Kiu mng mng Cformat lu tr cu trc nh dng mi CFormat

STT Tn 1 format

Kiu mng mng Cformat lu tr cu trc nh dng CFormat

Trang 149

Lun vn : nh gi cc h thng tm kim thng tin

Path

Kiu chui

ng dn n th mc cha tp ti liu chuyn i

o Kt qu tr v (khng c) o M t thut ton: T bin Doc_Path : duyt tt c cc file xml nm trong th mc Nu Path= : c ngha l ngi s dng khng nhp v tr file ti liu mi c chuyn i => lu tt c cc file chuyn i mi vo th mc mc nh ca chng trnh Nu Path != : lu cc file chuyn i vo th mc Path Vi mi file ti liu xml ca chng trnh : gi phng thc TranslateText ca lp XL_Text thc hin chuyn i v to file text mi cho tng file xml ti liu c LayDSTaiLieu: o ngha:ly danh sch ti liu ca file ti liu bt k da vo tn file ca n o Tham s u vo: fn (kiu chui ): l tn file ti liu cn ly o Kt qu tr v: dsDoc(Kiu mng CDocument) : lu thng tin ca danh sch ti liu nm trong file ti liu c tn l fn o M t thut ton: c file xml c tn l fn Ly danh sch cc node trong file xml c tn th l DOC => lstNode Gi n l chiu di ca lstNode Khai bo dsDoc l mng CDocument vi n phn t Duyt danh sch lstNode: Lp i=0 n n
Trang 150

Lun vn : nh gi cc h thng tm kim thng tin

u lp i: Ly danh sch node con ca lstNode[i] => childList Lp j=0 cho n cui childList u lp j tm cc th DOCID,TITLE,CONTENT... ly ni dung tng ng vi cc th gn vo dsDoc[i] Cui lp j Cui lp i LayTaiLieu o ngha:ly ti liu ca mt ti liu bt k da vo DocID ca n o Tham s u vo: DocID (kiu chui ): l ch s ca ti liu cn ly o Kt qu tr v: Kq (Kiu CDocument) : lu thng tin ca ti liu c ch s l DocID o M t thut ton: Tm tn file ti liu cha ti liu c ch s l DocID: Do mi file cha SO_DOC ti liu v mi file ti liu c t tn theo quy tc vn_ + n s 0 + s th t : v d file vn_000001 cha SO_DOC ti liu, mi ti liu trong file vn_000001 c nh ch s DocID t 1 n 50, v file vn_000002 cha SO_DOC ti liu , cc ti liu c nh s tip tc t 51 n 100 nn ta c th t DocID tnh c tn file cha DocID nh sau: int tenfile =DocID / SO_DOC; int m = DocID % SO_DOC; if (m != 0) tenfile += 1; tn file= vn_ + n s 0 + tenfile Gi phng thc LayDSTaiLieu vi u vo l fn : kt qu c danh sch ti liu CDocument dsdoc Tm trong dsdoc ti liu c docID=DocID => Kq
Trang 151

Lun vn : nh gi cc h thng tm kim thng tin

Tr v cho chng trnh ti liu thch hp Kq m) XL_Topic S lp:


XL_Topic Static String Topic_Path="topic" DinhDangCauHoi DinhDangCauHoiText LayDSTopicID LayDSCauHoi LayTongSoCauHoi NoiDungTopic

M t: Thuc tnh: Topic_Path:(kiu chui ) lu gi th mc cha tp cu hi, mc nh tp cu hi s c lu trong th mc topic ca chng trnh Phng thc DinhDangCauHoi o ngha: chuyn i nh dng ca tp ti liu , chuyn sang file xml o Tham s u vo: STT Tn 1 format Loi kiu ngha mi m file xml mi cn chuyn i sang cu trc 2 Path Kiu chui ng dn n th mc cha tp cu hi chuyn i o Kt qu tr v (khng c) o M t thut ton: T bin Topic_Path: duyt tt c cc file xml nm trong th mc Kiu mng mng CFormat lu tr cu trc nh dng CFormat

Trang 152

Lun vn : nh gi cc h thng tm kim thng tin

Nu Path= : c ngha l ngi s dng khng nhp v tr file ti liu mi c chuyn i => lu tt c cc file chuyn i mi vo th mc mc nh ca chng trnh Nu Path != : lu cc file chuyn i vo th mc Path Vi mi file ti liu xml ca chng trnh : gi phng thc TranslateXML ca lp XL_XML thc hin chuyn i v to file xml mi cho tng file xml ti liu c Phng thc DinhDangCauHoiText

o ngha: chuyn i nh dng ca tp ti liu , chuyn sang file text o Tham s u vo: STT Tn 1 format Loi kiu ngha mi m file xml mi cn chuyn i sang cu trc 2 Path Kiu chui ng dn n th mc cha tp cu hi chuyn i o Kt qu tr v (khng c) o M t thut ton: T bin Topic_Path : duyt tt c cc file xml nm trong th mc Nu Path= : c ngha l ngi s dng khng nhp v tr file ti liu mi c chuyn i => lu tt c cc file chuyn i mi vo th mc mc nh ca chng trnh Nu Path != : lu cc file chuyn i vo th mc Path Vi mi file ti liu xml ca chng trnh : gi phng thc TranslateText ca lp XL_Text thc hin chuyn i v to file text mi cho tng file xml ti liu c Phng thc LayDSTopicID o ngha :ly tt c topicID ca file cu hi
Trang 153

Kiu mng mng CFormat lu tr cu trc nh dng CFormat

Lun vn : nh gi cc h thng tm kim thng tin

o Tham s u vo : (khng c) o Kt qu tr v: Mng chui topicID : lu thng tin ca tt c topicID ca cu hi o M t thut ton: Duyt th mc c ng dn l Topic_Path c tng file xml cu hi trong th mc Topic_Path Ly danh sch cc node trong file xml c tn th l TOP => lstNode Gi n l chiu di ca lstNode Khai bo topicID l mng chui vi n phn t Duyt danh sch lstNode: Lp i=0 n n u lp i: Ly danh sch node con ca lstNode[i] => childList Lp j=0 cho n cui childList u lp j tm th TOPID v ly ni dung ca th gn vo topicID[i] Cui lp j Cui lp i Phng thc LayDSCauHoi o ngha :ly tt c thng tin cu hi ca file cu hi o Tham s u vo : (khng c) o Kt qu tr v: Mng CTopic dsTopic: lu thng tin ca tt c cu hi o M t thut ton: Duyt th mc c ng dn l Topic_Path c tng file xml cu hi trong th mc Topic_Path

Trang 154

Lun vn : nh gi cc h thng tm kim thng tin

Ly danh sch cc node trong file xml c tn th l TOP => lstNode Gi n l chiu di ca lstNode Khai bo dsTopic l mng CTopic vi n phn t Duyt danh sch lstNode: Lp i=0 n n u lp i: Ly danh sch node con ca lstNode[i] => childList Lp j=0 cho n cui childList u lp j tm th TOPID, TITLE v ly ni dung ca cc th gn vo dsTopic[i] Cui lp j Cui lp i Phng thc LayTongSoCauHoi

o ngha :ly tng s cu hi thc hin c trong kho d liu o Tham s u vo : (khng c) o Kt qu tr v: Kq : kiu s nguyn o M t thut ton: Gi phng thc LayDSTopicID =>dsTopic Kq= chiu di ca mng dsTopic Phng thc NoiDungTopic o ngha:ly ni dung ca mt cu hi bt k da vo TopIDca n o Tham s u vo: TopID(kiu chui ): l ch s ca cu hi cn ly o Kt qu tr v: Kq (Kiu chui) : lu thng tin ca ni dung ca cu hi c ch s l TopID
Trang 155

Lun vn : nh gi cc h thng tm kim thng tin

o M t thut ton: Gi phng thc LayDSCauHoi => ds Tm trong ds cu hi c topID=TopID => Kq Tr v cho chng trnh ti liu thch hp Kq n) XL_KetQua S lp:
XL_KetQua static String result_Path="result" static String eval_File="evaluation.xml" static String system_File="system.xml" static String rel_File="relevant.xml" static String relevant_Path="result/relevant" float RTrungBinh=0; float PTrungBinh=0; int TongSoTopic=0 String systemID DinhDangKqIR CapNhat_evaluation TinhDoChinhXac_Chuan TaoSysID ThemSystem

M t: Phng thc DinhDangKqIR: o ngha:nh dng file kt qu tr v ca h thng IR o Tham s u vo: STT Tn 1 2 3 fn sysID kq Loi kiu Kiu chui Kiu chui CKetQua ngha Tn file kt qu tr v ca h thng IR System ID ca h thng IR nh dng ca file kt qu

o Kt qu tr v: Kq: kiu s nguyn ; cho bit kt qu nh dng thnh cng hay tht bi o M t thut ton:

Trang 156

Lun vn : nh gi cc h thng tm kim thng tin

Ghi ch: file kt qu ca h thng s c chuyn i sang file mi c cu trc o chng trnh a ra nh sau: (xem phn T chc lu tr d liu (relevant_TT )) <RELEVANT> <REL TOPID=> <DOCID SIMILARITY= > </REL> </RELEVANT> Mi bng lin quan thc t ca h thng s c lu thnh 1 file ring bit c tn l rel_ + sysID + .xml Tao file xml mi , to nt gc RELEVANT c file xml fn, ly cc nt c tn l kq.TopicID Duyt tt c cc nt trn

Kim tra xem nt TopicID c thuc tnh hay khng (kq.ThuocTinh_TopicID== ?) Nu c thuc tnh : ly gi tr ca thuc tnh gn vo bin tm temp Nu khng c thuc tnh: ly gi tr ca nt TopicID To nt REL v to thuc tnh TOPID c gi tr l ni dung ca bin temp Ly cc nt con ca nt TopicID Duyt tt c cc nt con: Xem nt no l kq.DocID : Kim tra xem nt DocID c thuc tnh hay khng (kq.ThuocTinh_DocID= ?) Nu c thuc tnh : ly gi tr thuc tnh gn vo bin strChild Nu khng ly gi tr ca nt DocID gn vo bin strChild To nt DOCID vi ni dung l strChild
Trang 157

Lun vn : nh gi cc h thng tm kim thng tin

Xt xem c nt con no c tn l kq.TagSim Kim tra xem nt TagSim c thuc tnh hay khng (kq.ThuocTinh_Sim = ?) Nu c thuc tnh : ly gi tr thuc tnh gn vo bin strChild Nu khng ly gi tr ca nt TagSim gn vo bin strChild To thuc tnh SIMILARITY cho nt DOCID vi ni dung l strChild

Ghi ln file xml mi vi tn theo cng thc sau: rel_ + sysID + .xml => gi phng thc WriteXML ca lp LT_XML Phng thc CapNhat_evaluation

o ngha: cp nht file evaluation.xml : thm thng tin topic c nh gi do h thng IR thc hin o Tham s u vo: sysID (kiu chui): system ID ca h thng thc thi nh gi o Kt qu tr v: (khng c) o M t thut ton: c file evaluation.xml chun b thm thng tin danh sch cu hi do h thng IR (sysID) thc hin c nh gi To node SYSTEM vi thuc tnh SYSID =sysID c file relevant.xml l file lu thng tin bng lin quan l thuyt ly thng tin lin quan l thuyt Ly thng tin bng lin quan thc t . Thng tin ny ph thuc h thng IR cn kim tra. Do mi h thng IR tr v bng lin quan thc t u c lu thnh mt file cho h thng , vi cch t tn file l : rel_ + sysID. Nn tn file lu bng lin quan thc t ca mt h thng IR sysID l: retfn=relevant_Path+"/rel_"+sysID+".xml"
Trang 158

Lun vn : nh gi cc h thng tm kim thng tin

Ly danh sch topicID (=> gi phng thc LayDSTopicID ca lp XL_Topic) Duyt danh sch topicID: Vi mi topicID: o Ly cc ti liu lin quan l thuyt i vi topicID kt qu c mng chui lu DocID lin quan relArr=> tnh c s ti liu lin quan o Ly cc ti liu lin quan thc t i vi topicID kt qu c mng chui lu DocID tr v retArr => tnh c s ti liu tr v o Ly phn giao ca 2 mng relArr v retArr => tnh c s ti liu lin quan c tr v o Tnh bao ph o Tnh chnh xc o Tnh chnh xc ti 11 im chun ca bao ph => gi phng thc TinhDoChinhXac_Chuan => kt qu c mng gm 11 phn t lu 11 gi tr chnh xc ti 11 im chun bao ph (strPrecision) o Tnh R trung bnh v P trung bnh o Ghi nhn topicID ny vo file evaluation.xml Phng thc TinhDoChinhXac_Chuan

o ngha: tnh chnh xc ti 11 im chun ca bao ph ca mt topic o Tham s u vo: STT Tn 1 retArr Loi kiu Kiu chui Kiu chui ngha mng docID lin quan tc t do h thng IR tr v ca mt topicID 2 relArr mng docID lin quan theo l thuyt ca mt topicID

Trang 159

Lun vn : nh gi cc h thng tm kim thng tin

o Kt qu tr v: strPrecision (kiu mng chui) gi tr chnh xc ti 11 im chun ca bao ph o M t thut ton: // y ta dng lp CRelevant tnh chnh xc v bao ph at v tr // th n ca ti liu trong dy ti liu lin quan tr v Khai bo arrRelTT l mng CRelevant .Khi to gi tr mc nh cho mng arrReTT: o Khi to gi tr DocID : gn cc DocID lin quan vo mng arrRelTT o Khi ti gi tr bRelevant: vi cc DocID c lin quan l thuyt (da vo mng relArr) th nh du bRelevant ti v tr bng true Tnh tng s ti liu lin quan trong mng CRelevant arrRelTT : m s bRelevant bng true => sumRelLT Tnh R v P cho tng v tr: o Lp i=0 cho ti cui mng arrRelTT - Nu ( arrRelTT[i].bRelevant==true ) soRel_Rel tng ln 1 - arrRelTT[i].fPrecision= soRel_Rel/(i+1); - arrRelTT[i].fRecall=soRel_Rel/sumRelLT; Cui lp i Ni suy vi cc gi tr R,P ti cc th t t 1 n ht mng arrRelTT : Gi Recall=0 Gi arrP l mng s thc lu gi tr chnh xc ti 11 im chun o Lp i=0 cho ti 11 Gi soRel_Rel l s ti liu lin quan trong mng arrRelTT ti v tr th i

Trang 160

Lun vn : nh gi cc h thng tm kim thng tin


Tm ch s idx ca mng arrRelTT c arrRelTT[i].fRecall =

Recall hoc ( arrRelTT[i].fRecall < Recall + 0.1 v arrRelTT[i].fRecall > Recall)


Nu tm khng thy idx=-1 gn arrP[i]=-1; Ngc li tim thy (idx!=-1) :tm gi tr max t v tr idx cho

n cui ca mng arrRelTT . Gn gi tr max ny cho arrP[i]. =>y chnh l gi tr chnh xc ti recall chun Cui lp i Sau khi ni suy ta c mng s thc arrP lu gi tr 11 chnh xc => i t s thc sang chui thun li vic ghi ln file => strPrecision Tr v cho chng trnh mng strPrecision Phng thc TaoSysID o ngha: nh dng file kt qu tr v ca h thng IR o Tham s u vo (khng c) o Kt qu tr v; systemID: kiu chui o M t thut ton: c file system.xml Ly s node SYSTEM => sysID sysID=sysID +1 i sysID sang chui => systemID tr v cho chng trnh systemID Phng thc ThemSystem o ngha: nh dng file kt qu tr v ca h thng IR o Tham s u vo:

Trang 161

Lun vn : nh gi cc h thng tm kim thng tin

STT Tn 1 2 sysID name

Loi kiu Kiu chui Kiu chui

ngha System ID ca h thng cn thm Tn h thng cn thm

o Kt qu tr v: (khng c) o M t thut ton: c file system.xml (file system.xml l file lu tt c cc h thng tin hnh nh gi) To node SYSTEM v thuc tnh ca n SYSID c gi tr l sysID T o cc node NAME DATE AVGPRECISON AVGRECALL, vi gi tr ln lt l name, ngy gi h thng, PtrungBinh, RtrungBinh. Ghi ln file xml (gi phng thc WriteXML ca lp LT_XML) o) XL_HeThongIR S lp:
XL_HeThongIR Static String path_system="result/system.xml" ThucThiHTIR LayDSThongTinHeThong LayHeThong

M t: Phng thc ThucThiHTIR: o ngha: gi h thng IR thc thi o Tham s u vo: (khng c) o Kt qu tr v: (khng c) o M t thut ton:

Trang 162

Lun vn : nh gi cc h thng tm kim thng tin

Gi h thng IR hin th cho php ngi dng thc hin h thng IR Phng thc LayDSThongTinHeThong: o ngha: gi h thng IR thc thi o Tham s u vo: (khng c) o Kt qu tr v: ht mng CheThongIR lu thng tin gi tr ca danh sch h thng IR bao gm sysID, tn, ngy kim tra , R trung bnh , P trung bnh o M t thut ton: c file system.xml Ly ni dung cc th SYSTEM (SYSID) ,

NAME , DATE , AVGRECALL , AVGPRECISION gn vo ht[i] Tr v cho chng trnh danh sch ht Phng thc LayHeThong: o ngha: ly h thng bt k da vo sysid nhp vo o Tham s u vo: sysID: (kiu chui) : systemID ca h thng cn ly o Kt qu tr v: kq (kiu CHeThongIR) : h thng IR o M t thut ton: gi phng thc LayDSThongTinHeThong =>ht Duyt h thng ht , xem ht[i] no c systemID =sysID =>kq Tr v cho chng trnh kq

Trang 163

Lun vn : nh gi cc h thng tm kim thng tin

p) XL_Index S lp:
XL_Index String index_Path="result/index" DinhDangIndex DinhDangIndex_topic DinhDangIndex_document LayNoiDungIndex LayNoiDungIdx_Doc LayNoiDungIdx_Topic

M t: Phng thc DinhDangIndex o ngha: do cu trc ca file index ti lii v index cu hi tng t nhau nn phng thc DinhDangIndex s nh dng cng mt kiu file cu trc o Tham s u vo: STT Tn 1 2 3 fn newfn idx Loi kiu Kiu chui Kiu chui CIndex ngha Tn file index ca h thng IR Tn file xml mi nh dng ca file index

o Kt qu tr v: - Kq: li chng trnh : thnh cng hay tht bi o M t thut ton: (tng t nh dng file kt qu) Ghi ch: file index ca h thng s c chuyn i sang file mi c cu trc o chng trnh a ra nh sau: (xem phn T chc lu tr d liu (index ))

Trang 164

Lun vn : nh gi cc h thng tm kim thng tin

<MATRIX SIZE = > <INDEX ID= SIZE = > <TERM WORD= WEIGH=> </INDEX> </MATRIX> Mi bng lin quan thc t ca h thng s c lu thnh 1 file ring bit c tn l rel_ + sysID + .xml c file xml fn, ly cc nt c tn l idx.ID => Nodes Tao file xml mi , to nt gc MATRIX, vi thuc tnh SIZE l s nt ca Nodes Duyt tt c cc nt trong Nodes

Kim tra xem nt ID c thuc tnh hay khng (idx.ThuocTinh_ID== ?) Nu c thuc tnh : ly gi tr ca thuc tnh gn vo bin tm temp Nu khng c thuc tnh: ly gi tr ca nt ID To nt INDEX v to thuc tnh ID c gi tr l ni dung ca bin temp Ly cc nt con ca nt ID =>child To thuc tnh SIZE ca nt INDEX c gi tr l s nt ca child Duyt tt c cc nt con child: Xem nt no l idx.Term : Kim tra xem nt Term c thuc tnh hay khng (idx.ThuocTinh_Term = ?) Nu c thuc tnh : ly gi tr thuc tnh gn vo bin strChild Nu khng ly gi tr ca nt DocID gn vo bin strChild

Trang 165

Lun vn : nh gi cc h thng tm kim thng tin

To nt TERM v thuc tnh WORD vi ni dung l strChild

Xt xem c nt con no c tn l idx.Weigh Kim tra xem nt Weigh c thuc tnh hay khng (idx.ThuocTinh_Weigh = ?) Nu c thuc tnh : ly gi tr thuc tnh gn vo bin strChild Nu khng ly gi tr ca nt Weigh gn vo bin strChild To thuc tnh WEIGH cho nt TERMM vi ni dung l strChild

Ghi ln file xml mi vi tn l newfn => gi phng thc WriteXML ca lp LT_XML Phng thc DinhDangIndex_topic

o ngha: nh dng file index ca cu hi o Tham s u vo: STT Tn 1 2 3 fn sysId idx Loi kiu Kiu chui Kiu chui CIndex ngha Tn file index ca h thng IR systemID ca h thng IR nh dng ca file index

o Kt qu tr v: - Kq: li chng trnh : thnh cng hay tht bi o M t thut ton: file index ca cu hi c t tn theo cng thc sau: idxTopic_" + sysId + ".xml" => newfn Gi phng thc DinhDangIndex vi cc tham s u vo l fn, newfn , idx Phng thc DinhDangIndex_document o ngha: nh dng file index ca ti liu o Tham s u vo:
Trang 166

Lun vn : nh gi cc h thng tm kim thng tin

STT Tn 1 2 3 fn sysId idx

Loi kiu Kiu chui Kiu chui CIndex

ngha Tn file index ca h thng IR systemID ca h thng IR nh dng ca file index

o Kt qu tr v: - Kq: li chng trnh : thnh cng hay tht bi o M t thut ton: file index ca cu hi c t tn theo cng thc sau: idxDocument_" + sysId + ".xml" => newfn Gi phng thc DinhDangIndex vi cc tham s u vo l fn, newfn , idx Phng thc LayNoiDungIndex o ngha: do cu trc ca cc tp tin index ca ti liu v cu hi tng t nhau, nn phng thc LayNoiDungIndex l dng chung cho 2 phng thc ly ni dung ring cho tng file index ti liu v cu hi o Tham s u vo: STT Tn 1 2 fn ID Loi kiu Kiu chui Kiu chui ngha Tn file index ca h thng IR ID ca ti liu hay cu hi

o Kt qu tr v: - strNoiDung: mng 2 chiu: ghi nhn ni dung ca index gm 2 trng WORD v TERM o M t thut ton: c file xml fn , ly cc nt c tn l INDEX => lst Duyt cc nt trong lst:

Ly thuc tnh ID ca nt INDEX => strId Kim tra xem strId c l ID cn tm hay khng?
Trang 167

Lun vn : nh gi cc h thng tm kim thng tin

Nu phi tm ly cc nt con ca nt INDEX => childList Duyt cc nt trong childList Ly gi tr ca thuc tnh WORD nt con gn vo strNoiDung[k][0] Ly gi tr ca thuc tnh WEIGH nt con gn vo strNoiDung[k][1] k=k+1; Phng thc LayNoiDungIdx_Doc

o ngha: ly ni dung ca file index ti liu o Tham s u vo: STT Tn 1 2 sysID DocId Loi kiu Kiu chui Kiu chui ngha systemID ca h thng IR ID ca ti liu

o Kt qu tr v: - strNoiDung: mng 2 chiu: ghi nhn ni dung ca index gm 2 trng WORD v TERM o M t thut ton: Gi fn=idxDocument_" + sysID + ".xml" Gi phng thc LayNoiDungIndex vi tham s u vo l fn, DocId Phng thc LayNoiDungIdx_Topic o ngha: ly ni dung ca file index cu hi o Tham s u vo: STT Tn 1 2 sysID TopId Loi kiu Kiu chui Kiu chui ngha systemID ca h thng IR ID ca ti liu

Trang 168

Lun vn : nh gi cc h thng tm kim thng tin

o Kt qu tr v: - strNoiDung: mng 2 chiu: ghi nhn ni dung ca index gm 2 trng WORD v TERM o M t thut ton: Gi fn=idxDocument_" + sysID + ".xml" Gi phng thc LayNoiDungIndex vi tham s u vo l fn, DTopId q) XL_DoThi (v thi ng cong RP) 3.3.2.7.2. Cc lp i tng lu tr a) LT_XML: S lp:
LT_XML

ReadXML WriteXML

M t: Phung thc ReadXML: o ngha: c mt file XML bt k o Tham s u vo fileName : kiu chui ; tn file xml cn c o Kt qu tr v: Kq: kiu s nguyn cho bit c file thnh cng hay tht bi o M t thut ton c file xml c tn l fileName Tr kt qu c v cho chng trnh Phng thc WriteXML: o ngha: ghi ln file xml bt k

Trang 169

Lun vn : nh gi cc h thng tm kim thng tin

o Tham s u vo fn : kiu chui ; tn file cn ghi Doc : Document ; ni dung cc node ca file xml mi cn to o Kt qu tr v: Kq: kiu s nguyn cho bit ghi file thnh cng hay tht bi o M t thut ton ghi ni dung cc node ca file xml mi tr kt qu v cho chng trnh S lp:
LT_Text

b ) LT_Text:

WriteText

M t: Phng thc WriteText: o ngha: cho php ghi mt chui ln file dng text => dng to file text khi chuyn nh dng t file xml sang file text o Tham s u vo: fn: kiu chui ; l tn file text cn to strNoiDung: kiu chui; l ni dung cn ghi ln file o Kt qu tr v: Kq : kiu s nguyn : thng bo cho chng trnh bit l to file thnh cng hay tht bi o M t thut ton: -

To mi file vi tn file l fn Ghi strNoiDung vo file mi to.

Trang 170

Lun vn : nh gi cc h thng tm kim thng tin

Chng 4 : KT QU NH GI
4.1. Ngng nh gi thc hin nh gi h thng tm kim thng tin chng trnh xy dng bng nh gi chun. Bng nh gi chun ny c thc hin da trn phng php pooling vi chiu di pool l 50 ti liu lin quan u. Chng trnh thc hin nh gi hai h thng search4VN v Lucene ln lt ti cc ngng nh gi l 50,100,1000. Sau tin hnh so snh hai h thng trn. nh gi tin cy v kh nng tm kim thng tin ca h thng ta tin hnh tnh bao ph v chnh xc cho tng cu truy vn, v tnh bao ph v chnh xc ca h thng (chnh l bao ph v chnh xc trung bnh). Sau tin hnh nh gi da trn ng cong chun RP . 4.2. nh gi h thng tm kim thng tin search4VN nh gi h thng search4VN cc ngng 50, 100 , 1000 STT Ngng c=50 Cu truy vn 1 2 3 4 5 6 7 8 0.66 0.56 0.46 0.58 0.56 0.62 0.28 0.46 0.66 0.56 0.46 0.58 0.56 0.62 0.28 0.46 0.82 0.7 0.64 0.76 0.82 0.76 0.42 0.48 0.41 0.35 0.32 0.38 0.41 0.38 0.21 0.24 0.86 0.88 0.82 0.88 0.9 0.88 0.8 0.86 0.043 0.06128 0.041 0.044 0.045 0.184874 0.04 0.06636 Recall (Ri) Precision P(i) Ngng c=100 Recall (Ri) Precision P(i) Ngng c=1000 Recall (Ri) Precision P(i)

Trang 171

Lun vn : nh gi cc h thng tm kim thng tin

9 10 11 12 13 14 15 16 17

0.56 0.36 0.18 0.48 0.36 0.52 0.42 0.34 0.44 0.341463

0.56 0.36 0.18 0.48 0.36 0.52 0.42 0.34 0.44

0.78 0.46 0.36 0.64 0.48 0.7 0.6 0.62 0.68 0.487804

0.39 0.23 0.18 0.32 0.24 0.35 0.3 0.31 0.34

0.84 0.8 0.7 0.88 0.82 0.82 0.84 0.84 0.88

0.042 0.04 0.035 0.044 0.041 0.041 0.042 0.042 0.09224

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

42 0.34 0.6 0.4 0.52 0.42 0.34 0.56 0.5 0.54 0.18 0.56 0.54 0.36 0.5 0.38 0.54 0.5

0.28 0.34 0.6 0.4 0.52 0.42 0.34 0.56 0.5 0.54 0.18 0.56 0.54 0.36 0.5 0.38 0.54 0.5

9 0.38 0.84 0.62 0.68 0.74 0.4 0.6 0.6 0.68 0.22 0.7 0.74 0.58 0.64 0.74 0.7 0.64

0.2 0.19 0.42 0.31 0.34 0.37 0.2 0.3 0.3 0.34 0.11 0.35 0.37 0.29 0.32 0.37 0.35 0.32

0.80488 0.8 0.86 0.82 0.88 0.92 0.82 0.86 0.84 0.84 0.72 0.86 1.0 0.76 0.82 0.78 0.82 0.84

0.196429 0.04 0.043 0.041 0.044 0.055355 0.041 0.043 0.042 0.042 0.036 0.043 0.05 0.038 0.041 0.039 0.041 0.042

Trang 172

Lun vn : nh gi cc h thng tm kim thng tin

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

0.2 0.62 0.48 0.34 0.32 0.46 0.3 0.68 0.36 0.46 0.5 0.28 0.48 0.22 0.48 0.48 0.6 0.52 0.64 0.72 0.38 0.46 0.2 0.5 0.22 0.6 0.2 0.5

0.2 0.62 0.48 0.34 0.32 0.46 0.3 0.68 0.36 0.46 0.5 0.28 0.48 0.22 0.48 0.48 0.6 0.52 0.64 0.72 0.38 0.46 0.2 0.5 0.22 0.6 0.2 0.5

0.28 0.74 0.62 0.46 0.48 0.72 0.74 0.8 0.46 0.76 0.6 0.42 0.8 0.24 0.66 0.66 0.66 0.74 0.7 0.82 0.58 0.66 0.34 0.74 0.38 0.72 0.36 0.72

0.14 0.37 0.31 0.23 0.24 0.36 0.37 0.4 0.23 0.38 0.3 0.21 0.4 0.12 0.33 0.33 0.33 0.37 0.35 0.41 0.29 0.33 0.17 0.37 0.19 0.36 0.18 0.36

0.88 0.86 0.86 0.8 0.76 0.82 0.8 0.82 0.74 0.82 0.84 0.74 0.86 0.76 0.78 0.84 0.86 0.88 0.88 0.9 0.86 0.86 0.88 0.88 0.78 0.88 0.58 0.9

0.044 0.043 0.064 0.04 0.038 0.051638 0.09195 0.041 0.037 0.041 0.042 0.037 0.043 0.038 0.039 0.042 0.043 0.044 0.079855 0.045 0.043 0.043 0.0538 0.044 0.039 0.044 0.029 0.045

Trang 173

Lun vn : nh gi cc h thng tm kim thng tin

64

0.7

0.7

0.78

0.39

0.86

0.043

Ri
R=
i =1

Pi
P=
i =1

Ri
R=
i =1

Pi
P=
i =1

Ri
R=
i =1

P
P=
i =1

=0.45096 034 Nhn xt:

=0.44999 =0.61480 996 95

=0.306718 =0.83320 77 14

=0.049851 24

Vi ngng nh gi l 50 Do chiu di pool l 50 bng vi ngng nh gi nn bao ph v chnh xc bng nhau . Xt cu hi th 18: do s ti liu lin quan thc s khi ly phn giao 2 h thng nh hn chiu di pool 50 nn khi dng phng php pooling vi chiu di pool l12 50 th s ti liu lin quan theo l thuyt s nh hn chiu di pool (50) v thc t l ch c 41 ti liu lin quan theo l thuyt trong khi s ti liu tr v l 50 v ngng nh gi l 50 nn chnh xc nh hn bao ph. Vi ngng nh gi l 100 Do ngng nh gi (s ti liu tr v ) gp i chiu di pool (s ti liu lin quan theo l thuyt) nn bao ph gn nh ln gp i chnh xc. Vi ngng nh gi l 1000: Do ngng nh gi ln tc l s ti liu h thng search4VN tr v ln (1000) ti liu nn chnh xc so vi chnh xc khc ngng 50 v 100. Vi cng cu hi: Cng mt cu hi nu ngng nh gi cng cao (c ngha l s ti tr v ca h thng tm kim bn ngoi tng), nn s ti liu lin quan c tr v c th s tng nn bao ph ca ngng nh gi cao s cao hn chnh xc.
Trang 174

Lun vn : nh gi cc h thng tm kim thng tin

Tnh ton chnh xc ti 11 im chun ca bao ph: R 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Nhn xt : Nu cng tng ngng nh gi chnh xc gim do s ti liu tr v tng m s ti liu lin quan c tr v tng khng ng k. ng cong RP ca h thng search4VN nh sau: P (c=50) 0.92311794 0.7882654 0.69403636 0.63090414 0.5820023 0.5450672 0.5310865 0.5289436 0.5289436 0.5289436 0.5289436 P (c=100) 0.9234109 0.78855836 0.6889597 0.6179958 0.55077547 0.49265072 0.4223546 0.39856696 0.39096713 0.39096713 0.39096713 P (c=1000) 0.9234109 0.78855836 0.6889597 0.6106299 0.53635085 0.46107948 0.3832133 0.29289013 0.1897545 0.18582451 0.18582451

Trang 175

Lun vn : nh gi cc h thng tm kim thng tin

c= 50

c=100

Trang 176

Lun vn : nh gi cc h thng tm kim thng tin

c=1000

4.3. So snh h thng tm kim search4VN v h thng Lucene So snh h thng search4VN v Lucene cc ngng 50, 1000 c=50 R 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 P (searchVN) 0.92311794 0.7882654 0.69403636 0.63090414 0.5820023 0.5450672 0.5310865 0.5289436 0.5289436 P (Lucene) 0.9883535 0.9370161 0.8891043 0.8682885 0.8526954 0.8452069 0.8401279 0.83058465 0.8242704 P (searchVN) 0.9234109 0.78855836 0.6889597 0.6106299 0.53635085 0.46107948 0.3832133 0.29289013 0.1897545 c=1000 P (Lucene) 0.9883535 0.9370161 0.88669646 0.86513025 0.8495781 0.84239674 0.83736676 0.828869 0.77039164

Trang 177

Lun vn : nh gi cc h thng tm kim thng tin

0.9 1.0 c=50

0.5289436 0.5289436

0.8242704 0.8242704

0.18582451 0.18582451

0.29168567 0.29168567

c=1000

Nhn xt : ta thy th ca search4VN nm di th Lucene nn h thng search4VN c hiu sut thc thi thp hn so vi h thng Lucene

Trang 178

Lun vn : nh gi cc h thng tm kim thng tin

4.4. Nhn xt chng trnh h tr nh gi h thng tm kim thng tin 4.4.1. u im C th nh gi c cc h thng ting Anh v Ting Vit Khng ph thuc vo cu trc nh dng ca b ng liu dng nh gi So snh c cc h thng tm kim nh gi cc h thng tm kim da trn ng cong RP trc quan d hiu Mc d hn ch v c th ca ngn ng ting Vit vi cc ngn ng khc. V d ting Vit l loi hnh n lp phi hnh thi, cn ting Anh l loi hnh bin cch hay cn gi l loi hnh khut chit trong xc nh ranh gii t khng phi da vo khong trng nh cc th ting bin hnh khc, nhng chng ti gii quyt c vn ny bng cch chun ha li t ting Vit cc h thng tm kim ting Anh c th hiu c ranh gii t ting Vit v tm kim c vi ting Vit. 4.4.2. Khuyt im H thng tr gip nh gi tht s cho kt qu ng tin cy khi bng nh gi lin quan chun chnh xc v khch quan. Do nh gi h thng tr gip nh gi ph thuc vo bng nh gi lin quan chun Ting Vit l mt ngn ng c du nn vic m ho Ting Vit cng gy nhiu kh khn trong vic lp ch mc ca cc h thng tim kim ting Anh vi kho ng liu l ting Vit. Do vic chy cc h thng tm kim thng tin vn ch dnh cho ting Anh khng th tin hnh cho Ting Vit. Trong lun vn ny, chng ti nghin cu cc h thng tm kim thng tin ting Anh nh SMART, IOTA, TERRIER, LUCENE v gp rt nhiu kh khn vi vic lp ch mc cho

Trang 179

Lun vn : nh gi cc h thng tm kim thng tin

kho ng liu Ting Vit mc d chng ti c gng ht sc chnh sa m ngun cho tt c cc h thng ny. Nhng cui cng, ch c h thng LUCENE c th tm kim c vi Ting Vit. Bng nh gi lin quan chun c trch ra t phn giao ca hai h thng LUCENE v Search4VN. V vy bng nh gi lin quan chun lc u c mt s ch cha chnh xc do c t h thng tm kim thng tin ting Vit. Chng ti c gng khc phc vn ny bng cch xem li bng lin quan chun bng th cng v ly ra nhng ti liu no tht s lin quan n cu hi nht hon thin bng ny. Cch lm ny ch l gii php tm thi cho b ng liu dng nh gi hin ti ca chng ti. Nu c nhu cu pht trin thm b ng liu dng nh gi, cc bn nn xy dng thm bng nh gi ny bng cch chy nhiu h thng tm kim thng tin ting Vit hn m khng cn thay i g v m hnh ca h thng.

Trang 180

Lun vn : nh gi cc h thng tm kim thng tin

Chng 5 : KT LUN
Cng tc nh gi (evaluation) mt m hnh, mt h thng ni chung cng quan trng khng km so vi vic xy dng mt m hnh hay mt h thng. ti ca chng ti nhm t ng ho cng tc nh gi cc h thng tm kim thng tin (IR systems). Vi vic t ng ho , chng ta c th nh gi mt cch nhanh chng, chnh xc v quan trng l khch quan kh nng v hiu sut tm kim ca cc h thng tm kim thng tin.Nh s nh gi , nhng ngi xy dng h thng IR c c s phn hi (feedback) nhanh chng kp thi, h kp iu chnh (setting) li m hnh, phng php m h va ci t, th nghim cho mt h thng IR. Chnh nh s iu chnh kp thi v ph hp cc thng s ca m hnh trn chnh h thng IR , t h mi c th a n mt h thng IR ti u (optimal IR system). Nh c s nh gi kp thi v nhanh chng nh vy m cc nh xy dng h thng IR s c khch l v mt tinh thn, tit kim v mt thi gian, cng sc, thay v phi i mt thi gian di ch i s nh gi bng phng php th cng nh trc y (phi cho nhiu ngi s dng trong thi gian di ri mi nhn c kin phn hi t pha ngi s dng, cc kin ny c th chnh xc m cng c th l ch quan). T , h cng c tinh thn, thi gian v cng sc u t vo vic ci thin m hnh/phng php ca mnh c nhiu hn. Vic nh gi ny l bit c im mnh, im yu ca tng h thng IR m t ta chn ra c h thng IR ti u phc v cho nhu cu tm kim thng tin mt cch c hiu qu. Chng ti hy vng ti ny s l mt ng gp nh c ngha cho vic nghin cu v lnh vc tm kim thng tin.

Trang 181

Lun vn : nh gi cc h thng tm kim thng tin

Chng 6 : HNG PHT TRIN


Vic nghin cu nh gi cc h thng tm kim thng tin rt a dng vi nhiu phng php, m hnh nh gi khc nhau. Nhng m hnh, phng php ny ang c tip tc nghin cu, bn lun trn th gii. Trn c s nhng phn nghin cu v thc hin, ti ca chng ti c cc hng pht trin sau : Hng pht trin v m hnh nh gi tng qut: m hnh nh gi hng ngi dng. Hng pht trin v phng php xy dng b ng liu dng nh gi, c bit trong phng php xy dng bng nh gi lin quan chun nhm to ra bng nh gi khch quan v chnh xc. Hng pht trin v phng php nh gi: Ngoi cch nh gi da vo 11 im chun ca bao ph, ti c th pht trin thm cc phng php nh gi khc nh phng php nh gi da trn chnh xc trung bnh nghim ngt (Mean Average Precision MAP), o da trn gi tr n Swets E-Measure (Single-valued Measure) hoc chiu di tm kim trung bnh.

Trang 182

Lun vn : nh gi cc h thng tm kim thng tin

PH LC
1. Cu hi mu: <TOPIC> <TOP> <TOPID>1</TOPID> <TITLE>kinh t tri thc</TITLE> <DES>nn kinh t tri thc l g, ngha ca nn kinh t tri thc, tnh hnh xy dng nn kinh t tri thc? <DES> <NARR> Cc ti liu lin quan phi c nh ngha v ngha ca kinh t tri thc,cc yu t hnh thnh nn kinh t tri thc, nhu cu xy dng nn kinh t tri thc ti Vit Nam, tnh hnh nn kinh t tri thc ti Vit Nam</NARR> </TOP> <TOP> <TOPID>2</TOPID> <TITLE> v n tham nhng ln</TITLE> <DES> thng tin v cc v n tham nhng ln </DES> <NARR>Cc ti liu lin quan phi cha thng tin v cc v n tham nhng ln,cc ti danh lin quan nh nhn hi l, bin th cng qu, nguyn nhn v hu qu ca tham nhng, kin ca nhn dn v bo ch, cc bin php chng tham nhng trong b my cng quyn</NARR> </TOP> <TOP> <TOPID>3</TOPID> <TITLE>an ton giao thng ti Vit Nam</TITLE> <DES>vn an ton giao thng ti Vit Nam </DES>

Trang 183

Lun vn : nh gi cc h thng tm kim thng tin

<NARR> Cc ti liu lin quan phi ni v tnh hnh an ton giao thng ti Vit Nam gm c cc chnh sch ca chnh ph v an ton giao thng, tnh trng vi phm trt t an ton giao thng, n tc giao thng ,tai nn giao thng </NARR> </TOP> </TOPIC> 2. Ti liu mu <DOCUMENT> <DOC> <DOCID>1</DOCID> <TITLE>Thanh nin VN: ng lc cho nhng tng mi, tm nhn mi </TITLE> <AUTHOR>Tc gi :.Bnh</AUTHOR> <DATE>Ngy :01/12/2000</DATE> <NEWS>Tn t bo : Tui tr S bo : 155/2000 Th loi : Trang : trang 1, 14</NEWS> <CONTENT>Thanh nin VN: ng lc cho nhng tng mi, tm nhn mi. (TT-H Ni) - Ti l khai mc Din n thanh nin (TN) VN vi ch Sn sng cho th k 21 sng 30-11 ti H Ni (do Hi Lin hip TN VN phi hp vi cc c quan LHQ ti VN t chc), ng Edouard Wattez, iu phi vin thng tr LHQ ti VN, nhn mnh: Vi 60% dn s tui di 30, VN tht s l mt t nc tr. y l mt thi im kh c bit trong lch s t nc cc bn - thi im ca ha bnh v i mi, thi im ca VN bt u m ca vi th gii v tin hnh hin i ha, thi im ca VN c vai tr to ln trong cc hi ngh ton cu, trong cc t chc
Trang 184

Lun vn : nh gi cc h thng tm kim thng tin

quc t v vai tr ca VN ngy cng tr nn quan trng hn. TN VN c vai tr quan trng trong qu trnh m ca vi th gii.... . Bnh</CONTENT> </DOC> </DOCUMENT> 3. Bng nh gi lin quan chun Bng nh gi lin quan chun gm hai thnh phn chnh: cu hi v cc ti liu lin quan tht s ca cu hi . Cu trc DTD ca file cha bng nh gi lin quan chun c t chc nh sau: <!ELEMENT RELEVANT(REL*)> <ELEMENT REL(DOCID)> <!ATTLIST REL TOPID CDATA #REQUIRE> Din gii: <TOPID>: ch s ca topic <DOC ID>: ch s ca ti liu c lin quan vi cu hi c ch s l TOPID V d mt phn bng nh gi lin quan chun: <RELEVANT> <REL TOPID="1 "> <DOCID>10456</DOCID> <DOCID>3407</DOCID> <DOCID>2476</DOCID>
</REL> <REL TOPID="2 "> <DOCID>6689</DOCID> <DOCID>1582</DOCID> <DOCID>12854</DOCID> </REL> </RELEVANT> Trang 185

Lun vn : nh gi cc h thng tm kim thng tin

Ti liu tham kho


[ 1 ] Ricardo Beaza-Yates & Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Press, Anh, 1999. [ 2 ] Wessel Kraaij, Variations on Language Modeling for Information Retrieval, Thesis Enschede, Print Partners Ipskamp, Enschede, 2004. [ 3 ] Mei-Mei Wu & Danie H. SonnenWald, Reflections on Information Retrieval Evaluation, Hi ngh TREC, 2004. [ 4 ] F C Johnson, J R Griffiths, R J Hartley, A framework for the evaluation of Internet search engines, The Council of Museums, Archives and Libraries, Anh, 2001. [ 5 ] Van Rijsbergen C.J., Information Retrieval, Ti bn ln 2, ButterWorths, Lun n, 1979, Chng 7 c ti http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch7.html [ 6 ] Gerard Salton, Micheal J. McGrill, Introduction to Modern Information Retrieval, International Student Edition, New York, 1983. [ 7 ] Pia Borlund , The IIR evaluation model: a framework for evaluation of interactive information retrieval systems, Information Research, 2003. [ 8 ] Hi ngh TREC : http://trec.nist.gov [9 ] Ellen M. Voorhees, Overview of TREC 2003, National Institute of Standards

and Technology, 2003. [ 10 ] inh in, gio trnh X l Ngn ng T nhin, i hc Khoa hc T nhin Tp. H Ch Minh, 2004. [ 11 ] Nguyn Vn Tu, T v vn t ting Vit hin i, NXB i hc & THCN, H Ni , 1978. [ 12 ] a ch ftp ca SMART : ftp://ftp.cs.cornell.edu/pub/smart/ [ 13 ] Jean-Pierre Chevallet, XIOTA: An open XML framework for IR Experimentation, Hi ngh CLEF, 2004.
[ 14 ] a ch trang Web ca Terrier : http://ir.dcs.gla.ac.uk/terrier/ Trang 186

Lun vn : nh gi cc h thng tm kim thng tin

[ 15 ] a ch trang Web ca Lucene : http://lucene.apache.org/java/docs/index.html

Trang 187

Вам также может понравиться