Вы находитесь на странице: 1из 4

How to Build an Automated Text Summarizer:

An extraction-based summarizer
CS784 Spring 2013
Group Project (due by end of Spring term lectures
!n t"is #ssignment$ you %ill build #n e&tr#ction'b#sed te&t summ#ri(er t"#t %ill input #
te&t$ decide on t"e most signific#nt sentences in t"e te&t #ccording to # metric you %ill
specify$ t"en list t"ese signific#nt sentences #s # summ#ry of t"e origin#l te&t)
*o build #n #utom#ted te&t summ#ri(er$ for e&#mple$ # typic#l %ord'fre+uencyb#sed
summ#ri(er$ you %ill first need to underst#nd t"e b#sic components of #ny te&t
summ#ri(er) ,n o-er-ie% of te&t summ#ri(#tion is gi-en in t"e p#per by .r) /du#rd
0o-y on our course %ebsite) 1ou s"ould begin by re-ie%ing t"is p#per$ t"en re-ie% t"e
st#rter code gi-en to you in 2*e&tSumm#ri(er)(ip3 on t"e %ebsite) *"e st#rter code is
%ritten in 4#-# #nd cont#ins t"e follo%ing components5 (6ote5 !n your implement#tion$
you m#y use # progr#mming l#ngu#ge ot"er t"#n 4#-#5
1. The Main module:
#) C#lls t"e Generator module to produce t"e summ#ry of # te&t)
2. The Generator module contains the following methods:
#) setKeywords5 7e#ds in t"e te&t$ preprocesses it (i)e)$ con-erts to
lo%er c#se$ remo-es punctu#tion$ remo-es stop%ords (see belo%$
remo-es suffi&es (t"is is c#lled 2%ord stemming3$ t"en #ssigns t"e
-#ri#ble keywords to be t"e set of signific#nt %ords in t"e
document)
b) stopwords.txt: !t is necess#ry to "#-e # list of /nglis" 2stop
%ords3) *"ese #re sm#ll function %ords$ li8e 2t"e3$ 2#nd3$ 2#3$ %"ic"
do not contribute me#ning to t"e te&t summ#ry) *"e file
stopwords.txt cont#ins # list of common /nglis" stop %ords$ to
%"ic" you m#y #dd ot"ers)
c) ***calcAllSentenceScores: C#lcul#tes # -#ri#ble scores
gi-ing t"e -#lue of e#c" sentence in t"e document) *"is sentence
scorer c#lcul#tes t"e -#lue of # sentence #ccording to # single metric$
or # combined metric) *"e sentence scores %ill t"en be used to decide
%"ic" sentences to 8eep in t"e fin#l summ#ry) 1ou %ill need to
de-ise # metric #nd implement t"is met"od) (9e %ill discuss types of
metrics during t"e cl#ss sessions)
d) **generateSignificantSentences: 7#n8s t"e sentences
in t"e document #ccording to t"eir scores #nd decides %"ic" ones to
8eep (i)e)$ t"ese sentences %ill "#-e scores #bo-e # cert#in t"res"old)
1ou %ill need to implement t"is met"od)
e) *generateSummary: Prints out t"e most signific#nt sentences in
t"e document) 1ou %ill need to implement t"is met"od)
f. The methods aboe are labelled in order of easiest and shortest to
write !"#$% to most difficult and time-consuming !"###$%.
&. Hel'er classes gien to (ou:
#) Main.java: C#lls t"e Generator to produce t"e te&t summ#ry)
b) Word.java: Some useful utilities for processing indi-idu#l %ords
in # document) (6ote5 1ou m#y not need #ll t"ese met"ods for your
summ#ri(er)
c) TextExtractor.java: Cont#ins met"ods to e&tr#ct t"e
indi-idu#l terms #nd sentences from # document)
d) TermPreprocessor.java: Cont#ins utilities to con-ert # %ord
to lo%er c#se$ remo-e punctu#tion$ remo-e stop %ords$ #nd produce
# %ord:s stem (i)e)$ remo-e suffi&es)
e) TermCollectionProcessor.java: Cont#ins utilities for
8eeping tr#c8 of t"e number of occurrences of e#c" term (%ord in
t"e document$ for sorting %ords #ccording to t"eir scores$ etc) (;#y
or m#y not be needed depending on t"e metric you c"oose)
f) TermCollection.java: Cont#ins utilities for m#n#ging #
collection of terms) (;#y or m#y not be needed)
g) StringTrimmer.java: 7emo-es le#ding #nd tr#iling
c"#r#cters from # string)
") Stemmer.java: *"e Porter stemmer in 4#-#) 7eturns # %ord:s
2stem3 (i)e)$ its root form$ minus suffi&es)
i) InputDocument.java: Cont#ins utilities for setting up -#rious
met"ods to process # document) (;#y or m#y not be needed)
). Hel'er files gien to (ou:
#) stopwords.txt: Common function %ords (2t"e3$ 2#3$ 2#nd3$
etc) to be remo-ed before e&tr#cting # summ#ry) 1ou m#y #dd
ot"ers)
b) inputLarge.txt, inputNews.txt, inputTech.txt:
S#mple input te&ts from different genres to use in testing) 1ou #re
e&pected to test your summ#ri(er on #ddition#l te&ts) !n de-eloping #
te&t summ#ri(er$ it is usu#l to focus on te&ts from one specific genre
(e)g)$ ne%s #rticles$ blogs$ em#il$ product re-ie%s$ etc))
c) outputLarge.txt,outputNews.txt,outputTech.txt:
Summ#ries of t"e #bo-e s#mple input te&ts) 6ote5 *"ese summ#ries
#re for comp#rison only) *"ere is no 2rig"t3 summ#ry of # te&t)
<) How to get started:
#) .o%nlo#d t"e (ipfile 2*e&tSumm#ri(er)(ip3 from t"e course %ebsite5
"ttp5==%%%)cs)u%#terloo)c#=>cdim#rco=cs784
b) 9#l8 t"roug" t"e code follo%ing t"e se+uence of cl#sses #nd "elper
met"ods s"o%n #bo-e to m#8e sure you underst#nd t"e o-er#ll
structure of # summ#ri(er)
c) .ecide on # metric or combined metric for scoring sentences) ?or
e&#mple$ you m#y use # %ord fre+uency'b#sed metric (#s t"e st#rter
code is set up to do$ or # metric b#sed on # sentence:s position in t"e
te&t$ or some ot"er criterion) 9e %ill discuss types of metrics in our
cl#ss sessions$ but t"e 0o-y p#per on t"e %ebsite gi-es m#ny good
ide#s)
d) !mplement t"e Generator cl#ss met"ods for scoring sentences
(calcAllSentenceScores$ for gener#ting t"e most signific#nt
sentences (generateSignificantSentences$ #nd for
gener#ting t"e fin#l summ#ry (generateSummary)
e) (@ption#l 1ou m#y find it useful to debug your te&t summ#ri(er on
t"e input*.txt files pro-ided)
*. +hat to hand in:
#) (<0 m#r8s , %or8ing te&t summ#ri(er using #t le#st one type of
scoring met"od)
b) (3< m#r8s *ests of your summ#ri(er on 3'4 te&ts of re#son#ble
lengt" in # specific genre) ;#r8s %ill be gi-en #ccording to
sop"istic#tion of your summ#ri(er) 6ote5 , summ#ri(er t"#t %or8s
%ell on one type of te&t (e)g)$ ne%s #rticles %ill gener#lly not %or8
#s %ell on documents from ot"er genres)
c) (1< m#r8s , s"ort %rite'up on "o% you %ould e-#lu#te "o% %ell
your summ#ri(er %or8s)
d) (Aonus 20 m#r8s Perform #n e-#lu#tion of your summ#ri(er) 9e
%ill discuss forms of e-#lu#tion in our cl#ss session but t"e 0o-y
p#per on our course %ebsite describes -#rious met"ods of e-#lu#tion)

Вам также может понравиться