Вы находитесь на странице: 1из 32

1.

LanguageProcessingandPython
Itiseasytogetourhandsonmillionsofwordsoftext.Whatcanwedowithit,assumingwecanwritesome
simpleprograms?Inthischapterwe'lladdressthefollowingquestions:
1.Whatcanweachievebycombiningsimpleprogrammingtechniqueswithlargequantitiesoftext?
2.Howcanweautomaticallyextractkeywordsandphrasesthatsumupthestyleandcontentofatext?
3.WhattoolsandtechniquesdoesthePythonprogramminglanguageprovideforsuchwork?
4.Whataresomeoftheinterestingchallengesofnaturallanguageprocessing?
Thischapterisdividedintosectionsthatskipbetweentwoquitedifferentstyles.Inthe"computingwith
language"sectionswewilltakeonsomelinguisticallymotivatedprogrammingtaskswithoutnecessarily
explaininghowtheywork.Inthe"closerlookatPython"sectionswewillsystematicallyreviewkey
programmingconcepts.We'llflagthetwostylesinthesectiontitles,butlaterchapterswillmixbothstyles
withoutbeingsoupfrontaboutit.Wehopethisstyleofintroductiongivesyouanauthentictasteofwhatwill
comelater,whilecoveringarangeofelementaryconceptsinlinguisticsandcomputerscience.Ifyouhavebasic
familiaritywithbothareas,youcanskipto5wewillrepeatanyimportantpointsinlaterchapters,andifyou
missanythingyoucaneasilyconsulttheonlinereferencematerialathttp://nltk.org/.Ifthematerialis
completelynewtoyou,thischapterwillraisemorequestionsthanitanswers,questionsthatareaddressedinthe
restofthisbook.

1ComputingwithLanguage:TextsandWords
We'reallveryfamiliarwithtext,sincewereadandwriteiteveryday.Herewewilltreattextasrawdataforthe
programswewrite,programsthatmanipulateandanalyzeitinavarietyofinterestingways.Butbeforewecan
dothis,wehavetogetstartedwiththePythoninterpreter.

1.1GettingStartedwithPython
OneofthefriendlythingsaboutPythonisthatitallowsyoutotypedirectlyintotheinteractiveinterpreter
theprogramthatwillberunningyourPythonprograms.YoucanaccessthePythoninterpreterusingasimple
graphicalinterfacecalledtheInteractiveDeveLopmentEnvironment(IDLE).OnaMacyoucanfindthisunder
ApplicationsMacPython,andonWindowsunderAllProgramsPython.UnderUnixyoucanrunPythonfrom
theshellbytypingidle(ifthisisnotinstalled,trytypingpython).Theinterpreterwillprintablurbaboutyour
PythonversionsimplycheckthatyouarerunningPython3.2orlater(hereitisfor3.4.2):
Python3.4.2(default,Oct152014,22:01:37)
[GCC4.2.1CompatibleAppleLLVM5.1(clang503.0.40)]ondarwin

Type"help","copyright","credits"or"license"formoreinformation.
>>>

Note
IfyouareunabletorunthePythoninterpreter,youprobablydon'thavePython
installedcorrectly.Pleasevisithttp://python.org/fordetailedinstructions.NLTK
3.0worksforPython2.6and2.7.Ifyouareusingoneoftheseolderversions,
notethatthe/operatorroundsfractionalresultsdownwards(so1/3willgive
you0).Inordertogettheexpectedbehaviorofdivisionyouneedtotype:from
__future__importdivision

The>>>promptindicatesthatthePythoninterpreterisnowwaitingforinput.Whencopyingexamplesfromthis
book,don'ttypethe">>>"yourself.Now,let'sbeginbyusingPythonasacalculator:
>>>1+5*23
8
>>>

Oncetheinterpreterhasfinishedcalculatingtheansweranddisplayingit,thepromptreappears.Thismeansthe
Pythoninterpreteriswaitingforanotherinstruction.
Note
YourTurn:Enterafewmoreexpressionsofyourown.Youcanuseasterisk(*)
formultiplicationandslash(/)fordivision,andparenthesesforbracketing
expressions.
TheprecedingexamplesdemonstratehowyoucanworkinteractivelywiththePythoninterpreter,experimenting
withvariousexpressionsinthelanguagetoseewhattheydo.Nowlet'stryanonsensicalexpressiontoseehow
theinterpreterhandlesit:
>>>1+
File"<stdin>",line1
1+

^
SyntaxError:invalidsyntax
>>>

Thisproducedasyntaxerror.InPython,itdoesn'tmakesensetoendaninstructionwithaplussign.ThePython
interpreterindicatesthelinewheretheproblemoccurred(line1of<stdin>,whichstandsfor"standardinput").
NowthatwecanusethePythoninterpreter,we'rereadytostartworkingwithlanguagedata.

1.2GettingStartedwithNLTK
BeforegoingfurtheryoushouldinstallNLTK3.0,downloadableforfreefromhttp://nltk.org/.Followthe
instructionstheretodownloadtheversionrequiredforyourplatform.
Onceyou'veinstalledNLTK,startupthePythoninterpreterasbefore,andinstallthedatarequiredforthebook
bytypingthefollowingtwocommandsatthePythonprompt,thenselectingthebookcollectionasshownin1.1.

>>>importnltk
>>>nltk.download()

Figure1.1:DownloadingtheNLTKBookCollection:browsetheavailablepackagesusingnltk.download().The
Collectionstabonthedownloadershowshowthepackagesaregroupedintosets,andyoushouldselecttheline
labeledbooktoobtainalldatarequiredfortheexamplesandexercisesinthisbook.Itconsistsofabout30
compressedfilesrequiringabout100Mbdiskspace.Thefullcollectionofdata(i.e.,allinthedownloader)is
nearlytentimesthissize(atthetimeofwriting)andcontinuestoexpand.
Oncethedataisdownloadedtoyourmachine,youcanloadsomeofitusingthePythoninterpreter.Thefirststep
istotypeaspecialcommandatthePythonpromptwhichtellstheinterpretertoloadsometextsforustoexplore:
fromnltk.bookimport*.Thissays"fromNLTK'sbookmodule,loadallitems."Thebookmodulecontainsall
thedatayouwillneedasyoureadthischapter.Afterprintingawelcomemessage,itloadsthetextofseveral
books(thiswilltakeafewseconds).Here'sthecommandagain,togetherwiththeoutputthatyouwillsee.Take
caretogetspellingandpunctuationright,andrememberthatyoudon'ttypethe>>>.
>>>fromnltk.bookimport*
***IntroductoryExamplesfortheNLTKBook***
Loadingtext1,...,text9andsent1,...,sent9
Typethenameofthetextorsentencetoviewit.
Type:'texts()'or'sents()'tolistthematerials.
text1:MobyDickbyHermanMelville1851
text2:SenseandSensibilitybyJaneAusten1811
text3:TheBookofGenesis
text4:InauguralAddressCorpus
text5:ChatCorpus
text6:MontyPythonandtheHolyGrail
text7:WallStreetJournal
text8:PersonalsCorpus
text9:TheManWhoWasThursdaybyG.K.Chesterton1908
>>>

Anytimewewanttofindoutaboutthesetexts,wejusthavetoentertheirnamesatthePythonprompt:
>>>text1
<Text:MobyDickbyHermanMelville1851>
>>>text2
<Text:SenseandSensibilitybyJaneAusten1811>
>>>

NowthatwecanusethePythoninterpreter,andhavesomedatatoworkwith,we'rereadytogetstarted.

1.3SearchingText

Therearemanywaystoexaminethecontextofatextapartfromsimplyreadingit.Aconcordanceviewshowsus
everyoccurrenceofagivenword,togetherwithsomecontext.HerewelookupthewordmonstrousinMoby
Dickbyenteringtext1followedbyaperiod,thenthetermconcordance,andthenplacing"monstrous"in
parentheses:
>>>text1.concordance("monstrous")
Displaying11of11matches:
ongtheformer,onewasofamostmonstroussize....Thiscametowardsus,
ONOFTHEPSALMS."Touchingthatmonstrousbulkofthewhaleororkwehaver
lloverwithaheathenisharrayofmonstrousclubsandspears.Somewerethick
dasyougazed,andwonderedwhatmonstrouscannibalandsavagecouldeverhav
thathassurvivedtheflood;mostmonstrousandmostmountainous!ThatHimmal

theymightscoutatMobyDickasamonstrousfable,orstillworseandmorede
thofRadney.'"CHAPTER55OfthemonstrousPicturesofWhales.Ishallerel
ingScenes.Inconnexionwiththemonstrouspicturesofwhales,Iamstrongly
eretoenteruponthosestillmoremonstrousstoriesofthemwhicharetobefo
ghthavebeenrummagedoutofthismonstrouscabinetthereisnotelling.But
ofWhaleBones;forWhalesofamonstroussizeareoftentimescastupdeadu
>>>

Thefirsttimeyouuseaconcordanceonaparticulartext,ittakesafewextrasecondstobuildanindexsothat
subsequentsearchesarefast.
Note
YourTurn:Trysearchingforotherwordstosaveretyping,youmightbeable
touseuparrow,CtrluparroworAltptoaccessthepreviouscommandand
modifythewordbeingsearched.Youcanalsotrysearchesonsomeofthe
othertextswehaveincluded.Forexample,searchSenseandSensibilityforthe
wordaffection,usingtext2.concordance("affection").SearchthebookofGenesisto
findouthowlongsomepeoplelived,usingtext3.concordance("lived").Youcould
lookattext4,theInauguralAddressCorpus,toseeexamplesofEnglishgoing
backto1789,andsearchforwordslikenation,terror,godtoseehowthese
wordshavebeenuseddifferentlyovertime.We'vealsoincludedtext5,theNPS
ChatCorpus:searchthisforunconventionalwordslikeim,ur,lol.(Notethat
thiscorpusisuncensored!)
Onceyou'vespentalittlewhileexaminingthesetexts,wehopeyouhaveanewsenseoftherichnessand
diversityoflanguage.Inthenextchapteryouwilllearnhowtoaccessabroaderrangeoftext,includingtextin
languagesotherthanEnglish.
Aconcordancepermitsustoseewordsincontext.Forexample,wesawthatmonstrousoccurredincontextssuch
asthe___picturesanda___size.Whatotherwordsappearinasimilarrangeofcontexts?Wecanfindoutby
appendingthetermsimilartothenameofthetextinquestion,theninsertingtherelevantwordinparentheses:
>>>text1.similar("monstrous")
meanpartmaddensdolefulgamesomesubtlyuncommoncarefuluntoward
exasperatelovingpassingmouldychristianfewtruemystifying
imperialmodifiescontemptible

>>>text2.similar("monstrous")
veryheartilysoexceedinglyremarkablyasvastagreatamazingly
extremelygoodsweet
>>>

Observethatwegetdifferentresultsfordifferenttexts.AustenusesthiswordquitedifferentlyfromMelvillefor

her,monstroushaspositiveconnotations,andsometimesfunctionsasanintensifierlikethewordvery.
Thetermcommon_contextsallowsustoexaminejustthecontextsthataresharedbytwoormorewords,suchas
monstrousandvery.Wehavetoenclosethesewordsbysquarebracketsaswellasparentheses,andseparatethem
withacomma:
>>>text2.common_contexts(["monstrous","very"])
a_prettyis_prettyam_gladbe_glada_lucky
>>>

Note
YourTurn:Pickanotherpairofwordsandcomparetheirusageintwodifferent
texts,usingthesimilar()andcommon_contexts()functions.
Itisonethingtoautomaticallydetectthataparticularwordoccursinatext,andtodisplaysomewordsthat
appearinthesamecontext.However,wecanalsodeterminethelocationofawordinthetext:howmanywords
fromthebeginningitappears.Thispositionalinformationcanbedisplayedusingadispersionplot.Eachstripe
representsaninstanceofaword,andeachrowrepresentstheentiretext.In1.2weseesomestrikingpatternsof
wordusageoverthelast220years(inanartificialtextconstructedbyjoiningthetextsoftheInauguralAddress
Corpusendtoend).Youcanproducethisplotasshownbelow.Youmightliketotrymorewords(e.g.,liberty,
constitution),anddifferenttexts.Canyoupredictthedispersionofawordbeforeyouviewit?Asbefore,take
caretogetthequotes,commas,bracketsandparenthesesexactlyright.

>>>text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])
>>>

Figure1.2:LexicalDispersionPlotforWordsinU.S.PresidentialInauguralAddresses:Thiscanbeusedto
investigatechangesinlanguageuseovertime.
Note
Important:YouneedtohavePython'sNumPyandMatplotlibpackages

installedinordertoproducethegraphicalplotsusedinthisbook.Pleasesee
http://nltk.org/forinstallationinstructions.

Note
Youcanalsoplotthefrequencyofwordusagethroughtimeusing
https://books.google.com/ngrams

Now,justforfun,let'strygeneratingsomerandomtextinthevariousstyleswehavejustseen.Todothis,we
typethenameofthetextfollowedbythetermgenerate.(Weneedtoincludetheparentheses,butthere'snothing
thatgoesbetweenthem.)
>>>text3.generate()
Inthebeginningofhisbrotherisahairyman,whosetopmayreach
untoheaven;andyeshallsowthelandofEgypttherewasnobreadin
allthathewastakenoutofthemonth,upontheearth.Soshallthy

wagesbe?Andtheymadetheirfather;andIsaacwasold,andkissed
him:andLabanwithhiscattleinthemidstofthehandsofEsauthy
firstborn,andPhicholthechiefbutleruntohissonIsaac,she
>>>

Note
Thegenerate()methodisnotavailableinNLTK3.0butwillbereinstatedina
subsequentversion.

1.4CountingVocabulary
Themostobviousfactabouttextsthatemergesfromtheprecedingexamplesisthattheydifferinthevocabulary
theyuse.Inthissectionwewillseehowtousethecomputertocountthewordsinatextinavarietyofuseful
ways.Asbefore,youwilljumprightinandexperimentwiththePythoninterpreter,eventhoughyoumaynot
havestudiedPythonsystematicallyyet.Testyourunderstandingbymodifyingtheexamples,andtryingthe
exercisesattheendofthechapter.
Let'sbeginbyfindingoutthelengthofatextfromstarttofinish,intermsofthewordsandpunctuationsymbols
thatappear.Weusethetermlentogetthelengthofsomething,whichwe'llapplyheretothebookofGenesis:
>>>len(text3)
44764
>>>

SoGenesishas44,764wordsandpunctuationsymbols,or"tokens."Atokenisthetechnicalnameforasequence
ofcharacterssuchashairy,his,or:)thatwewanttotreatasagroup.Whenwecountthenumberof
tokensinatext,say,thephrasetobeornottobe,wearecountingoccurrencesofthesesequences.Thus,inour
examplephrasetherearetwooccurrencesofto,twoofbe,andoneeachoforandnot.Butthereareonlyfour
distinctvocabularyitemsinthisphrase.HowmanydistinctwordsdoesthebookofGenesiscontain?Towork
thisoutinPython,wehavetoposethequestionslightlydifferently.Thevocabularyofatextisjustthesetof
tokensthatituses,sinceinaset,allduplicatesarecollapsedtogether.InPythonwecanobtainthevocabulary
itemsoftext3withthecommand:set(text3).Whenyoudothis,manyscreensofwordswillflypast.Nowtry
thefollowing:

>>>sorted(set(text3))
['!',"'",'(',')',',',',)','.','.)',':',';',';)','?','?)',
'A','Abel','Abelmizraim','Abidah','Abide','Abimael','Abimelech',
'Abr','Abrah','Abraham','Abram','Accad','Achbor','Adah',...]
>>>len(set(text3))
2789
>>>

Bywrappingsorted()aroundthePythonexpressionset(text3) ,weobtainasortedlistofvocabularyitems,
beginningwithvariouspunctuationsymbolsandcontinuingwithwordsstartingwithA.Allcapitalizedwords
precedelowercasewords.Wediscoverthesizeofthevocabularyindirectly,byaskingforthenumberofitemsin
theset,andagainwecanuselentoobtainthisnumber .Althoughithas44,764tokens,thisbookhasonly
2,789distinctwords,or"wordtypes."Awordtypeistheformorspellingofthewordindependentlyofits
specificoccurrencesinatextthatis,thewordconsideredasauniqueitemofvocabulary.Ourcountof2,789
itemswillincludepunctuationsymbols,sowewillgenerallycalltheseuniqueitemstypesinsteadofwordtypes.
Now,let'scalculateameasureofthelexicalrichnessofthetext.Thenextexampleshowsusthatthenumberof
distinctwordsisjust6%ofthetotalnumberofwords,orequivalentlythateachwordisused16timesonaverage
(rememberifyou'reusingPython2,tostartwithfrom__future__importdivision).
>>>len(set(text3))/len(text3)
0.06230453042623537
>>>

Next,let'sfocusonparticularwords.Wecancounthowoftenawordoccursinatext,andcomputewhat
percentageofthetextistakenupbyaspecificword:
>>>text3.count("smote")
5
>>>100*text4.count('a')/len(text4)
1.4643016433938312
>>>

Note
YourTurn:Howmanytimesdoesthewordlolappearintext5?Howmuchis
thisasapercentageofthetotalnumberofwordsinthistext?
Youmaywanttorepeatsuchcalculationsonseveraltexts,butitistedioustokeepretypingtheformula.Instead,
youcancomeupwithyourownnameforatask,like"lexical_diversity"or"percentage",andassociateitwitha
blockofcode.NowyouonlyhavetotypeashortnameinsteadofoneormorecompletelinesofPythoncode,
andyoucanreuseitasoftenasyoulike.Theblockofcodethatdoesataskforusiscalledafunction,andwe
defineashortnameforourfunctionwiththekeyworddef.Thenextexampleshowshowtodefinetwonew
functions,lexical_diversity()andpercentage():
>>>deflexical_diversity(text):
...returnlen(set(text))/len(text)
...
>>>defpercentage(count,total):
...return100*count/total
...

Caution!

ThePythoninterpreterchangesthepromptfrom>>>to...afterencounteringthecolonattheendofthefirstline.
The...promptindicatesthatPythonexpectsanindentedcodeblocktoappearnext.Itisuptoyoutodothe
indentation,bytypingfourspacesorhittingthetabkey.Tofinishtheindentedblockjustenterablankline.
Inthedefinitionoflexical_diversity() ,wespecifyaparameternamedtext.Thisparameterisa
"placeholder"fortheactualtextwhoselexicaldiversitywewanttocompute,andreoccursintheblockofcode
thatwillrunwhenthefunctionisused .Similarly,percentage()isdefinedtotaketwoparameters,named
countandtotal .
OncePythonknowsthatlexical_diversity()andpercentage()arethenamesforspecificblocksofcode,we
cangoaheadandusethesefunctions:
>>>lexical_diversity(text3)
0.06230453042623537
>>>lexical_diversity(text5)
0.13477005109975562
>>>percentage(4,5)
80.0
>>>percentage(text4.count('a'),len(text4))
1.4643016433938312
>>>

Torecap,weuseorcallafunctionsuchaslexical_diversity()bytypingitsname,followedbyanopen
parenthesis,thenameofthetext,andthenacloseparenthesis.Theseparentheseswillshowupoftentheirroleis
toseparatethenameofatasksuchaslexical_diversity()fromthedatathatthetaskistobeperformed
onsuchastext3.Thedatavaluethatweplaceintheparentheseswhenwecallafunctionisanargumentto
thefunction.
Youhavealreadyencounteredseveralfunctionsinthischapter,suchaslen(),set(),andsorted().By
convention,wewillalwaysaddanemptypairofparenthesesafterafunctionname,asinlen(),justtomakeclear
thatwhatwearetalkingaboutisafunctionratherthansomeotherkindofPythonexpression.Functionsarean
importantconceptinprogramming,andweonlymentionthemattheoutsettogivenewcomersasenseofthe
powerandcreativityofprogramming.Don'tworryifyoufinditabitconfusingrightnow.
Laterwe'llseehowtousefunctionswhentabulatingdata,asin1.1.Eachrowofthetablewillinvolvethesame
computationbutwithdifferentdata,andwe'lldothisrepetitiveworkusingafunction.
Table1.1:
LexicalDiversityofVariousGenresintheBrownCorpus
Genre
skillandhobbies
humor
fiction:science
press:reportage
fiction:romance
religion

Tokens
82345
21695
14470
100554
70022
39399

Types
11935
5017
3233
14394
8452
6373

Lexicaldiversity
0.145
0.231
0.223
0.143
0.121
0.162

2ACloserLookatPython:TextsasListsofWords
You'veseensomeimportantelementsofthePythonprogramminglanguage.Let'stakeafewmomentstoreview
themsystematically.

2.1Lists
Whatisatext?Atonelevel,itisasequenceofsymbolsonapagesuchasthisone.Atanotherlevel,itisa
sequenceofchapters,madeupofasequenceofsections,whereeachsectionisasequenceofparagraphs,andso
on.However,forourpurposes,wewillthinkofatextasnothingmorethanasequenceofwordsandpunctuation.
Here'showwerepresenttextinPython,inthiscasetheopeningsentenceofMobyDick:

>>>sent1=['Call','me','Ishmael','.']
>>>

Afterthepromptwe'vegivenanamewemadeup,sent1,followedbytheequalssign,andthensomequoted
words,separatedwithcommas,andsurroundedwithbrackets.Thisbracketedmaterialisknownasalistin
Python:itishowwestoreatext.Wecaninspectitbytypingthename .Wecanaskforitslength .Wecan
evenapplyourownlexical_diversity()functiontoit .
>>>sent1
['Call','me','Ishmael','.']
>>>len(sent1)
4
>>>lexical_diversity(sent1)
1.0
>>>

Somemorelistshavebeendefinedforyou,onefortheopeningsentenceofeachofourtexts,sent2sent9.We
inspecttwoofthemhereyoucanseetherestforyourselfusingthePythoninterpreter(ifyougetanerrorwhich
saysthatsent2isnotdefined,youneedtofirsttypefromnltk.bookimport*).
>>>sent2
['The','family','of','Dashwood','had','long',
'been','settled','in','Sussex','.']
>>>sent3
['In','the','beginning','God','created','the',
'heaven','and','the','earth','.']
>>>

Note
YourTurn:Makeupafewsentencesofyourown,bytypinganame,equals
sign,andalistofwords,likethis:ex1=['Monty','Python','and','the','Holy',
'Grail'].RepeatsomeoftheotherPythonoperationswesawearlierin1,e.g.,
sorted(ex1),len(set(ex1)),ex1.count('the').
ApleasantsurpriseisthatwecanusePython'sadditionoperatoronlists.Addingtwolists createsanewlist
witheverythingfromthefirstlist,followedbyeverythingfromthesecondlist:
>>>['Monty','Python']+['and','the','Holy','Grail']
['Monty','Python','and','the','Holy','Grail']
>>>

Note

Thisspecialuseoftheadditionoperationiscalledconcatenationitcombines
theliststogetherintoasinglelist.Wecanconcatenatesentencestobuildupa
text.
Wedon'thavetoliterallytypethelistseitherwecanuseshortnamesthatrefertopredefinedlists.
>>>sent4+sent1
['Fellow','','Citizens','of','the','Senate','and','of','the',

'House','of','Representatives',':','Call','me','Ishmael','.']
>>>

Whatifwewanttoaddasingleitemtoalist?Thisisknownasappending.Whenweappend()toalist,thelist
itselfisupdatedasaresultoftheoperation.
>>>sent1.append("Some")
>>>sent1

['Call','me','Ishmael','.','Some']
>>>

2.2IndexingLists
Aswehaveseen,atextinPythonisalistofwords,representedusingacombinationofbracketsandquotes.Just
aswithanordinarypageoftext,wecancountupthetotalnumberofwordsintext1withlen(text1),andcount
theoccurrencesinatextofaparticularwordsay,'heaven'usingtext1.count('heaven').
Withsomepatience,wecanpickoutthe1st,173rd,oreven14,278thwordinaprintedtext.Analogously,wecan
identifytheelementsofaPythonlistbytheirorderofoccurrenceinthelist.Thenumberthatrepresentsthis
positionistheitem'sindex.WeinstructPythontoshowustheitemthatoccursatanindexsuchas173inatext
bywritingthenameofthetextfollowedbytheindexinsidesquarebrackets:
>>>text4[173]
'awaken'
>>>

Wecandotheconversegivenaword,findtheindexofwhenitfirstoccurs:
>>>text4.index('awaken')
173
>>>

Indexesareacommonwaytoaccessthewordsofatext,or,moregenerally,theelementsofanylist.Python
permitsustoaccesssublistsaswell,extractingmanageablepiecesoflanguagefromlargetexts,atechnique
knownasslicing.
>>>text5[16715:16735]
['U86','thats','why','something','like','gamefly','is','so','good',
'because','you','can','actually','play','a','full','game','without',
'buying','it']
>>>text6[1600:1625]
['We',"'",'re','an','anarcho','','syndicalist','commune','.','We',
'take','it','in','turns','to','act','as','a','sort','of','executive',
'officer','for','the','week']
>>>

Indexeshavesomesubtleties,andwe'llexplorethesewiththehelpofanartificialsentence:

>>>sent=['word1','word2','word3','word4','word5',
...'word6','word7','word8','word9','word10']
>>>sent[0]
'word1'
>>>sent[9]
'word10'
>>>

Noticethatourindexesstartfromzero:sentelementzero,writtensent[0],isthefirstword,'word1',whereas
sentelement9is'word10'.Thereasonissimple:themomentPythonaccessesthecontentofalistfromthe
computer'smemory,itisalreadyatthefirstelementwehavetotellithowmanyelementsforwardtogo.Thus,
zerostepsforwardleavesitatthefirstelement.
Note
Thispracticeofcountingfromzeroisinitiallyconfusing,buttypicalofmodern
programminglanguages.You'llquicklygetthehangofitifyou'vemasteredthe
systemofcountingcenturieswhere19XYisayearinthe20thcentury,orifyou
liveinacountrywherethefloorsofabuildingarenumberedfrom1,andso
walkingupn1flightsofstairstakesyoutoleveln.
Now,ifweaccidentallyuseanindexthatistoolarge,wegetanerror:
>>>sent[10]
Traceback(mostrecentcalllast):
File"<stdin>",line1,in?
IndexError:listindexoutofrange
>>>

Thistimeitisnotasyntaxerror,becausetheprogramfragmentissyntacticallycorrect.Instead,itisaruntime
error,anditproducesaTracebackmessagethatshowsthecontextoftheerror,followedbythenameoftheerror,
IndexError,andabriefexplanation.
Let'stakeacloserlookatslicing,usingourartificialsentenceagain.Hereweverifythattheslice5:8includes
sentelementsatindexes5,6,and7:
>>>sent[5:8]
['word6','word7','word8']
>>>sent[5]
'word6'
>>>sent[6]
'word7'
>>>sent[7]
'word8'
>>>

Byconvention,m:nmeanselementsmn1.Asthenextexampleshows,wecanomitthefirstnumberiftheslice
beginsatthestartofthelist ,andwecanomitthesecondnumberiftheslicegoestotheend :
>>>sent[:3]
['word1','word2','word3']
>>>text2[141525:]
['among','the','merits','and','the','happiness','of','Elinor','and','Marianne',
',','let','it','not','be','ranked','as','the','least','considerable',',',
'that','though','sisters',',','and','living','almost','within','sight','of',

'each','other',',','they','could','live','without','disagreement','between',
'themselves',',','or','producing','coolness','between','their','husbands','.',
'THE','END']
>>>

Wecanmodifyanelementofalistbyassigningtooneofitsindexvalues.Inthenextexample,weputsent[0]
ontheleftoftheequalssign .Wecanalsoreplaceanentireslicewithnewmaterial .Aconsequenceofthis
lastchangeisthatthelistonlyhasfourelements,andaccessingalatervaluegeneratesanerror .
>>>sent[0]='First'
>>>sent[9]='Last'
>>>len(sent)
10
>>>sent[1:9]=['Second','Third']
>>>sent
['First','Second','Third','Last']
>>>sent[9]
Traceback(mostrecentcalllast):
File"<stdin>",line1,in?
IndexError:listindexoutofrange
>>>

Note
YourTurn:Takeafewminutestodefineasentenceofyourownandmodify
individualwordsandgroupsofwords(slices)usingthesamemethodsused
earlier.Checkyourunderstandingbytryingtheexercisesonlistsattheendof
thischapter.

2.3Variables
Fromthestartof1,youhavehadaccesstotextscalledtext1,text2,andsoon.Itsavedalotoftypingtobeable
torefertoa250,000wordbookwithashortnamelikethis!Ingeneral,wecanmakeupnamesforanythingwe
caretocalculate.Wedidthisourselvesintheprevioussections,e.g.,definingavariablesent1,asfollows:

>>>sent1=['Call','me','Ishmael','.']
>>>

Suchlineshavetheform:variable=expression.Pythonwillevaluatetheexpression,andsaveitsresulttothe
variable.Thisprocessiscalledassignment.Itdoesnotgenerateanyoutputyouhavetotypethevariableona
lineofitsowntoinspectitscontents.Theequalssignisslightlymisleading,sinceinformationismovingfrom
therightsidetotheleft.Itmighthelptothinkofitasaleftarrow.Thenameofthevariablecanbeanythingyou
like,e.g.,my_sent,sentence,xyzzy.Itmuststartwithaletter,andcanincludenumbersandunderscores.Hereare
someexamplesofvariablesandassignments:
>>>my_sent=['Bravely','bold','Sir','Robin',',','rode',
...'forth','from','Camelot','.']
>>>noun_phrase=my_sent[1:4]
>>>noun_phrase
['bold','Sir','Robin']
>>>wOrDs=sorted(noun_phrase)
>>>wOrDs
['Robin','Sir','bold']
>>>

Rememberthatcapitalizedwordsappearbeforelowercasewordsinsortedlists.
Note
Noticeinthepreviousexamplethatwesplitthedefinitionofmy_sentovertwo
lines.Pythonexpressionscanbesplitacrossmultiplelines,solongasthis
happenswithinanykindofbrackets.Pythonusesthe"..."prompttoindicate
thatmoreinputisexpected.Itdoesn'tmatterhowmuchindentationisusedin
thesecontinuationlines,butsomeindentationusuallymakesthemeasierto
read.
ItisgoodtochoosemeaningfulvariablenamestoremindyouandtohelpanyoneelsewhoreadsyourPython
codewhatyourcodeismeanttodo.Pythondoesnottrytomakesenseofthenamesitblindlyfollowsyour
instructions,anddoesnotobjectifyoudosomethingconfusing,suchasone='two'ortwo=3.Theonly
restrictionisthatavariablenamecannotbeanyofPython'sreservedwords,suchasdef,if,not,andimport.If
youuseareservedword,Pythonwillproduceasyntaxerror:
>>>not='Camelot'
File"<stdin>",line1
not='Camelot'

^
SyntaxError:invalidsyntax
>>>

Wewilloftenusevariablestoholdintermediatestepsofacomputation,especiallywhenthismakesthecode
easiertofollow.Thuslen(set(text1))couldalsobewritten:
>>>vocab=set(text1)
>>>vocab_size=len(vocab)
>>>vocab_size
19317
>>>

Caution!
Takecarewithyourchoiceofnames(oridentifiers)forPythonvariables.First,youshouldstartthenamewitha
letter,optionallyfollowedbydigits(0to9)orletters.Thus,abc23isfine,but23abcwillcauseasyntaxerror.
Namesarecasesensitive,whichmeansthatmyVarandmyvararedistinctvariables.Variablenamescannotcontain
whitespace,butyoucanseparatewordsusinganunderscore,e.g.,my_var.Becarefulnottoinsertahyphen
insteadofanunderscore:myvariswrong,sincePythoninterpretsthe""asaminussign.

2.4Strings
Someofthemethodsweusedtoaccesstheelementsofalistalsoworkwithindividualwords,orstrings.For
example,wecanassignastringtoavariable ,indexastring ,andsliceastring :
>>>name='Monty'
>>>name[0]
'M'
>>>name[:4]
'Mont'
>>>

Wecanalsoperformmultiplicationandadditionwithstrings:
>>>name*2
'MontyMonty'
>>>name+'!'
'Monty!'
>>>

Wecanjointhewordsofalisttomakeasinglestring,orsplitastringintoalist,asfollows:
>>>''.join(['Monty','Python'])
'MontyPython'
>>>'MontyPython'.split()
['Monty','Python']
>>>

Wewillcomebacktothetopicofstringsin3.Forthetimebeing,wehavetwoimportantbuildingblockslists
andstringsandarereadytogetbacktosomelanguageanalysis.

3ComputingwithLanguage:SimpleStatistics
Let'sreturntoourexplorationofthewayswecanbringourcomputationalresourcestobearonlargequantitiesof
text.Webeganthisdiscussionin1,andsawhowtosearchforwordsincontext,howtocompilethevocabulary
ofatext,howtogeneraterandomtextinthesamestyle,andsoon.
Inthissectionwepickupthequestionofwhatmakesatextdistinct,anduseautomaticmethodstofind
characteristicwordsandexpressionsofatext.Asin1,youcantrynewfeaturesofthePythonlanguageby
copyingthemintotheinterpreter,andyou'lllearnaboutthesefeaturessystematicallyinthefollowingsection.
Beforecontinuingfurther,youmightliketocheckyourunderstandingofthelastsectionbypredictingtheoutput
ofthefollowingcode.Youcanusetheinterpretertocheckwhetheryougotitright.Ifyou'renotsurehowtodo
thistask,itwouldbeagoodideatoreviewtheprevioussectionbeforecontinuingfurther.
>>>saying=['After','all','is','said','and','done',
...'more','is','said','than','done']
>>>tokens=set(saying)
>>>tokens=sorted(tokens)
>>>tokens[2:]
whatoutputdoyouexpecthere?
>>>

3.1FrequencyDistributions
Howcanweautomaticallyidentifythewordsofatextthataremostinformativeaboutthetopicandgenreofthe
text?Imaginehowyoumightgoaboutfindingthe50mostfrequentwordsofabook.Onemethodwouldbeto
keepatallyforeachvocabularyitem,likethatshownin3.1.Thetallywouldneedthousandsofrows,andit
wouldbeanexceedinglylaboriousprocesssolaboriousthatwewouldratherassignthetasktoamachine.

Figure3.1:CountingWordsAppearinginaText(afrequencydistribution)
Thetablein3.1isknownasafrequencydistribution,andittellsusthefrequencyofeachvocabularyitemin
thetext.(Ingeneral,itcouldcountanykindofobservableevent.)Itisa"distribution"becauseittellsushowthe
totalnumberofwordtokensinthetextaredistributedacrossthevocabularyitems.Sinceweoftenneed
frequencydistributionsinlanguageprocessing,NLTKprovidesbuiltinsupportforthem.Let'suseaFreqDistto
findthe50mostfrequentwordsofMobyDick:
>>>fdist1=FreqDist(text1)
>>>print(fdist1)
<FreqDistwith19317samplesand260819outcomes>
>>>fdist1.most_common(50)
[(',',18713),('the',13721),('.',6862),('of',6536),('and',6024),
('a',4569),('to',4542),(';',4072),('in',3916),('that',2982),
("'",2684),('',2552),('his',2459),('it',2209),('I',2124),
('s',1739),('is',1695),('he',1661),('with',1659),('was',1632),
('as',1620),('"',1478),('all',1462),('for',1414),('this',1280),
('!',1269),('at',1231),('by',1137),('but',1113),('not',1103),
('',1070),('him',1058),('from',1052),('be',1030),('on',1005),
('so',918),('whale',906),('one',889),('you',841),('had',767),
('have',760),('there',715),('But',705),('or',697),('were',680),
('now',646),('which',640),('?',637),('me',627),('like',624)]
>>>fdist1['whale']
906
>>>

WhenwefirstinvokeFreqDist,wepassthenameofthetextasanargument .Wecaninspectthetotalnumber
ofwords("outcomes")thathavebeencountedup 260,819inthecaseofMobyDick.Theexpression
most_common(50)givesusalistofthe50mostfrequentlyoccurringtypesinthetext .
Note
YourTurn:Trytheprecedingfrequencydistributionexampleforyourself,for
text2.Becarefultousethecorrectparenthesesanduppercaseletters.Ifyou
getanerrormessageNameError:name'FreqDist'isnotdefined,youneedtostart
yourworkwithfromnltk.bookimport*
Doanywordsproducedinthelastexamplehelpusgraspthetopicorgenreofthistext?Onlyoneword,whale,is
slightlyinformative!Itoccursover900times.Therestofthewordstellusnothingaboutthetextthey'rejust
English"plumbing."Whatproportionofthetextistakenupwithsuchwords?Wecangenerateacumulative
frequencyplotforthesewords,usingfdist1.plot(50,cumulative=True),toproducethegraphin3.2.These50
wordsaccountfornearlyhalfthebook!

Figure3.2:CumulativeFrequencyPlotfor50MostFrequentlyWordsinMobyDick:theseaccountfornearly
halfofthetokens.
Ifthefrequentwordsdon'thelpus,howaboutthewordsthatoccuronceonly,thesocalledhapaxes?Viewthem
bytypingfdist1.hapaxes().Thislistcontainslexicographer,cetological,contraband,expostulations,andabout
9,000others.Itseemsthattherearetoomanyrarewords,andwithoutseeingthecontextweprobablycan'tguess
whathalfofthehapaxesmeaninanycase!Sinceneitherfrequentnorinfrequentwordshelp,weneedtotry
somethingelse.

3.2FinegrainedSelectionofWords
Next,let'slookatthelongwordsofatextperhapsthesewillbemorecharacteristicandinformative.Forthiswe
adaptsomenotationfromsettheory.Wewouldliketofindthewordsfromthevocabularyofthetextthatare
morethan15characterslong.Let'scallthispropertyP,sothatP(w)istrueifandonlyifwismorethan15
characterslong.Nowwecanexpressthewordsofinterestusingmathematicalsetnotationasshownin(1a).This
means"thesetofallwsuchthatwisanelementofV(thevocabulary)andwhaspropertyP".
(1)

a. {w|wV&P(w)}
b. [wforwinVifp(w)]

ThecorrespondingPythonexpressionisgivenin(1b).(Notethatitproducesalist,notaset,whichmeansthat
duplicatesarepossible.)Observehowsimilarthetwonotationsare.Let'sgoonemorestepandwriteexecutable
Pythoncode:
>>>V=set(text1)
>>>long_words=[wforwinViflen(w)>15]
>>>sorted(long_words)
['CIRCUMNAVIGATION','Physiognomically','apprehensiveness','cannibalistically',
'characteristically','circumnavigating','circumnavigation','circumnavigations',

'comprehensiveness','hermaphroditical','indiscriminately','indispensableness',
'irresistibleness','physiognomically','preternaturalness','responsibilities',
'simultaneousness','subterraneousness','supernaturalness','superstitiousness',
'uncomfortableness','uncompromisedness','undiscriminating','uninterpenetratingly']
>>>

ForeachwordwinthevocabularyV,wecheckwhetherlen(w)isgreaterthan15allotherwordswillbeignored.
Wewilldiscussthissyntaxmorecarefullylater.
Note
YourTurn:TryoutthepreviousstatementsinthePythoninterpreter,and
experimentwithchangingthetextandchangingthelengthcondition.Doesit
makeadifferencetoyourresultsifyouchangethevariablenames,e.g.,using
[wordforwordinvocabif...]?

Let'sreturntoourtaskoffindingwordsthatcharacterizeatext.Noticethatthelongwordsintext4reflectits
nationalfocusconstitutionally,transcontinentalwhereasthoseintext5reflectitsinformalcontent:
boooooooooooglyyyyyyandyuuuuuuuuuuuummmmmmmmmmmm.Havewesucceededinautomatically
extractingwordsthattypifyatext?Well,theseverylongwordsareoftenhapaxes(i.e.,unique)andperhapsit
wouldbebettertofindfrequentlyoccurringlongwords.Thisseemspromisingsinceiteliminatesfrequentshort
words(e.g.,the)andinfrequentlongwords(e.g.antiphilosophists).Hereareallwordsfromthechatcorpusthat
arelongerthansevencharacters,thatoccurmorethanseventimes:
>>>fdist5=FreqDist(text5)
>>>sorted(wforwinset(text5)iflen(w)>7andfdist5[w]>7)
['#1419teens','#talkcity_adults','((((((((((','........','Question',
'actually','anything','computer','cute.ass','everyone','football',
'innocent','listening','remember','seriously','something','together',
'tomorrow','watching']
>>>

Noticehowwehaveusedtwoconditions:len(w)>7ensuresthatthewordsarelongerthansevenletters,and
fdist5[w]>7ensuresthatthesewordsoccurmorethanseventimes.Atlastwehavemanagedtoautomatically
identifythefrequentlyoccurringcontentbearingwordsofthetext.Itisamodestbutimportantmilestone:atiny
pieceofcode,processingtensofthousandsofwords,producessomeinformativeoutput.

3.3CollocationsandBigrams
Acollocationisasequenceofwordsthatoccurtogetherunusuallyoften.Thusredwineisacollocation,whereas
thewineisnot.Acharacteristicofcollocationsisthattheyareresistanttosubstitutionwithwordsthathave
similarsensesforexample,maroonwinesoundsdefinitelyodd.
Togetahandleoncollocations,westartoffbyextractingfromatextalistofwordpairs,alsoknownasbigrams.
Thisiseasilyaccomplishedwiththefunctionbigrams():
>>>list(bigrams(['more','is','said','than','done']))
[('more','is'),('is','said'),('said','than'),('than','done')]
>>>

Note
Ifyouomittedlist()above,andjusttypedbigrams(['more',...]),youwouldhave
seenoutputoftheform<generatorobjectbigramsat0x10fb8b3a8>.ThisisPython's
wayofsayingthatitisreadytocomputeasequenceofitems,inthiscase,
bigrams.Fornow,youjustneedtoknowtotellPythontoconvertitintoalist,
usinglist().
Hereweseethatthepairofwordsthandoneisabigram,andwewriteitinPythonas('than','done').Now,
collocationsareessentiallyjustfrequentbigrams,exceptthatwewanttopaymoreattentiontothecasesthat
involverarewords.Inparticular,wewanttofindbigramsthatoccurmoreoftenthanwewouldexpectbasedon
thefrequencyoftheindividualwords.Thecollocations()functiondoesthisforus.Wewillseehowitworks
later.
>>>text4.collocations()
UnitedStates;fellowcitizens;fouryears;yearsago;Federal
Government;GeneralGovernment;Americanpeople;VicePresident;Old
World;AlmightyGod;Fellowcitizens;ChiefMagistrate;ChiefJustice;
Godbless;everycitizen;Indiantribes;publicdebt;oneanother;

foreignnations;politicalparties
>>>text8.collocations()
wouldlike;mediumbuild;socialdrinker;quietnights;nonsmoker;
longterm;ageopen;Wouldlike;easygoing;financiallysecure;fun
times;similarinterests;Ageopen;weekendsaway;possrship;well
presented;nevermarried;singlemum;permanentrelationship;slim
build
>>>

Thecollocationsthatemergeareveryspecifictothegenreofthetexts.Inordertofindredwineasacollocation,
wewouldneedtoprocessamuchlargerbodyoftext.

3.4CountingOtherThings
Countingwordsisuseful,butwecancountotherthingstoo.Forexample,wecanlookatthedistributionofword
lengthsinatext,bycreatingaFreqDistoutofalonglistofnumbers,whereeachnumberisthelengthofthe
correspondingwordinthetext:
>>>[len(w)forwintext1]
[1,4,4,2,6,8,4,1,9,1,1,8,2,1,4,11,5,2,1,7,6,1,3,4,5,2,...]
>>>fdist=FreqDist(len(w)forwintext1)
>>>print(fdist)
<FreqDistwith19samplesand260819outcomes>
>>>fdist
FreqDist({3:50223,1:47933,4:42345,2:38513,5:26597,6:17111,7:14399,
8:9966,9:6428,10:3528,...})
>>>

Westartbyderivingalistofthelengthsofwordsintext1 ,andtheFreqDistthencountsthenumberoftimes
eachoftheseoccurs .Theresult isadistributioncontainingaquarterofamillionitems,eachofwhichisa
numbercorrespondingtoawordtokeninthetext.Butthereareatmostonly20distinctitemsbeingcounted,the
numbers1through20,becausethereareonly20differentwordlengths.I.e.,therearewordsconsistingofjust
onecharacter,twocharacters,...,twentycharacters,butnonewithtwentyoneormorecharacters.Onemight
wonderhowfrequentthedifferentlengthsofwordare(e.g.,howmanywordsoflengthfourappearinthetext,
aretheremorewordsoflengthfivethanlengthfour,etc).Wecandothisasfollows:
>>>fdist.most_common()
[(3,50223),(1,47933),(4,42345),(2,38513),(5,26597),(6,17111),(7,14399),
(8,9966),(9,6428),(10,3528),(11,1873),(12,1053),(13,567),(14,177),
(15,70),(16,22),(17,12),(18,1),(20,1)]
>>>fdist.max()
3
>>>fdist[3]
50223
>>>fdist.freq(3)
0.19255882431878046
>>>

Fromthisweseethatthemostfrequentwordlengthis3,andthatwordsoflength3accountforroughly50,000
(or20%)ofthewordsmakingupthebook.Althoughwewillnotpursueithere,furtheranalysisofwordlength
mighthelpusunderstanddifferencesbetweenauthors,genres,orlanguages.
3.1summarizesthefunctionsdefinedinfrequencydistributions.
Table3.1:
FunctionsDefinedforNLTK'sFrequencyDistributions

Example
fdist=FreqDist(samples)
fdist[sample]+=1
fdist['monstrous']
fdist.freq('monstrous')
fdist.N()
fdist.most_common(n)
forsampleinfdist:
fdist.max()
fdist.tabulate()
fdist.plot()
fdist.plot(cumulative=True)
fdist1|=fdist2
fdist1<fdist2

Description
createafrequencydistributioncontainingthegivensamples
incrementthecountforthissample
countofthenumberoftimesagivensampleoccurred
frequencyofagivensample
totalnumberofsamples
thenmostcommonsamplesandtheirfrequencies
iterateoverthesamples
samplewiththegreatestcount
tabulatethefrequencydistribution
graphicalplotofthefrequencydistribution
cumulativeplotofthefrequencydistribution
updatefdist1withcountsfromfdist2
testifsamplesinfdist1occurlessfrequentlythaninfdist2

OurdiscussionoffrequencydistributionshasintroducedsomeimportantPythonconcepts,andwewilllookat
themsystematicallyin4.

4BacktoPython:MakingDecisionsandTaking
Control
Sofar,ourlittleprogramshavehadsomeinterestingqualities:theabilitytoworkwithlanguage,andthepotential
tosavehumaneffortthroughautomation.Akeyfeatureofprogrammingistheabilityofmachinestomake
decisionsonourbehalf,executinginstructionswhencertainconditionsaremet,orrepeatedlyloopingthrough
textdatauntilsomeconditionissatisfied.Thisfeatureisknownascontrol,andisthefocusofthissection.

4.1Conditionals
Pythonsupportsawiderangeofoperators,suchas<and>=,fortestingtherelationshipbetweenvalues.Thefull
setoftheserelationaloperatorsisshownin4.1.
Table4.1:
NumericalComparisonOperators
Operator
<
<=
==
!=
>
>=

Relationship
lessthan
lessthanorequalto
equalto(notethisistwo"="signs,notone)
notequalto
greaterthan
greaterthanorequalto

Wecanusethesetoselectdifferentwordsfromasentenceofnewstext.Herearesomeexamplesonlythe
operatorischangedfromonelinetothenext.Theyallusesent7,thefirstsentencefromtext7(WallStreet
Journal).Asbefore,ifyougetanerrorsayingthatsent7isundefined,youneedtofirsttype:fromnltk.book
import*
>>>sent7
['Pierre','Vinken',',','61','years','old',',','will','join','the',
'board','as','a','nonexecutive','director','Nov.','29','.']
>>>[wforwinsent7iflen(w)<4]

[',','61','old',',','the','as','a','29','.']
>>>[wforwinsent7iflen(w)<=4]
[',','61','old',',','will','join','the','as','a','Nov.','29','.']
>>>[wforwinsent7iflen(w)==4]
['will','join','Nov.']
>>>[wforwinsent7iflen(w)!=4]
['Pierre','Vinken',',','61','years','old',',','the','board',
'as','a','nonexecutive','director','29','.']
>>>

Thereisacommonpatterntoalloftheseexamples:[wforwintextifcondition],whereconditionisa
Python"test"thatyieldseithertrueorfalse.Inthecasesshowninthepreviouscodeexample,theconditionis
alwaysanumericalcomparison.However,wecanalsotestvariouspropertiesofwords,usingthefunctionslisted
in4.2.
Table4.2:
SomeWordComparisonOperators
Function
s.startswith(t)
s.endswith(t)
tins
s.islower()
s.isupper()
s.isalpha()
s.isalnum()
s.isdigit()
s.istitle()

Meaning
testifsstartswitht
testifsendswitht
testiftisasubstringofs
testifscontainscasedcharactersandallarelowercase
testifscontainscasedcharactersandallareuppercase
testifsisnonemptyandallcharactersinsarealphabetic
testifsisnonemptyandallcharactersinsarealphanumeric
testifsisnonemptyandallcharactersinsaredigits
testifscontainscasedcharactersandistitlecased(i.e.allwordsinshaveinitial
capitals)

Herearesomeexamplesoftheseoperatorsbeingusedtoselectwordsfromourtexts:wordsendingwith
ablenesswordscontaininggntwordshavinganinitialcapitalandwordsconsistingentirelyofdigits.
>>>sorted(wforwinset(text1)ifw.endswith('ableness'))
['comfortableness','honourableness','immutableness','indispensableness',...]
>>>sorted(termforterminset(text4)if'gnt'interm)
['Sovereignty','sovereignties','sovereignty']
>>>sorted(itemforiteminset(text6)ifitem.istitle())
['A','Aaaaaaaaah','Aaaaaaaah','Aaaaaah','Aaaah','Aaaaugh','Aaagh',...]
>>>sorted(itemforiteminset(sent7)ifitem.isdigit())
['29','61']
>>>

Wecanalsocreatemorecomplexconditions.Ifcisacondition,thennotcisalsoacondition.Ifwehavetwo
conditionsc1andc2,thenwecancombinethemtoformanewconditionusingconjunctionanddisjunction:c1
andc2,c1orc2.
Note
YourTurn:Runthefollowingexamplesandtrytoexplainwhatisgoingonin
eachone.Next,trytomakeupsomeconditionsofyourown.
>>>sorted(wforwinset(text7)if''inwand'index'inw)
>>>sorted(wdforwdinset(text3)ifwd.istitle()andlen(wd)>10)

>>>sorted(wforwinset(sent7)ifnotw.islower())
>>>sorted(tfortinset(text2)if'cie'intor'cei'int)

4.2OperatingonEveryElement
In3,wesawsomeexamplesofcountingitemsotherthanwords.Let'stakeacloserlookatthenotationweused:
>>>[len(w)forwintext1]
[1,4,4,2,6,8,4,1,9,1,1,8,2,1,4,11,5,2,1,7,6,1,3,4,5,2,...]
>>>[w.upper()forwintext1]
['[','MOBY','DICK','BY','HERMAN','MELVILLE','1851',']','ETYMOLOGY','.',...]
>>>

Theseexpressionshavetheform[f(w)for...]or[w.f()for...],wherefisafunctionthatoperatesona
wordtocomputeitslength,ortoconvertittouppercase.Fornow,youdon'tneedtounderstandthedifference
betweenthenotationsf(w)andw.f().Instead,simplylearnthisPythonidiomwhichperformsthesameoperation
oneveryelementofalist.Intheprecedingexamples,itgoesthrougheachwordintext1,assigningeachonein
turntothevariablewandperformingthespecifiedoperationonthevariable.
Note
Thenotationjustdescribediscalleda"listcomprehension."Thisisourfirst
exampleofaPythonidiom,afixednotationthatweusehabituallywithout
botheringtoanalyzeeachtime.Masteringsuchidiomsisanimportantpartof
becomingafluentPythonprogrammer.
Let'sreturntothequestionofvocabularysize,andapplythesameidiomhere:
>>>len(text1)
260819
>>>len(set(text1))
19317
>>>len(set(word.lower()forwordintext1))
17231
>>>

NowthatwearenotdoublecountingwordslikeThisandthis,whichdifferonlyincapitalization,we'vewiped
2,000offthevocabularycount!Wecangoastepfurtherandeliminatenumbersandpunctuationfromthe
vocabularycountbyfilteringoutanynonalphabeticitems:
>>>len(set(word.lower()forwordintext1ifword.isalpha()))
16948
>>>

Thisexampleisslightlycomplicated:itlowercasesallthepurelyalphabeticitems.Perhapsitwouldhavebeen
simplerjusttocountthelowercaseonlyitems,butthisgivesthewronganswer(why?).
Don'tworryifyoudon'tfeelconfidentwithlistcomprehensionsyet,sinceyou'llseemanymoreexamplesalong
withexplanationsinthefollowingchapters.

4.3NestedCodeBlocks

Mostprogramminglanguagespermitustoexecuteablockofcodewhenaconditionalexpression,orif
statement,issatisfied.Wealreadysawexamplesofconditionaltestsincodelike[wforwinsent7iflen(w)<
4].Inthefollowingprogram,wehavecreatedavariablecalledwordcontainingthestringvalue'cat'.Theif
statementcheckswhetherthetestlen(word)<5istrue.Itis,sothebodyoftheifstatementisinvokedandthe
printstatementisexecuted,displayingamessagetotheuser.Remembertoindenttheprintstatementbytyping
fourspaces.
>>>word='cat'
>>>iflen(word)<5:
...print('wordlengthislessthan5')

...
wordlengthislessthan5
>>>

WhenweusethePythoninterpreterwehavetoaddanextrablankline inorderforittodetectthatthenested
blockiscomplete.
Note
IfyouareusingPython2.6or2.7,youneedtoincludethefollowinglinein
orderfortheaboveprintfunctiontoberecognized:
>>>from__future__importprint_function

Ifwechangetheconditionaltesttolen(word)>=5,tocheckthatthelengthofwordisgreaterthanorequalto5,
thenthetestwillnolongerbetrue.Thistime,thebodyoftheifstatementwillnotbeexecuted,andnomessage
isshowntotheuser:
>>>iflen(word)>=5:
...print('wordlengthisgreaterthanorequalto5')

...
>>>

Anifstatementisknownasacontrolstructurebecauseitcontrolswhetherthecodeintheindentedblockwill
berun.Anothercontrolstructureistheforloop.Trythefollowing,andremembertoincludethecolonandthe
fourspaces:
>>>forwordin['Call','me','Ishmael','.']:
...print(word)
...
Call

me
Ishmael
.
>>>

ThisiscalledaloopbecausePythonexecutesthecodeincircularfashion.Itstartsbyperformingtheassignment
word='Call',effectivelyusingthewordvariabletonamethefirstitemofthelist.Then,itdisplaysthevalueof
wordtotheuser.Next,itgoesbacktotheforstatement,andperformstheassignmentword='me',before
displayingthisnewvaluetotheuser,andsoon.Itcontinuesinthisfashionuntileveryitemofthelisthasbeen
processed.

4.4LoopingwithConditions

Nowwecancombinetheifandforstatements.Wewillloopovereveryitemofthelist,andprinttheitemonly
ifitendswiththeletterl.We'llpickanothernameforthevariabletodemonstratethatPythondoesn'ttrytomake
senseofvariablenames.
>>>sent1=['Call','me','Ishmael','.']
>>>forxyzzyinsent1:
...ifxyzzy.endswith('l'):
...print(xyzzy)

...
Call
Ishmael
>>>

Youwillnoticethatifandforstatementshaveacolonattheendoftheline,beforetheindentationbegins.In
fact,allPythoncontrolstructuresendwithacolon.Thecolonindicatesthatthecurrentstatementrelatestothe
indentedblockthatfollows.
Wecanalsospecifyanactiontobetakeniftheconditionoftheifstatementisnotmet.Hereweseetheelif
(elseif)statement,andtheelsestatement.Noticethatthesealsohavecolonsbeforetheindentedcode.
>>>fortokeninsent1:
...iftoken.islower():
...print(token,'isalowercaseword')
...eliftoken.istitle():
...print(token,'isatitlecaseword')
...else:
...print(token,'ispunctuation')
...
Callisatitlecaseword
meisalowercaseword
Ishmaelisatitlecaseword
.ispunctuation
>>>

Asyoucansee,evenwiththissmallamountofPythonknowledge,youcanstarttobuildmultilinePython
programs.It'simportanttodevelopsuchprogramsinpieces,testingthateachpiecedoeswhatyouexpectbefore
combiningthemintoaprogram.ThisiswhythePythoninteractiveinterpreterissoinvaluable,andwhyyou
shouldgetcomfortableusingit.
Finally,let'scombinetheidiomswe'vebeenexploring.First,wecreatealistofcieandceiwords,thenweloop
overeachitemandprintit.Noticetheextrainformationgivenintheprintstatement:end=''.ThistellsPythonto
printaspace(notthedefaultnewline)aftereachword.
>>>tricky=sorted(wforwinset(text2)if'cie'inwor'cei'inw)
>>>forwordintricky:
...print(word,end='')

ancientceilingconceitconceitedconceiveconscience
conscientiousconscientiouslydeceitfuldeceive...
>>>

5AutomaticNaturalLanguageUnderstanding
Wehavebeenexploringlanguagebottomup,withthehelpoftextsandthePythonprogramminglanguage.
However,we'realsointerestedinexploitingourknowledgeoflanguageandcomputationbybuildinguseful
languagetechnologies.We'lltaketheopportunitynowtostepbackfromthenittygrittyofcodeinordertopainta
biggerpictureofnaturallanguageprocessing.

Atapurelypracticallevel,weallneedhelptonavigatetheuniverseofinformationlockedupintextontheWeb.
SearchengineshavebeencrucialtothegrowthandpopularityoftheWeb,buthavesomeshortcomings.Ittakes
skill,knowledge,andsomeluck,toextractanswerstosuchquestionsas:WhattouristsitescanIvisitbetween
PhiladelphiaandPittsburghonalimitedbudget?WhatdoexpertssayaboutdigitalSLRcameras?What
predictionsaboutthesteelmarketweremadebycrediblecommentatorsinthepastweek?Gettingacomputerto
answerthemautomaticallyinvolvesarangeoflanguageprocessingtasks,includinginformationextraction,
inference,andsummarization,andwouldneedtobecarriedoutonascaleandwithalevelofrobustnessthatis
stillbeyondourcurrentcapabilities.
Onamorephilosophicallevel,alongstandingchallengewithinartificialintelligencehasbeentobuildintelligent
machines,andamajorpartofintelligentbehaviourisunderstandinglanguage.Formanyyearsthisgoalhasbeen
seenastoodifficult.However,asNLPtechnologiesbecomemoremature,androbustmethodsforanalyzing
unrestrictedtextbecomemorewidespread,theprospectofnaturallanguageunderstandinghasreemergedasa
plausiblegoal.
Inthissectionwedescribesomelanguageunderstandingtechnologies,togiveyouasenseoftheinteresting
challengesthatarewaitingforyou.

5.1WordSenseDisambiguation
Inwordsensedisambiguationwewanttoworkoutwhichsenseofawordwasintendedinagivencontext.
Considertheambiguouswordsserveanddish:
(2)

a. serve:helpwithfoodordrinkholdanofficeputballintoplay
b. dish:platecourseofamealcommunicationsdevice

Inasentencecontainingthephrase:heservedthedish,youcandetectthatbothserveanddisharebeingused
withtheirfoodmeanings.It'sunlikelythatthetopicofdiscussionshiftedfromsportstocrockeryinthespaceof
threewords.Thiswouldforceyoutoinventbizarreimages,likeatennisprotakingouthisorherfrustrationsona
chinateasetlaidoutbesidethecourt.Inotherwords,weautomaticallydisambiguatewordsusingcontext,
exploitingthesimplefactthatnearbywordshavecloselyrelatedmeanings.Asanotherexampleofthiscontextual
effect,considerthewordby,whichhasseveralmeanings,e.g.:thebookbyChesterton(agentiveChesterton
wastheauthorofthebook)thecupbythestove(locativethestoveiswherethecupis)andsubmitbyFriday
(temporalFridayisthetimeofthesubmitting).Observein(3c)thatthemeaningoftheitalicizedwordhelps
usinterpretthemeaningofby.
(3)

a. Thelostchildrenwerefoundbythesearchers(agentive)
b. Thelostchildrenwerefoundbythemountain(locative)
c. Thelostchildrenwerefoundbytheafternoon(temporal)

5.2PronounResolution
Adeeperkindoflanguageunderstandingistoworkout"whodidwhattowhom"i.e.,todetectthesubjects
andobjectsofverbs.Youlearnttodothisinelementaryschool,butit'sharderthanyoumightthink.Inthe
sentencethethievesstolethepaintingsitiseasytotellwhoperformedthestealingaction.Considerthree
possiblefollowingsentencesin(4c),andtrytodeterminewhatwassold,caught,andfound(onecaseis
ambiguous).
(4)

a. Thethievesstolethepaintings.Theyweresubsequentlysold.

b. Thethievesstolethepaintings.Theyweresubsequentlycaught.
c. Thethievesstolethepaintings.Theyweresubsequentlyfound.
Answeringthisquestioninvolvesfindingtheantecedentofthepronounthey,eitherthievesorpaintings.
Computationaltechniquesfortacklingthisproblemincludeanaphoraresolutionidentifyingwhatapronoun
ornounphrasereferstoandsemanticrolelabelingidentifyinghowanounphraserelatestotheverb(as
agent,patient,instrument,andsoon).

5.3GeneratingLanguageOutput
Ifwecanautomaticallysolvesuchproblemsoflanguageunderstanding,wewillbeabletomoveontotasksthat
involvegeneratinglanguageoutput,suchasquestionansweringandmachinetranslation.Inthefirstcase,a
machineshouldbeabletoanswerauser'squestionsrelatingtocollectionoftexts:
(5)

a. Text:...Thethievesstolethepaintings.Theyweresubsequentlysold....
b. Human:Whoorwhatwassold?
c. Machine:Thepaintings.

Themachine'sanswerdemonstratesthatithascorrectlyworkedoutthattheyreferstopaintingsandnotto
thieves.Inthesecondcase,themachineshouldbeabletotranslatethetextintoanotherlanguage,accurately
conveyingthemeaningoftheoriginaltext.IntranslatingtheexampletextintoFrench,weareforcedtochoose
thegenderofthepronouninthesecondsentence:ils(masculine)ifthethievesarefound,andelles(feminine)if
thepaintingsarefound.Correcttranslationactuallydependsoncorrectunderstandingofthepronoun.
(6)

a. Thethievesstolethepaintings.Theyweresubsequentlyfound.
b. Lesvoleursontvollespeintures.Ilsontttrouvsplustard.(thethieves)
c. Lesvoleursontvollespeintures.Ellesontttrouvesplustard.(thepaintings)

Inalloftheseexamples,workingoutthesenseofaword,thesubjectofaverb,andtheantecedentofapronoun
arestepsinestablishingthemeaningofasentence,thingswewouldexpectalanguageunderstandingsystemto
beabletodo.

5.4MachineTranslation
Foralongtimenow,machinetranslation(MT)hasbeentheholygrailoflanguageunderstanding,ultimately
seekingtoprovidehighquality,idiomatictranslationbetweenanypairoflanguages.Itsrootsgobacktotheearly
daysoftheColdWar,whenthepromiseofautomatictranslationledtosubstantialgovernmentsponsorship,and
withit,thegenesisofNLPitself.
Today,practicaltranslationsystemsexistforparticularpairsoflanguages,andsomeareintegratedintoweb
searchengines.However,thesesystemshavesomeseriousshortcomings,whicharestarklyrevealedby
translatingasentencebackandforthbetweenapairoflanguagesuntilequilibriumisreached,e.g.:
0>howlongbeforethenextflighttoAliceSprings?
1>wielangvordemfolgendenFlugzuAliceSprings?
2>howlongbeforethefollowingflighttoAlicejump?
3>wielangvordemfolgendenFlugzuAlicespringenSie?

4>howlongbeforethefollowingflighttoAlicedoyoujump?
5>wielang,bevorderfolgendeFlugzuAlicetun,Siespringen?
6>howlong,beforethefollowingflighttoAlicedoes,doyoujump?
7>wielangbevorderfolgendeFlugzuAlicetut,tunSiespringen?
8>howlongbeforethefollowingflighttoAlicedoes,doyoujump?
9>wielang,bevorderfolgendeFlugzuAlicetut,tunSiespringen?
10>howlong,beforethefollowingflightdoestoAlice,dodoyoujump?
11>wielangbevorderfolgendeFlugzuAlicetut,SietunSprung?
12>howlongbeforethefollowingflightdoesleaptoAlice,doesyou?
ObservethatthesystemcorrectlytranslatesAliceSpringsfromEnglishtoGerman(inthelinestarting1>),buton
thewaybacktoEnglish,thisendsupasAlicejump(line2).Theprepositionbeforeisinitiallytranslatedintothe
correspondingGermanprepositionvor,butlaterintotheconjunctionbevor(line5).Afterline5thesentences
becomenonsensical(butnoticethevariousphrasingsindicatedbythecommas,andthechangefromjumpto
leap).Thetranslationsystemdidnotrecognizewhenawordwaspartofapropername,anditmisinterpretedthe
grammaticalstructure.
Note
YourTurn:Trythisyourselfusinghttp://translationparty.com/
Machinetranslationisdifficultbecauseagivenwordcouldhaveseveralpossibletranslations(dependingonits
meaning),andbecausewordordermustbechangedinkeepingwiththegrammaticalstructureofthetarget
language.Todaythesedifficultiesarebeingfacedbycollectingmassivequantitiesofparalleltextsfromnewsand
governmentwebsitesthatpublishdocumentsintwoormorelanguages.GivenadocumentinGermanand
English,andpossiblyabilingualdictionary,wecanautomaticallypairupthesentences,aprocesscalledtext
alignment.Oncewehaveamillionormoresentencepairs,wecandetectcorrespondingwordsandphrases,and
buildamodelthatcanbeusedfortranslatingnewtext.

5.5SpokenDialogSystems
Inthehistoryofartificialintelligence,thechiefmeasureofintelligencehasbeenalinguisticone,namelythe
TuringTest:canadialoguesystem,respondingtoauser'stextinput,performsonaturallythatwecannot
distinguishitfromahumangeneratedresponse?Incontrast,today'scommercialdialoguesystemsarevery
limited,butstillperformusefulfunctionsinnarrowlydefineddomains,asweseehere:
S:HowmayIhelpyou?
U:WhenisSavingPrivateRyanplaying?
S:Forwhattheater?
U:TheParamounttheater.
S:SavingPrivateRyanisnotplayingattheParamounttheater,but
it'splayingattheMadisontheaterat3:00,5:30,8:00,and10:30.
Youcouldnotaskthissystemtoprovidedrivinginstructionsordetailsofnearbyrestaurantsunlesstherequired
informationhadalreadybeenstoredandsuitablequestionanswerpairshadbeenincorporatedintothelanguage
processingsystem.
Observethatthissystemseemstounderstandtheuser'sgoals:theuseraskswhenamovieisshowingandthe
systemcorrectlydeterminesfromthisthattheuserwantstoseethemovie.Thisinferenceseemssoobviousthat
youprobablydidn'tnoticeitwasmade,yetanaturallanguagesystemneedstobeendowedwiththiscapabilityin
ordertointeractnaturally.Withoutit,whenaskedDoyouknowwhenSavingPrivateRyanisplaying?,asystem
mightunhelpfullyrespondwithacoldYes.However,thedevelopersofcommercialdialoguesystemsuse

contextualassumptionsandbusinesslogictoensurethatthedifferentwaysinwhichausermightexpress
requestsorprovideinformationarehandledinawaythatmakessensefortheparticularapplication.So,ifyou
typeWhenis...,orIwanttoknowwhen...,orCanyoutellmewhen...,simpleruleswillalwaysyieldscreening
times.Thisisenoughforthesystemtoprovideausefulservice.

Figure5.1:SimplePipelineArchitectureforaSpokenDialogueSystem:Spokeninput(topleft)isanalyzed,
wordsarerecognized,sentencesareparsedandinterpretedincontext,applicationspecificactionstakeplace(top
right)aresponseisplanned,realizedasasyntacticstructure,thentosuitablyinflectedwords,andfinallyto
spokenoutputdifferenttypesoflinguisticknowledgeinformeachstageoftheprocess.
DialoguesystemsgiveusanopportunitytomentionthecommonlyassumedpipelineforNLP.5.1showsthe
architectureofasimpledialoguesystem.Alongthetopofthediagram,movingfromlefttoright,isa"pipeline"
ofsomelanguageunderstandingcomponents.Thesemapfromspeechinputviasyntacticparsingtosomekindof
meaningrepresentation.Alongthemiddle,movingfromrighttoleft,isthereversepipelineofcomponentsfor
convertingconceptstospeech.Thesecomponentsmakeupthedynamicaspectsofthesystem.Atthebottomof
thediagramaresomerepresentativebodiesofstaticinformation:therepositoriesoflanguagerelateddatathatthe
processingcomponentsdrawontodotheirwork.
Note
YourTurn:Foranexampleofaprimitivedialoguesystem,tryhavinga
conversationwithanNLTKchatbot.Toseetheavailablechatbots,run
nltk.chat.chatbots().(Remembertoimportnltkfirst.)

5.6TextualEntailment
Thechallengeoflanguageunderstandinghasbeenbroughtintofocusinrecentyearsbyapublic"sharedtask"
calledRecognizingTextualEntailment(RTE).Thebasicscenarioissimple.Supposeyouwanttofindevidence
tosupportthehypothesis:SandraGoudiewasdefeatedbyMaxPurnell,andthatyouhaveanothershorttextthat
seemstoberelevant,forexample,SandraGoudiewasfirstelectedtoParliamentinthe2002elections,narrowly
winningtheseatofCoromandelbydefeatingLabourcandidateMaxPurnellandpushingincumbentGreenMP
JeanetteFitzsimonsintothirdplace.Doesthetextprovideenoughevidenceforyoutoacceptthehypothesis?In
thisparticularcase,theanswerwillbe"No."Youcandrawthisconclusioneasily,butitisveryhardtocomeup

withautomatedmethodsformakingtherightdecision.TheRTEChallengesprovidedatathatallowcompetitors
todeveloptheirsystems,butnotenoughdatafor"bruteforce"machinelearningtechniques(atopicwewillcover
inchapdataintensive).Consequently,somelinguisticanalysisiscrucial.Inthepreviousexample,itisimportant
forthesystemtonotethatSandraGoudienamesthepersonbeingdefeatedinthehypothesis,nottheperson
doingthedefeatinginthetext.Asanotherillustrationofthedifficultyofthetask,considerthefollowingtext
hypothesispair:
(7)

a. Text:DavidGolinkinistheeditororauthorofeighteenbooks,andover150responsa,articles,
sermonsandbooks
b. Hypothesis:Golinkinhaswritteneighteenbooks

Inordertodeterminewhetherthehypothesisissupportedbythetext,thesystemneedsthefollowingbackground
knowledge:(i)ifsomeoneisanauthorofabook,thenhe/shehaswrittenthatbook(ii)ifsomeoneisaneditorof
abook,thenhe/shehasnotwritten(allof)thatbook(iii)ifsomeoneiseditororauthorofeighteenbooks,then
onecannotconcludethathe/sheisauthorofeighteenbooks.

5.7LimitationsofNLP
DespitetheresearchledadvancesintaskslikeRTE,naturallanguagesystemsthathavebeendeployedforreal
worldapplicationsstillcannotperformcommonsensereasoningordrawonworldknowledgeinageneraland
robustmanner.Wecanwaitforthesedifficultartificialintelligenceproblemstobesolved,butinthemeantimeit
isnecessarytolivewithsomeseverelimitationsonthereasoningandknowledgecapabilitiesofnaturallanguage
systems.Accordingly,rightfromthebeginning,animportantgoalofNLPresearchhasbeentomakeprogresson
thedifficulttaskofbuildingtechnologiesthat"understandlanguage,"usingsuperficialyetpowerfultechniques
insteadofunrestrictedknowledgeandreasoningcapabilities.Indeed,thisisoneofthegoalsofthisbook,andwe
hopetoequipyouwiththeknowledgeandskillstobuildusefulNLPsystems,andtocontributetothelongterm
aspirationofbuildingintelligentmachines.

6Summary
TextsarerepresentedinPythonusinglists:['Monty','Python'].Wecanuseindexing,slicing,andthe
len()functiononlists.
Aword"token"isaparticularappearanceofagivenwordinatextaword"type"istheuniqueformofthe
wordasaparticularsequenceofletters.Wecountwordtokensusinglen(text)andwordtypesusing
len(set(text)).
Weobtainthevocabularyofatexttusingsorted(set(t)).
Weoperateoneachitemofatextusing[f(x)forxintext].
Toderivethevocabulary,collapsingcasedistinctionsandignoringpunctuation,wecanwrite
set(w.lower()forwintextifw.isalpha()).
Weprocesseachwordinatextusingaforstatement,suchasforwint:orforwordintext:.This
mustbefollowedbythecoloncharacterandanindentedblockofcode,tobeexecutedeachtimethrough
theloop.
Wetestaconditionusinganifstatement:iflen(word)<5:.Thismustbefollowedbythecolon
characterandanindentedblockofcode,tobeexecutedonlyiftheconditionistrue.
Afrequencydistributionisacollectionofitemsalongwiththeirfrequencycounts(e.g.,thewordsofatext
andtheirfrequencyofappearance).
Afunctionisablockofcodethathasbeenassignedanameandcanbereused.Functionsaredefinedusing
thedefkeyword,asindefmult(x,y)xandyareparametersofthefunction,andactasplaceholdersfor
actualdatavalues.
Afunctioniscalledbyspecifyingitsnamefollowedbyzeroormoreargumentsinsideparentheses,like
this:texts(),mult(3,4),len(text1).

7FurtherReading
Thischapterhasintroducednewconceptsinprogramming,naturallanguageprocessing,andlinguistics,all
mixedintogether.Manyofthemareconsolidatedinthefollowingchapters.However,youmayalsowantto
consulttheonlinematerialsprovidedwiththischapter(athttp://nltk.org/),includinglinkstoadditional
backgroundmaterials,andlinkstoonlineNLPsystems.Youmayalsoliketoreaduponsomelinguisticsand
NLPrelatedconceptsinWikipedia(e.g.,collocations,theTuringTest,thetypetokendistinction).
YoushouldacquaintyourselfwiththePythondocumentationavailableathttp://docs.python.org/,includingthe
manytutorialsandcomprehensivereferencematerialslinkedthere.ABeginner'sGuidetoPythonisavailableat
http://wiki.python.org/moin/BeginnersGuide.MiscellaneousquestionsaboutPythonmightbeansweredinthe
FAQathttp://python.org/doc/faq/general/.
AsyoudelveintoNLTK,youmightwanttosubscribetothemailinglistwherenewreleasesofthetoolkitare
announced.ThereisalsoanNLTKUsersmailinglist,whereusershelpeachotherastheylearnhowtouse
PythonandNLTKforlanguageanalysiswork.Detailsoftheselistsareavailableathttp://nltk.org/.
Formoreinformationonthetopicscoveredin5,andonNLPmoregenerally,youmightliketoconsultoneofthe
followingexcellentbooks:
Indurkhya,NitinandFredDamerau(eds,2010)HandbookofNaturalLanguageProcessing(Second
Edition)Chapman&Hall/CRC.2010.(Indurkhya&Damerau,2010)(Dale,Moisl,&Somers,2000)
Jurafsky,DanielandJamesMartin(2008)SpeechandLanguageProcessing(SecondEdition).Prentice
Hall.(Jurafsky&Martin,2008)
Mitkov,Ruslan(ed,2003)TheOxfordHandbookofComputationalLinguistics.OxfordUniversityPress.
(secondeditionexpectedin2010).(Mitkov,2002)
TheAssociationforComputationalLinguisticsistheinternationalorganizationthatrepresentsthefieldofNLP.
TheACLwebsite(http://www.aclweb.org/)hostsmanyusefulresources,including:informationabout
internationalandregionalconferencesandworkshopstheACLWikiwithlinkstohundredsofusefulresources
andtheACLAnthology,whichcontainsmostoftheNLPresearchliteraturefromthepast50+years,fully
indexedandfreelydownloadable.
SomeexcellentintroductoryLinguisticstextbooksare:[Finegan2007]_,(O'Gradyetal,2004),(OSU,2007).You
mightliketoconsultLanguageLog,apopularlinguisticsblogwithoccasionalpoststhatusethetechniques
describedinthisbook.

8Exercises
1.TryusingthePythoninterpreterasacalculator,andtypingexpressionslike12/(4+1).
2.Givenanalphabetof26letters,thereare26tothepower10,or26**10,tenletterstringswecanform.
Thatworksoutto141167095653376.Howmanyhundredletterstringsarepossible?
3.ThePythonmultiplicationoperationcanbeappliedtolists.Whathappenswhenyoutype['Monty',
'Python']*20,or3*sent1?
4.Review1oncomputingwithlanguage.Howmanywordsarethereintext2?Howmanydistinctwords
arethere?
5.Comparethelexicaldiversityscoresforhumorandromancefictionin1.1.Whichgenreismore
lexicallydiverse?

6.ProduceadispersionplotofthefourmainprotagonistsinSenseandSensibility:Elinor,Marianne,
Edward,andWilloughby.Whatcanyouobserveaboutthedifferentrolesplayedbythemalesandfemales
inthisnovel?Canyouidentifythecouples?
7.Findthecollocationsintext5.
8.ConsiderthefollowingPythonexpression:len(set(text4)).Statethepurposeofthisexpression.
Describethetwostepsinvolvedinperformingthiscomputation.
9.Review2onlistsandstrings.
1.Defineastringandassignittoavariable,e.g.,my_string='MyString'(butputsomethingmore
interestinginthestring).Printthecontentsofthisvariableintwoways,firstbysimplytypingthe
variablenameandpressingenter,thenbyusingtheprintstatement.
2.Tryaddingthestringtoitselfusingmy_string+my_string,ormultiplyingitbyanumber,e.g.,
my_string*3.Noticethatthestringsarejoinedtogetherwithoutanyspaces.Howcouldyoufix
this?
10.Defineavariablemy_senttobealistofwords,usingthesyntaxmy_sent=["My","sent"](butwith
yourownwords,orafavoritesaying).
1.Use''.join(my_sent)toconvertthisintoastring.
2.Usesplit()tosplitthestringbackintothelistformyouhadtostartwith.
11.Defineseveralvariablescontaininglistsofwords,e.g.,phrase1,phrase2,andsoon.Jointhemtogether
invariouscombinations(usingtheplusoperator)toformwholesentences.Whatistherelationship
betweenlen(phrase1+phrase2)andlen(phrase1)+len(phrase2)?
12.Considerthefollowingtwoexpressions,whichhavethesamevalue.Whichonewilltypicallybemore
relevantinNLP?Why?
1."MontyPython"[6:12]
2.["Monty","Python"][1]
13.Wehaveseenhowtorepresentasentenceasalistofwords,whereeachwordisasequenceof
characters.Whatdoessent1[2][2]do?Why?Experimentwithotherindexvalues.
14.Thefirstsentenceoftext3isprovidedtoyouinthevariablesent3.Theindexoftheinsent3is1,
becausesent3[1]givesus'the'.Whataretheindexesofthetwootheroccurrencesofthiswordinsent3?
15.Reviewthediscussionofconditionalsin4.FindallwordsintheChatCorpus(text5)startingwiththe
letterb.Showtheminalphabeticalorder.
16.Typetheexpressionlist(range(10))attheinterpreterprompt.Nowtrylist(range(10,20)),
list(range(10,20,2)),andlist(range(20,10,2)).Wewillseeavarietyofusesforthisbuiltin
functioninlaterchapters.
17.Usetext9.index()tofindtheindexofthewordsunset.You'llneedtoinsertthiswordasanargument
betweentheparentheses.Byaprocessoftrialanderror,findthesliceforthecompletesentencethat
containsthisword.
18.Usinglistaddition,andthesetandsortedoperations,computethevocabularyofthesentencessent1...
sent8.
19.Whatisthedifferencebetweenthefollowingtwolines?Whichonewillgivealargervalue?Willthisbe
thecaseforothertexts?

>>>sorted(set(w.lower()forwintext1))
>>>sorted(w.lower()forwinset(text1))

20.Whatisthedifferencebetweenthefollowingtwotests:w.isupper()andnotw.islower()?
21.Writethesliceexpressionthatextractsthelasttwowordsoftext2.
22.FindallthefourletterwordsintheChatCorpus(text5).Withthehelpofafrequencydistribution
(FreqDist),showthesewordsindecreasingorderoffrequency.
23.Reviewthediscussionofloopingwithconditionsin4.Useacombinationofforandifstatementsto
loopoverthewordsofthemoviescriptforMontyPythonandtheHolyGrail(text6)andprintallthe
uppercasewords,oneperline.
24.Writeexpressionsforfindingallwordsintext6thatmeettheconditionslistedbelow.Theresultshould
beintheformofalistofwords:['word1','word2',...].
1.Endinginize
2.Containingtheletterz
3.Containingthesequenceofletterspt
4.Havingalllowercaselettersexceptforaninitialcapital(i.e.,titlecase)
25.Definesenttobethelistofwords['she','sells','sea','shells','by','the','sea','shore'].
Nowwritecodetoperformthefollowingtasks:
1.Printallwordsbeginningwithsh
2.Printallwordslongerthanfourcharacters
26.WhatdoesthefollowingPythoncodedo?sum(len(w)forwintext1)Canyouuseittoworkoutthe
averagewordlengthofatext?
27.Defineafunctioncalledvocab_size(text)thathasasingleparameterforthetext,andwhichreturnsthe
vocabularysizeofthetext.
28.Defineafunctionpercent(word,text)thatcalculateshowoftenagivenwordoccursinatext,and
expressestheresultasapercentage.
29.Wehavebeenusingsetstostorevocabularies.TrythefollowingPythonexpression:set(sent3)<
set(text1).Experimentwiththisusingdifferentargumentstoset().Whatdoesitdo?Canyouthinkofa
practicalapplicationforthis?
Aboutthisdocument...
UPDATEDFORNLTK3.0.ThisisachapterfromNaturalLanguageProcessingwithPython,byStevenBird,
EwanKleinandEdwardLoper,Copyright2014theauthors.ItisdistributedwiththeNaturalLanguageToolkit
[http://nltk.org/],Version3.0,underthetermsoftheCreativeCommonsAttributionNoncommercialNo
DerivativeWorks3.0UnitedStatesLicense[http://creativecommons.org/licenses/byncnd/3.0/us/].
ThisdocumentwasbuiltonWed1Jul201512:30:05AEST

DocutilsSystemMessages
SystemMessage:ERROR/3(ch01.rst2,line1889)backlink

Unknowntargetname:"finegan2007".