Вы находитесь на странице: 1из 46

Basic Text

Processing
8egular Lxpresslons
uan !urafsky
kegu|ar express|ons
A formal language for speclfylng LexL sLrlngs
Pow can we search for any of Lhese?
woodchuck
woodchucks
Woodchuck
Woodchucks

uan !urafsky
kegu|ar Lxpress|ons: D|s[uncnons
Leuers lnslde square brackeLs []

8anges [A-Z]

auern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any dlglL
auern Matches
[A-Z] An upper case leuer Drenched Blossoms
[a-z] A lower case leuer my beans were impatient
[0-9] A slngle dlglL Chapter 1: Down the Rabbit Hole
uan !urafsky
kegu|ar Lxpress|ons: Neganon |n D|s[uncnon
negauons [^Ss]
CaraL means negauon only when rsL ln []
auern Matches
[^A-Z] noL an upper case leuer Oyfn pripetchik
[^Ss] nelLher 'S' nor 's' I have no exquisite reason
[^e^] nelLher e nor ^ Look here
a^b 1he pauern a caraL b Look up a^b now
uan !urafsky
kegu|ar Lxpress|ons: More D|s[uncnon
Woodchucks ls anoLher name for groundhog!
1he plpe | for dls[uncuon
auern Matches
groundhog|woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
hoLo u. lleLcher
uan !urafsky
kegu|ar Lxpress|ons: ? * + .
SLephen C kleene
auern Matches
colou?r Cpuonal
prevlous char
color colour
oo*h! 0 or more of
prevlous char
oh! ooh! oooh! ooooh!
o+h! 1 or more of
prevlous char
oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
kleene *, kleene +
uan !urafsky
kegu|ar Lxpress|ons: Anchors ^ 5
auern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 Hello
\.$ The end.
.$ The end? The end!

uan !urafsky
Lxamp|e
llnd me all lnsLances of Lhe word Lhe" ln a LexL.
the
Mlsses caplLallzed examples
[tT]he
lncorrecLly reLurns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]

uan !urafsky
Lrrors
1he process we [usL wenL Lhrough was based on xlng
Lwo klnds of errors
MaLchlng sLrlngs LhaL we should noL have maLched (Lhere,
Lhen, oLher)
lalse posluves (1ype l)
noL maLchlng Lhlngs LhaL we should have maLched (1he)
lalse negauves (1ype ll)
uan !urafsky
Lrrors cont.
ln nL we are always deallng wlLh Lhese klnds of
errors.
8educlng Lhe error raLe for an appllcauon oen
lnvolves Lwo anLagonlsuc eorLs:
lncreaslng accuracy or preclslon (mlnlmlzlng false posluves)
lncreaslng coverage or recall (mlnlmlzlng false negauves).
uan !urafsky
Summary
8egular expresslons play a surprlslngly large role
SophlsucaLed sequences of regular expresslons are oen Lhe rsL model
for any LexL processlng LexL
lor many hard Lasks, we use machlne learnlng classlers
8uL regular expresslons are used as feaLures ln Lhe classlers
Can be very useful ln capLurlng generallzauons
11
Basic Text
Processing
8egular Lxpresslons
8as|c 1ext
rocess|ng
Word Lokenlzauon

uan !urafsky
1ext Norma||zanon
Lvery nL Lask needs Lo do LexL
normallzauon:
1. Segmenung/Lokenlzlng words ln runnlng LexL
2. normallzlng word formaLs
3. Segmenung senLences ln runnlng LexL

uan !urafsky
now many words?
l do uh maln- malnly buslness daLa processlng
lragmenLs, lled pauses
Seuss's caL ln Lhe haL ls dlerenL from oLher caLs!
Lemma: same sLem, parL of speech, rough word sense
caL and caLs = same lemma
Wordform: Lhe full lnecLed surface form
caL and caLs = dlerenL wordforms
uan !urafsky
now many words?
Lhey lay back on Lhe San lranclsco grass and looked aL Lhe sLars and Lhelr
1ype: an elemenL of Lhe vocabulary.
1oken: an lnsLance of LhaL Lype ln runnlng LexL.
Pow many?
13 Lokens (or 14)
13 Lypes (or 12) (or 11?)
uan !urafsky
now many words?
! = number of Lokens
" = vocabulary = seL of Lypes
|!| ls Lhe slze of Lhe vocabulary






1okens = N 1ypes = |V|
SwlLchboard phone
conversauons
2.4 mllllon 20 Lhousand
Shakespeare 884,000 31 Lhousand
Coogle n-grams 1 Lrllllon 13 mllllon
Church and Cale (1990): |v| > C(n
x
)

uan !urafsky
S|mp|e 1oken|zanon |n UNIk
(lnsplred by ken Church's unlx for oeLs.)
Clven a LexL le, ouLpuL Lhe word Lokens and Lhelr frequencles
tr -sc A-Za-z \n < shakes.txt
| sort
| uniq c

1945 A
72 AARON
19 ABBESS
5 ABBOT
... ...

25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
.... .
Change all non-alpha to newlines
Sort in alphabetical order
Merge and count each type
uan !urafsky
1he hrst step: token|z|ng
tr -sc A-Za-z \n < shakes.txt | head

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
uan !urafsky
1he second step: sornng
tr -sc A-Za-z \n < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
uan !urafsky
More counnng
Merglng upper and lower case
tr A-Z a-z < shakes.txt | tr sc A-Za-z \n | sort | uniq c
Sorung Lhe counLs
tr A-Z a-z < shakes.txt | tr sc A-Za-z \n | sort | uniq c | sort n r
23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you
10839 my
10005 in
8954 d

What happened here?
uan !urafsky
Issues |n 1oken|zanon
Finlands capital ! Finland Finlands Finlands #
whatre, Im, isnt ! What are, I am, is not
Hewlett-Packard ! Hewlett Packard ?
state-of-the-art ! state of the art ?
Lowercase ! lower-case lowercase lower case ?
San Francisco ! one Loken or Lwo?
m.p.h., hu. ! ??
uan !urafsky
1oken|zanon: |anguage |ssues
lrench
#$%&'%()*% ! one Loken or Lwo?
# ? #, ? #% ?
WanL *,%&'%()*% Lo maLch wlLh -& %&'%()*%
Cerman noun compounds are noL segmenLed
#%)%&'.%/'012%/-&3'3%'%**'1245'4&3%'6%**6%/
'llfe lnsurance company employee'
Cerman lnformauon reLrleval needs compound sp||uer
uan !urafsky
1oken|zanon: |anguage |ssues
Chlnese and !apanese no spaces beLween words:


Sharapova now llves ln uS souLheasLern llorlda
lurLher compllcaLed ln !apanese, wlLh muluple alphabeLs
lnLermlngled
uaLes/amounLs ln muluple formaLs
500$500K(6,000)
kaLakana Plragana kan[l 8oma[l
Lnd-user can express query enurely ln hlragana!
uan !urafsky
Word 1oken|zanon |n Ch|nese
Also called Word Segmentanon
Chlnese words are composed of characLers
CharacLers are generally 1 syllable and 1 morpheme.
Average word ls 2.4 characLers long.
SLandard basellne segmenLauon algorlLhm:
Maxlmum MaLchlng (also called Creedy)
uan !urafsky
Max|mum Match|ng
Word Segmentanon A|gor|thm
Clven a wordllsL of Chlnese, and a sLrlng.
1) SLarL a polnLer aL Lhe beglnnlng of Lhe sLrlng
2) llnd Lhe longesL word ln dlcuonary LhaL maLches Lhe sLrlng
sLarung aL polnLer
3) Move Lhe polnLer over Lhe word ln sLrlng
4) Co Lo 2
uan !urafsky
Max-match segmentanon |||ustranon
1hecaunLhehaL
1heLabledownLhere
uoesn'L generally work ln Lngllsh!
8uL works asLonlshlngly well ln Chlnese


Modern probablllsuc segmenLauon algorlLhms even beuer
the table down there
the cat in the hat
theta bled own there
8as|c 1ext
rocess|ng
Word Lokenlzauon

8as|c 1ext
rocess|ng

Word normallzauon and
SLemmlng

uan !urafsky
Norma||zanon
need Lo normallze" Lerms
lnformauon 8eLrleval: lndexed LexL & query Lerms musL have same form.
We wanL Lo maLch 7898:8 and 79:
We lmpllclLly dene equlvalence classes of Lerms
e.g., deleung perlods ln a Lerm
AlLernauve: asymmeLrlc expanslon:
LnLer: ;0&<=; Search: ;0&<=;> ;0&<=;'
LnLer: ;0&<=;' Search: ?0&<=;'> ;0&<=;'> ;0&<=;
LnLer: ?0&<=;' Search: ?0&<=;'
oLenually more powerful, buL less emclenL
uan !urafsky
Case fo|d|ng
Appllcauons llke l8: reduce all leuers Lo lower case
Slnce users Lend Lo use lower case
osslble excepuon: upper case ln mld-senLence?
e.g., @%&%/4* A=6=/'
B%< vs. C%<
9:D# vs. '40*
lor senumenL analysls, M1, lnformauon exLracuon
Case ls helpful (79 versus -' ls lmporLanL)
uan !urafsky
Lemmanzanon
8educe lnecuons or varlanL forms Lo base form
$%& $'(& )* ! +(
,$'& ,$'*& ,$'-*, ,$'*- ! ,$'
./( +01-* ,$'* $'( 2)3('(4. ,050'* ! ./( +01 ,$' +( 2)3('(4. ,050'
Lemmauzauon: have Lo nd correcL dlcuonary headword form
Machlne Lranslauon
Spanlsh qulero ('l wanL'), quleres ('you wanL') same lemma as querer 'wanL'
uan !urafsky
Morpho|ogy
Morphemes:
1he small meanlngful unlLs LhaL make up words
Stems: 1he core meanlng-bearlng unlLs
Amxes: 8lLs and pleces LhaL adhere Lo sLems
Cen wlLh grammaucal funcuons
uan !urafsky
Stemm|ng
8educe Lerms Lo Lhelr sLems ln lnformauon reLrleval
6.(%%)47 ls crude chopplng of amxes
language dependenL
e.g., 4-6=(46%E'F> 4-6=(4G1> 4-6=(4G=& all reduced Lo 4-6=(46.
80' (9$%:5( ,0%:'(**(2
$42 ,0%:'(**)04 $'( +0./
$,,(:.(2 $* (;<)=$5(4. .0
,0%:'(**.
for exampl compress and
compress ar boLh accepL
as equlval Lo compress
uan !urafsky
orter's a|gor|thm
1he most common Lng||sh stemmer
SLep 1a
sses ! ss caresses ! caress
ies ! i ponies ! poni
ss ! ss caress ! caress
s ! cats ! cat
SLep 1b
(*v*)ing ! walking ! walk
sing ! sing
(*v*)ed ! plastered ! plaster

SLep 2 (for long sLems)
ational! ate relational! relate
izer! ize digitizer ! digitize
ator! ate operator ! operate

SLep 3 (for longer sLems)
al ! revival ! reviv
able ! adjustable ! adjust
ate ! activate ! activ

uan !urafsky
V|ew|ng morpho|ogy |n a corpus
Why on|y str|p -|ng |f there |s a vowe|?
(*v*)ing ! walking ! walk
sing ! sing

36
uan !urafsky
V|ew|ng morpho|ogy |n a corpus
Why on|y str|p -|ng |f there |s a vowe|?
(*v*)ing ! walking ! walk
sing ! sing

37
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ing$' | sort | uniq -c | sort nr








tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort nr
548 being
541 nothing
152 something
145 coming
130 morning
122 having
120 living
117 loving
116 Being
102 going
1312 King
548 being
541 nothing
388 king
375 bring
358 thing
307 ring
152 something
145 coming
130 morning
uan !urafsky
Dea||ng w|th comp|ex morpho|ogy |s
somenmes necessary
Some languages requlres complex morpheme segmenLauon
1urklsh
uygarlasuramadlklarlmlzdanmlsslnlzcaslna
`(behavlng) as lf you are among Lhose whom we could noL clvlllze'
uygar `clvlllzed' + las `become'
+ ur `cause' + ama `noL able'
+ dlk `pasL' + lar 'plural'
+ lmlz 'p1pl' + dan 'abl'
+ mls 'pasL' + slnlz '2pl' + caslna 'as lf'

8as|c 1ext
rocess|ng

Word normallzauon and
SLemmlng

8as|c 1ext
rocess|ng

SenLence SegmenLauon
and ueclslon 1rees

uan !urafsky
Sentence Segmentanon
!, ? are relauvely unamblguous
erlod ." ls qulLe amblguous
SenLence boundary
Abbrevlauons llke lnc. or ur.
numbers llke .02 or 4.3
8ulld a blnary classler
Looks aL a ."
uecldes LndCfSenLence/noLLndCfSenLence
Classlers: hand-wrluen rules, regular expresslons, or machlne-learnlng
uan !urafsky
Determ|n|ng |f a word |s end-of-sentence:
a Dec|s|on 1ree
uan !urafsky
More soph|sncated dec|s|on tree features
Case of word wlLh .": upper, Lower, Cap, number
Case of word aer .": upper, Lower, Cap, number
numerlc feaLures
LengLh of word wlLh ."
robablllLy(word wlLh ." occurs aL end-of-s)
robablllLy(word aer ." occurs aL beglnnlng-of-s)
uan !urafsky
Imp|emennng Dec|s|on 1rees
A declslon Lree ls [usL an lf-Lhen-else sLaLemenL
1he lnLeresung research ls chooslng Lhe feaLures
Seng up Lhe sLrucLure ls oen Loo hard Lo do by hand
Pand-bulldlng only posslble for very slmple feaLures, domalns
lor numerlc feaLures, lL's Loo hard Lo plck each Lhreshold
lnsLead, sLrucLure usually learned by machlne learnlng from a Lralnlng
corpus
uan !urafsky
Dec|s|on 1rees and other c|ass|hers
We can Lhlnk of Lhe quesuons ln a declslon Lree
As feaLures LhaL could be explolLed by any klnd of
classler
Loglsuc regresslon
SvM
neural neLs
eLc.

8as|c 1ext
rocess|ng

SenLence SegmenLauon
and ueclslon 1rees

Вам также может понравиться