Академический Документы
Профессиональный Документы
Культура Документы
Conflation Algorithms
October 2009
Acknowledgements
John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundE x1.htm Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this] Jurafsky & Martin appendix B pp 833-836.
October 2009
Conflation
COMPUT COMPUTE COMPUTER
COMPUTING
COMPUTES
COMPUTABILITY COMPUTATION
October 2009
Lemmatisation
Attempt to map to same lemma POS dependent
Morphological Analysis
Includes morpho-syntactic information
October 2009
October 2009
Soundex Algorithm 1
The Soundex Algorithm uses the following steps to encode a word:
1. The first character of the word is retained as the first character of the Soundex code. 2. The following letters are discarded: a,e,i,o,u,h,w, and y. 3. Remaining consonants are given a code number. 4. If consonants having the same code number appear consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")
October 2009
Code Numbers
b, p, f, and v c, s, k, g, j, q, x, z d, t l m m,n r
October 2009
1 2 3 4 5 6
Soundex Algorithm 2
The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200") If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")
October 2009
11
Improvements
Preprocessing before applying the basic algorithm, e.g. identification of
DG with G GH with H GN with N (not 'ng') KN with N PH with F
IR Applications
Information Retrieval: Query Relevant Documents
Issues
Is a dictionary available?
Stems Affixes
Motivation: linguistic credibility or engineering performance? When to remove a affix versus when to leave it alone Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2" relate/relativity vs. radioactive/radioactivity
October 2009 HLT: Conflation Algorithms 16
Measure
All the above patterns can be replaced by the following regular expression (C) (VC)m (V) m is called the measure of any word or word part. m=0: tr, ee, tree, y, by m=1: trouble, oats, trees, ivy m=2: troubles; private
October 2009 HLT: Conflation Algorithms 18
Rules
Rules for removing a suffix are given in the form (condition) S1 S2 i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example (m > 1) EMENT Example: enlargement enlarg
October 2009 HLT: Conflation Algorithms 19
Conditions
*S - stem ends with s *Z - stem ends with z *T stem ends with t *v* - stem contains a vowel *d - stem ends with a double consonant *o - stem ends cvc, where second c is not w, x or y e.g. wil, -hop In conditions, Boolean operators are possible e.g. (m>1 and (*S or *T)) Sets of rules applied in 7 steps. Within each step, rule matching longest suffix applies.
October 2009 HLT: Conflation Algorithms 20
Organisation
-s Step 1 Plurals and Third Person Singular Verbs -ed, -ing Step 2 Verbal Past Tense and Progressive fly/flies Step 3: Y to I Noun Inflections Steps 4 and 5 Derivational Morphology Multiple Suffixes visualisation visualise
21
October 2009
22
October 2009
example generat generate troubl trouble capsiz capsize hopp hop hiss hiss
24
Step 3: Y to I
(*v*) YI happy happi cry cry
October 2009
25
STEP 4: Derivational Morphology 1 Multiple Suffixes (excerpt) Condition (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0) (m > 0)
October 2009
Rewrite ATIONAL ATE TIONAL TION ENCI ENCE ABLI ABLE OUSLI OUS IZATION IZE ATION ATE ATOR ATE ALISM AL IVENESS IVE FULNESS FUL OUSNESS OUS ALITI AL BILITI BLE
Example relational relate conditional condition valenci valence comfortabli comfortable analagously analagous digitizer digitize generation generate operator operate formalism formal pensiveness pensive hopefulness hopeful callousness callous formality formal possibility possible
26
Step 6: Derivational Morphology III: Single Suffixes Condition (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 1) (m > 0) (*S or *T) (m > 1) (m > 1) (m > 1)
October 2009
Rewrite AL ANCE ENCE ER IC ABLE ANT EMENT MENT ENT ION OU ISM ATE
Example revival reviv allowance allow inference infer airliner airlin Coptic Copt laughable laugh irritant irrit replacement replac adjustment adjust dependent depend adoption adopt callousness callous formalism formal activate activ
27
Porter Example
INPUT in the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management
October 2009
28
Porter Output
Original Word Stemmed Word Original Word Stemmed Word
first focus area integrated projects help develop principally common open platforms
October 2009
first focu area integr project help develop princip common open platform
platforms software services supporting distributed information decision systems risk crisis management
platform softwar servic support distribut inform decis system risk crisi manag
29
Stemming Errors
Under-stemming
the error of taking off too small a suffix croulons croulon since croulons is a form of the verb crouler
Over-stemming
the error of taking off too much example: crotons crot since crotons is the plural of croton
Miss-stemming
taking off what looks like an ending, but is really part of the stem reply rep
October 2009 HLT: Conflation Algorithms 30
Summary
Conflation serves different purposes Generally, motivation is to achieve an engineering goal rather than linguistic fidelity. This can cause errors in the bag of words model. Soundex and Porter very well established and easily available.
October 2009 HLT: Conflation Algorithms 31