Вы находитесь на странице: 1из 30

Artificial intelligence & natural

language processing
Mark Sanderson
Porto, 2000
Aims
• To provide an outline of the attempts made
at using NLP techniques in IR
Objectives
• At the end of this lecture you will be able to
– Outline a range of attempts to get NLP to work
with IR systems
– Idly speculate on why they failed
– Describe the successful use of NLP in a limited
domain
Why?
• Seems an obvious area of investigation
– Why not working?
Use of NLP
• Syntactic
– Parsing to identify phrases
– Full syntactic structure comparison
• Semantic
– Building an understanding of a document’s
content
• Discourse
– Exploiting document structure?
Syntactic
• Parsing to identify phrases
– The issues.
– Explain how it’s done (a bit).
– Is it worth it?
• Other possibilities
– Grammatical tagging
– Full syntactic structure comparison
• Explain how it’s done (a little bit).
• Show results.
Simple phrase identification
• High frequency terms could be good candidates.
– Why?

• Terms co-occurring more often than chance.


– Within small number of words.
– Surrounding simple terms.
– Not surrounding punctuation.
Problems
• Close words that aren’t phrases.
• “the use of computers in science & technology”

• Distant words that are phrases.


• “preparation & evaluation of abstracts and extracts”
Parsing for phrases
• Using parsers to identify noun phrases.

• Make a phrase out of a head and the head of


its modifiers. NP

PP

ADJ NOUN PREP ADJ NOUN

“automatic analysis of scientific text”


Errors
• Not a perfect rule by any means.
– Need restrictions to eliminate bogus phrases.

NP

PP

ADJ NOUN PREP DET QUANT ADJ NOUN

“automatic analysis of these four scientific texts”


Do they work?
• Fagan compared statistical with syntactic,
statistics won, just
– J. Fagan (1987) Experiments in phrase indexing for document
retrieval: a comparison of syntactic & nonsyntactic methods,
in TR 87-868 - Department of Computer Science, Cornell
University

• More research has been conducted.


– T. Strzalkowski (1995) Natural language information retrieval,
in Information Processing & Management, Vol. 31, No. 3, pp
397-417
Check out TREC
• Overview of the Seventh Text REtrieval
Conference (TREC-7), E. Voorhees, D. Harman
(National Institute of Standards and Technology)
– http://trec.nist.gov/
– Ad hoc track
• Fairly even between statistical phrases, syntactic phrases and
no phrases.
Grammatical tagging?
• Tag document text with grammatical codes?
– R. Garside (1987). The CLAWS word tagging system, in
The computational analysis of english: a corpus based
approach, R. Garside, G. Leech, G. Sampson Eds.,
Longman: 30-41.

• Doesn’t appear to work


– R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using
syntactic analysis in a document retrieval system that uses
signature files, in Proceedings of 13th ACM SIGIR
Conference: 179-191.
Syntactic structure comparison
• Has been tried…
– A. F. Smeaton & P. Sheridan (1991) Using morpho-syntactic
language analysis in phrase matching, in Proceedings of RIAO ‘91,
Pages 414-429

• Method
– Parse sentences into tree structures
– When you get a phrase match
• Look at linking syntactic operator.
• Look at the residual tree structure that didn’t match
• Does not to work
Semantic
• Disambiguation
– Given a word appearing in a certain context,
disambiguators will tell you what sense it is.

• IR system
– Index document collections by senses rather than
words
– Ask the users what senses the query words are
– Retrieve on senses
Disambiguation
• Does it work?
– No (well maybe)
• M. Sanderson, Word sense disambiguation and
information retrieval, in Proceedings of the 17th
ACM SIGIR Conference, Pages 142-151, 1994
• M. Sanderson & C.J. van Rijsbergen, The impact on
retrieval effectiveness of skewed frequency
distributions, in ACM Transactions on Information
Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.
Partial conclusions
• NLP has yet to prove itself in IR
– Agree
– D.D. Lewis & K. Sparck-Jones (1996) Natural language
processing for information retrieval, in Communications
of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101
– Sort of don’t agree
– A. Smeaton (1992) Progress in the application of natural
language processing to information retrieval tasks, in The
Computer Journal, Vol. 35, No. 3.
Mark’s idle speculation
• What people think is going on always

Keywords

NLP
Mark’s idle speculation
• What’s usually actually going on

Keywords NLP
Areas where NLP does work
• Systems with the following ingredients.
– Collection documents cover small domain.
– Language use is limited in some manner.
– User queries cover tight subject area.
– Documents/queries very short
• Image captions
– LSI, pseudo-relevance feedback
– People willing to spend money getting NLP to
work
RIME & IOTA
• From Grenoble
– Y. Chiaramella & J. Nie (1990) A retrieval model based on
an extended modal logic and its application to the RIME
experimental approach, in Proceedings of the 13th SIGIR
conference, Pages 25-43

• Medical record retrieval system


• Some database’y parts
• Free text descriptions of cases
Indexing
• “an opacity affecting probably the lung and
the trachea”
{[p], SGN} SGN - observed sign
LOC - localisation
{[and], SGN}

{[bears-on], SGN} {[bears-on], SGN}

{[opacity], SGN} {[lung], LOC} {[opacity], SGN} {[trachea], LOC}


Retrieval
• How do we match a user’s query to these
structures?
– Using transformations - bit like logic.
t - uncertainty

{[bears-on], SGN} {[lung], LOC}, t



⇒ {[opacity], SGN}, t
{[opacity], SGN} {[lung], LOC}
Tree transformation
{[has-for-value], SGN}

{[bears-on], SGN} {[has-for-value], SGN}

{[opacity], SGN} {[lung], LOC} {[contour], SGN} {[blurred], LOC}


{[has-for-value], SGN}, t

{[opacity], SGN} {[has-for-value], SGN}

{[contour], SGN} {[blurred], LOC}


Term transforms
• Basic medical terms stored in a hierarchy.
– Transformations possible again with
uncertainty added.

Level 1 Level 2 Level 3


tumour cancer sarcoma
hygroma
kyste polykystosis
pseudokyst
polyp polyposis
Isn’t this a bit slow?
• Yes

• Optimisation
– Scan for potential documents.
– Process them intensively.

• Evaluation?
– Not in that paper.
Not unique
• SCISOR
– P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting
Information from On-line News, in Communications of
the ACM (CACM), Vol. 33, No. 11, 88-97
Why do they work?
• Because of the restrictions
– Small subject domain.
– Limited vocabulary.
– Restricted type of question.
• Compare with large scale IR system.
– Keywords are good enough.
– Long time to set up.
– Hard to adapt to new domain.
Anything else for NLP?
• Text Generation
– IR system explaining itself?
Conclusions
• By now, you will be able to
– Outline a range of attempts to get NLP to work
with IR systems
– Idly speculate on why they failed
– Describe the successful use of NLP in a limited
domain

Вам также может понравиться