Вы находитесь на странице: 1из 19

INFORMATION

EXTRACTION
John Francis Olivo
BSCS 3-4
INFORMATION EXTRAC TION
information extraction

Input text
Relevant info from
IE System text
Information Extraction
• Skim text input

• Creates database entries

• Objects and Events

• Relations between objects and event

• Midway between IR and full-text parsers


Information
Extraction Informally:
Systems Gets the gist of the text
and organizes it in a way useful
to people

Formally:
Gathers the semantic
information out of documents,
esp. web pages, that allow further
inferences to be made
IE systems are considered as “dumber” versions of the
goal of Natural Language understanding because IE
focuses more on
learning of particular relations
and producing a structured
representation
AHA!

Process:
- Skim text
- Locate relation instances
- Get information
- Store in DB
Named Entity Recognition (NER)

As presented by Christopher Manning, NER is a


very important subtask in information extraction

NER does the task of LOCATING and


CLASSIFYING entities or names found in
the text.
Sample text
The decision by the independent MP Andrew Wilkie to
withdraw his support for the minority Labor government
sounded dramatic but it should not further threaten its
stability. When, after the 2010 election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greens agreed to
support Labor, they gave just two guarantees: confidence
and supply.
Sample text task: FIND ENTITIES
The decision by the independent MP Andrew Wilkie to
withdraw his support for the minority Labor government
sounded dramatic but it should not further threaten its
stability. When, after the 2010 election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greens agreed to
support Labor, they gave just two guarantees: confidence
and supply.
Sample text task: CLASSIFY
The decision by the independent MP Andrew Wilkie to KEY:
withdraw his support for the minority Labor government Person
sounded dramatic but it should not further threaten its Date
Location
stability. When, after the 2010 election, Wilkie, Rob
Organization
Oakeshott, Tony Windsor and the Greens agreed to
support Labor, they gave just two guarantees: confidence
and supply.
Some uses:
Named entities can be indexed, linked off, etc.
Sentiment can be attributed to companies or products
A lot of IE relations are associations between named
entities
Standard approaches to IE (and NER)
1. Hand-written regular expressions

2. Using classifiers
Generative: Naïve Bayes
Discriminative: Maxent models

3. Sequence models
HMMs
CMMs/MEMMs
CRFs
Types of IE Systems

(a) Attribute-based IE system

(b) Relational-based IE system


Attribute-based IE systems

The whole text is considered as an object

Get attributes of the object

Low-level information extraction


e.g. mail program extractions:
- time and date
- phone number
- events
- etc
Relational-based IE systems
Extracts objects and relations between these objects

Built through CASCADED FINITE STATE TRANSDUCERS

Series of finite state automata


Steps in relational-based IE system:

1. Tokenization Same with lexical; basic words

2. Complex words handling Words with two or more morphemes

3. Basic phrases/groups handling Noun Groups and Verb Groups


4. Complex phrases/groups handling Compound-complex and Complex

5. Domain events Sequences of phrases in level 3 (and 4) is scanned for patterns of interest
6. Structure merging Semantic procedures are merged
AHA!

Information extraction works well for a


restricted domain in which it is
possible to determine what subjects
will be discussed and how they will be
mentioned
References:

Russell, Stuart J., and Norvig, Peter. Artificial Intelligence: A Modern Approach 2nd Ed. Pearson Education Inc. 2003

Manning, Christopher. Information Extraction and Named Entity Recognition. Stanford University

Hobbs, Jerry R., and Riloff, Ellen. Handbook of Natural Language Processing. Information Sciences Institute

Вам также может понравиться