Natural Language Processing

Abstract
The goal of the Natural Language Processing (NLP) group is to design and build software
that will analyze, understand, and generate languages that humans use naturally, so that
eventually you will be able to address your computer as though you were addressing
another person.
This goal is not easy to reach. "Understanding" language means, among other things,
knowing what concepts a word or phrase stands for and knowing how to link those
concepts together in a meaningful way. It's ironic that natural language, the symbol
system that is easiest for humans to learn and use, is hardest for a computer to master.
The challenges we face stem from the highly ambiguous nature of natural language. As
an English speaker you effortlessly understand a sentence like "Flying planes can be
dangerous". Yet this sentence presents difficulties to a software program that lacks both
your knowledge of the world and your experience with linguistic structures. Is the more
plausible interpretation that the pilot is at risk, or that the danger is to people on the
ground? Should "can" be analyzed as a verb or as a noun? Which of the many possible
meanings of "plane" is relevant? Depending on context, "plane" could refer to, among
other things, an airplane, a geometric object, or a woodworking tool. How much and what
sort of context needs to be brought to bear on these questions in order to adequately
disambiguate the sentence?
We address these problems using a mix of knowledge-engineered and statistical/machine-

learning techniques to disambiguate and respond to natural language input. Our work has
implications for applications like text critiquing, information retrieval, question
answering, summarization, gaming, and translation. The grammar checkers in Office for
English, French, German, and Spanish are outgrowths of our research; Encarta uses our
technology to retrieve answers to user questions; Intellishrink uses natural language
technology to compress cellphone messages; Microsoft Product Support uses our
machine translation software to translate the Microsoft Knowledge Base into other
languages. As our work evolves, we expect it to enable any area where human users can
benefit by communicating with their computers in a natural way.
Natural language processing
Natural language processing (NLP) is a subfield of artificial intelligence and
linguistics. It studies the problems of automated generation and understanding of natural
human languages. Natural language generation systems convert information from
computer databases into normal-sounding human language, and natural language
understanding systems convert samples of human language into more formal
representations that are easier for computer programs to manipulate.
Contents
1 Tasks and limitations
2 Concrete problems
3 Subproblems
4 Statistical NLP
5 The major tasks in NLP
6 Evaluation of natural language processing
7 Organizations and conferences
Tasks and limitations
In theory natural language processing is a very attractive method of human-computer

interaction. Early systems such as SHRDLU, working in restricted "blocks worlds"
with restricted vocabularies, worked extremely well, leading researchers to excessive
optimism which was soon lost when the systems were extended to more realistic
situations with real-world ambiguity and complexity.
Natural language understanding is sometimes referred to as an AI-complete problem,

because natural language recognition seems to require extensive knowledge about the
outside world and the ability to manipulate it. The definition of "understanding" is one of
the major problems in natural language processing.
Concrete problems
Some examples of the problems faced by natural language understanding systems:
• The sentences We gave the monkeys the bananas because they were hungry and
We gave the monkeys the bananas because they were over-ripe have the same
surface grammatical structure. However, in one of them the word they refers to
the monkeys, in the other it refers to the bananas: the sentence cannot be
understood properly without knowledge of the properties and behaviour of
monkeys and bananas.
• A string of words may be interpreted in myriad ways. For example, the string
Time flies like an arrow may be interpreted in a variety of ways:
o time moves quickly just like an arrow does;
o measure the speed of flying insects like you would measure that of an
arrow - i.e. (You should) time flies like you would an arrow.;
o measure the speed of flying insects like an arrow would - i.e. Time flies in
the same way that an arrow would (time them).;
o measure the speed of flying insects that are like arrows - i.e. Time those
flies that are like arrows;
o a type of flying insect, "time-flies," enjoy arrows (compare Fruit flies like
a banana.)
English is particularly challenging in this regard because it has little inflectional

morphology to distinguish between parts of speech.
• English and several other languages don't specify which word an adjective applies
to. For example, in the string "pretty little girls' school".
o Does the school look little?
o Do the girls look little?
o Do the girls look pretty?
o Does the school look pretty?
Subproblems
Speech segmentation
In most spoken languages, the sounds representing successive letters blend into
each other, so the conversion of the analog signal to discrete characters can be a
very difficult process. Also, in natural speech there are hardly any pauses between
successive words; the location of those boundaries usually must take into account
grammatical and semantical constraints, as well as the context.
Text segmentation
Some written languages like Chinese, Japanese and Thai do not have signal word
boundaries either, so any significant text parsing usually requires the
identification of word boundaries, which is often a non-trivial task.
Word sense disambiguation
Many words have more than one meaning; we have to select the meaning which
makes the most sense in context.
Syntactic ambiguity
The grammar for natural languages is ambiguous, i.e. there are often multiple
possible parse trees for a given sentence. Choosing the most appropriate one
usually requires semantic and contextual information. Specific problem
components of syntactic ambiguity include sentence boundary disambiguation.
Imperfect or irregular input
Foreign or regional accents and vocal impediments in speech; typing or
grammatical errors, OCR errors in texts.
Speech acts and plans
Sentences often don't mean what they literally say; for instance a good answer to
"Can you pass the salt" is to pass the salt; in most contexts "Yes" is not a good
answer, although "No" is better and "I'm afraid that I can't see it" is better yet. Or
again, if a class was not offered last year, "The class was not offered last year" is a
better answer to the question "How many students failed the class last year?" than
"None" is.
Statistical NLP
Statistical natural language processing uses stochastic, probabilistic and statistical

methods to resolve some of the difficulties discussed above, especially those which arise
because longer sentences are highly ambiguous when processed with realistic grammars,
yielding thousands or millions of possible analyses. Methods for disambiguation often
involve the use of corpora and Markov models. The technology for statistical NLP comes
mainly from machine learning and data mining, both of which are fields of artificial
intelligence that involve learning from data.
The major tasks in NLP
• Speech recognition
• Natural language generation
• Machine translation
• Question answering
• Information retrieval
• Information extraction
• Text Simplification
• Text-proofing
• Translation technology
• Automatic summarization
• Foreign Language Reading Aid
• Foreign Language Writing Aid
Speech recognition
Speech recognition (in many contexts also known as 'automatic speech
recognition', computer speech recognition or erroneously as Voice Recognition) is the
process of converting a speech signal to a sequence of words, by means of an algorithm
implemented as a computer program. Speech recognition applications that have emerged
over the last years include voice dialing (e.g., Call home), call routing (e.g., I would like
to make a collect call), simple data entry (e.g., entering a credit card number), and
preparation of structured documents (e.g., a radiology report).
Voice recognition or speaker recognition is a related process that attempts to identify the
person speaking, as opposed to what is being said.
Natural language generation

Natural Language Generation (NLG) is the natural language processing task of
generating natural language from a machine representation system such as a knowledge
base or a logical form.
Some people view NLG as the opposite of natural language understanding. The
difference can be put this way: whereas in natural language understanding the system
needs to disambiguate the input sentence to produce the machine representation language,
in NLG the system needs to take decisions about how to put a concept into words.
Machine translation
Machine translation, sometimes referred to by the acronym MT, is a sub-field of
computational linguistics that investigates the use of computer software to translate text
or speech from one natural language to another. At its basic level, MT performs simple
substitution of atomic words in one natural language for words in another. Using corpus
techniques, more complex translations may be attempted, allowing for better handling of
differences in linguistic typology, phrase recognition, and translation of idioms, as well as
the isolation of anomalies.
Current machine translation software often allows for customisation by domain or

profession (such as weather reports) — improving output by limiting the scope of
allowable substitutions. This technique is particularly effective in domains where formal
or formulaic language is used. It follows then that machine translation of government and
legal documents more readily produces usable output than conversation or less
standardised text.
Improved output quality can also be achieved by human intervention: for example, some
systems are able to translate more accurately if the user has unambiguously identified
which words in the text are names. With the assistance of these techniques, MT has
proven useful as a tool to assist human translators, and in some cases can even produce
output that can be used "as is". However, current systems are unable to produce output of
the same quality as a human translator, particularly where the text to be translated uses
casual language.
Question answering
Question answering (QA) is a type of information retrieval. Given a collection of
documents (such as the World Wide Web or a local collection) the system should be able
to retrieve answers to questions posed in natural language. QA is regarded as requiring
more complex natural language processing (NLP) techniques than other types of
information retrieval such as document retrieval, and it is sometimes regarded as the next
step beyond search engines.
QA research attempts to deal with a wide range of question types including: fact, list,
definition, How, Why, hypothetical, semantically-constrained, and cross-lingual
questions. Search collections vary from small local document collections, to internal
organization documents, to compiled newswire reports, to the world wide web.
• Closed-domain question answering deals with questions under a specific domain

(for example, medicine or automotive maintenance), and can be seen as an easier
task because NLP systems can exploit domain-specific knowledge frequently
formalized in ontologies.
• Open-domain question answering deals with questions about nearly everything,
and can only rely on general ontologies and world knowledge. On the other hand,
these systems usually have much more data available from which to extract the
answer.
(Alternatively, closed-domain might refer to a situation where only a limited type of

questions are accepted, such as questions asking for descriptive rather than procedural
information.)
Information retrieval
Information retrieval (IR) is the science of searching for information in
documents, searching for documents themselves, searching for metadata which describe
documents, or searching within databases, whether relational stand-alone databases or
hypertext networked databases such as the Internet or World Wide Web or intranets, for
text, sound, images or data. There is a common confusion, however, between data
retrieval, document retrieval, information retrieval, and text retrieval, and each of these
has its own bodies of literature, theory, praxis and technologies. IR is like most nascent
fields interdisciplinary, based on computer science, mathematics, library science,
information science, cognitive psychology, linguistics, statistics, physics.
Automated IR systems are used to reduce information overload. Many universities and
public libraries use IR systems to provide access to books, journals, and other documents.
IR systems are often related to object and query. Queries are formal statements of
information needs that are put to an IR system by the user. An object is an entity which
keeps or stores information in a database. User queries are matched to objects stored in
the database. A document is, therefore, a data object. Often the documents themselves are
not kept or stored directly in the IR system, but are instead represented in the system by
document surrogates.
In 1992 the US Department of Defense, along with the National Institute of Standards
and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of
the TIPSTER text program. The aim of this was to look into the information retrieval
community by supplying the infrastructure that was needed for such a huge evaluation of
text retrieval methodologies.
Web search engines such as Google, Live.com, or Yahoo search are the most visible IR
applications.
Information extraction
Information extraction (IE) is a type of information retrieval whose goal is to
automatically extract structured or semistructured information from unstructured
machine-readable documents. It is a sub-discipline of language engineering, a branch of
computer science.
It aims to apply methods and technologies from practical computer science such as
compiler construction and artificial intelligence to the problem of processing unstructured
textual data automatically, with the objective to extract structured knowledge in some
domain. A typical example is the extraction of information on corporate merger events,
whereby instances of the relation MERGE(company1,company2,date) are extracted from
online news ("Yesterday, New-York based Foo Inc. announced their acquisition of Bar
Corp.").
The significance of Information Extraction is determined by the growing amount of

information available in unstructured (i.e. without metadata) form, for instance on the
Internet. This knowledge can be made more accessible by means of transformation into
relational form.
Natural Language texts may need to use some form a Text Simplification to create a more
easily machine readable text to extract the sentences.
Typical subtasks of IE are:
• Named Entity Recognition: recognition of entity names (for people and

organizations), place names, temporal expressions, and certain types of numerical
expressions.
• Coreference: identification chains of noun phrases that refer to the same object.
For example, anaphora is a type of coreference.
• Terminology extraction: finding the relevant terms for a given corpus
Text Simplification
In natural language processing, text simplification is an important task due to the
fact that much of the english language is in complex compound sentences that are not
easily processible for information tasks.
Text–proofing (Proofreading)
Proofreading traditionally means reading a proof copy of a text in order to detect
and correct any errors. Modern proofreading often requires reading copy at earlier stages
as well.
Translation technology
Translation is the interpretation of the meaning of a text in one language (the
"source text") and the production, in another language, of an equivalent text (the "target
text," or "translation") that communicates the same message.
Translation must take into account a number of constraints, including context, the rules of
grammar of the two languages, their writing conventions, their idioms and the like.
Consequently, as has been recognized at least since the time of the translator Martin
Luther, one translates best into the language that one knows best.
Traditionally translation has been a human activity, though attempts have been made to
computerize or otherwise automate the translation of natural-language texts (machine
translation) or to use computers as an aid to translation (computer-assisted translation).
Perhaps the most common misconception about translation is that there exists a simple
"word-for-word" relation between any two languages, and that translation is therefore a
straightforward and mechanical process. On the contrary, translation is always fraught
with uncertainties and with the potential for inadvertent "spilling over" of idioms and
usages from one language into the other.
Automatic summarization
Automatic summarization is the creation of a shortened version of a text by a
computer program. The product of this procedure still contains the most important points
of the original text.
The phenomenon of information overload has meant that access to coherent and
correctly-developed summaries is vital. As access to data has increased so has interest in
automatic summarization. An example of the use of summarization technology is search
engines such as Google.
Technologies that can make a coherent summary, of any kind of text, need to take into
account several variables such as length, writing-style and syntax to make a useful
summary.
Foreign Language Writing Aid

A foreign language writing aid is a computer program that assists a non-native
language user in writing decently in their target language. Assistive operations can be
classified into two categories: on-the-fly prompts and post-writing checks. Assisted
aspects of writing include: lexical syntax (especially the "syntactic and semantic roles of
a word's frame"), lexical semantics (context/collocation-influenced word choice and user-
intention-driven synonym choice), idiomatic expression transfer, etc. Online dictionaries
can also be considered as a type of foreign language writing aid.
Evaluation of natural language processing
This section is a stub and needs to be expanded
• History of evaluation in NLP

• Intrinsic evaluation
• Extrinsic evaluation
• Automatic evaluation
• Manual evaluation
Organizations and conferences
• Association for Computational Linguistics

• Association for Machine Translation in the Americas
• AFNLP - Asian Federation of Natural Language Processing Associations

Natural Language Processing

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Natural Language Processing

Загружено:

Авторское право:

Доступные форматы

Abstract

We address these problems using a mix of knowledge-engineered and statistical/machine-

5 The major tasks in NLP

6 Evaluation of natural language processing

7 Organizations and conferences

Tasks and limitations

In theory natural language processing is a very attractive method of human-computer

Natural language understanding is sometimes referred to as an AI-complete problem,

Some examples of the problems faced by natural language understanding systems:

English is particularly challenging in this regard because it has little inflectional

Statistical natural language processing uses stochastic, probabilistic and statistical

The major tasks in NLP

Natural language generation

Current machine translation software often allows for customisation by domain or

• Closed-domain question answering deals with questions under a specific domain

(Alternatively, closed-domain might refer to a situation where only a limited type of

The significance of Information Extraction is determined by the growing amount of

Typical subtasks of IE are:

• Named Entity Recognition: recognition of entity names (for people and

Foreign Language Writing Aid

Evaluation of natural language processing

This section is a stub and needs to be expanded

• History of evaluation in NLP

Organizations and conferences

• Association for Computational Linguistics

Вам также может понравиться