Вы находитесь на странице: 1из 29

Computational Linguistics and

Learning from Big Data


Gabriel Doyle
UCSD Linguistics
From not enough data to too much

Finding people:
90s, 700 datapoints, 7 years

People finding you:


00s, 30000 datapoints, 3 years

People just talking:


10s, 10000 datapoints, 5 days
Big data
Cheap to collect
Unsolicited
Benefits
Huge size
Covers rare events
Little control
Problems Noisy data
Difficult to analyze
Need for intelligent analysis
Big data is too big to analyze dumbly
no one can read millions of tweets
Analysis needed to establish
relevance
are they talking about what were interested in?
meaning
what are they saying about it?
use
what does it mean to us?
Structured & Unstructured Data
Surveys, focus groups, questionnaires, etc.
yield structured data
we know what were asking
we force the respondents to fit that structure
Imposing structure is costly
can only get answers to the questions we ask
respondents cant tell us what they might think
need to design & implement the structure
Structured & Unstructured Data
The internet / social media / devices provide
unstructured data
People tell us what they want to say, not what
we want to know
Modern computational linguistic analyses can
bridge the gap between our interests
fewer constraints on data coming in
low cost to speaker, medium cost to analyst
The dangers of simplistic analysis
Dont want ads for cutlery on a story about a
stabbing
Eastland Mall in Pittsburghs closed BUT
Eastland Mall in Bloomington isnt
Im not happy the food was expensive vs.
Im happy the food was not expensive
Computational approaches
Word-sense disambiguation what are people
Named-entity recognition talking about?

Automated parsing what are people


Sentiment analysis saying about it?

Information extraction putting it


Topic modeling together
Word-sense disambiguation
Language is ambiguous
what does mean mean?
Distinguish between multiple meanings of a
word
going to the park vs. will park my car
connotations: chintzy cheap vs. frugal cheap
can be done with supervision (e.g., WordNet)
or unsupervised
Named-entity recognition
Identifying names of people & things
finding out what people are talking about
Identifies & connects information about an
object
central to information extraction
Can be tied to other modalities
identifying people in photos from captions
Berg et al 2004
Cross-modal named-entities
Named-entity recognition
Named-entity resources
ANNIE, Stanford NER
excellent performance on edited newsprint
[90%+]
poor performance on tweets & social media
[40-70%]
Derczynski & Bontcheva 2014
increased noise-tolerance, post-editing
improves performance to 84% on tweets
Automated parsing
Extracting the structure of a sentence
Automated parsing
Core step for getting specific semantic
information
Structure of a sentence has a huge effect on
meaning
Im not happy the food was expensive
Im happy the food was not expensive
Existing parsers are really good, as long as the
text isnt too bad
Sentiment analysis
Basic idea: what emotion is being expressed
here?
who has the emotion?
whats the emotion directed at?
what reason is offered?
Learning: train with known data and then
extend to unknown
e.g., given a set of reviews, what features do the
good/bad have?
Sentiment analysis + parsing
Socher et al 2013: sentiment percolates up a
parse tree

This movie doesnt care about [anything good]


Topic models
Want to bundle documents/words into groups
covering similar topics (Blei, Ng, & Jordan 03)
Intuition: Words appearing in the same
document are more likely to be related
Documents built by choosing topics
then choosing words from topics
Topic model infers the topics per document &
words per topic
Buying a computer
When it came time to upgrade
Computers: 45% our computer, when I had to
computer: 23%
internet: 14%
figure out the meanings of solid-
laptop: 12% state drives and quad-cores, I
headed to the Internet to do my
Shopping: 13%
store: 20%
research, finding the right stores
buy: 19% and the right sites to answer my
price: 11% questions

Research: 19%
Topic models
Good for general semantic classification
grouping news stories, blog posts, etc.
categorizing documents into known classes
Many extensions, not just text
timeseries data, author recognition
connecting text to images (Costa Pereira et al 13)
financial data (Doyle & Elkan 09)
Pompeiian households (Mimno 09)
Information extraction
Produces a structured representation of
information (knowledge base)
human-readable or machine-readable
information as relations between entities
throw(quarterback,pass)
within- or across-document learning
IE example: learning football
Hovy et al 2011: Unsupervised Discovery of
Domain-Specic Knowledge from Text
The last time the Detroit Lions won a game in the Metrodome,
Scott Mitchell threw a touchdown pass to Herman Moore

throw(ScottMitchell,touchdown,HermanMoore)

is.a(ScottMitchell,quarterback)
throw(QB,touchdown,WR)
is.a(HermanMoore,widereceiver)

Big, young, talented and inexperienced, Scott Mitchell, the former backup
quarterback for the Miami Dolphins, was in prime position to profit

Lions wide receiver Herman Moore reflects on the Detroit-Chicago rivalry


IE example: learning football
Parse input using automated parser
Use parse + named entities to build semantic
structure
Use multiple levels of semantic representation
to identify general rules
Learn on 33,000 New York Times articles
95% sensible propositions extracted
Overview
Big data demands intelligent analysis
methods are out there already
plus new ones all the time
Think through the problem you want to solve
what data sources do you have?
what information would you ask for if you could?
what structure do you want to impose?
which method(s) yield that structure?
Computational methods summary
Automated parsing
basic step in structuring natural language data
wont fail, will buy vs. will fail, wont buy
key to extracting specific information
Word-sense disambiguation
basic step for assessing whats being discussed
toilet tank vs. military tank
makes sure youre looking at relevant data
Computational methods summary
Sentiment analysis
general emotional assessment
automatic ratings, user triage
noisy due to irony, sarcasm, etc.
Named-entity recognition
figuring out the lexicon
what do people talk about?
building knowledge of things
Computational methods summary
Topic models
document-level semantic classification
overall gist of an article
good for multimedia linkages
Information Extraction
specific semantic structures
Whos doing what to whom?
establishing rules & knowledge
Overall summary
Computational methods exist to structure
large-scale unstructured data
Identify what structure you want to get out
find the class of methods that develop such
structure
combine multiple methods if necessary
Test extensively!
lots of noise in unstructured data
Starting-Point References
NER: Derczynski & Bontcheva 2014, Passive-Aggressive
Sequence Labeling with Discriminative Post-Editing for
Recognizing Person Entities in Tweets

NER/MM: Berg, Berg, Edwards, & Forsyth 2004, Whos in


the Picture?

Sentiment: Socher, Bauer, Manning, & Ng 2013, Parsing


with Compositional Vector Grammars

IE: Hovy, Zhang, Hovy, & Peas 2011, Unsupervised


Discovery of Domain-Specic Knowledge from Text

Вам также может понравиться