Академический Документы
Профессиональный Документы
Культура Документы
The Unreasonable
Effectiveness of Data
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
E
ugene Wigner’s article “The Unreasonable Ef- behavior. So, this corpus could serve as the basis of
fectiveness of Mathematics in the Natural Sci- a complete model for certain tasks—if only we knew
how to extract the model from the data.
ences”1 examines why so much of physics can be
neatly explained with simple mathematical formulas Learning from Text at Web Scale
The biggest successes in natural-language-related
such as f = ma or e = mc 2 . Meanwhile, sciences that machine learning have been statistical speech rec-
involve human beings rather than elementary par- ognition and statistical machine translation. The
ticles have proven more resistant to elegant math- reason for these successes is not that these tasks are
ematics. Economists suffer from physics envy over easier than other tasks; they are in fact much harder
their inability to neatly model human behavior. than tasks such as document classification that ex-
An informal, incomplete grammar of the English tract just a few bits of information from each doc-
language runs over 1,700 pages. 2 Perhaps when it ument. The reason is that translation is a natural
comes to natural language processing and related task routinely done every day for a real human need
fields, we’re doomed to complex theories that will (think of the operations of the European Union or
never have the elegance of physics equations. But of news agencies). The same is true of speech tran-
if that’s so, we should stop acting as if our goal is scription (think of closed-caption broadcasts). In
to author extremely elegant theories, and instead other words, a large training set of the input-output
embrace complexity and make use of the best ally behavior that we seek to automate is available to us
we have: the unreasonable effectiveness of data. in the wild. In contrast, traditional natural language
One of us, as an undergraduate at Brown Univer- processing problems such as document classifica-
sity, remembers the excitement of having access to tion, part-of-speech tagging, named-entity recogni-
the Brown Corpus, containing one million English tion, or parsing are not routine tasks, so they have
words.3 Since then, our field has seen several notable no large corpus available in the wild. Instead, a cor-
corpora that are about 100 times larger, and in 2006, pus for these tasks requires skilled human annota-
Google released a trillion-word corpus with frequency tion. Such annotation is not only slow and expen-
counts for all sequences up to five words long.4 In sive to acquire but also difficult for experts to agree
some ways this corpus is a step backwards from the on, being bedeviled by many of the difficulties we
Brown Corpus: it’s taken from unfiltered Web pages discuss later in relation to the Semantic Web. The
and thus contains incomplete sentences, spelling er- fi rst lesson of Web-scale learning is to use available
rors, grammatical errors, and all sorts of other er- large-scale data rather than hoping for annotated
rors. It’s not annotated with carefully hand-corrected data that isn’t available. For instance, we fi nd that
part-of-speech tags. But the fact that it’s a million useful semantic relationships can be automatically
times larger than the Brown Corpus outweighs these learned from the statistics of search queries and the
drawbacks. A trillion-word corpus—along with other corresponding results5 or from the accumulated evi-
Web-derived corpora of millions, billions, or tril- dence of Web-based text patterns and formatted ta-
lions of links, videos, images, tables, and user inter- bles, 6 in both cases without needing any manually
actions—captures even very rare aspects of human annotated data.