Zipf's Law

Zipfs Law
Christopher McBryde
November 14, 2016
Agenda
Background
Zipfs and Zipf-Mandlebrot Laws
Non-English languages
Word categories
Models and explanations
Conclusion
Background
Who was George Kingsley Zipf?
Born 1902 in Freeport, IL

Received bachelors,
masters, and doctorate
(comparative philology)
from Harvard University
Professor in German and
University Lecturer at
Harvard
Strong proponent of
interdisciplinary studies
Died 1950 from cancer
Interest in word frequency
Dissertation: Relative Frequency as a

Determinant of Phonetic Change
Inspired by Lotkas law
The number of authors making contributions in a
given period is a fraction of the number making a

single contribution as
First alluded to in Selected Studies of the

Principle of Relative Frequency in Language
(1932)
Formulation of Zipfs Law
Final rank-frequency form in The psycho-biology

of language (1935)
One can consider the words of a vocabulary as
ranked in the order of their frequencyWe can
indicate on the abscissa of a double logarithmic
chart the number of the word in the series and on
the ordinate its frequency
Demonstrated on word frequency data from
James Joyces Ulysses
Zipfs and
Zipf-Mandelbrot Laws
Zipfs Law
Rank the frequency of words in a corpus

The second-ranked word appears half as often
as the most common word, the third-ranked
word appears half as often as the second, etc.
Zipfian laws exist for other applications
City population size
Income rankings
TV Ratings
Zipfs Law
(original form)
(general form)
Zipf-Mandlebrot Law
Extended Zipfs law
with the use of the
parameter q
When the limit is
taken as N goes to
infinity, becomes the
Hurwitz zeta
function
Non-English Languages
Does Zipfs law hold for other languages?
How do we use language? Shared patters in the frequency of word

use across 17 world languages, Calude and Pagel
17 world languages across 6 language families, with one isolate
and one creole
9 Indo-European (English, Greek, etc.)
1 Sino-Tibetan (Chinese)
2 Uralic (Finnish and Estonian
1 Niger-Congo (Swahili)
1 Altaic (Turkish)
1 Austronesian (Maori)
1 Isolate (Basque)
1 creole (Tok Pisin)
Methodology
Used corpora from various academic and national

sources
To compare apples to apples, word frequency was
compared for 200 words with common meanings
Called a Swadesh list (Swadesh 1952)
Avoids technical and specifically environmental terms
The words in each language with each of the 200

meanings was compared against its corpus to
determine the relative frequency
Results
Frequency is fixed across

all languages
Rank ordering is
determined by aggregate
rank, not rank within the
language
Blue dots are word
frequency, grey line is a
locally smoothed
regression, red is the fit to
Zipf-Mandlebrot
Closer look
Closer look
Language
coefficien
t
coefficient
R-squared
Adj. RSquared
Spanish
1.84
3.81
0.88
1.00
Russian
1.88
5.02
0.88
1.00
Greek
1.17
-0.45
0.83
0.97
Portugues
e
3.68
26.26
0.87
1.00
Chinese
1.46
3.13
0.73
0.97
Swahili
0.79
-0.66
0.59
0.89
Conclusion
All of these languages conform fairly well to

Zipf-Mandlebrot, but not in the same way
(different coefficients)
Indo-European languages fit more closely
than Chinese or Swahili
Frequency is systematically related to
meaning
Word Categories
Can meaning predict word

frequency?
Analysis of number words

by Dehaene and Mehler
(1992)
Figures show frequency
versus cardinality, not rank
for English (top), Russian
(middle) and Italian (bottom)
Frequency is closer to a
inverse square law ( = 2)
Number word frequency
predictable from meaning
Do other factors contribute to word

frequency?
Compare words with roughly the same meaning

Plot word frequency in the American National
Corpus within different categories
Taboo words
Sex (gerund)
Feces
Months
Planets
Elements
Results
Taboo (feces)
Taboo (sex)
Months
Planets
Elements
Results
Category
coefficien
t
coefficient
R-squared
Adj. RSquared
Taboo
(sex)
1.84
0.65
0.95
0.99
Taboo
(feces)
3.75
8.48
0.92
1.00
Months
9.43 than just0.97
Additional1.54
factors at play
meaning 0.97
Planets
11.65
13.49
0.95
Near-Zipfian
distribution
even when
categories1.00
are
Elements
constrained
3.53
by natural
17.17
world, such0.91
as the phases
0.94of the
moon
Does Zipfs Law apply to other levels of

language
Analysis of syntactic categories

Distribution of part of speech tags within the
category
Determiners (DT)
Prepositions (IN)
Modals (MD)
Singular or mass nouns (NN)
Past participle verbs (VBN)
3rd person singular present tense verbs (VBZ)
Results
Results
Category
coefficien
t
coefficient
R-squared
Adj. RSquared
Determiner 2.15
s
0.21
0.91
0.93
Propositio
ns
2.37
4.25
0.95
0.97
Modals
119.60
458.71
0.86
0.88
Sing/mass
nouns
1.15
77.06
0.86
0.96
Past
participle
0.82
-0.27
0.84
0.96
3rd person
singular
verbs
1.04
-0.81
0.81
0.93
Zipfs Law holds for novel words
25 subjects were
recruited through
Mechanical Turk and
given a prompt
The relative
frequency of these
novel words was
analyzed on the
completed stories
An alien spaceship
crashes in the Nevada
desert. Eight creatures
emerge: a Wug, a Plit,
a Blicket, a Flark, a
Warit, a Jupe, a Ralex,
and a Timon. In at least
2000 words, describe
what happens next
Results
Experiment was
designed to bias the
subjects as little as
possible
Still, the results
seem to follow a
power law
distribution
Models and
Explanations
What are some of the possible

explanations for Zipfs law?
Random typing
Simple stochastic models
Semantic accounts
Communicative accounts
Universality
Random typing
Theory: Zipfs law is
a statistical artifact
Test: Divide the
corpus by a different
character than ,
such as e
Results: A nearZipfian distribution
Random Typing
Random monkey
processes do not
accurately reflect real
language
Example: e-divided
word lengths decay
exponentially, but in
real language its not
even monotonic (D.
Manin 2009)
Simple stochastic models
Preferential re-use will lead to a very skewed

frequency distribution since frequent words
will tend to get re-used even more
Only can prove sufficiency: if stochastic, then
Zipfs law results
It may explain novel words, but doesnt fully
explain the connection between meaning and
frequency
Semantic accounts
Theory: Zipfs law results from labeling of a

semantic hierarchy and a pressure to avoid
synonymy (D. Manin 2008)
Evidence in distribution where meaning is
restricted
However, the behavioral experiment results in a
near-Zipfian distribution with meanings unkwown
Zipfs law cant full be explained by semantics
Communicative accounts
Theory: Zipfs law results from an optimization of effort for the

speaker and the listener (Ferrer i Cancho & Sol, 2003)
Speakers effort proportional to the diversity of signals that need to be
conveyed
Listeners effort proportional to the expected entropy over referents given
a word
The trade-off parameter returns a Zipfian distribution for = 0.41

Drawbacks
Violates Spearmans principle of statistical modeling

ZL still holds for highly constrained words
What does optimization mean for storytelling?
Universality
Theory: Zipfs law results from a universal pressure rather

than psychology or statistics
Derive from mathematical principles
Algorithmic information theory (Corominas-Murtra and Sol 2010)
Kolmogorov complexity and Levins probability distribution (Y. I.
Manin 2013)
Entropy maximizing processes (S. A. Frank 2009)
These theories have not been used to make novel

predictions or explain variations in parameters between
categories or languages
Conclusions
There are a profusion of theories trying to

explain Zipfs law
However, little distinguishes them in terms
using scientific methodology
Several ways to derive the law, no novel
predictions
Other human processes
Primary References
Piantadosi, S. T. (2014). Zipfs word frequency law

in natural language: A critical review and future
directions. Psychonomic bulletin & review, 21(5),
1112-1130.
Calude, A. S., & Pagel, M. (2011). How do we use
language? Shared patterns in the frequency of
word use across 17 world languages. Philosophical
Transactions of the Royal Society of London B:
Biological Sciences, 366(1567), 1101-1107.
Secondary References
Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language.

Zipf, G. K. (1935). The psycho-biology of language.
Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: with special reference to North
American Indians and Eskimos. Proceedings of the American philosophical society, 96(4), 452-463.
Dehaene, S., & Mehler, J. (1992). Cross-linguistic regularities in the frequency of number
words. Cognition, 43(1), 1-29.
Manin, D. Y. (2009). Mandelbrot's Model for Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law for
Language?. Journal of Quantitative Linguistics, 16(3), 274-285.
Manin, D. Y. (2008). Zipf's law and avoidance of excessive synonymy. Cognitive Science, 32(7), 10751098.
Ferrer i Cancho, R.F., & Sol, R. V. (2003). Least effort and the origins of scaling in human
language. Proceedings of the National Academy of Sciences, 100(3), 788-791.
Corominas-Murtra, B., & Sol, R. V. (2010). Universality of Zipfs law. Physical Review E, 82(1), 011102.
Manin, Y. I. (2014). Zipfs law and L. Levin probability distributions. Functional Analysis and Its
Applications, 48(2), 116-127.
Frank, S. A. (2009). The common patterns of nature. Journal of evolutionary biology, 22(8), 1563-1585.

Zipf's Law

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Zipf's Law

Загружено:

Авторское право:

Доступные форматы

Zipfs Law

Who was George Kingsley Zipf?

Born 1902 in Freeport, IL

Interest in word frequency

Dissertation: Relative Frequency as a

given period is a fraction of the number making a

First alluded to in Selected Studies of the

Formulation of Zipfs Law

Final rank-frequency form in The psycho-biology

Rank the frequency of words in a corpus

Does Zipfs law hold for other languages?

How do we use language? Shared patters in the frequency of word

Used corpora from various academic and national

The words in each language with each of the 200

Frequency is fixed across

All of these languages conform fairly well to

Can meaning predict word

Analysis of number words

Do other factors contribute to word

Compare words with roughly the same meaning

Does Zipfs Law apply to other levels of

Analysis of syntactic categories

Zipfs Law holds for novel words

What are some of the possible

Simple stochastic models

Preferential re-use will lead to a very skewed

Theory: Zipfs law results from labeling of a

Theory: Zipfs law results from an optimization of effort for the

The trade-off parameter returns a Zipfian distribution for = 0.41

Violates Spearmans principle of statistical modeling

Theory: Zipfs law results from a universal pressure rather

These theories have not been used to make novel

There are a profusion of theories trying to

Piantadosi, S. T. (2014). Zipfs word frequency law

Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language.

Вам также может понравиться