Вы находитесь на странице: 1из 38

Zipfs Law

Christopher McBryde
November 14, 2016

Agenda

Background
Zipfs and Zipf-Mandlebrot Laws
Non-English languages
Word categories
Models and explanations
Conclusion

Background

Who was George Kingsley Zipf?

Born 1902 in Freeport, IL


Received bachelors,
masters, and doctorate
(comparative philology)
from Harvard University
Professor in German and
University Lecturer at
Harvard
Strong proponent of
interdisciplinary studies
Died 1950 from cancer

Interest in word frequency

Dissertation: Relative Frequency as a


Determinant of Phonetic Change
Inspired by Lotkas law
The number of authors making contributions in a

given period is a fraction of the number making a


single contribution as

First alluded to in Selected Studies of the


Principle of Relative Frequency in Language
(1932)

Formulation of Zipfs Law

Final rank-frequency form in The psycho-biology


of language (1935)
One can consider the words of a vocabulary as
ranked in the order of their frequencyWe can
indicate on the abscissa of a double logarithmic
chart the number of the word in the series and on
the ordinate its frequency
Demonstrated on word frequency data from
James Joyces Ulysses

Zipfs and
Zipf-Mandelbrot Laws

Zipfs Law

Rank the frequency of words in a corpus


The second-ranked word appears half as often
as the most common word, the third-ranked
word appears half as often as the second, etc.
Zipfian laws exist for other applications
City population size
Income rankings
TV Ratings

Zipfs Law

(original form)

(general form)

Zipf-Mandlebrot Law
Extended Zipfs law
with the use of the
parameter q
When the limit is
taken as N goes to
infinity, becomes the
Hurwitz zeta
function

Non-English Languages

Does Zipfs law hold for other languages?

How do we use language? Shared patters in the frequency of word


use across 17 world languages, Calude and Pagel
17 world languages across 6 language families, with one isolate
and one creole
9 Indo-European (English, Greek, etc.)
1 Sino-Tibetan (Chinese)
2 Uralic (Finnish and Estonian
1 Niger-Congo (Swahili)
1 Altaic (Turkish)
1 Austronesian (Maori)
1 Isolate (Basque)
1 creole (Tok Pisin)

Methodology

Used corpora from various academic and national


sources
To compare apples to apples, word frequency was
compared for 200 words with common meanings
Called a Swadesh list (Swadesh 1952)
Avoids technical and specifically environmental terms

The words in each language with each of the 200


meanings was compared against its corpus to
determine the relative frequency

Results

Frequency is fixed across


all languages
Rank ordering is
determined by aggregate
rank, not rank within the
language
Blue dots are word
frequency, grey line is a
locally smoothed
regression, red is the fit to
Zipf-Mandlebrot

Closer look

Closer look
Language

coefficien
t

coefficient

R-squared

Adj. RSquared

Spanish

1.84

3.81

0.88

1.00

Russian

1.88

5.02

0.88

1.00

Greek

1.17

-0.45

0.83

0.97

Portugues
e

3.68

26.26

0.87

1.00

Chinese

1.46

3.13

0.73

0.97

Swahili

0.79

-0.66

0.59

0.89

Conclusion

All of these languages conform fairly well to


Zipf-Mandlebrot, but not in the same way
(different coefficients)
Indo-European languages fit more closely
than Chinese or Swahili
Frequency is systematically related to
meaning

Word Categories

Can meaning predict word


frequency?

Analysis of number words


by Dehaene and Mehler
(1992)
Figures show frequency
versus cardinality, not rank
for English (top), Russian
(middle) and Italian (bottom)
Frequency is closer to a
inverse square law ( = 2)
Number word frequency
predictable from meaning

Do other factors contribute to word


frequency?

Compare words with roughly the same meaning


Plot word frequency in the American National
Corpus within different categories
Taboo words
Sex (gerund)
Feces
Months
Planets
Elements

Results

Taboo (feces)

Taboo (sex)

Months

Planets

Elements

Results
Category

coefficien
t

coefficient

R-squared

Adj. RSquared

Taboo
(sex)

1.84

0.65

0.95

0.99

Taboo
(feces)

3.75

8.48

0.92

1.00

Months
9.43 than just0.97
Additional1.54
factors at play
meaning 0.97
Planets
11.65
13.49
0.95
Near-Zipfian
distribution
even when
categories1.00
are
Elements
constrained
3.53
by natural
17.17
world, such0.91
as the phases
0.94of the

moon

Does Zipfs Law apply to other levels of


language

Analysis of syntactic categories


Distribution of part of speech tags within the
category
Determiners (DT)
Prepositions (IN)
Modals (MD)
Singular or mass nouns (NN)
Past participle verbs (VBN)
3rd person singular present tense verbs (VBZ)

Results

Results
Category

coefficien
t

coefficient

R-squared

Adj. RSquared

Determiner 2.15
s

0.21

0.91

0.93

Propositio
ns

2.37

4.25

0.95

0.97

Modals

119.60

458.71

0.86

0.88

Sing/mass
nouns

1.15

77.06

0.86

0.96

Past
participle

0.82

-0.27

0.84

0.96

3rd person
singular
verbs

1.04

-0.81

0.81

0.93

Zipfs Law holds for novel words

25 subjects were
recruited through
Mechanical Turk and
given a prompt
The relative
frequency of these
novel words was
analyzed on the
completed stories

An alien spaceship
crashes in the Nevada
desert. Eight creatures
emerge: a Wug, a Plit,
a Blicket, a Flark, a
Warit, a Jupe, a Ralex,
and a Timon. In at least
2000 words, describe
what happens next

Results
Experiment was
designed to bias the
subjects as little as
possible
Still, the results
seem to follow a
power law
distribution

Models and
Explanations

What are some of the possible


explanations for Zipfs law?

Random typing
Simple stochastic models
Semantic accounts
Communicative accounts
Universality

Random typing
Theory: Zipfs law is
a statistical artifact
Test: Divide the
corpus by a different
character than ,
such as e
Results: A nearZipfian distribution

Random Typing

Random monkey
processes do not
accurately reflect real
language
Example: e-divided
word lengths decay
exponentially, but in
real language its not
even monotonic (D.
Manin 2009)

Simple stochastic models

Preferential re-use will lead to a very skewed


frequency distribution since frequent words
will tend to get re-used even more
Only can prove sufficiency: if stochastic, then
Zipfs law results
It may explain novel words, but doesnt fully
explain the connection between meaning and
frequency

Semantic accounts

Theory: Zipfs law results from labeling of a


semantic hierarchy and a pressure to avoid
synonymy (D. Manin 2008)
Evidence in distribution where meaning is
restricted
However, the behavioral experiment results in a
near-Zipfian distribution with meanings unkwown
Zipfs law cant full be explained by semantics

Communicative accounts

Theory: Zipfs law results from an optimization of effort for the


speaker and the listener (Ferrer i Cancho & Sol, 2003)
Speakers effort proportional to the diversity of signals that need to be

conveyed
Listeners effort proportional to the expected entropy over referents given
a word

The trade-off parameter returns a Zipfian distribution for = 0.41


Drawbacks

Violates Spearmans principle of statistical modeling


ZL still holds for highly constrained words
What does optimization mean for storytelling?

Universality

Theory: Zipfs law results from a universal pressure rather


than psychology or statistics
Derive from mathematical principles
Algorithmic information theory (Corominas-Murtra and Sol 2010)
Kolmogorov complexity and Levins probability distribution (Y. I.

Manin 2013)
Entropy maximizing processes (S. A. Frank 2009)

These theories have not been used to make novel


predictions or explain variations in parameters between
categories or languages

Conclusions

There are a profusion of theories trying to


explain Zipfs law
However, little distinguishes them in terms
using scientific methodology
Several ways to derive the law, no novel
predictions
Other human processes

Primary References

Piantadosi, S. T. (2014). Zipfs word frequency law


in natural language: A critical review and future
directions. Psychonomic bulletin & review, 21(5),
1112-1130.
Calude, A. S., & Pagel, M. (2011). How do we use
language? Shared patterns in the frequency of
word use across 17 world languages. Philosophical
Transactions of the Royal Society of London B:
Biological Sciences, 366(1567), 1101-1107.

Secondary References

Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language.


Zipf, G. K. (1935). The psycho-biology of language.
Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: with special reference to North
American Indians and Eskimos. Proceedings of the American philosophical society, 96(4), 452-463.
Dehaene, S., & Mehler, J. (1992). Cross-linguistic regularities in the frequency of number
words. Cognition, 43(1), 1-29.
Manin, D. Y. (2009). Mandelbrot's Model for Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law for
Language?. Journal of Quantitative Linguistics, 16(3), 274-285.
Manin, D. Y. (2008). Zipf's law and avoidance of excessive synonymy. Cognitive Science, 32(7), 10751098.
Ferrer i Cancho, R.F., & Sol, R. V. (2003). Least effort and the origins of scaling in human
language. Proceedings of the National Academy of Sciences, 100(3), 788-791.
Corominas-Murtra, B., & Sol, R. V. (2010). Universality of Zipfs law. Physical Review E, 82(1), 011102.
Manin, Y. I. (2014). Zipfs law and L. Levin probability distributions. Functional Analysis and Its
Applications, 48(2), 116-127.
Frank, S. A. (2009). The common patterns of nature. Journal of evolutionary biology, 22(8), 1563-1585.

Вам также может понравиться