Академический Документы
Профессиональный Документы
Культура Документы
Christopher McBryde
November 14, 2016
Agenda
Background
Zipfs and Zipf-Mandlebrot Laws
Non-English languages
Word categories
Models and explanations
Conclusion
Background
Zipfs and
Zipf-Mandelbrot Laws
Zipfs Law
Zipfs Law
(original form)
(general form)
Zipf-Mandlebrot Law
Extended Zipfs law
with the use of the
parameter q
When the limit is
taken as N goes to
infinity, becomes the
Hurwitz zeta
function
Non-English Languages
Methodology
Results
Closer look
Closer look
Language
coefficien
t
coefficient
R-squared
Adj. RSquared
Spanish
1.84
3.81
0.88
1.00
Russian
1.88
5.02
0.88
1.00
Greek
1.17
-0.45
0.83
0.97
Portugues
e
3.68
26.26
0.87
1.00
Chinese
1.46
3.13
0.73
0.97
Swahili
0.79
-0.66
0.59
0.89
Conclusion
Word Categories
Results
Taboo (feces)
Taboo (sex)
Months
Planets
Elements
Results
Category
coefficien
t
coefficient
R-squared
Adj. RSquared
Taboo
(sex)
1.84
0.65
0.95
0.99
Taboo
(feces)
3.75
8.48
0.92
1.00
Months
9.43 than just0.97
Additional1.54
factors at play
meaning 0.97
Planets
11.65
13.49
0.95
Near-Zipfian
distribution
even when
categories1.00
are
Elements
constrained
3.53
by natural
17.17
world, such0.91
as the phases
0.94of the
moon
Results
Results
Category
coefficien
t
coefficient
R-squared
Adj. RSquared
Determiner 2.15
s
0.21
0.91
0.93
Propositio
ns
2.37
4.25
0.95
0.97
Modals
119.60
458.71
0.86
0.88
Sing/mass
nouns
1.15
77.06
0.86
0.96
Past
participle
0.82
-0.27
0.84
0.96
3rd person
singular
verbs
1.04
-0.81
0.81
0.93
25 subjects were
recruited through
Mechanical Turk and
given a prompt
The relative
frequency of these
novel words was
analyzed on the
completed stories
An alien spaceship
crashes in the Nevada
desert. Eight creatures
emerge: a Wug, a Plit,
a Blicket, a Flark, a
Warit, a Jupe, a Ralex,
and a Timon. In at least
2000 words, describe
what happens next
Results
Experiment was
designed to bias the
subjects as little as
possible
Still, the results
seem to follow a
power law
distribution
Models and
Explanations
Random typing
Simple stochastic models
Semantic accounts
Communicative accounts
Universality
Random typing
Theory: Zipfs law is
a statistical artifact
Test: Divide the
corpus by a different
character than ,
such as e
Results: A nearZipfian distribution
Random Typing
Random monkey
processes do not
accurately reflect real
language
Example: e-divided
word lengths decay
exponentially, but in
real language its not
even monotonic (D.
Manin 2009)
Semantic accounts
Communicative accounts
conveyed
Listeners effort proportional to the expected entropy over referents given
a word
Universality
Manin 2013)
Entropy maximizing processes (S. A. Frank 2009)
Conclusions
Primary References
Secondary References