Вы находитесь на странице: 1из 4

RAPID COMMUNICATIONS

PHYSICAL REVIEW E 79, 035102共R兲 共2009兲

Level statistics of words: Finding keywords in literary texts and symbolic sequences

P. Carpena,1 P. Bernaola-Galván,1 M. Hackenberg,2 A. V. Coronado,1 and J. L. Oliver3


1
Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
2
Bioinformatics Group, CIC bioGUNE, Technology Park of Bizkaia, 48160 Derio, Bizkaia, Spain
3
Departamento de Genética, Universidad de Granada, 18071 Granada, Spain
共Received 21 May 2008; published 10 March 2009兲
Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach
able to extract automatically keywords in literary texts. Our approach takes into account not only the frequen-
cies of the words present in the text but also their spatial distribution along the text, and is based on the fact that
relevant words are significantly clustered 共i.e., they self-attract each other兲, while irrelevant words are distrib-
uted randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single
documents for which no a priori information is available. In addition, we show that our method works also in
generic symbolic sequences 共continuous texts without spaces兲, thus suggesting its general applicability.

DOI: 10.1103/PhysRevE.79.035102 PACS number共s兲: 89.20.⫺a, 89.65.⫺s, 89.75.Fb, 05.50.⫹q

Statistical keyword extraction is a critical step in informa- text. For example, in the sentence “A great scientist must be
tion science, with multiple applications in text-mining and a good teacher and a good researcher” the spectrum corre-
information-retrieval systems 关1兴. Since Luhn 关2兴 proposed sponding to the word “a” is formed by three energy levels
the analysis of frequency occurrences of words in the text as 共1,6,10兲. Figure 1 shows an example of a real book.
a method for keyword extraction, many refinements have Following the physics analogy, the nearest-neighbor spac-
been developed. With a few exceptions 关3兴, the basic prin- ing distribution P共d兲 was used in 关9兴 to characterize the spa-
ciple for keyword extraction is the comparison to a corpus of tial distribution of a particular word, and to show the rela-
documents taken as a reference. For a collection of docu- tionship between word clustering and word semantic
ments, modern term-weighting schemes use the frequency of meaning. P共d兲 is obtained as the normalized histogram of the
a term in a document and the proportion of documents con- sets of distances 共or spacings兲 共d1 , d2 , . . . , dn兲 between con-
taining that term 关4兴. Following a different approach, the secutive occurrences of a word, with di = ei+1 − ei. As seen in
probabilistic model of information retrieval related the sig-
Fig. 1, a nonrelevant word 共as “but”兲 is placed at random
nificance of a term to its frequency fluctuations between
along the text, while a relevant word 共as “Quixote”兲 appears
documents 关5–7兴. The frequency analysis approach to detect
keywords seems to work properly in this context. in the text forming clusters, and this difference is reflected in
However, a more general approach should try to detect their corresponding P共d兲 distributions. In the case of a rel-
keywords in a single text without knowing a priori the sub- evant word the energy levels attract each other, while for a
ject of the text, i.e., without using a corpus of reference. The nonrelevant word, the energy levels are uncorrelated and
applications of such an algorithm are clear: internet searches, therefore distributed at random, so the higher the relevance
data mining, automatic classification of documents, etc. In of a word, the larger the clustering 共the attraction兲 and the
this case, the information provided by the frequency of a larger the deviation of P共d兲 from the random expectation.
word is not very useful, since there are no more texts to The connection between word attraction 共clustering兲 and rel-
compare. In addition, such frequency analysis is of little use evance comes from the fact that a relevant word is usually
in a single document for two main reasons: 共i兲 Two words the main subject on local contexts, and therefore it appears
with very different relevance in the text can have a similar
frequency 共see Fig. 1兲. 共ii兲 A randomization of the text pre-
serves the frequency values but destroys the information, 'Quixote'
(288
which must be also stored in the ordering of the words, and occurrences)
not only in the words themselves. Thus, to detect keywords,
we propose the use of the spatial distribution of the words
along the text and not only their frequencies, in order to take
into account the structure of the text as well as its composi- 'but'
tion. (248
Inspired by the level statistics of quantum-disordered sys- occurrences)
tems following the random matrix theory 关8兴, Ortuño et al.
关9兴 have shown that the spatial distribution of a relevant
0 10000 20000 30000 40000 50000
word in a text is very different from that corresponding to a position (words)
nonrelevant word. In this approach, any of the occurrences of
a particular word is considered as an “energy level” ei within FIG. 1. Spectra of the words “Quixote” and “but” obtained in
an “energy spectrum” formed by all the occurrences of the the first 50 000 words of the book Don Quixote, by Miguel de
analyzed word within the text. The value of any energy level Cervantes. Both words have a similar frequency in the whole text
ei is given simply by the position of the analyzed word in the 共around 2150兲, and also in the part shown in the figure.

1539-3755/2009/79共3兲/035102共4兲 035102-1 ©2009 The American Physical Society


RAPID COMMUNICATIONS

CARPENA et al. PHYSICAL REVIEW E 79, 035102共R兲 共2009兲

more often in some areas and less frequently in others, giv- (a) 1.00
ing rise to clusters.
0.98 p = 0.01 p = 0.05 p = 0.1
We present here an approach to the problem of keyword 1.00
detection with does not need a corpus and which is based on 0.96

<σnor>
the principle that real keywords are clustered in the text 共as 0.96

<σ>
0.94
in 关9兴兲, but in addition we introduce the idea that the cluster- 0.92
ing must be statistically significant, i.e., not due to statistical 0.92
0.88
fluctuations. This is of fundamental importance when analyz- 0.90
ing any text, but is critical when analyzing short documents 0.84
0.88
共articles, etc.兲 where fluctuations are important since all the 0 200 400 600 800 1000
n (word count)
words present a small frequency. As the statistical fluctua- 0.86
0 100 200 300 400 500 600 700 800 900 1000
tions depend on the frequency of the corresponding word
n (word count)
共see below兲, our approach combines both the information
provided by the clustering of the word 共the spatial structure (b) 1.0
along the text兲 and by its frequency. Furthermore, we extend 0.9 10

our approach to general symbolic sequences, where word

<σnor> , sd(σnor)
8
0.8 n = 50

P(σnor)
n = 100
boundaries are not known. In particular, we model a generic 0.7
6
n = 500
symbolic sequence 共i.e., a chain of “letters” from a certain 0.6
4

“alphabet”兲 by an ordinary text without blank spaces be- <σnor> (simulation) 2


0.5 <σnor> (fitting)
tween words. As we know a priori the correct hidden key-
0.4 sd(σnor) (simulation) 0.6 0.8 1.0 σnor1.2 1.4 1.6
words this allows us to test our method. We show that the sd(σnor) (fitting)
clustering 共attraction兲 experienced by keywords is still ob- 0.3
servable in such a text, and therefore that real keywords can 0.2
be detected even when the “words” of the text are not 0.1
known. 0 10 20 30 40 50 60 70 80 90 100 110
n (word count)
To quantify the clustering 共and thus the relevance兲 of a
word using a single parameter instead of the whole distribu- FIG. 2. 共Color online兲 共a兲 具␴nor典 as a function of the word count
tion P共d兲, in 关9兴 the parameter ␴ was used defined as ␴ n for words with different p in a random text. The horizontal line is
⬅ s / 具d典, with 具d典 being the average distance and s the value ␴nor = 1. Inset: the same, but for ␴ instead of ␴nor. The
= 冑具d2典 − 具d典2 the standard deviation of P共d兲. For a particular horizontal lines are the expected values 冑1 − p. 共b兲 The mean 具␴nor典
word, ␴ is the standard deviation of its normalized set of and the standard deviation sd共␴nor兲 of the probability distribution
distances 兵d1 / 具d典 , d2 / 具d典 , . . . , dn / 具d典其, i.e., distances given in P共␴nor兲 as a function of n obtained by simulation of random texts.
units of the mean distance, which allows the direct compari- The solid lines correspond to fittings according to Eq. 共3兲. Inset:
son of the ␴ values obtained for words with different fre- P共␴nor兲 for three different n.
quency. The use of ␴ to characterize P共d兲 is common in the
analysis of energy levels of quantum disordered systems we use it as our null hypothesis. For the geometric case,
关10兴. For these systems, when the energy levels are uncorre- ␴geo = 冑1 − p since s = 冑1 − p / p and 具d典 = 1 / p, and the con-
lated and behave randomly, the corresponding P共d兲 is the tinuum case 共␴ = 1兲 is recovered when p → 0. Thus, in the
Poisson distribution 关8兴, P共d兲 = e−d, for which ␴ = 1. Thus in discrete case, words with different p randomly placed in a
关9兴 the value expected for a nonrelevant word without clus- text would give a different clustering level ␴ 关see Fig. 2共a兲,
tering and distributed randomly in a text was ␴ = 1, and the inset兴. To eliminate this effect, we define the clustering mea-
larger ␴, the larger the clustering 共and the relevance兲 of the sure ␴nor as
corresponding word. This approach proved to be fruitful and
later works used it to test keywords detection 关11兴. ␴ ␴
␴nor = = . 共2兲
However, the cluster-free 共random兲 distribution P共d兲 is ␴geo 冑1 − p
Poissonian only for a continuous distance distribution, which
is valid for the energy levels, but not for the words, where To show that this correction is effective, in Fig. 2共a兲 we plot
the distances are integers. The discrete counterpart of the the behavior of the average value of ␴nor for words with
Poisson distribution is the geometric one: different p in a simulation of random texts 关14兴: all the
curves collapse into a single one independently on p, show-
Pgeo共d兲 = p共1 − p兲d−1 , 共1兲
ing that the normalization works, and the expected value
where p = n / N is the probability of the word within the text, ␴nor = 1 is recovered in all cases 共for large n兲.
n being the counts of the corresponding word and N the total However, in ␴nor the influence of the word count n is not
number of words in the text. Pgeo共d兲 is expected for a word considered, although it can be of critical importance. The
placed at random in a text. Examples of P共d兲 for words mean value 具␴nor典 presents clear finite size effects 共is biased兲
distributed according to more complex models than the ran- 关Fig. 2共a兲兴. The strong n dependence also appears in the stan-
dom one can be found, for example, in 关12,13兴, but we have dard deviation of the distribution of ␴nor values 关sd共␴nor兲兴
observed that the geometric distribution is a very good model and in the whole distribution P共␴nor兲 itself 关Fig. 2共b兲兴. As
to describe the behavior of unrelevant words, and therefore expected, for small n the distribution P共␴nor兲 is wide, and the

035102-2
RAPID COMMUNICATIONS

LEVEL STATISTICS OF WORDS: FINDING KEYWORDS… PHYSICAL REVIEW E 79, 035102共R兲 共2009兲

TABLE I. The first 20 “words” extracted from the book Rela- ␴nor − 具␴nor典共n兲
tivity: The Special and General Theory, by A. Einstein, with spaces C共␴nor,n兲 ⬅ , 共4兲
and punctuation marks removed.
sd共␴nor兲共n兲

i.e., C measures the deviation of ␴nor with respect to the


Word Counts ␴nor C
expected value in a random text 关具␴nor典共n兲兴 in units of the
energy 23 4.29 19.10 expected standard deviation 关sd共␴nor兲共n兲兴. Thus C is a
theuniverse 20 3.84 15.76 Z-score measure which depends on the frequency n of the
word considered, and combines the clustering of a word and
erical 26 3.25 13.74
its frequency. To calculate C we use the numerical fittings of
project 35 2.73 11.85 Eq. 共3兲. C = 0 indicates that the word appears at random, C
alongthe 17 2.92 10.28 ⬎ 0 that the word is clustered, and C ⬍ 0 that the word repels
econtinuum 23 2.70 10.04 itself. In addition, two words with the same C value can have
thegravitationalfield 27 2.60 10.01 different clustering 共different ␴nor value兲, but the same statis-
sphere 16 2.8 9.79 tical significance.
electron 13 2.92 9.54 We used systematically C to analyze a large collection of
geometry 31 2.45 9.54
texts 关15兴 共novels, poetry, scientific books兲. C can be used in
two ways: 共i兲 to rank the words according to their C values
theprincipleofrelativity 33 2.41 9.48
and 共ii兲 to rank the words according to their ␴nor values but
specific 11 2.91 9.11 only for words with a C value larger than a threshold value
theembankment 40 2.25 9.09 C0, which fixes the statistical significance considered. Both
square 28 2.41 8.92 approaches work extremely well for many texts in different
thetheoryofrelativity 32 2.31 8.78 languages 关16兴.
velocityv 17 2.60 8.63 The Origin of Species by Means of Natural Selection is a
referencebody 56 2.01 8.50 good example to understand the effect of C: using ␴nor, for
the very relevant word “species” 共n = 1922兲 we have ␴nor
materialpoint 12 2.69 8.29
= 1.905. In the ␴nor-ranking “species” appears in the 505th
thelorentztransformation 33 2.22 8.26
place! Nevertheless, when using the C measure we find for
fourdimensional 26 2.33 8.25 this word C = 39.97, and in the C ranking it is in the 5th place
共after “sterility,” “hybrids,” “varieties,” and “instincts”兲.
Next, we would like to extract keywords from symbolic
probability of having by chance large ␴nor values is not neg- sequences, considered as a continuous chain of “letters”
ligible. As n increases, P共␴nor兲 becomes narrower and con- without “spaces” separating them. Previous attempts in this
sists essentially of a Gaussian peak centered at ␴nor = 1: now, direction 关17兴 were based on word frequencies and not on the
the probability of having by chance large ␴nor values is very spatial structure of the text. Our underlying idea is that even
small. As a consequence, this strong n dependence can be in a symbolic sequence the spatial distribution of relevant
crucial: since the statistical fluctuations 关as measured by “words” should be different of the irrelevant ones, and the
sd共␴nor兲兴 are much larger for small n, it is possible to obtain clustering approach can provide useful results.
a larger ␴nor for a rare word placed at random in a text than To model generic symbolic sequences, we use standard,
for a more frequent real keyword. The rare random word literary texts in which all the spaces, punctuation marks, etc.,
would be misidentified as a keyword. have been removed, thus producing a continuous chain of
To solve this problem we propose a new relevance mea- letters drawn from the alphabet 兵a , b , . . . , z , 0 , 1 , . . . , 9其 关18兴.
sure which takes into account not only the clustering of the Since we know a priori the real keywords hidden in such
word measured by ␴nor, but also its statistical significance texts, this may be a good benchmark for our method. Our
given the word counts n. To achieve this, we obtained first by approach works as follows: as true “words” are unknown, we
extensive simulation of random texts 关14兴 the n dependence calculate the C measure 关19兴 for all possible ᐉ-letter words,
共bias兲 of the mean value 具␴nor典 and the standard deviation where ᐉ is a small integer 共ᐉ = 2 – 35兲. For each ᐉ, we rank
sd共␴nor兲 of the distribution P共␴nor兲, which are shown in Fig. the ᐉ words by their C values. As the number of different ᐉ
2共b兲. Both functions are very well fitted in the whole n range words is immense 共xᐉ, with x the number of letters of the
by alphabet兲, keyword detection is a daunting task. Note that,
for a given ᐉ, any word contains many other words of
2n − 1 1 smaller ᐉ and is also part of words with larger ᐉ. In this way,
具␴nor典 = , sd共␴nor兲 = 共3兲
2n + 2 冑n共1 + 2.8n−0.865兲 . the putative words can be viewed as a direct acyclic graph
共DAG兲 关20兴. DAGs are hierarchical treelike structures where
each child node can have various parent nodes. Parent nodes
Note how for large n, 具␴nor典 → 1 and sd共␴nor兲 → 1 / 冑n, in are general words 共shorter words, small ᐉ兲 while child nodes
agreement with the central limit theorem. are more specific words 共larger words, large ᐉ兲. For a given
As P共␴nor兲 tends to be Gaussian, we can design an appro- ᐉ, each ᐉ word has two 共ᐉ − 1兲 parents 共for example, the word
priate relevance measure C: for a word with n counts and a “energy” has “energ” and “nergy”兲 and 2x 共ᐉ + 1兲 children
given ␴nor value, we define the measure C as 共like “eenergy,” “energya,” etc.兲. As expected, we observed

035102-3
RAPID COMMUNICATIONS

CARPENA et al. PHYSICAL REVIEW E 79, 035102共R兲 共2009兲

that words with semantic meaning and their parents are text without spaces is not only able to detect real hidden
strongly clustered, while irrelevant and common words are keywords, but also a combination of words or whole sen-
randomly distributed. tences with plenty of meaning for the text considered, thus
For keyword extraction we use two principles: 共i兲 for any supporting the validity of our approach.
ᐉ, we apply a threshold for C to remove irrelevant words In conclusion, our algorithm has proven to work properly
which is taken as a percentile of the C distribution, usually a in a generic symbolic sequence, being thus potentially useful
p value 艋0.05兲. And 共ii兲 we explore the “lineages” of all
when analyzing other specific symbolic sequences of inter-
words 共from short, “general” ᐉ words to larger, “specialized”
ones兲 to extract just the words with semantic meaning and est, as for example, spoken language, where only sentences
not any of their parents which might be also highly clustered. can be preceded and followed by silence but not the indi-
The lineage of a word can easily be established by iteratively vidual words, or DNA sequences, where commaless codes
detecting the child word with the highest C value. The result are the rule.
of such an algorithm 关21兴 for a famous book without spaces
and punctuation marks can be seen in Table I, and for other We thank the Spanish Junta de Andalucía 共Grant Nos.
books in 关16兴. It is remarkable that this algorithm, based on P07-FQM3163 and P06-FQM1858兲 and the Spanish Govern-
the spatial attraction of the relevant words, when applied to a ment 共Grant No. BIO2008-01353兲 for financial support.

关1兴 WordNet: An Electronic Lexical Database, edited by C. Fell- 关14兴 We simulate random texts as random binary sequences
baum 共MIT Press, Cambridge, MA, 1998兲; E. Frank et al., 010010100001…. The symbol “1” appears with probability p
Proceedings of the 16th International Joint Conference on Ar- and models a word in a text, and the symbol “0” accounts for
tificial Intelligence, 1999, p. 668; L. van der Plas et al., in the rest of the words with probability 1 − p, so in a realistic
Proceedings of the 4th International Conference on Language case p is very small.
Resources and Evaluation, edited by M. T. Lino, M. F. Xavier, 关15兴 All the texts analyzed have been downloaded from the Project
F. Ferreira, R. Costa, and R. Silva, European Language Re- Gutenberg web page: www.gutenberg.org
source Association, 2004, p. 2205; K Cohen and L. Hunter, 关16兴 http://bioinfo2.ugr.es/TextKeywords
PLOS Comput. Biol. 4, e20 共2008兲. 关17兴 H. J. Bussemaker, H. Li, and E. D. Siggia, Proc. Natl. Acad.
关2兴 H. P. Luhn, IBM J. Res. Dev. 2, 157 共1958兲. Sci. U.S.A. 97, 10096 共2000兲.
关3兴 Y. Matsuo and M. Ishizuka, Int. J. Artif. Intell. 13, 157 共2004兲; 关18兴 We only use lowercase letters to reduce the alphabet.
G. Palshikar, in Proceedings of the Second International Con- 关19兴 Now, the nearest neighbors distances are measured in letters,
ference on Pattern Recognition and Machine Intelligence not in words, since we advance letter by letter exploring the
(PReMI07), Vol. 4815 of Lecture Notes on Computer Science, text. Thus we ignore the “reading frame,” which would be
edited by A. Ghosh, R. K. De, and S. K. Pal 共Springer-Verlag, relevant only in texts with fixed-length words consecutively
Berlin, 2007兲, p. 503. placed, as the words of length 3 共codons兲 in the coding regions
关4兴 G. Salton and M. J. McGill, Introduction to Modern Informa- of DNA. In addition, we consider only nonoverlapping occur-
tion Retrieval 共McGraw-Hill, New York, 1983兲; K. Sparck rences of a word. Although the overlapping is not very likely
Jones, J. Document. 28, 11 共1972兲; S. E. Robertson et al., in in text without spaces, it can happen in other symbolic se-
Overview of the Third Text REtrieval Conference (TREC-3), quences, such as DNA, and the expected P共d兲 is different for
NIST SP 500–225, edited by D. K. Harman 共National Institute overlapping or nonoverlapping randomly placed words 共see S.
of Standards and Technology, Gaithersburg, MD, 1995兲, pp. Robin, F. Rodolphe, and S. Schbath, DNA, Words and Models
109–126. 共Cambridge University Press, Cambridge, England, 2005兲.
关5兴 A. Bookstein and D. R. Swanson, J. Am. Soc. Inf. Sci. 25, 312 关20兴 T. H. Cormen et al., Introduction to Algorithms 共The MIT
共1974兲; 26, 45 共1975兲. Press, Cambridge, MA, 1990兲, pp. 485–488.
关6兴 S. P. Harter, J. Am. Soc. Inf. Sci. 26, 197 共1975兲. 关21兴 The algorithm proceeds as follows: we consider an initial ᐉ0,
关7兴 A. Berger and J. Lafferty, Proc. ACM SIGIR99, 1999, p. 222; and we start with a given ᐉ0 word for which C ⬎ C0, where C0
J. Ponte and W. Croft, Proc. ACM SIGIR98, 1998, p. 275. corresponds to a p value ⫽ 0.05. We then find the child 共ᐉ0
关8兴 T. A. Brody et al., Rev. Mod. Phys. 53, 385 共1981兲; M. L. + 1兲 word with the highest C value, and proceed the same in
Mehta, Random Matrices 共Academic Press, New York, 1991兲. the following generations: we find the successive child 共ᐉ0
关9兴 M. Ortuño et al., Europhys. Lett. 57, 759 共2002兲. + i兲 word with the highest C value; i = 1 , 2 , . . .. This process
关10兴 P. Carpena, P. Bernaola-Galvan, and P. C. Ivanov, Phys. Rev. stops at a previously chosen maximal word length ᐉmax. Fi-
Lett. 93, 176804 共2004兲. nally, we choose as the representative of the lineage the longest
关11兴 H. D. Zhou and G. W. Slater, Physica A 329, 309 共2003兲; M. word for which C ⬎ C0 and define it as the extracted semantic
J. Berryman, A. Allison, and D. Abbott, Fluct. Noise Lett. 3, unit. We repeat this algorithm for all the ᐉ0 words with C
L1 共2003兲. ⬎ C0, and we repeat also all the processes by changing the
关12兴 V. T. Stefanov, J. Appl. Probab. 40, 881 共2003兲. initial ᐉ0 value. Finally, we remove the remaining redundan-
关13兴 S. Robin and J. J. Daudin, Ann. Inst. Stat. Math. 53, 895 cies 共repeated words or semantic units兲 due to explore all lin-
共2001兲. eages from different initial ᐉ0.

035102-4

Вам также может понравиться