Вы находитесь на странице: 1из 2

The frequency dictionary for Russian

1 de 2

http://www.artint.ru/projects/frqlist/frqlist-en.php

Alex
AURA
InBASE
InDOC
Nemo+
SemP-T
Time-EX
Unicalc
FinPlan

The frequency dictionary for Russian



Serge Sharoff
The second version of the frequency list
From this page you can access the frequency list for modern
Russian. Up to now Chastotnyj slovarj russkogo jazyka
(Zasorina, 1977) provided the most widely used frequency list
for Russian. However, the corpus used in Zasorina is
relatively small according to modern standards (about 1
million words). It is outdated: mostly it covers uses from
1920s to 1960s and includes a high proportion of ideological
sources, like texts by Lenin and Khrushchev and Soviet
newspapers, thus, word frequencies in it are severely biased,
e.g. Soviet and comrade are in the first hundred of Russian
words on a par with function words. Finally, the list of
(Zasorina, 1977) is not available electronically.
The list accessible from this page includes about 32000 words
with frequency greater than 1 ipm (one instance per million
words). A shorter selection of 5000 most frequent words is
also available. Lists use utf8 encoding for Cyrillic and are
compressed by lemma.al.zip - lemmas sorted in the
alphabetical order
lemma.num.zip - lemmas sorted by their frequency
words.num.zip - word forms sorted by their frequency
Lists of 5000 most frequent words
5000lemma.al.zip - lemmas sorted in the alphabetical
order
5000lemma.num.zip - lemmas sorted by their frequency
Some data about uses of words in modern Russian
The average word length is 5.28 characters.
The average sentence length is 10.38 words.
1000 most frequent lemmas cover 64.0708% of word
forms in texts.
2000 most frequent lemmas cover 71.9521% of word
forms in texts.
3000 most frequent lemmas cover 76.6824% of word
forms in texts.
5000 most frequent lemmas cover 82.0604% of word
forms in texts.
The exact information on the mapping of frequency to
coverage is available from here.

The frequency dictionary for Russian

2 de 2

http://www.artint.ru/projects/frqlist/frqlist-en.php

The list is compiled on the basis of a corpus of modern


Russian. It contains a selection of modern fiction, political
texts, newspapers, and popular science (about 40 million
words, MW, fiction allocates for about half of the corpus). All
texts were written originally in Russian between 1970 and
2002; the majority of them between 1980 and 1995, the
newspapers corpus is from 1997-1999.
It is widely known that large texts present a problem for
frequency lists, since a large text that contains many instances
of a rare word can boost its frequency. If the corpus is based
on fiction, large texts are quite frequent. As an example, the
corpus contains a huge sequel to Tolkien's "The Lord of the
Rings" written by a Russian author (Nick Perumov). In spite
of the fact that the length of the sequel is about 250 kW, less
than one percent of the whole corpus, the frequency of uses of
the word hobbit in that book puts the word in the first
thousand of most frequent Russian words, if no precautions
against large texts are made. Out of this reason, the frequency
list is calculated under the condition that no single text from
the corpus contributes more than 10 kW and no author
contributes more than 100 kW to the count. Thus, the subset
of the whole corpus used for frequency count is about 16 MW.
Words are not uniformly distributed in texts. Some of them
(like prepositions) occur in many texts with predictable rate,
some (like pronouns or mental verbs) are significantly more
frequent for certain writers or genres, while some are
"contagious": if a word (e.g. a proper name, a title of nobility
or a technical term) occurs once in a text, it tends to be
repeated, thus boosting its frequency in a document. The
variation can be measured in a variety of ways (Church, K.
and Gale, W. (1995) here. The structure:
lemma, mean frequency (ipm), number of texts in which the
lemma occurs, standard deviation of frequency counted for all
texts, coefficient of variation, variance.
The corpus, tools for working with it, as well as an aligned
parallel English-Russian corpus are discussed in the following
publication:
Sharoff, Serge, (2002). Meaning as use: exploitation of
aligned corpora for the contrastive study of lexical semantics.
Proc. of Language Resources and Evaluation Conference
(LREC02). May, 2002, Las Palmas, Spain. PDF file.
Three frequency lists for word classes are also available:
size adjectives
verbs of motion
words designating emotions
The compilation of the corpora, development of respective
tools and the frequency lists were available due to the
Fellowship awarded to the author from the
. 2001
Copyright. 2001 RRIAI

Вам также может понравиться