Вы находитесь на странице: 1из 14

A Frequency Dictionary of Contemporary Russian Core vocabulary for learners

Serge Sharoff and Elena Umanskaya University of Leeds Proposed delivery date: July 2011

Purpose

Like other volumes in the Routledge Frequency Dictionary series, A Frequency Dictionary of Contemporary Russian will provide a list of core vocabulary (5,000 words) for learners of Russian as a foreign language. Russias turbulent history in the past hundred years has resulted in several substantial changes in the Russian lexicon. At the moment there are no resources for language learners reecting the modern state of Russian. Existing frequency dictionaries are outdated, while many materials for teaching Russian as foreign language do not take the frequency of words into account. The proposed dictionary will dene the Russian teaching curriculum for the English-speaking world on the basis of scientic evidence coming from representative corpora.

Corpus

The dictionary we propose will be based on the Russian Internet Corpus, I-RU (Sharoff, 2006), which consists of more than 150 million words taken from more than 75,000 webpages. The corpus was collected in 2005 by making queries to Google and collecting the top 10 pages retrieved for each query. Often there are concerns about the quality of texts available on the Web. However, a closer investigation of this corpus (Sharoff, 2006; Sharoff, 2007) shows that the Internet does not consist of porn and spam. In terms of genres, I-RU contains a fair range of newswires and other types journalistic texts, research papers, administrative texts, as well as ction. An automated estimate of the composition of I-RU puts the amount of reporting (mostly news) to 15%, FAQs, tutorials, textbooks and similar instruction-related materials 1

word I-RU RNC (I) 10146 9714 (you, fam) 2530 2390 2918 2194 (you, polite) (thank you) 151 91 (please) 104 72 / (lets) 106 84 Personal interaction words

word I-RU (premire) 9 (theatre) 86 (arbitrage) 7 (Federation) 83 (virus) 20 (virus strain) 1 Topic-specic words

RNC 90 303 55 255 107 36

Table 1: Comparing the frequencies in I-RU against RNC (data per million words)

to 7%, ction to 25%, legal texts to 4%, with the rest belonging to argumentative texts, including traditional columns and opinions, academic texts, as well as such webgenres as blogs and forums. Fiction is under-represented on the Web for many other languages, but for Russian the situation is different: modern ction is available in considerable amount and the unclear copyright status of ction produced during the Soviet era means it is available as well. Thus, the Russian Internet offers a good match to what the Russian population reads at the moment. Forums and blogs available in I-RU provide an account of the language of personal interaction, which is important to language learners. An example comparing the frequency of some words in I-RU against the Russian National Corpus (RNC) is given in Table 1. Studies of other corpora derived from the Web, e.g., (Ferraresi et al., 2008), also show that in comparison to traditional corpora Web corpora contain more words related to personal interaction, like rst and second person pronouns, verbs in the present tense. This comes from the fact that it is quite difcult to collect sufcient amount of spoken language data, so traditional corpora cannot fully represent the language spontaneous personal interaction, they had to rely on written sources, while Web corpora can produce a useful approximation. As for domains, I-RU is based on a much larger number of sources than traditional manually collected corpora. It is inevitable that some words become overrepresented in traditional corpora, since the amount of sources for each text type is usually limited by what was available to researchers responsible for their collection. Adam Kilgarriff refers to this as a whelk problem, i.e., if a text source is about whelks, the frequency of this word becomes disproportionately high (Kilgarriff, 1997). The RNC contains a number of memoirs of former actors and theatre directors, the business section of the Russian legal code (partly responsible for the frequency of the formal reference to Russia as the Russian Federation in Table 1), 2

and a large number of medical texts. The number of different sources of I-RU results in a better coverage of core vocabulary, as individual topics of each document are levelled out. Overall, I-RU gives the most reliable frequency list for language learners.

Creating the dictionary

Russian is a language with considerable amount of morphological variation: adjectives, nouns and verbs vary according to their grammatical case, number, gender, as well as person and tense for verbs. In the end, mapping forms to their lemmas (dictionary headwords) is not trivial. In addition to this, the level of syncretism is relatively high: usually forms can have several grammatical interpretations depending on the context, for example, can be a pronoun (my) or a verb (to wash). We have a reliable part-of-speech tagger and lemmatiser (Sharoff et al., 2008), which has been used to process I-RU. The accuracy of tagging is about 95% with the accuracy of lemmatisation of more than 98%. However, we are going to check the resulting frequency list manually. In this process we will also add English glosses, examples and basic morphosyntactic information about each word. The lemmas will be ranked by their frequency (normalised as instances per million words) as well as by Juillands D coefcient (Juilland et al., 1970), which represents the dispersion of frequency across the range of documents: D(x) = 1 (x) ( x )

where (x) is the standard deviation of the frequency of word x across the range of documents in a corpus, while (x) is the overall frequency of this word. The value ranges from 1 ( = 0), i.e., a word is equally frequent in all documents, to 0, when a word is extremely frequent in a small number of documents. A technical issue concerns the use of the letter (yo), which is normally written as in standard Russian texts except those aimed for children and the beginning level of language learners. Given that the letter is not marked in the vast majority of Russian texts and is very rare in our corpus, its possibility will be indicated in the entry, but the headword will not use it. Another technical issue is the representation of stress, which is important for pronouncing the Russian words correctly. Typographically it is represented by an accent symbol over the vowel under stress, but the conventions for putting it depend on the typesetting package. In the spreadsheet submitted to Routledge the stress will be represented by the single quote character.

In addition to ranking of the top 5,000 individual words we are going to represent the frequency of the 300 most common multiword constructions consisting of two or three words. Formulaic language is treated as very important for language learners. Besides, many Russian constructions make sense only taken as a whole, e.g., (each other, lit. friend to friend). For this task we will start from an initial list of the most common two- and three-word expressions ranked by the log-likelihood score (Dunning, 1993) and select a pedagogically relevant list. The examples for the corpus with their translations will be collected from a parallel corpus, which contains ction and technical texts. We will be aiming at selecting representative examples in which the headword is used with their most signicant collocates. In cases we have few examples in the parallel corpus, the main corpus will be consulted and the selected examples will be translated into English. Processing of words and multiword constructions will be done in spreadsheets, which will be delivered to Routledge by July 2011.

Contents

Like in several other dictionaries in the series we are planning to have the following sections: Introduction This chapter will explain the rationale of this dictionary, its structure and suggestions for its use. It will also provide background information on the Russian language, history of development of Russian frequency lists, explanation of the corpus used, procedures employed for its part-of-speech tagging, lemmatisation and disambiguation. This chapter will also give an overview of statistics about the frequency list, e.g., how many words are needed to understand 75% of words in a Russian text. It will also contain the Russian alphabet, conventions used in the entries and references to the Russian grammar. Frequency listing This chapter will list the 5,000 most frequent lemmas with the following information included (see sample entries in the Appendix): rank order of frequency headword (lemma) with stress given for polysyllabic words part of speech indication (with gender information for nouns) information on inections for irregular forms illustrative example from the corpus with translation into English normalised frequency (per million words)

Juillands D dispersion index Alphabetical listing This chapter will list the 5,000 words in the standard Russian alphabetical order (without using ) with the following information included: lemma with part of speech English translation rank in the frequency listing Part-Of-Speech listing This chapter will list the words in the frequency order separately for the main parts of speech (nouns, adjectives, verbs, adverbs) with the following information included: lemma rank in the listing for this part of speech rank in the overall frequency listing Multiword constructions This chapter will list the top 300 most frequent multiword constructions (consisting of two or three words) The experience of other dictionaries in the series shows that the dictionary is expected to be of about 350 pages with the following split: Section Introduction Frequency listing Alphabetical listing Part-of-speech listing Multiword listing Total Pages 10 210 50 50 30 350

Again from the experience of other dictionaries in the series, the following thematic vocabulary lists in the call-out boxes are planned: 1. 2. 3. 4. 5. 6. 7. 8. Fruit and vegetables Drinks and beverages Food Clothing Colours Weather phenomena City facilities Travel

9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

Directions and locations Expressing motion Expressing size House and room School life and subjects Professions Sports and leisure Human body Numbers Time expressions Popular festivals Plants and natural features Animals Kinship and family relations Moods and emotions Language learning Opposites Verbs of communication

Market

Russian is one of the major world languages, which is also one of the six ofcial languages of the UN, so it has always been popular with students. In addition to being the language of Russia, it is also used as a lingua franca in the post-Soviet space. After initial decline in its popularity following the collapse of the Soviet Union, Russian has recently returned to the list of the three languages in greatest demand according to the UK Government (in addition to Arabic and Chinese). The number of students applying to British universities to study Russian increased by more than 50% in the last few years.

Competition

As mentioned earlier, the existing lists are outdated and/or not suitable for learners. Frequency dictionaries for Russian appeared fairly early on (Shteinfeld, 1963; Zasorina, 1977), but they were based on relatively small collections of texts, so their word lists are not reliable. Also the sources of these texts from the Soviet era make them seriously outdated now, e.g., Soviet and comrade are in the rst hundred in Zasorinas list on a par with function words.

The most recent proper frequency list (Lnngren, 1993) is based on the Uppsala corpus, which is still small by modern standards. It consists of one million words, with approximately equal amount of ction and journalistic texts published between 1960 and 1987. The word list included in Nicholas Browns Learner Dictionary (Brown, 1996) is an adaptation of Zasorinas frequency list produced by demoting the Communist vocabulary of Lenin, Khrushchev and Soviet newspapers, but it is not a proper frequency list itself. However, human judgements do not correlate with actual frequencies (Alderson, 2007), while Zasorinas list is based on a very small corpus, so it is not reliable in itself. There is a more modern Russian National Corpus (Sharoff, 2005) with a plan for publication of a frequency dictionary on its basis, but even when the plan comes into being, it will result in an academic publication aimed at the Russian market1 with little potential for its use for foreign language teaching. Besides, even though the RNC is considerably bigger (80 million words) than the previous attempts to compile Russian frequency lists, it is still based on a large number of literary texts from 1930s-1970s, thus limiting its claim to reect modern Russian. The I-RU is also more modern and twice of the size of the RNC. Table 1 indicates some of the problems with the RNC frequency list.

Copyright

There are no constraints on the copyright of a frequency list. The examples taken from the corpus are citations covered by the fair use principle. The quotes selected from the corpus will be not longer than 10 words. This is the same practice as used in other frequency dictionaries in the series.

Authors

Serge Sharoff has been involved in computational research on Russian since 1990s, including morphosyntactic analysis and disambiguation of Russian. He collected several Russian corpora widely used by Russian linguists now, including the design of the blueprint of the Russian National Corpus. Since then he turned to collecting corpora from the Web and produced resources for a range of languages.
The plan (as far as we know it) is to publish a list of about 30,000 words with information on their frequency distribution by years and genres, but without information on inectional classes, which are quite irregular in Russian and needed for language learners. The publication will be with a Russian publisher without any information in English provided (neither in the introduction, nor in the entries.
1

Elena Umanskaya is a freelance teacher of Russian as a foreign language. She is currently involved in a project on ranking Russian vocabulary in terms of the Common European Framework of Reference (CEFR) to dene the minimal lexical levels needed for language learners. She also has extensive experience of annotation of language corpora and their use for language teaching.

References
Alderson, J. C. (2007). Judging the Frequency of English Words. Applied Linguistics, 28(3):383409. Brown, N. J. (1996). Russian learners dictionary: 10,000 words in frequency order. Routledge, London. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):6174. Ferraresi, A., Zanchetta, E., Bernardini, S., and Baroni, M. (2008). Introducing and evaluating ukWaC, a very large web-derived corpus of English. In The 4th Web as Corpus Workshop: Can we beat Google? (at LREC 2008), Marrakech. Juilland, A., Brodin, D., and Davidovitch, C. (1970). Frequency dictionary of French words. Mouton. Kilgarriff, A. (1997). Putting frequencies in the dictionary. International Journal of Lexicography, 10(2):135155. Lnngren, L. (1993). Chastotnyi slovar sovremennogo russkogo yazyka (The Frequency Dictionary of Modern Russian). Acta Univ. Ups, Uppsala. Sharoff, S. (2005). Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A., and Rayson, P., editors, Corpus Linguistics Around the World, pages 167180. Rodopi, Amsterdam. Sharoff, S. (2006). Open-source corpora: using the net to sh for linguistic data. International Journal of Corpus Linguistics, 11(4):435462. Sharoff, S. (2007). Classifying web corpora into domain and genre using automatic feature identication. In Proc. of Web as Corpus Workshop, Louvainla-Neuve. Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., and Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the Sixth Language Resources and Evaluation Conference, LREC 2008, Marrakech. 8

Shteinfeld, E. (1963). Chastotnyj slovarj sovremennogo russkogo literaturnogo jazyka (Frequency dictionary of modern Russian literary language). Tallin. Zasorina, L., editor (1977). Chastotnyj slovarj russkogo jazyka (Frequency Dictionary of Russian). Russkij Jazyk, Moscow.

Frequency listing
108 Nm ; father . (His father had been a dentist too) 295; D 90 109 Nf , pl nom ; leg . (His leg hurt him pretty sharply. ) 294; D 95 111 N singular only; information . (Any information must not be disclosed to third parties.) 287; D 94 112 A , ; high . (The methodology provides a high degree of control.) 286; D 96 113 Nm ; month . (He could stay another month.) 283; D 98 114 Nf ; minute . (I ll be there in twenty minutes.) 280; D 95 115 Nm ; process . (The company fosters a collaborative software development process.) 278; D 98 116 Nf ; situation . (She began to feel the bitterness of the situation.) 278; D 94 117 A ; full . (The pail was full and very heavy.) 277; D 97 118 Nm ; author . (Only the author can edit their comments.) 277; D 83 119 Nf ; form , . (To use a word correctly, you can look up its grammatical forms.) 277; D 95 120 Nm , pl nom ; voice . (The voice went on and on.) 277; D 97 121 Nf ; thought . (The digression of my thoughts must have done me good.) 274; D 97 122 Nm , singular only; light . (Light suddenly ooded the room) 273; D 96 124 Nn ; quality . (Adjust the contrast and brightness until the image quality improves.) 271; D 93 125 Nf ; school ? (Why are nt you in school?) 271; D 95 126 Nn ; action . (This undoes the previous action.) 271; D 96 127 Nf ; area, region . (I appreciate our cooperation in this eld.) 269; D 95

10

128 Nm ; man , . (He was a darkly-tanned, burly man.) 269; D 90 129 Nn , singular only; attention . (Operators need to pay very close attention to determining the anticipated loads) 268; D 89 130 Nf ; road . (The road went steeply down-hill towards the river) 267; D 96 131 Nf ; connection; communication . (Students were relatively slow to use wireless communications.) 267; D 97 132 Nm ; glance (He gave her a quick, grateful look.) 263; D 98 133 Nn ; condition . (Educational institutions can take advantage of special licensing conditions.) 263; D 94 134 R ; again . (This allows the user to log in again.) 260; D 91 135 R ; soon . (This certicate will expire soon.) 260; D 91 136 Nf ; love . (True love never grows old.) 259; D 94

137 ...

11

Alphabetical listing
5590 264 1541 4596 815 4465 4566 4069 1593 1186 1597 1815 4859 5488 3524 2378 1125 1002 251 4766 2837 , N methodology , N metre , N underground , N fur , N mechanism , N mechanic , N mechanics , A mechanical , N sword , N dream , N bag , N moment , N migration , N Ministry of Foreign Aairs , N microphone , N policeman , N police , N billion (1,000,000,000) , N million , N mercy , N favour 2530 2165 1838 2473 1277 947 663 4591 2644 114 34 4288 2086 4070 531 5709 2084 5171 3200 4577 1988 , N mile , R past; missed , N mine , A minimal , N minimum , N ministry , N minister , A past , N minus , N minute , N world; peace , R peacefully , A peaceful , N world view , A global , N bowl , N mission , A mystical , N meeting , N Metropolitan (Archbishop) , N myth

1271 , A nice; dear

5905 , N target (shooting)

12

Multiword constructions listing


1 , including , . You can insert a table anywhere on a page, including in another table. 189 2 , at least . Pressure must be at least fty atmospheres. 106 3 , nevertheless . However, nothing dispirits him. 79 4 , unlike -, , . Unlike a local dictionary, an online dictionary is available only when there is an active Internet connection. 62 5 , depending on . The name on the button changes depending on the selected mode. 58 6 , nally . They nally reached the opposite river bank. 55 7 , on the other hand , , . Events, on the other hand, can be copied in any folder location. 53 8 , nearly the little room ceiling almost rests on his forehead. 49 9 , this time , , This time we may succeed. 49 10 , each other . Many teammates are personally known to each other. 45

13

D
D.1

Call-out boxes
Weather phenomena
translation sun wind snow rain weather sunny temperature cloud fog frost forecast degree black cloud lightning climate storm elements (weather) thunder thunderstorm hailstorm foggy breeze rainbow freq 131 112 76 68 46 45 44 39 33 31 26 25 19 18 17 15 13 13 10 8 8 7 7 rank 679 816 1242 1364 1966 2042 2066 2316 2667 2759 3220 3335 4125 4251 4543 4944 5576 5639 6365 7571 7617 8154 8512

headword

14

Вам также может понравиться