Академический Документы
Профессиональный Документы
Культура Документы
Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex
Linguistic evidence within and across languages, word frequency lists and language learning
Or Word lists are useful, but are they (could they be) scientific?
2
KELLY
Word in one lg on one side, other on other Language learning Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden (Leeds does Arabic Chinese Russian)
April 2010
9 languages, 36 pairs
Kilgarriff: KELLY 3
Leeds
Method
Leeds
Kilgarriff: KELLY
Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing
Leeds
April 2010
Kilgarriff: KELLY
Should be corpus-based
Most aren't
Leeds
April 2010
Kilgarriff: KELLY
How
Leeds
April 2010
Kilgarriff: KELLY
Complications
What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy
Leeds
Kilgarriff: KELLY
Not for Chinese: segmentation co-operate, widely-held, farmer's, can't Compounding, separable verbs Clitics, al, ...
April 2010
English
Norwegian, Swedish
Arabic, Italian
...
Kilgarriff: KELLY 9
Leeds
invading
Lemmatisation
Leeds
Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
April 2010
Kilgarriff: KELLY
10
Grammatical classes
Recommendation: different
Required
Leeds
April 2010
Marginal cases
Numbers
twelve, seventeenth, fifties Days of week, months Capitals, nationalities, currencies, adjectives, languages easter, christmas, islam, republican
Kilgarriff: KELLY 12
Closed sets
Countries
Leeds
Multiwords
According to
Base list:
Leeds
Kilgarriff: KELLY
13
Homonymy
We can't do (with decent accuracy) We can't give freqs for senses Sometimes disconcerting See also below
Leeds
April 2010
Kilgarriff: KELLY
14
Corpora
Make it big and diverse From web Can do for any language Web language: less formal
WACKY corpora
Leeds
April 2010
not mainly 'reporting' or fiction, cf news, BNC Good for lg learners Kilgarriff: KELLY
15
Comparing corpora
Corpora: new We are all beginners Best way to get sense of a corpus
Leeds
April 2010
Leeds
April 2010
4 legal
trademarks pursuant accordance herein
Leeds April 2010
Kilgarriff: KELLY
18
Web-low
Exclude British English, transcription/tokenisation anomalies
herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Leeds
April 2010
Kilgarriff: KELLY
19
Observations
Pronouns and past tense verbs Fiction Masc vs fem Yesterday Probably daily newspapers Constancy of ratios: He/him/himself She/her/herself
Leeds
April 2010
Kilgarriff: KELLY
20
Corpus Factory
Many languages General corpus, 100m+ words Fast High quality Comparable across languages
Leeds
April 2010
Kilgarriff: KELLY
Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come Extract text from Wiki. Wikipedia 2 Text
Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
April 2010
Leeds
Kilgarriff: KELLY
Leeds
April 2010
Kilgarriff: KELLY
Evaluation
For each of the languages, two corpora available: Web and Wiki Dutch: also a carefullydesigned lexicographic corpus.
Hypothesis: Wiki corpora are informational Informational --> typical written Interactional --> typical spoken
Leeds
April 2010
Kilgarriff: KELLY
Evaluation
strong indicators of interactional language. English: I me my mine you your yours we us our Ratio: web:wiki
Leeds
April 2010
Kilgarriff: KELLY
Results
Thai Total Word Web 2935 133 770 1722 2390 21 434 2108 179 431 11123 Wiki 366 19 97 320 855 6 66 2070 148 677 4624 Ratio 8.00 7.00 7.87 5.36 2.79 3.20 6.54 1.01 1.20 0.63 2.40
Table : 1st and 2nd person pronouns in Web and Wiki corpora per million words
Kilgarriff: KELLY
Leeds
April 2010
Theme Belgian
NlWaC English gloss (city) Belgian Flemish Looked/watched previous watched/looked Percent million billion (Belgian) Franc
Kilgarriff: KELLY
Theme Religion
English gloss
Fiction
Web
Newspapers
Leeds
Stages
Leeds
April 2010
Kilgarriff: KELLY
28
review - how?
points system
deduct points for words in overrepresented areas add in words from other corpora
Leeds
April 2010
Kilgarriff: KELLY
29
Translation database
All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'
Leeds
April 2010
Kilgarriff: KELLY
30
Translations
Leeds
April 2010
Kilgarriff: KELLY
31
Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So:
Leeds
Kilgarriff: KELLY
32
Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs
Leeds
Kilgarriff: KELLY
33
Numbers
Leeds
April 2010
Kilgarriff: KELLY
34
Leeds
April 2010
Kilgarriff: KELLY
35
Current status
Leeds
April 2010
Kilgarriff: KELLY
36
Big problems
Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello
Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between
Leeds
April 2010
article of faith
Leeds
April 2010
Kilgarriff: KELLY
38
http://forbetterenglish.com
Leeds
April 2010
Kilgarriff: KELLY
39