Вы находитесь на странице: 1из 16

Corpus

Stylistic
Presented by: Quissa Marie M. Gonzales-BSED
Presented to: Dr. Arjan Espiritu
Contents:
 Background of the subject
 Methodology and Style
 Applications
What is Corpus Stylistics?
* refers to the statistical study of style with the use of
corpora, and the tools and methods of corpus linguistics, in
the study of literary (and sometimes non-literary) style
Example :
-study of the relative frequency of elements in a text

*Augustus de Morgan, 1851: disputes about the


authenticity of some of the writings of St Paul settled
by the measurement of the length of the words used
in the various Epistles
*T.C. Mendenhall, 1887: analysis of several authors’
frequency distributions of word-length
*Corpus: a body or collection of linguistic data
for use in research.
*Since the early 1960s: interest in computer
corpora or machine readable corpora.
*Statements about the relative frequency of
various linguistic items in a corpus have
become very accurate.
*Some uses of statistical analysis of style through
corpora:
*Examples:
*Education, e.g. EFL textbook writing
*Establishment of authorship, e.g. of unascribed
manuscripts
*Interpretive stylistics, e.g. study of the
writer’s ideology and point of view
*Methodology
Simple things may characterise different styles
*average sentence length
*average word length
*type:token ratio (vocabulary richness)
* number of types = number of different words
* number of tokens = total number of words
*vocabulary growth (homogeneity of text)
* number of new types in 1st, 2nd, …, nth 1000 words
* in rich varied text, number will climb steadily
*Especially when used comparatively
*More complex analyses can give a more
interesting picture
* specific syntactic structures
* degree of modification in NPs
* types of verbs (e.g. verbs of persuasion, speech verbs, action
verbs, descriptive verbs)
* distribution of pronouns (1st/2nd/3rd person)
* etc … (anything you can think of!)
*Quite sophisticated mathematical techniques
can give an overall picture
* e.g. factor analysis: identifies from a (big) range of variables
which ones best identify/characterise differences
Multidimensional analysis
*Collect a huge range of measures of a wide variety
*some simple word counts
*syntactic features
*classes and subclasses of N, V, Adj, Avd
*Factor analysis
*choose a range of features to measure, see which
ones are correlated
*Example: work based on corpora trying to
quantify and characterise genre and register
differences
*Work pioneered by Douglas Biber*
*Biber used statistical measures to identify
stylistic factors that co-occurred, and could
therefore be definitional of text types and
genres
* E.g. conjuncts like therefore, nevertheless and use of passive
together indicate more formal style

*D. Biber, S. Conrad & R. Reppen, Corpus Linguistics: Investigating


Language Structure and Use, Ch 5: the study of discourse characteristics
*Corpora useful not only for counting frequencies
of features, but also:
*1.Concordance
*Lists occurrences of word in context
*Identify syntactic use of word
*Identify range of meanings
*Identify relative frequency of different uses/meanings
*2.Collocation
*What words occur together?
*Compare distribution of close synonyms
3.Vocabulary in context
*“Concordance”, also known as KWIC list (key
word in context)
*Allows us to see the (immediate) environment
in which a word appears
*Listings can be customised to show what you
want more clearly, e.g.
*sorted according to next or previous word
*showing more or less context
4.Collocation
*Term coined by J R Firth (1957) to characterise (part of)
his theory of meaning
*“You shall judge a word by the company it keeps”
*“The occurrence of two or more words within a short
space of each other in a text” (Sinclair 1991)
*“The relationship a lexical item has with items that
appear with greater than random probability in its
(textual) context” (Hoey 1991)
Collocation, text type and style – example:
*Distinguish between general and more usual
collocations vs. technical and more personal
ones
*e.g. in a general corpus time collocates with
save, spend, waste, fritter away, …
*but in a corpus of sports reports time
collocates with half, full, extra, injury, first,
second, third, …
5. Stylometry
*An attempt to capture the essence of the style of a
particular author by reference to a variety of
quantitative criteria, usually lexical, called
discriminators.
*Study of frequently occurring features:
word/sentence length; choice and frequency of
words; vocabulary richness)
*The ideal situation for authorship studies is
*when there are large amounts of undisputed text,
or
*few contenders for the authorship of the disputed
text(s).
*
Author attribution
Establishing the author of an unascribed manuscript:
*Build corpora
*A - works definitely by author A
*B - works definitely by author B
*C - works of disputed authorship, but probably written
by A or B
*Then select discriminants and associated measures
*When the technique has been shown to discriminate
effectively between A and B, then try it on C

(M. Oakes: ‘Computational Stylometry’, in Handbook of


Corpus Linguistics)
Language Learning

* Frequency - in particular, word frequency - had a role in


language learning in the days before electronic corpora
existed.
* The 'corpus revolution' made available frequency information
about language use in a totally unprecedented way
* Frequency dictionaries and frequency-based grammatical
information are becoming more and more available and new
sources of frequency information from the Web are being
tapped
* Various kinds of knowledge found in present-day language
textbooks (grammatical, collocational, semantic) are getting
to be frequency-based.
* In general, corpora represent real usage of language
* In addition, "more frequent” can equal “more important“ in
many aspects of language learning

Вам также может понравиться