Вы находитесь на странице: 1из 5

Terminology Lecture 1

Corpora, corpus design and corpus selection


What is a corpus?
a)
Sinclair (1994) defines corpus as a collection of pieces of language that are selected
and ordered according to explicit linguistic criteria in order to be used as a sample of the
language. It is interesting that in an earlier publication, he had defined corpus as a
collection of naturally-occurring language text, chosen to characterize a state or variety of a
language (1991).
What is the difference between text and pieces of language? = both are used by
Sinclair to describe the components of a corpus; this is because the term text can be
misleading; it could be interpreted as meaning complete texts whereas the pieces of language
selected for a corpus are not always complete texts. The pieces of language selected for
inclusion in a corpus are selected according to explicit linguistic criteria; this means that the
selection is not arbitrary, and texts must fulfill certain conditions in order to be included. The
selected texts are chosen to be used as a sample of the language; they are therefore to be
perceived as being representative of the language or some subset of the language, depending
on the selection criteria which have been used.
b)
Atkins, Clear and Ostler (1992) define corpus as a subset of an ETL (= Electronic
Text Library) built according to explicit design criteria for a specific purpose, e.g. the Cobuild
Corpus, the Longman/Lancaster corpus, the Oxford Pilot Corpus. In this definition, there is
an assumption that the material which is to be selected for inclusion in a corpus is already
available in electronic form. Not all corpus compilers find themselves in such a fortunate
position, particularly those involved in compiling spoken corpora. It is not necessarily true
that the compilers of corpora always know in advance what they are going to do with their
corpus apart from having a fairly general purpose in mind such as linguistic analysis which
might prove to be an umbrella term for a whole range of more specific purposes.
Consequently, the notion of a specific purpose is perhaps not essential to the definition of
corpus. It might have been more useful to Electronic text library: a collection of electronic
texts in standardized format with certain conventions relating to content, etc., but without
rigorous selectional constraints.
c)
Francis (1992) defines corpus as a collection of texts assumed to be representative of
a given language, dialect, or other subset of language, to be used for linguistic analysis.
Francis definition would now be considered to be too vague because it is not sufficient to
state that texts are assumed to be representative. If representativeness is considered to be an
important criterion, then the means of achieving it should be explicit rather than assumed.
Like Atkins et al., Francis specifies the purpose for which a corpus will be used (i.e. linguistic
analysis). Corpus linguistics has evolved considerably in recent years and is now used for
purposes other than linguistic analysis (e.g. as a testbed for natural language processing
systems) which means that his definition would require some revision.
d)
McEnery and Wilson (1996) define corpus as follows:
(1) (loosely) any body of text;
(2) (most commonly) a body of machine-readable text;
(3) (more strictly) a finite collection of machine readable text, sampled to be maximally
representative of a language or variety.
These definitions are interesting in that they confirm that 'corpus' is not yet fully defined by
the linguistic community. We would suggest that (1) and (2) are too general to be useful but
that (3) is closest to what would be considered to be an adequate definition. It incorporates the

notions of collection, sampling and representativeness, all of which are important to the
description of a corpus.
On the basis of the definitions provided above, there appears to be a consensus that a
corpus is an artifact; it is selected, chosen or assembled according to explicit criteria. It is
stored in electronic form. It consists of pieces of naturally occurring language. In this context,
we understand naturally occurring to mean that the pieces of language have not been
tampered with or edited. The corpus may, however, be annotated during or after the
compilation process; grammatical tags or markups (e.g. indicating text origin, authorship)
may be added to facilitate information retrieval. A corpus may be used as a "sample of the
language" (Sinclair) or because it is "representative of a given language" (Francis). A corpus
may be a collection of transcribed spoken and/or written pieces of language, contrary to what
the use of the word text might suggest.
Types of corpora
a)

general reference corpora and monitor corpora


Sinclair (1991), a general reference corpus is:
not a collection of material from different specialist areas technical, dialectal,
juvenile etc. It is a collection of material which is broadly homogeneous, but which is
gathered from a variety of sources so that the individuality of a source is obscured unless the
researcher isolates a particular text.
The function of a general reference corpus is = to provide comprehensive information
about a language. It aims to be large enough to represent all the relevant varieties of the
language, and the characteristic vocabulary, so that it can be used as a basis for reliable
grammars, dictionaries, thesauruses and other language reference materials. (Sinclair 1994)
Within the hierarchy of corpus types, a general reference corpus appears to be the
superordinate in the hierarchy, even though it is not representative of all varieties of a
language. It is broadly homogenous and designed to be representative of all "relevant
varieties" of the language and the "characteristic vocabulary" of a language. English appears
to lead the field in terms of the size of reference corpora available. The Bank of English, over
200 million words, and the British National Corpus, over 100 million words, are described as
general reference corpora.
A monitor corpus is one where texts are "scanned on a continuing basis, 'filtered' to
extract data for a database' but not permanently archived" (Atkins et al 1992.
According to Sinclair (1987) it is a dynamic rather than a static phenomenon,
consisting of very large amounts of electronically-held text... A certain proportion of the data
will be stored at anyone time, but the bulk will necessarily be discarded after processing. The
object will be to 'monitor' such data, from various points of view, in order to record facts
about the changing nature of the language.
b)

subcorpora, components of corpora, specialized corpora and special corpora


Atkins et al. define subcorpus as a subset of a corpus, either a static component of a
complex corpus or a dynamic selection from a corpus during online analysis" (1992:1). If we
have understood Atkins et al. correctly, a subcorpus may be a subset of any type of corpus,
whether it is a sample corpus, a full text corpus, a monitor corpus or a general reference
corpus. The definition does not specify whether a subcorpus must contain the same number of
genres as the corpus from which it is drawn, thereby making it a small-scale version of the
original corpus, or whether the subset can consist of, for example, just one genre, in which
case it is not a small-scale version of the original corpus. Sinclair, who states that corpora can
be divided into subcorpora, and that corpora and subcorpora can be divided into components,

defines a subcorpus as having all the properties of a corpus but happens to be part of a larger
corpus (1994). Thus, a subcorpus must have all the properties of a larger corpus. We
understand this to mean that it is representative of the larger corpus.
A component illustrates a particular type of language and is selected according to a
set of linguistic criteria that serve to characterize its linguistic homogeneity (Sinclair 1994).
It differs from a subcorpus in that it is not intended to be representative of the corpus from
which it is drawn and is therefore not necessarily an adequate sample of a language.
Sinclair uses the term specialized corpora to describe a series of smaller corpora
which were designed with various purposes in mind (1987), those which do not contribute
to a description of the ordinary language, either because they contain a high proportion of
unusual features, or their origins are not reliable as records of people behaving normally.
(1994)
Examples of special corpora given by Sinclair (1994) are corpora of the language of
children, the language of geriatrics, the language of non-native speakers and the language of
very specialized areas of communication.
c)

sample corpora and full text corpora


Early corpora are now described as sample corpora because these corpora consist of
a large number (500) of fairly short extracts (2,000 words), giving a total of around one
million words (Sinclair 1991). These were originally described simply as corpora but, with
developments in computing and the concomitant changes in corpus size and composition, it
became possible to include complete and unabridged texts. Consequently, a distinction is now
made between a corpus which comprises extracts (e.g. a sample or samples corpus) and a
corpus which contains unabridged texts (a full text corpus).
d)

parallel and comparable corpora


Teubert (1996:245) uses the term comparable corpora to describe corpora in two or
more languages with the same or similar composition. McEnery and Wilson (1996) describe
comparable corpora as collections of individual monolingual corpora which use the same
or similar sampling procedures and categories for each language but contain completely
different texts in several languages. Peters, Picchi and Biagini (1996) also use the term
comparable corpora to describe sets of texts from pairs or multiples of languages which can
be contrasted and compared because of their common features.
A parallel corpus, on the other hand, is a bi- or multilingual corpus that contains one
set of texts in two or more languages (Teubert 1996). According to Teubert, a parallel corpus
may contain 1) original texts written in language A and their translations into B and C, 2) an
equal amount of texts originally written in languages A and B and their respective
translations, or 3) only translations of texts into languages A, B and C where the texts were
originally written in language Z.
Classification of texts
There are 2 categories of criteria for the classification of texts into corpora:
1) external criteria (non-linguistic criteria) => concern the participants, the
communicative function, the occasion and the social setting
2) internal criteria (linguistic criteria) which concern the recurrence of language patterns
within the piece of language

a corpus entirely selected on external criteria would be liable to miss significant


variation among texts since its categories are not motivated by textual, but by
contextual factors
a corpus entirely selected on internal criteria would yield no information about the
relation between language and its context of situation.

1) External criteria genre, mode, origin and aims of the text, audience, intended outcome
The genre category allows for distinctions to be made between different types of
written publications such as books which may subdivide further into fiction and non-fiction,
newspapers, magazines, ephemera correspondence, typed material which includes all types
of reports and documentation, and manuscript material which consists of handwritten texts.
Each of these categories may be further subdivided if necessary. There is no single universal
system for classifying genre and no set of universally agreed specifications for each particular
genre. Consequently, each corpus project tends to have its own method of classifying genre.
Mode is used to describe in what form a text was originally produced i.e. whether it is
a transcription of the spoken word or whether it was originally in written form. Sinclair and
Ball (1995) recommend the addition of a third category electronic to cater for texts
transmitted in electronic media because the language used may be different from that used in
the older established modes (1995). Electronic texts would include e-mail, discussions in
newsgroups etc.
Origin allows compilers to indicate who has been involved in the production of a text.
These may include the author, editor, publisher, rights holder, translator and adapter.
Compilers may choose to include further information about the originator(s) such as their age,
sex, language background and nationality.
The aims of the text include considerations about the target audience and the intended
outcome of the text.
Audience may include details about audience size and constituency, the latter
subdividing into general public, informed lay people, professional people, specialists, students
and trainees. It may be considered useful to specify the relationship between the author and
reader, whether distant, neutral or personal.
The intended outcome is the purpose for which a text is written and includes the
following categories: information, discussion, recommendation, recreation which includes
fiction and non-fiction, instruction which includes academic works, textbooks and practical
books.
2) Internal criteria topic and style
Topic, as Sinclair and Ball state, "is one of the central controversial areas of text typology"
(1995). It is also considered to be a very important criterion in the classification of texts in
corpora. Topic may be identified by looking at what a particular text is about (e.g. on the basis
of its title, table of contents in the case of a book) and classifying the text accordingly.
However, to classify texts in this way is to ignore the fact that texts may deal with more topics
than the one specified in the title or indeed in the table of contents. Phillips (1983) suggested
that the topic of a text could be identified by examining the lexical structure of a text and
identifying keywords used frequently in the text. This type of approach is already being used
in some abstracting and information retrieval techniques. However, Sinclair and Ball (1995)
suggest that the corpus community should agree to use a list of topics which should be varied
and extended to suit the researchers priorities: the life of the mind, culture, the physical
world, living things, society, manufacture, communications.

Style is a notorious term, because it is used in so many different ways by researchers


from several disciplines, and has popular meanings as well. It is used here to mean the way
texts are differentiated other than by topic. Hitherto, the corpus community has used
categories such as formal, informal or colloquial to classify text style but, as Sinclair and Ball
point out, there are no institutiona1ised schemata (1995) for these categories. One persons
formal may be another's informal and what may be considered formal in speech might be
considered to be informal in written text.

Вам также может понравиться