Академический Документы
Профессиональный Документы
Культура Документы
notions of collection, sampling and representativeness, all of which are important to the
description of a corpus.
On the basis of the definitions provided above, there appears to be a consensus that a
corpus is an artifact; it is selected, chosen or assembled according to explicit criteria. It is
stored in electronic form. It consists of pieces of naturally occurring language. In this context,
we understand naturally occurring to mean that the pieces of language have not been
tampered with or edited. The corpus may, however, be annotated during or after the
compilation process; grammatical tags or markups (e.g. indicating text origin, authorship)
may be added to facilitate information retrieval. A corpus may be used as a "sample of the
language" (Sinclair) or because it is "representative of a given language" (Francis). A corpus
may be a collection of transcribed spoken and/or written pieces of language, contrary to what
the use of the word text might suggest.
Types of corpora
a)
defines a subcorpus as having all the properties of a corpus but happens to be part of a larger
corpus (1994). Thus, a subcorpus must have all the properties of a larger corpus. We
understand this to mean that it is representative of the larger corpus.
A component illustrates a particular type of language and is selected according to a
set of linguistic criteria that serve to characterize its linguistic homogeneity (Sinclair 1994).
It differs from a subcorpus in that it is not intended to be representative of the corpus from
which it is drawn and is therefore not necessarily an adequate sample of a language.
Sinclair uses the term specialized corpora to describe a series of smaller corpora
which were designed with various purposes in mind (1987), those which do not contribute
to a description of the ordinary language, either because they contain a high proportion of
unusual features, or their origins are not reliable as records of people behaving normally.
(1994)
Examples of special corpora given by Sinclair (1994) are corpora of the language of
children, the language of geriatrics, the language of non-native speakers and the language of
very specialized areas of communication.
c)
1) External criteria genre, mode, origin and aims of the text, audience, intended outcome
The genre category allows for distinctions to be made between different types of
written publications such as books which may subdivide further into fiction and non-fiction,
newspapers, magazines, ephemera correspondence, typed material which includes all types
of reports and documentation, and manuscript material which consists of handwritten texts.
Each of these categories may be further subdivided if necessary. There is no single universal
system for classifying genre and no set of universally agreed specifications for each particular
genre. Consequently, each corpus project tends to have its own method of classifying genre.
Mode is used to describe in what form a text was originally produced i.e. whether it is
a transcription of the spoken word or whether it was originally in written form. Sinclair and
Ball (1995) recommend the addition of a third category electronic to cater for texts
transmitted in electronic media because the language used may be different from that used in
the older established modes (1995). Electronic texts would include e-mail, discussions in
newsgroups etc.
Origin allows compilers to indicate who has been involved in the production of a text.
These may include the author, editor, publisher, rights holder, translator and adapter.
Compilers may choose to include further information about the originator(s) such as their age,
sex, language background and nationality.
The aims of the text include considerations about the target audience and the intended
outcome of the text.
Audience may include details about audience size and constituency, the latter
subdividing into general public, informed lay people, professional people, specialists, students
and trainees. It may be considered useful to specify the relationship between the author and
reader, whether distant, neutral or personal.
The intended outcome is the purpose for which a text is written and includes the
following categories: information, discussion, recommendation, recreation which includes
fiction and non-fiction, instruction which includes academic works, textbooks and practical
books.
2) Internal criteria topic and style
Topic, as Sinclair and Ball state, "is one of the central controversial areas of text typology"
(1995). It is also considered to be a very important criterion in the classification of texts in
corpora. Topic may be identified by looking at what a particular text is about (e.g. on the basis
of its title, table of contents in the case of a book) and classifying the text accordingly.
However, to classify texts in this way is to ignore the fact that texts may deal with more topics
than the one specified in the title or indeed in the table of contents. Phillips (1983) suggested
that the topic of a text could be identified by examining the lexical structure of a text and
identifying keywords used frequently in the text. This type of approach is already being used
in some abstracting and information retrieval techniques. However, Sinclair and Ball (1995)
suggest that the corpus community should agree to use a list of topics which should be varied
and extended to suit the researchers priorities: the life of the mind, culture, the physical
world, living things, society, manufacture, communications.