Академический Документы
Профессиональный Документы
Культура Документы
Abstract
Texts vary not only by topic, but by style; indeed, often the
variation between texts about the same thing can be just
as noticeable as the variation between texts about different
things. Some facets of this variation are quite easy to detect, and quite predictable when applied to categorization
of texts by genre, functional style, or - tentatively - quality.
Making use of such variation in an retrieval context is
quite straightforward in principle; our work consists of
an implementation of a visualization tool for document
databases.
The issues addressed include 1) choice of stylistic items
to investigate, 2) composition of dimensions of variation,
and 3) judicious naming of dimensions for presentation. We
use use principal components analysis to combine our quite
large number of stylistic items into two most significant dimensions of variation and plot the document space under
consideration into a plane. This space can be used as a first
or last filter in an information retrieval task.
The composition of the most significant dimensions is
naturally corpus dependent, as is the naming of them: our
work is tested on Internet and TREC data.
1 Stylistic Variation
Texts vary not only by topic, but by style; indeed, often the
variation between texts about the same thing can be just
as noticeable as the variation between texts about different things. Human readers process a multitude of stylistic
markers, where each one of them taken separately will be
almost meaningless, to categorize texts in functional styles
or genres, or to assess their position along some continuum
of stylistic variation. Some markers of this type are quite
easy to identify and compute. We are most interested in
examining the stylistic variation based on the specific genres or functional styles (Vachek, 1975) that can be found
in electronically published documents as opposed to very
subjective or situation-specific measures such as individual
style or even writing quality.
Methods such as ours have been used previously for authorship determination in cases where documents have unknown or disputed authors with some success, and for readability measurement for educational and mass-market reading materials with some lesser degree of success. Conceivably similar metods could be used for quality determination: determining which of two texts about the same subject
in the same genre is the better text in some or any sense.
Variable name
WORDS
TT
CPW
WPS
P1
P2
P3
IT
NT
Statistic
Text length in words
Type token ratio
Average word length in characters
Average sentence length in words
Proportion first person pronouns of words
Proportion second person pronouns of words
Proportion third person pronouns of words
Proportion it of words
Proportion contractions: Ill, youre, etc.
Typical Range
31-9228
0.13-0.89
4.59-9.95
2.45-63.1
0-105
0-20
0-60
0-44
0-33
4 Stylistic Items
5 Text materials
We want to weigh together information from a large number of stylistic items or style markers parameters where
typically each taken by itself will be inconsequential. Combining parameters by weighting them together is a common
problem in many branches of science, and there is a battery
of algorithms to do so automatically. It is debatable whether
simple linear score combinations of textual measurements
capture the rather complex underlying interdependencies
we aim to measure, but to investigate the power of the variables tested, we elected to take a cautious approach, and
to make a minimal amount of assumptions about the data.1
In general, the items under consideration reflect variation
of various kinds: lexical - where texts about the same subject can treat it with technical or lay vocabulary; syntactic - complex syntax may reflect more complex ideas or
reasoning about a given subject (Menshikov, 1974; Losee,
1996); textual - texts can be in-depth treatments of a topic
or overviews over several topics. An item, naturally, may,
and most often will, relate to variation of several kinds simultaneously: therefore signifies a certain lexical choice
as compared to thus but also a certain textual progression
as compared to and; tortious interference not only has
different flavor than bad influence but may suggest a different genre.
New York University participates in the Text Retrieval Conference (TREC) information retrieval evaluation project2
jointlywith General Electric, Rutgers University, and Lockheed Martin. We have experimented with web retrieval using some TREC tasks; while the TREC tasks are designed
to be used on the TREC database, which consists mainly
of journalistic material, they are well known in the information retrieval community. We ran a set of typical TREC
queries on the Altavista search engine3 and retrieved the
top 60 returned pages. These vary considerably in style.
We will use the query What is the economic impact of recycling tires? as an example in the following discussion.
This is a very small text corpus for this sort of experiment,
and the results should be understood to illustrate the techniques used, rather than provide any information about texts
on the Internet.
Each text in the test material is processed to obtain
among others the statistics for the items listed in table 1. The items are suggested by classic readability studies (Chall, 1948; Klare, 1963), our previous experiments
(Karlgren and Cutting, 1994; Karlgren, 1996), or by previous work on the computational study of textual variation
(Biber, 1988, 1989).
6
1 The standard techniques used in these experiments principal com-
ponents analysis, factorial analysis, and discriminant analysis make assumptions about the distributions of the variables under consideration.
Specifically, if nothing is specified, the algorithms assume a variable is
normally distributed. These assumptions are unfounded for linguistic data
such as the stylistic items in our experiments, and could give misleading
results if the variables diverge significantly from the normal distribution.
There exist no standard methods for examining multivariate distributions
without making assumptions about the variable distributions; in this case
we have tested each of the items individually using Mann Whitneys U test
in other experiments, and found them useful and reliable (Karlgren, 1996);
we still have no method for treating their variation.
So, how do we combine the variation of these items, to distinguish functional styles or genres from each other? We
may pick simply pick a couple of parameters from the table,
and plot them against each other. A useful strategy might be
to pick a couple of parameters with a seemingly high spread,
and see what the graph looks like. We find some examples
which seem to disperse the material quite well, as in figure 1
and some which let the texts stick together into a corner of
2 http://potomac.ncsl.nist.gov/TREC/.
3 http://www.altavista.digital.com
Variable
WORDS
TT
CPW
WPS
P1
P2
P3
IT
NT
Proportion
PRIN1
0.392610
-.335365
-.123467
-.074589
0.402501
0.268218
0.447032
0.437037
0.296300
0.500556
PRIN2
0.188773
0.035366
0.531823
0.627062
0.205379
-.386261
0.128708
0.178483
-.217421
0.142114
PRIN3
-.320255
0.447046
0.364009
0.189015
0.096396
0.427624
-.004284
0.025705
0.580106
0.117911
PRIN4
0.144107
-.009935
0.697326
-.687406
-.003460
0.003216
-.020343
0.053218
-.130678
0.087368
PRIN5
-.212966
0.539090
-.248076
-.299164
0.420170
-.507121
0.169437
0.224331
0.015447
0.076375
error messages, technical texts, journalistic texts, commercial texts, legal texts, announcements, forms, and various
other textual and non-textual material. Since the material
is small, we divide the material into four quite broad categories: proper text (white triangles, 23), database listings
and lists of links (black triangles, 15), governmental announcements (black circles, 11), and error messages (black
squares, 2).5
The graphs displayed in the above sections show us
that the genres we defined emerge quite nicely in figure 1,
whereas the pattern is much less clear in the other two figures.
Conclusions
To get explanatory power, a genre analysis of the exemplified kind must be designed to make use of informative
dimensions of textual variation. The algorithmically best
choices may be too dependent on the variables chosen to
give a useful and explanatorily powerful display of the textual material at hand.
References
Douglas Biber. 1988. Variation across speech and writing. Cambridge University Press.
Douglas Biber. 1989. A typology of English texts, Linguistics, 27:3-43.
Jussi Karlgren and Douglass Cutting. 1994. Recognizing
Text Genres with Simple Metrics Using Discriminant
niques to do the same. We would then end up with the same problem as
for factorial analysis or principal components analysis: we would have descriptively interesting categories which would be difficult to explain to the
reader. We will here go with sloppily manually defined categories.
5 Some documents were duplicates, and thus the total is 51 rather than
60.
Figure 2: Plot of first person pronoun content versus average sentence length