Вы находитесь на странице: 1из 7

Visualizing Stylistic Variation

Jussi Karlgren and Troy Straszheim


Courant Institute of the Mathematical Sciences
New York University
karlgren,troys@cs.nyu.edu

Abstract
Texts vary not only by topic, but by style; indeed, often the
variation between texts about the same thing can be just
as noticeable as the variation between texts about different
things. Some facets of this variation are quite easy to detect, and quite predictable when applied to categorization
of texts by genre, functional style, or - tentatively - quality.
Making use of such variation in an retrieval context is
quite straightforward in principle; our work consists of
an implementation of a visualization tool for document
databases.
The issues addressed include 1) choice of stylistic items
to investigate, 2) composition of dimensions of variation,
and 3) judicious naming of dimensions for presentation. We
use use principal components analysis to combine our quite
large number of stylistic items into two most significant dimensions of variation and plot the document space under
consideration into a plane. This space can be used as a first
or last filter in an information retrieval task.
The composition of the most significant dimensions is
naturally corpus dependent, as is the naming of them: our
work is tested on Internet and TREC data.

1 Stylistic Variation
Texts vary not only by topic, but by style; indeed, often the
variation between texts about the same thing can be just
as noticeable as the variation between texts about different things. Human readers process a multitude of stylistic
markers, where each one of them taken separately will be
almost meaningless, to categorize texts in functional styles
or genres, or to assess their position along some continuum
of stylistic variation. Some markers of this type are quite
easy to identify and compute. We are most interested in
examining the stylistic variation based on the specific genres or functional styles (Vachek, 1975) that can be found
in electronically published documents as opposed to very
subjective or situation-specific measures such as individual
style or even writing quality.

Methods such as ours have been used previously for authorship determination in cases where documents have unknown or disputed authors with some success, and for readability measurement for educational and mass-market reading materials with some lesser degree of success. Conceivably similar metods could be used for quality determination: determining which of two texts about the same subject
in the same genre is the better text in some or any sense.

Text in Uniform Guise

Digital information technology has been vectored towards


the production of information, and the publishing threshold for information has been lowered dramatically the past
few hundred years. By contrast, comparatively little work
has been put into tools for the consumer. Indeed, many
of the markers such as paper quality, typesetting, and even
spelling, that readers have been able to use previously to
distinguish the New York Times from home produced handouts have been neutralized through the advent of inexpensive proofreading tools and the World Wide Web. On the
Internet the publishing threshold is very low, and usefulness
of the abundance is offset by the less than perspicuous variation in quality, provenance, and author intentions.

Aim of these experiments

This paper will describe some experiments made as a


groundwork to build a tool which will display a set of texts
as points on a plane, scattered according to stylistic criteria. We will not go into the experiments in every detail, but
we will attempt to describe how we motivate the more important design choices we make. Our hypotheses are that
there are important stylistic cues in electronically published
texts; that these cues can be used for categorizing or sorting
documents in an interactive information retrieval scenario;
that the stylistic variation can most handily be explained in
terms of genres.

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Variable name
WORDS
TT
CPW
WPS
P1
P2
P3
IT
NT

Statistic
Text length in words
Type token ratio
Average word length in characters
Average sentence length in words
Proportion first person pronouns of words
Proportion second person pronouns of words
Proportion third person pronouns of words
Proportion it of words
Proportion contractions: Ill, youre, etc.

Typical Range
31-9228
0.13-0.89
4.59-9.95
2.45-63.1
0-105
0-20
0-60
0-44
0-33

Table 1: Stylistic items under consideration

4 Stylistic Items

5 Text materials

We want to weigh together information from a large number of stylistic items or style markers parameters where
typically each taken by itself will be inconsequential. Combining parameters by weighting them together is a common
problem in many branches of science, and there is a battery
of algorithms to do so automatically. It is debatable whether
simple linear score combinations of textual measurements
capture the rather complex underlying interdependencies
we aim to measure, but to investigate the power of the variables tested, we elected to take a cautious approach, and
to make a minimal amount of assumptions about the data.1
In general, the items under consideration reflect variation
of various kinds: lexical - where texts about the same subject can treat it with technical or lay vocabulary; syntactic - complex syntax may reflect more complex ideas or
reasoning about a given subject (Menshikov, 1974; Losee,
1996); textual - texts can be in-depth treatments of a topic
or overviews over several topics. An item, naturally, may,
and most often will, relate to variation of several kinds simultaneously: therefore signifies a certain lexical choice
as compared to thus but also a certain textual progression
as compared to and; tortious interference not only has
different flavor than bad influence but may suggest a different genre.

New York University participates in the Text Retrieval Conference (TREC) information retrieval evaluation project2
jointlywith General Electric, Rutgers University, and Lockheed Martin. We have experimented with web retrieval using some TREC tasks; while the TREC tasks are designed
to be used on the TREC database, which consists mainly
of journalistic material, they are well known in the information retrieval community. We ran a set of typical TREC
queries on the Altavista search engine3 and retrieved the
top 60 returned pages. These vary considerably in style.
We will use the query What is the economic impact of recycling tires? as an example in the following discussion.
This is a very small text corpus for this sort of experiment,
and the results should be understood to illustrate the techniques used, rather than provide any information about texts
on the Internet.
Each text in the test material is processed to obtain
among others the statistics for the items listed in table 1. The items are suggested by classic readability studies (Chall, 1948; Klare, 1963), our previous experiments
(Karlgren and Cutting, 1994; Karlgren, 1996), or by previous work on the computational study of textual variation
(Biber, 1988, 1989).

6
1 The standard techniques used in these experiments principal com-

ponents analysis, factorial analysis, and discriminant analysis make assumptions about the distributions of the variables under consideration.
Specifically, if nothing is specified, the algorithms assume a variable is
normally distributed. These assumptions are unfounded for linguistic data
such as the stylistic items in our experiments, and could give misleading
results if the variables diverge significantly from the normal distribution.
There exist no standard methods for examining multivariate distributions
without making assumptions about the variable distributions; in this case
we have tested each of the items individually using Mann Whitneys U test
in other experiments, and found them useful and reliable (Karlgren, 1996);
we still have no method for treating their variation.

Using the Stylistic Items

So, how do we combine the variation of these items, to distinguish functional styles or genres from each other? We
may pick simply pick a couple of parameters from the table,
and plot them against each other. A useful strategy might be
to pick a couple of parameters with a seemingly high spread,
and see what the graph looks like. We find some examples
which seem to disperse the material quite well, as in figure 1
and some which let the texts stick together into a corner of
2 http://potomac.ncsl.nist.gov/TREC/.
3 http://www.altavista.digital.com

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Variable
WORDS
TT
CPW
WPS
P1
P2
P3
IT
NT
Proportion

PRIN1
0.392610
-.335365
-.123467
-.074589
0.402501
0.268218
0.447032
0.437037
0.296300
0.500556

PRIN2
0.188773
0.035366
0.531823
0.627062
0.205379
-.386261
0.128708
0.178483
-.217421
0.142114

PRIN3
-.320255
0.447046
0.364009
0.189015
0.096396
0.427624
-.004284
0.025705
0.580106
0.117911

PRIN4
0.144107
-.009935
0.697326
-.687406
-.003460
0.003216
-.020343
0.053218
-.130678
0.087368

PRIN5
-.212966
0.539090
-.248076
-.299164
0.420170
-.507121
0.169437
0.224331
0.015447
0.076375

Table 2: First principal components


the graph to a much higher extent as in figure 2.
Now, we know that each of these factors is of little consequence taken alone: even when they may have quite high
descriptive power, using them for diagnostics is a risky
proposition. Random variation, and more distressingly,
nonrandom intentional variation may obscure or obfuscate
the variation we are interested in. Thus using a combination of factors may be a better idea. As mentioned above,
there are standard methods for extracting linear combinations of several variables that covary over a set of objects of
study; using principal components analysis we find the relative variable weightings displayed in table 2. The principal components are linear combinations of the various variables under study; the weights indicate the relative importance of the variables the variables are normalized first,
so that their scale of variation will be similar. The proportion row indicates how much of the total variation this
component covers: in our case, the first component covers
half of the total variation, and the second 14 per cent.
Plotting the texts with the two variables against each
other we get the graph in figure 3. The problem with this
otherwise interesting plot is that it may not be immediately
useful for information retrieval. The dimensions are not
readily translatable to plain English descriptors. This is
where genres come in handy.

7 Genres and Stylistic Items


There are no objectively defined genres for our type of material; what genres we want to make use of will depend on
the domain of discourse, the data we have recourse to, and
what stylistic items we have chosen. Above all, they will
depend on reader preferences or our perception of the readers information needs.
For the purposes of this experiment we make a rough
hand-categorization of the texts.4 We find database listings,
4 Conceivably we could use automatic methods such as clustering tech-

error messages, technical texts, journalistic texts, commercial texts, legal texts, announcements, forms, and various
other textual and non-textual material. Since the material
is small, we divide the material into four quite broad categories: proper text (white triangles, 23), database listings
and lists of links (black triangles, 15), governmental announcements (black circles, 11), and error messages (black
squares, 2).5
The graphs displayed in the above sections show us
that the genres we defined emerge quite nicely in figure 1,
whereas the pattern is much less clear in the other two figures.

Conclusions

To get explanatory power, a genre analysis of the exemplified kind must be designed to make use of informative
dimensions of textual variation. The algorithmically best
choices may be too dependent on the variables chosen to
give a useful and explanatorily powerful display of the textual material at hand.

References
Douglas Biber. 1988. Variation across speech and writing. Cambridge University Press.
Douglas Biber. 1989. A typology of English texts, Linguistics, 27:3-43.
Jussi Karlgren and Douglass Cutting. 1994. Recognizing
Text Genres with Simple Metrics Using Discriminant
niques to do the same. We would then end up with the same problem as
for factorial analysis or principal components analysis: we would have descriptively interesting categories which would be difficult to explain to the
reader. We will here go with sloppily manually defined categories.
5 Some documents were duplicates, and thus the total is 51 rather than
60.

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Analysis, Proceedings of 15th International Conference on Computational Linguistics (COLING), Kyoto.


(In the Computation and Language E-Print Archive:
cmp-lg/9410008).
Jussi Karlgren. 1996. Stylistic Variation in an Information Retrieval Experiment In Proceedings of The Second International Conference on New Methods in Language Processing - NeMLaP 2, Bilkent, September
1996. Ankara: Bilkent University.
George R. Klare 1963. The Measurement of Readability.
Iowa Univ press.
Robert M. Losee. forthcoming. Text Windows and
Phrases Differing by Discipline, Location in Document, and Syntactic Structure. Information Processing and Management. (In the Computation and Language E-Print Archive: cmp-lg/9602003).
I. I. Menshikov. 1974. K voprosu o zhanrovo-stilevoy
obuslovlennosti sintaksicheskoy struktury frazy.
(On genre-dependent stylistic variation of the syntactic structure in the clause) In Voprosy statisticheskoy
stilistiki. Golovin et al. (eds.) 1974. Kiev: Naukova
dumka; Akademia Nauk Ukrainskoy SSR.
Josef Vachek. 1975. Some remarks on functional dialects of standard languages. In Style and Text - Studies presented to Nils Erik Enkvist. Hakan Ringbom.
Akademi.
(ed.) Stockholm: Skriptor and Turku: Abo

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Figure 1: Plot of average word length versus type-token ratio

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Figure 2: Plot of first person pronoun content versus average sentence length

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Figure 3: Plot of first two principal components

Proceedings of The Thirtieth Annual Hawwaii International Conference


on System Sciences ISBN 0-8186-7862-3/97 $17.00 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Вам также может понравиться