Вы находитесь на странице: 1из 5

Automated Japanese Essay Scoring System : Jess

Tsunenori ISHIOKA Masayuki KAMEDA

Research Division Software Research Center


National Center for Univ. Entrance Exam. Ricoh Co., Ltd.
Tokyo, Japan 153-8501 Tokyo, Japan 112-0002

Abstract these is probably e-rater [1] developed by the Educa-


We have developed an automated Japanese essay tional Testing Service (ETS) in the United States and
scoring system named jess. The system evaluates an currently managed and extended by ETS Technolo-
essay from three features: (1) Rhetoric ease of read- gies, a subsidiary organization. E-rater is presently
ing, diversity of vocabulary, percentage of big words being used to score essays in the Graduate Manage-
(long, dicult words), and percentage of passive sen- ment Admission Test (GMAT), an entrance examina-
tences, (2) Organization characteristics associated tion for business graduate schools. E-rater evaluates
with the orderly presentation of ideas, such as rhetori- essays from the following three points of view.
cal features and linguistic cues, (3) Contents vocab-
ulary related to the topic, such as relevant information Structure: syntactic variety, i.e., use of diverse struc-
and precise or specialized vocabulary. The nal eval- tures in arrangement of phrases, clauses, and sen-
uated score is calculated by deducting from a perfect tences.
score assigned by a learning process using editorials Organization: logical presentation of ideas using
and columns from the Mainichi Daily News newspaper. rhetorical expressions, logical connectors between
A diagnosis for the essay is also given. Our system clauses and sentences, etc.
does not need any essays graded by human experts. Contents: use of vocabulary related to the topic.
1 Introduction The e-rater system features a database of hundreds
When giving an essay test, the examiner expects of essays scored by expert readers. Performing linear
a written essay to reect the writing ability of the regression against those expert scores and computer-
examinee. A variety of factors, however, can aect based scores makes it possible to determine regression
scores in a complicated manner. Most of the factors coecients for multiplying the matrices used in scor-
are present in giving tests, and the human rater, ing. In Japan, however, there is no collection of such
in particular, is a major error factor in the scoring authorized scores, and after careful consideration, it
of essays. In fact, there are many other factors that was decided that the same kind of approach would
inuence the scoring of essay tests as listed below, and not be practical for implementing a Japanese version
much research has been devoted to them. of e-rater. It is possible, though, to obtain complete
articles from the Mainichi Daily News newspaper up
Handwriting skill (handwriting quality, spelling)
to 2002 from Nichigai Associates, Inc. and complete
Serial eects of rating (the order in which essay articles from the Nihon Keizai newspaper up to 2001
answers are rated) from Nikkei Books and Software, Inc. for purposes of
Topic selection (how should essays written on dif- linguistic studies. In short, it is relatively easy to col-
ferent topics be rated?) lect editorials and columns (e.g., Yoroku) on some
Other error factors (writers gender, ethnic group, form of electronic media for use as essay models. Fur-
etc.) thermore, with regard to morphological analysis, the
basis of Japanese natural language processing, a num-
In recent years, with the aim of removing these error ber of free Japanese morphological analyzers are avail-
factors and to establish fairness, considerable research able. These include JUMAN developed by the Lan-
has been performed on computer-based automated es- guage Media Laboratory of Kyoto University; ChaSen
say scoring systems [1, 3, 11, 12]. The most famous of (http://chasen.aist-nara.ac.jp/; used by the authors

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA04)
1529-4188/04 $ 20.00 IEEE
in this study) from the Matsumoto Laboratory of the turns out to be an outlier value with respect to such an
Nara Institute of Science and Technology. Likewise, ideal distribution, that score is judged to be inappro-
for syntactic analysis, there are free resources such as priate for that metric. The points originally allotted
KNP from Kyoto University. to the metric are then reduced and a comment to that
With resources such as these, we can prepare tools eect is output. An outlier is an item of data more
for computer processing of the articles and columns than 1.5 times the interquartile range.
that we collect as essay models. In addition, for In scoring, the relative weights of the broken-down
the scoring of essays, where it is essential to evalu- metrics are equivalent with the exception of diver-
ate whether content is suitable, i.e., whether a writ- sity of vocabulary, which is given a weight twice that
ten essay responds appropriately to the essay prompt, of the others as the authors consider it an index con-
it is becoming possible for us to use semantic search tributing to not only rhetoric but to contents as
technologies not based on pattern matching as used well.
by search engines on the Web. The methods for im- 2.1 Ease of reading
plementing such technologies are explained in detail The following items are considered as indexes of
in Ishioka [5] and elsewhere. It is the authors belief ease of reading. These indexes do not agree with
that this learning approach to published essays and usual reading complexity [4].
columns as models makes it possible to develop a sys-
tem essentially the same as e-rater, that is, an auto- 1. Median and maximum sentence length:
mated scoring system for essays written in Japanese, It is generally assumed that shorter sentences
but using technically superior methods. make for easier reading [7]. Many books on writ-
We have named this automated Japanese essay ing in the Japanese language, moreover, state
scoring system jess. This system evaluates essays that a sentence should be no longer than 40 or
based on the three essay features of (1) rhetoric, (2) 50 characters. Median and maximum sentence
organization, and (3) contents, which are basically the length can therefore be treated as an index. The
same as structure, organization, and contents used by reason why the median value is used as opposed
e-rater. Jess also allows the user to designate weights to the average value is that sentence-length dis-
(allotted points) for each of these essay features. If tributions are skewed in most cases. The relative
the user does not explicitly specify point allotment, weight used in the evaluation of median and max-
default weights are 5, 2, and 3 for structure, orga- imum sentence length is equivalent to that of the
nization, and contents, respectively, for a total of 10 indexes described below.
points. This default point allotment of 5, 2, and 3 in
which rhetoric is weighted higher than organiza- 2. Median and maximum clause length:
tion and contents is based on the work of Watan- In addition to periods (.), commas (,) can also
abe et al. [13]. Users can change the point allotment. contribute to ease of reading. Here, text between
The following sections describe the scoring crite- commas is called a clause. The number of char-
ria of jess in detail. Sections 2, 3, and 4 examine acters in a clause is also an evaluation index.
rhetoric, organization, and contents, respectively. Sec-
3. Median and maximum number of phrases in
tion 5 presents an application example and associated
clauses:
operation times.
A human being cannot understand many things
2 Rhetoric at one time. The limit of human short-term mem-
As metrics to portray rhetoric, jess uses (1) ease of ory is said to be seven things in general, and that
reading, (2) diversity of vocabulary, (3) percentage of is thought to limit the length of clauses. Actually,
big words (long, dicult words), and (4) percentage on surveying the number of phrases in clauses
of passive sentences, in accordance with Maekawa [8] from editorials in the Mainichi Daily News, the
and Nagao [9]. These metrics are broken down further authors found it to have a median value of four,
into various statistical quantities in the following sec- which is highly compatible with the short-term
tions. The distributions of these statistical quantities memory maximum of seven things.
were obtained from the editorials and columns stored
on the Mainichi Daily News CD-ROMs. Though most 4. Kanji/kana ratio:
of these distributions are asymmetrical (skewed), they To simplify text and make it easier to read, a
are each treated as a distribution of an ideal essay. In writer will generally reduce kanji (Chinese charac-
the event that a score (obtained statistical quantity) ters) intentionally. In fact, an appropriate range

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA04)
1529-4188/04 $ 20.00 IEEE
for the kanji/kana ratio in essays is thought to 3 Organization
exist, and this range is taken to be an evaluation Comprehending the ow of a discussion is essential
index. to understanding the connection between various as-
sertions. To help the reader to catch this ow, the
5. Number of attributive declined or conjugated frequent use of conjunctive expressions is useful. We
words (embedded sentences): therefore attempt to determine the logical structure of
The declined or conjugated words of attributive a document by detecting the occurrence of conjunc-
modiers indicate the existence of embedded tive expressions. Now, a conjunctive relationship can
sentences, and their quantity is thought to af- be broadly divided into forward connection and re-
fect ease of understanding. verse connection. Forward connection has a rather
broad meaning indicating a general conjunctive struc-
6. Maximum number of consecutive innitive-form ture that leaves discussion ow unchanged. In con-
or conjunctive-particle clauses: trast, reverse connection corresponds to a conjunc-
Consecutive innitive-form or conjunctive- tive relationship that changes the ow of discussion.
particle clauses, if many, are also thought to aect These logical structures can be classied as follows ac-
ease of understanding. Note that not this av- cording to Noya [10]. The forward connection struc-
erage size but maximum number of consecu- ture comes in the following types.
tive innitive-form or conjunctive-particle clauses Addition: A conjunctive relationship that adds em-
holds signicant meaning as an indicator of the phasis. A good example is in addition, while
depth of dependency aecting ease of understand- other examples include moreover and rather.
ing. Abbreviation of such words is not infrequent.
2.2 Diversity of vocabulary Explanation: A conjunctive relationship typied by
Yule [14] used a variety of statistical quantities in words and phrases such as namely, in short,
his analysis of writing. The most famous of these is an in other words, and in summary.
index of vocabulary concentration called the K char- Demonstration: A structure indicating a reason-
acteristic value. The value of K is non-negative, and consequence relation. Expressions indicating a
increases as vocabulary becomes more concentrated, reason include because and the reason is,
and conversely decreases as vocabulary becomes more and those indicating a consequence include as
diversied. The median value of K for editorials and a result, accordingly, therefore, and that
columns in the Mainichi Daily News was found to be is why. Conjunctive particles in Japanese like
87.3 and 101.3, respectively. node (since) and kara (because) also indicate
2.3 Percentage of big words a reason-consequence relation.
It is thought that the use of big words, to whatever Illustration: A conjunctive relationship most typi-
extent, cannot help but impress the reader. On inves- ed by the phrase for example having a struc-
tigating big words in Japanese, however, care must be ture that either explains or demonstrates by ex-
taken since simply measuring the length of a word may ample.
lead to erroneous conclusions. While a big word The reverse connection structure comes in the fol-
in English is usually synonymous with long word, lowing types.
a word expressed in kanji becomes longer when ex-
pressed in kana characters. That is to say, a small Transition: A conjunctive relationship indicating a
word in Japanese may become a big word simply due change in emphasis from A to B expressed by such
to notation. It is therefore necessary to count the num- structures as A ..., but B... and A...; however,
ber of characters in a word after converting it to kana B...).
characters (i.e., to its reading) to judge whether that Restriction: A conjunctive relationship indicating
word is big or small. a continued emphasis on A. Also referred to
2.4 Percentage of passive sentences as a proviso structure typically expressed by
It is generally felt that text should be written in though in fact and but then.
active voice as much as possible, and that text with Concession: A type of transition that takes on a con-
many passive sentences is an example of poor writing versational structure in the case of concession or
[7]. It is for this reason that percentage of passive compromise. Typical expressions indicating this
sentences is also used as an index of rhetoric. relationship are certainly and of course.

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA04)
1529-4188/04 $ 20.00 IEEE
Contrast: A conjunctive relationship typically ex- tracts diagonal elements from singular value matrix up
pressed by at the same time, on the other to the kth element to form a new matrix S. Likewise,
hand, and in contrast. it extracts left and right hand singular value decom-
position matrices up to the kth column to form new
We extracted all (=125) phrases indicating con- matrices T and D. Matrix X  can be expressed as
junctive relationships from editorials of the Mainichi follows.
Daily News, and classied them into the above four  = T SD
X
categories for forward connection and that for reverse  is an approximation of X with T and S being
Here, X
connection for a total of eight exclusive categories. In
t k and k k square diagonal matrices, respectively,
jess, the system attaches labels to conjunctive rela-
and D a k d matrix. The  symbol denotes transpo-
tionships and tallies them to judge the strength of the
sition. According to Deerwester [2], a k of from 50 to
discourse in the essay being scored. As in the case of
100 is sucient for linguistic data based on empirical
rhetoric, jess learns what an appropriate number of
results.
conjunctive relationships should be from editorials of
Essay e to be scored can be expressed by t-
the Mainichi Daily News, and deducts from the ini-
dimension word vector xe based on morphological
tially allotted points in the event of an outlier value
analysis, and using this, 1 k document vector de
in the model distribution.
corresponding to a row in document space D can be
In the scoring, we also determined whether the
derived as follows.
pattern in which these conjunctive relationships ap-
peared in the essay was singular compared to that de = xe T S 1
in the model editorials. This was accomplished by
considering a trigram model [6] for the appearance Similarly, k-dimension vector dq corresponding to es-
patterns of forward and reverse connections. The say prompt q can be obtained. Similarity between
probability of occurrence of certain {a : forward- these documents is denoted by r(de , dq ), which can be
connection} and {b : reverse-connection} patterns can given by the cosine of the angle formed between the
be obtained by taking the product of appropriate con- two document vectors. Note that the normalization of
ditional probabilities. For example, the probability of sizes of two document vectors is not necessary. Theo-
occurrence p of the pattern {a, b, a, a} turns out to be retically speaking, r can take on negative values, but
0.44 0.52 0.55 0.28 = 0.035. Furthermore, given setting its lower limit to zero appears to be appropri-
that the probability of {a} appearing without prior ate here.
information is 0.47 and that of {b} appearing without
prior information is 0.53, the probability q that a for- 5 Application Example
ward connection occurs three times and a reverse con- An e-rater demonstration can be viewed at http://
nection once under the condition of no prior informa- www.etctechnologies.com/html/eraterdemo.html. In
tion would be 0.473 0.53 = 0.055. As shown by this this demonstration, seven response patterns (seven es-
example, an occurrence probability that is greater for says) are evaluated. We translated essays A-to-G on
no prior information would indicate that the forward- that Web site into Japanese and then scored them us-
connection and reverse-connection appearance pattern ing jess as shown in Table 1.
is singular, in which case the points initially allocated
to conjunctive relationships in a discussion would be Table 1: Comparison of scoring results
reduced.
Essay e-rater jess No. of Time (s)
Characters
4 Contents A 4 6.9(4.1) 687 !! 1.00
A technique called latent semantic indexing (LSI) B 3 5.1(3.0) 431 !! 1.01
[2] can be used to check whether the contents of a writ- C 6 8.3(5.0) 1,884 !! 1.35
ten essay responds appropriately to the essay prompt. D 2 3.1(1.9) 297 !! 0.94
The usefulness of this technique has been stressed at E 3 7.9(4.7) 726 !! 0.99
the Text REtrieval Conference (TREC) and elsewhere. F 5 8.4(5.0) 1,478 !! 1.14
Latent semantic indexing begins after performing sin- G 3 6.0(3.6) 504 !! 0.95
gular value decomposition on td term-document ma-
trix X (t : number of words; d : number of documents) The second and third columns show e-rater and jess
indicating the frequency of words appearing in a suf- scores, respectively, and the fourth column shows the
ciently large number of documents. The process ex- number of characters in each essay. A perfect score

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA04)
1529-4188/04 $ 20.00 IEEE
in jess is 10 with 5 points allocated to rhetoric, 2 References
to organization, and 3 to contents as standard. For [1] J. Burstein, K. Kukich, S. Wol, C. Lu, M. Chodorow,
purposes of comparison, the jess score converted to L. Braden-Harder & M. D. Harris, Automated Scoring
e-raters 6-point system is shown in parentheses. It Using A Hybrid Feature Identication Technique. In
can be seen here that essays given good scores by e- the Proceedings of the Annual Meeting of the Associa-
rater are also given good scores by jess and that the tion of Computational Linguistics, Montreal, Canada,
1998.
two sets of scores show good agreement. However, e-
rater (and probably human raters) tends to give more [2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Lan-
points to longer essays despite similar writing formats. dauer & R. Harshman, Indexing by latent semantic
It is here where a dierence between e-rater and jess, analysis. Journal of the American Society for Infor-
which uses the point-deduction system for scoring, ap- mation Science Vol. 41 No. 7, pp. 391407, 1990.
pears. Examining the scores for essay C, for example, [3] P. W. Foltz, D. Laham & T. K. Landauer, Automated
we see that e-rater gave a perfect score of 6 while jess Essay Scoring: Applications to Educational Technol-
gave only a score of 5 after converting to e-raters 6- ogy. In proceedings of EdMedia 99., 1999.
point system. In other words, the length of the essay
could not compensate for various weak points in the [4] R. Gunning, The Technique of Clear Writing, New
essay under jesss point-deduction system. The fth York, McGraw Hill, 1968.
column in Table 1 shows jess processing time (CPU [5] T. Ishioka & M. Kameda, Document retrieval based
time). Further research by using 590 essays proves on Words cooccurrences the algorithm and its ap-
that jess has same degree of the performance of human plications (in Japanese), Japanese Journal of Applied
experts. The computer used was PlatHome Standard Statistics, Vol. 28, No. 2, pp. 107121, 1999.
System 801S using an 800-MHz Intel Pentium III run-
[6] F. Jelinek, Up from trigrams! The struggle for im-
ning RedHat 7.2. The jess program is written in C proved Language models, In Proceedings of the Euro-
shell script, jgawk, jsed, and C, and comes to just un- pean Conference on Speech Communication and Tech-
der 10,000 lines. Jess can be executed on the Web at nology (EUROSPEECH-91), pp. 10371040, 1991.
http://zaza.rd.dnc.ac.jp/jess/.
[7] D. E. Knuth, T. Larrabee & P. M. Roberts, Mathe-
matical Writing, Stanford University Computer Sci-
6 Conclusion ence Department, Report Number: STAN-CS-88-
An automated Japanese essay scoring system called 1193, January. 1988.
jess has been created for use in scoring essays in [8] M. Maekawa, Scientic Analysis of Writing (in
college-entrance exams. This system has been shown Japanese), Iwanami Shotten, 1995.
to be valid for essays in the range of 800 to 1600 char-
acters. Jess, however, uses editorials and columns [9] M. Nagao (ed.), Natural Language Processing (in
taken from the Mainichi Daily News newspaper as Japanese), The Iwanami Software Science Series 15,
Iwanami Shotten, 1996.
learning models, and such models are not sucient
for learning terms used in scientic and technical elds [10] S. Noya Logical Training (in Japanese), Sangyo Tosho,
such as computers. It was consequently found that jess 1997.
could return a low evaluation of contents even for an
[11] E. B. Page, J. P. Poggio & T. Z. Keith, Computer
essay that responds well to the essay prompt. When
analysis of student essays: Finding trait dierences
analyzing contents, a mechanism is needed for auto- in the student prole. AERA/NCME Symposium on
matically selecting a term-document cooccurrence ma- Grading Essays by Computer, 1997.
trix in accordance with the essay targeted for evalua-
tion. [12] L. M. Rudner & L. Liang, Automated essay scoring
using Bayes theorem, National Council on Measure-
Acknowledgement ment in Education, New Orleans, LA., 2002.

The authors would like to extend their deep appre- [13] H. Watanabe, Y. Taira & T. Inoue, An Analysis of
ciation to Professor Eiji Muraki, currently of Tohoku Essay Examination Data (in Japanese), Research bul-
University, Graduate School of Educational Informat- letin, Faculty of Education, University of Tokyo, Vol.
ics, Research Division, who, while resident at Edu- 28, pp. 143164, 1988.
cational Testing Service (ETS), was kind enough to [14] G. U. Yule, The Statistical Study of Literary Vocabu-
arrange a visit for us during our survey of the e-rater lary, Cambridge University Press, Cambridge, 1944.
system.

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA04)
1529-4188/04 $ 20.00 IEEE

Вам также может понравиться