You are on page 1of 24

Native- and nonnative-speaking EFL

teachers evaluation of Chinese


students English writing
Ling Shi University of British Columbia

This study examined differences between native and nonnative EFL (English as
a Foreign Language) teachers ratings of the English writing of Chinese university
students. I explored whether two groups of teachers expatriates who typically
speak English as their first language and ethnic Chinese with proficiency in
English gave similar scores to the same writing task and used the same criteria
in their judgements. Forty-six teachers 23 Chinese and 23 English-background
rated 10 expository essays using a 10-point scale, then wrote and ranked three
reasons for their ratings. I coded their reported reasons as positive or negative
criteria under five major categories: general, content, organization, language and
length. MANOVA showed no significant differences between the two groups in
their scores for the 10 essays. Chi-square tests, however, showed that the Englishbackground teachers attended more positively in their criteria to the content and
language, whereas the Chinese teachers attended more negatively to the organization and length of the essays. The Chinese teachers were also more concerned
with content and organization in their first criteria, whereas English-background
teachers focused more on language in their third criteria. The results raise questions about the validity of holistic ratings as well as the underlying differences
between native and nonnative EFL teachers in their instructional goals for second
language (L2) writing.

I Introduction
The fairness and construct validity of the ratings of different groups
of raters using the same scale in assessing ESL/EFL (English as a
Second/Foreign Language) writing is a major concern in second
language (L2) writing instruction and testing (Hamp-Lyons, 1991a;
Connor-Linton, 1995a; Silva, 1997). Exploring whether L2 writers
texts meet the expectations of native English readers, a growing number of studies have investigated whether native English speakers
(NES) and nonnative English speakers (NNS) share the same judgement of the students writing (James, 1977; Hughes and Lascaratou,
1982; Machi, 1988; Santos, 1988; Kobayashi, 1992; Hinkel, 1994;
Address for correspondence: Ling Shi, Department of Language and Literacy Education, 2034
Lower Mall Road, UBC, Vancouver, BC, Canada V6T 1Z2; email: Ling.Shiubc.ca
Language Testing 2001 18 (3) 303325

0265-5322(01)LT206OA 2001 Arnold

304 Evaluation of Chinese students English writing


Connor-Linton, 1995b; Kobayashi and Rinnert, 1996; Zhang, 1999;
Hamp-Lyons and Zhang, 2001). Most studies on the evaluation of L2
writing have focused on either heterogeneous ESL students in British
(Hamp-Lyons, 1989) and North American universities (Santos, 1988;
Cumming, 1990; Brown, 1991; Vaughan, 1991; Hinkel, 1994) or a
homogeneous EFL group, such as Arab university students (Khalil,
1985), Greek high school students (Hughes and Lascaratou, 1982),
Israeli high school students (Shohamy et al., 1992), Chinese university students (Zhang, 1999; Hamp-Lyons and Zhang, 2001) or
Japanese adult students (Machi, 1988; Kobayashi, 1992; ConnorLinton, 1995b; Kobayashi and Rinnert, 1996). Following in this tradition, the present study investigates differences between NES and
NNS EFL teachers in their judgments of English writing of Chinese
university English majors. It aims not only to verify previous findings
but also to explore the issue of cooperation between NNS and NES
EFL teachers in countries like China where writing programs are
commonly taught jointly by both groups of teachers.
Previous research suggests that we know little about whether the
NES and NNS raters/teachers give similar holistic ratings and qualitative evaluations to the same ESL/EFL essays. In a study comparing
the evaluative criteria of 26 American ESL and 29 Japanese EFL
instructors in their ratings of 10 compositions written by adult L1
Japanese EFL students, Connor-Linton (1995b) asked half of the
raters from each group to rate the compositions holistically and the
other half to rate the compositions with an analytical evaluation
profile and then to state three reasons for their ratings. The results
suggested similarities in ratings between the two groups, although
NES teachers tended to focus on intersentential discourse features
compared to NNS teachers focus on accuracy in their qualitative
judgments (Connor-Linton, 1995b).
Compared with Connor-Linton (1995b), other researchers have
used analytical scoring to investigate the extent to which NES and
NNS teachers/raters attend to the same features and value the same
qualities in student writing. As a result, NES teachers were found to
favour American English rhetorical patterns (Kobayashi and Rinnert,
1996) and clarity of meaning and organization (Kobayashi, 1992).
They also differed systematically from NNS raters in their judgments
in terms of students paragraph structuring and political/social stance
(Hamp-Lyons and Zhang, 2001) in terms of purpose and audience,
specificity, clarity and adequate support (Hinkel, 1994), and in terms
of overall organization, supporting evidence, use of conjunctions,
register, objectivity and persuasiveness (Zhang, 1999). Furthermore,
differences were found in the raters judgments of error gravity. In
general, NES teachers were found to be more tolerant of students

Ling Shi 305


errors than NNS teachers (James, 1977; Hughes and Lascaratou, 1982;
Santos, 1988). Kobayashi (1992), however, observed that NNS instructors would accept grammatically correct but awkward sentences compared to NESs. These findings were all based on pre-determined categorical evaluations which might have restricted or mandated
teachers/raters judgments. Some, for example, used decontextualized
or edited student writing to direct the raters attention (Santos, 1988;
Hinkel, 1994; Kobayashi and Rinnert, 1996). Research is therefore
needed, using authentic writing samples and no predetermined evaluation criteria, to verify accurately whether NES and NNS teachers
score L2 essays for different reasons. In addition, as these studies
indicating differences in teachers/raters qualitative evaluations did
not compare the analytic judgments using holistic scoring, I thought
it would be worth trying to verify the findings using both qualitative
and quantitative judgments.
Set in the context of Mainland China, the present study parallels
Connor-Lintons (1995b) study to compare holistic scores and selfreported reasons from teachers with different ethnolinguistic backgrounds. I planned the study with a two-fold purpose, viz. to verify
whether NES and NNS teacher raters differ in (1) their holistic ratings
and (2) their analytical reasons or qualitative judgments in evaluating
EFL students writing.
II Method
1 Written samples
Ten written samples were selected randomly in the fall of 1998 from
writings of third-year students in the English department of a university in Mainland China. From a pool of 86 in-class written assignments gathered in no particular order, every eighth essay was selected
for the study. Students were asked to write, within a 50-minute session (a common length of a lesson period in Chinese universities), a
250-word expository essay on the topic of TV and newspapers. The
decision on the 250-word length was made based on the suggestion
of the three class teachers who administered the task with an attempt
to match the present task with other existing writing tasks in the program. Most students, however, produced much longer essays. The 10
essays selected averaged 292 words. (For Essay 1 as an example of
essays in the middle range of scores, see Appendix 1.) Apart from
length, the writing prompt was also a result of negotiation with the
class teachers to incorporate the task into their teaching routines. Here
is the prompt that was used in this study:
Nowadays with the popularity of televisions people gain daily news more conveniently. Some people even begin to play down the advantage of newspapers,

306 Evaluation of Chinese students English writing


arguing that it is time that they were replaced by television. To what extent
do you agree or disagree with this statement? Give support for your argument.

2 Teacher raters
The 10 written samples were first sent, together with a letter of invitation to participate in the study and a background questionnaire, to
70 English expatriates teaching in various tertiary institutions in all
parts of China. All of them had then been teaching in China for a
minimum of six months and a maximum of about a year. The name
list was provided by Amity Foundation, a Christian organization that
helps Chinese universities to recruit native English teachers. Twentyfour of the NES EFL teachers responded to the letter and sent back
the completed questionnaires and their quantitative and qualitative
evaluations of the essays. I then invited 24 EFL NNS teachers, who
were either working or taking an in-service teacher training program
in the university where data were collected, to evaluate the same set
of essays. In each of the groups, there was one rater who missed
ratings for several of the essays, which made for an equal number of
23 raters in each group for the study.
The information summarized from the questionnaire suggested that
the participating raters were mostly experienced teachers with some
professional training. At the time of data collection, the 46 raters, as
they reported in the questionnaires, were teaching at 23 tertiary institutes in 12 cities of China. Of the 23 NES teachers, 14 were US
citizens, 5 British, 2 Canadian, and 2 Norwegians. (The two Norwegians are considered NESs in this study because they were expatriates
who had learnt and used English since elementary school. With
emphasis on multilingualism in the school system and English being
the second most important language, many educated Norwegians are
bilinguals.) Summarizing the rater profiles, Table 1 shows that more
than half (frequency of 31, 67%) of the raters were female and 15
(33%) were male. In terms of educational background, about half
(frequency of 23, 50%) had a masters degree and 17 (37%) had an
undergraduate degree. Most raters (frequency of 34, 74%) said they
had teacher training, whereas 11 (8 NNSs and 5 NESs) reported that
they had no such training. Of the 46 raters, 18 (39.1%) had taught
English for 1 to 5 years, 11 (23.9%) 6 to 9 years, and 17 (37%)
about 10 years. In general, the NES group appeared to be more educated than the NNSs (4 NES teachers had a PhD whereas 1 NNS had
only a high school diploma) although the latter seemed to have more
teaching experience than the former. (12 NNSs had taught for about
10 years compared with only 5 NESs who had similar experience.)

Ling Shi 307


Table 1 Rater profiles
Variables

Categories

Frequencies

Percentage

English

Chinese

Total

Number of
missing
cases

Gender

Male
Female
Total

6
17
23

9
14
23

15
31
46

32.6
67.4
100.0

Degrees

High School
Graduate
Master
PhD
Total

7
12
4
23

1
10
11

22

1
17
23
4
45

2.2
37.0
50.0
8.7
97.9

Yes
No
Total

20
3
23

14
8
22

34
11
45

73.9
23.9
97.8

15 years
610 years
Over 10 years
Total

10
8
5
23

8
3
12
23

18
11
17
46

39.1
23.9
37.0
100.0

Teacher
training

Years of
teaching
English

3 Procedures
I asked the participating teachers to rate the 10 compositions holistically using a 10-point scale. The teachers were told that the purpose
of the research was to compare how teachers evaluated English writing of Chinese university students. Following previous research
(Cumming, 1990; Connor-Linton, 1995b; Kobayashi and Rinnert,
1996), no criteria or analytical categories for the rating scale were
provided with the purpose of finding out how these teachers defined
the criteria themselves. In addition, to find out whether these teachers
made different qualitative judgments, the teachers were asked to state
three reasons or comments, in the order of importance, for their ratings of each essay. (For evaluation instructions, see Appendix 2.)
With no pre-determined criteria, the teachers were expected to make
and weight qualitative judgments based on their own beliefs of how
salient certain writing elements were in L2 writing evaluation.
Finally, the raters completed a questionnaire about their teaching
experiences and educational backgrounds. Although similar in its
basic design to Connor-Lintons (1995b) study, asking two groups of
teachers to rate 10 essays and state three reasons (for justifications
of these methodological choices, see Connor-Linton, 1995b: 102), the
present study, apart from subject identities and context, differed from
the previous study in that all the teachers rated the essays holistically

308 Evaluation of Chinese students English writing


either on a 10-point rather than a 4-point scale or with an analytical
evaluation profile that Connor-Linton (1995b) used. In addition, the
teachers in the present study were asked to rank order their reasons
rather than listing them randomly.
A coding scheme was developed for analysing the teachers comments. Previous researchers have used different methods to categorize
verbal protocols or self-reports of raters (Hamp-Lyons, 1989; 1991b;
Cumming, 1990; Vaughan, 1991; Connor-Linton, 1995b; Zhang,
1999). The present study used key-word analysis based on an initial
observation of the comments which contained typically short phrases
including an adjective and a content word, such as good idea or
poor argument, indicating the raters negative or positive attitude to
a chosen feature. First of all, one researcher, a doctoral student
studying in the area of L2 writing, went through the whole data and
identified key words, or words that had been repeatedly used in these
teachers comments, such as thesis, argument ideas, content,
logic, organization, grammar, vocabulary, paragraph, clarity,
support, as well as words indicating whether the comments were
positive or negative, such as good, poor, clear, unclear, balanced, unbalanced, etc. Based on the key words, five major categories
were generated in terms of general comments on the essay as a whole,
and specific comments addressing the content, organization, language
or length of the essays. Then the comments on content, organization
and language were further divided into 10 subcategories:
comments on content further distinguished as to whether they
were (1) general comments or specific comments on the quality
of (2) ideas or (3) arguments;
comments on organization as to whether they were (4) general
comments or specific comments focusing on (5) paragraph organization or (6) transitions, including coherence or cohesion;
comments on language as to whether they were (7) general comments or specific comments on (8) intelligibility, (9) accuracy or
(10) fluency of language use.
Before coding the data, the doctoral student and I each independently
coded comments from 10 teachers, 5 NES and 5 NNS, and reached
an agreement of 95%. As a result, a total of 1299 reasons (a few
teachers gave fewer than three reasons for some essays) were identified as positive or negative comments under the 12 categories. (For
definitions and examples of each category, see Appendix 3.)
Three types of statistical analysis were used to identify group differences in the raters evaluation of the 10 essays. First, reliability
tests were run to check the intra-group reliability of NES and NNS
teachers. Then MANOVA assessed differences between the two

Ling Shi 309


groups in their holistic scoring of the 10 essays. Finally, chi-square
values on the frequencies of the comments were computed to explore
the differences between the two groups, first in terms of whether the
12 categorical comments were positive or negative, and then in terms
of the ordering of the three comments for each essay.
III Findings and discussion
1 Reliability
Consistency of the ratings of each group NNE and NES teachers
was estimated by Cronbachs coefficient alpha, which reflects the
level of agreement of each group as a whole during the rating process
applied to the package of 10 essays. The ratings of the NES group
showed a greater reliability (coefficient = .88) than the NNS group
(coefficient = .71), indicating that the NES group was more consistent than the NNS group in their assessment of the 10 essays during
the rating process.
The reliability achieved by the NNS teachers in the present study
would be more or less predicable in the given context. Similar levels
of reliability for both the NES and NNS raters (.75) were also
reported by Connor-Linton (1995b). However, the much higher
reliability of the NES teachers in the present study was unusual, since
such high reliability is usually achieved only through training. Nevertheless, Shohamy et al. (1992) also observed high reliability coefficients in the range of .80 to .90 from untrained NES raters with the
aid of indicators for a rating scale.
2 Comparison of scores
Table 2 summarizes and compares the means and standard deviations
of the ratings of the two groups of teachers for the 10 students essays.
MANOVA indicated no significant differences between the two
groups on the ratings (F = 1.940, p 0.05). Similarities between the
two groups were also shown in the positive rank orderings of the
average scores of the 10 essays (Kendall Correlation Coefficients,
.73, 2-tailed, p 0.01). The mean scores of the two groups indicate
that the 10 essays ranged from a low of 5 to a high of 8 on the 10point scale, with slightly lower scores from the NES teachers (Group
M of 5.98 vs. 6.19). Both groups of teachers agreed that Essay 9 was
the best and that Essay 5 the second best. The rest of the essays
showed a difference of one to two ranks between the NES and NNS
teachers. It is also interesting to note a smaller group mean of standard
deviations for the NNS group compared with that for the NES group

310 Evaluation of Chinese students English writing


Table 2 Comparisons of NES and NNS teachers holistic scores on the 10 essays
NES (n = 23)

NNS (n = 23)

Ranking of essays

sd

sd

NES

NNS

1
2
3
4
5
6
7
8
9
10

6.69
6.29
6.96
5.14
7.50
5.47
7.21
7.06
8.14
5.64

2.27
1.20
1.52
1.77
1.20
1.86
1.99
1.51
1.25
2.20

6.83
7.14
7.35
6.07
7.65
5.35
7.31
6.91
7.87
6.59

1.40
1.14
1.17
1.54
1.15
0.92
1.78
1.01
1.04
1.47

6
7
5
10
2
8
3
4
1
9

7
5
3
9
2
10
4
6
1
8

Group M

5.98

1.68

6.19

1.26

Essays

(Group M of S.D. of 1.26 vs. 1.68). This indicates that the scores
given by the NNS teachers were more homogeneous or less spread
out than those given by the NES teachers, although individual members within the NNS group were less consistent as suggested by the
lower reliability of the group I reported earlier. In contrast, the NES
group, as the present data suggest, while being able to maintain a
high reliability, also used a wider range of average scores than the
NNSs (from 5.35 to 7.87 for NNSs and from 5.14 to 8.14 for NESs).
This tendency towards great variation among NES raters has also
been documented by previous researchers (e.g., Hamp-Lyons 1989;
Vaughan, 1991; Zhang, 1999). Together with previous studies, the
present study suggests that NES raters were perhaps either more
willing to take risks in awarding scores at the endpoints of the scale
or were better able to make finer distinctions among levels of ability
in rating ESL/EFL essays than their NNS colleagues.
3 Comparison of qualitative judgements: raters positive and
negative comments
Figure 1 compares the raters self-reported positive and negative comments, grouped into 12 categories. The overall statistically significant
chi-square value of the frequencies of the 12 types of positive and
negative comments indicates that NES and NNS teachers were systematically influenced by different qualitative judgments in their ratings (chi-square = 54.03, 51.01, d.f. = 11, p 0.001). In line with
previous research (Brown, 1991; Connor-Linton, 1995b), the present
study implies that these two groups of teacher raters arrived at their

Ling Shi 311


NES positive
140

NNS positive

Frequencies

120

NES negative

100

NNS negative

80
60
40
20

(
)
th

**

nc

ng

ge
ua

La

ng

Le

y)

(fl

ue

(+

y)

,
)

**

**
ur

La

ng

ua

ge

(a

cc

te
l
(in

ge
ua

ng

ac

ilit
y)
ib

lig

ge
ua
ng

La
La

(+

er
al
)

ns

(g

si
an
(tr

at
io
iz
an

en

tio

ph
ra

ag

al

(p
ar
n

at
io
iz

s)

(
)
)*

en
t)
en
(g
n
at
io

an

rg
O

O
rg

an

iz

C
on

te
n

t(

ar

er

t(

gu

id

ea
s)

*(
l)*

te
n

C
on

ra
ne

ge

rg

on

te
n

t(

en

er

al

*(

+)

Figure 1 NES and NNS teachers positive and negative comments


Notes: *(+/) Indicates a chi-square value showing a significant difference at .05 level
(** at .01 level) between the NES and NNS teachers in their positive/negative comments.

holistic ratings based on somewhat different criteria or qualitative


judgments.
A preliminary perspective on the trends from Figure 1 is that content, particularly the argument of the essay, appears to be identified
more often as a positive feature by both NES and NNS raters, whereas
language, particularly its intelligibility, shows the reverse pattern with
both groups identifying it more as a negative feature in these students
writing. This suggests a biased attitude among these EFL teachers,
the NESs even more so as they gave significantly more positive comments on general content quality than the NNSs (chi-square = 14.44,
d.f. = 1, p 0.01). Apart from an unbalanced perception of the writing performance of these L2 students, this tendency implies a distinction, implicitly made by these teacher raters, between the writing skill
of content presentation and language proficiency in L2 writing.
According to Cumming (1990), both experienced and inexperienced
ESL teachers in his study distinguished students writing expertise
and L2 proficiency while evaluating students essays. The present
finding suggests that these participating teacher raters seemed to make
this distinction implicitly in their ratings.
Despite the general tendency of both NES and NNS teachers in
the present study to be positive on content but negative on language
qualities of students writing, NES teachers were generally more positive than the NNS teachers in their qualitative evaluations (frequency
of 365 vs. 280 positive comments). While the NES teachers tended
to give lower marks than NNS teachers (Group M of 5.98 vs. 6.19),
they nevertheless gave more positive qualitative reasons or comments
than did the NES teachers. Kobayashi (1992) also reported that NES

312 Evaluation of Chinese students English writing


raters were more positive on certain qualities of students writing.
One possible reason for NES raters to be stricter raters but give more
positive comments was perhaps best explained by a NES rater in
Zhangs (1999: 250) study, who reported that she used different standards in assessing EFL students essays while playing a double role
of a strict native speaker and a lenient EFL teacher. The mismatch
between the NESs positive comments and lower scores suggests not
only a gap between teaching criteria and testing standards, but also
a concern of whether NES teachers should be advised or trained to
play consistently the role of a stricter rater or whether NESs other
than NES teachers should be used for testing purposes.
The chi-square values indicate divergent tendencies between the
two groups of teachers in the frequencies of their positive and negative comments in terms of the 12 evaluation categories. As Figure 1
shows, apart from general content quality, the NESs gave significantly
more positive comments on intelligibility (chi-square = 15.81, d.f. =
1, p 0.01) of language. In addition, they also gave more comments,
both positive and negative, on language accuracy than did the NNS
teachers (chi-square = 13.50, 11.77, d.f. = 1, p 0.01). Previous
researchers (such as Hughes and Lascaratou, 1982; Koyabashi, 1992;
Connor-Linton, 1995b) have also reported that NNS raters/teachers
appeared to be less attentive than the NES raters/teachers to the language quality of students essays. In line with previous findings, the
present study confirms Koyabashis (1992) observation of a lack of
confidence of NNS raters. Put differently, being nonnative speakers,
the Chinese teachers might shy away from making qualitative judgments or comments about the English language of their students. It
is interesting to note that these differences in the qualitative comments
were not reflected in holistic ratings which showed no significant
group differences on the 10 essays. As Connor-Linton (1995a: 763)
put it, holistic rating scales risk forcing potentially multidimensional
rater responses into a single dimension of variation. The discrepancies between the NES and NNS teachers found in the present study
confirm the disadvantage of holistic rating being unable to distinguish
accurately various characteristics of students writing.
Compared with the higher frequencies of the positive comments
from NES teachers, significantly higher frequencies of negative comments were produced by the NNS teachers in terms of general comments on the whole essay (chi-square = 8.07, d.f. = 1, p 0.05),
comments on general organization (chi-square = 8.76, d.f. = 1, p
0.05) and the length of the essays (chi-square = 12.52, d.f. = 1, p
0.01). These differences suggest an underlying disagreement between
the NES and NNS teachers on how students writing should be evaluated. The NNS teachers in the present study appeared more critical

Ling Shi 313


to the general quality and the structure of these students essays than
were the NES raters. As one Chinese rater in Hamp-Lyons and
Zhangs (2001) study explained, this tendency could result from an
awareness of a heavy influence of L1 in their students writing. In
contrast to the present finding, previous studies based on predetermined criteria or edited student essays have reported negative cultural biases of many NES raters towards the rhetorical writing styles
of NNS students (Land and Whitley, 1989; Basham and Kwachka,
1991; Hinkel, 1994; Tedick and Mathison, 1995; Kobayashi and Rinnert, 1996; Silva, 1997; Zhang, 1999; Hamp-Lyons and Zhang, 2001).
Following the assumption that NES raters might be positively affected
by their exposure to the particular culture (Hamp-Lyons, 1989), the
present study suggests that the participating NES raters may have
been influenced by their experiences in China. Unlike the NESs in
most previous studies, the NES instructors in the present study had
all been teaching in Mainland China for at least six months, a period
which might be long enough for them to be familiar with and, consequently, tolerant of the Chinese rhetorical style. In sum, the present
finding implies that the issue raised by previous researchers on the
biased attitudes of NESs towards non-English rhetorical styles might
not be consonant with the evaluations of EFL students writing at the
proficiency level examined in the present study.
4 Comparison of qualitative judgments: self-reported importance of
raters comments
A slightly different picture emerges in Figures 2 to 4 documenting
the differences in the qualitative comments between the two groups
of teachers in terms of the ordering of their three comments for each

Frequencies

100
NES

80

NNS

60
40
20

th
ng
Le

nc
y)

ge

(fl

ur

ue

ac
y)

y)
*

ua
ng
La

ge
ua
ng

La

ua

ge

(in

(a

te

cc

llig

ib
i

lit

er

al
)

ns
)
ge
ua
ng

ng
La

La

at
iz
an
rg

(g
en

tio
si
(tr

n
io

n
io
at
iz

an
rg
O

an

ag
(p
ar

(g
at

iz
an

hs
ra
p

er
al
en

um

n
io

nt
te

rg

on

)*

t)
en

)*
(a

nt
te
on
C

rg

(id
e

er
en
(g

nt

as

al
)

al
er
G
en

te
on
C

Figure 2 First comments


Notes: *Indicates a chi-square value, with a significant level at 0.05; Overall chi-square
= 51.14; d.f. = 11; p 0.001.

314 Evaluation of Chinese students English writing


Frequencies

100

NES

80

NNS

60
40
20

t(

Le
ng
th

*
ar
gu
m
iz
en
at
O
io
t)
rg
n
an
(g
en
iz
at
er
io
al
n
)*
(p
O
rg
ar
an
ag
iz
ra
at
ph
io
s)
n
(tr
an
La
si
ng
tio
ua
ns
La
)
ge
ng
(g
ua
en
ge
er
al
(in
)
t
e
La
llig
ng
ib
ua
ilit
y)
ge
*
(a
cc
La
ur
ac
ng
y)
ua
*
ge
(fl
ue
nc
y)

id
ea
s)

O
rg

an

te
n
C
on

C
on

on

te
n

te
n

t(

ge
ne
r

t(

G
en
er

al

al
)

Figure 3 Second comments


Notes: *Indicates a chi-square value, with a significant level at 0.05; Overall chi-square
= 19.99; d.f. = 11; p 0.001.

Frequencies

100

NES

80

NNS

60
40
20

**
th
ng

y)
nc
ue
(fl

cc
La

ng

ua

ge

(a
ge
ua

ng
La

Le

y)
ur

ilit
ib
llig

te
(in
ge

ua
ng

ac

y)

al
er
en

ge
ua
ng

La

**

)
ns

(g

si
an
(tr
n

io
at

rg

an

iz

at
O

iz
an
rg

tio

ph
ra
ag

n
io

iz
an
rg
O

La

)
al
er

n
io
at

nt
te
on
C

(p
ar

(g

en

um
(a

rg

(id
nt
te

on
C

t)

s)
ea

er
en
(g

nt
te

en

**

)
al

al
er
en
G
on
C

s)

Figure 4 Third comments


Notes: *Indicates a chi-square value, with a significant level at 0.05; Overall chi-square
= 58.42; d.f. = 11; p 0.001.

of the 10 essays. Since the teachers were asked to record their three
reasons or comments in order of importance, I expected a comparison
of the ranking of the comments to reveal how these teachers weighted
various writing elements in their evaluations. An initial inspection of
the data suggests that in their rank ordering of the 12 categorical
comments, the two groups of teachers varied from primary focus on
the quality of content and organization to a lesser focus on language.
As Figure 2 illustrates, in the comments they ranked as first in
importance, both groups of teachers focused primarily on the content,
particularly the ideas and arguments of the essays, although the NNS
teachers were found to attend more than the NES teachers to ideas
(chi-square = 5.15, 4.19, d.f. = 1, p 0.05). There was also a fair
amount of attention among both groups of teachers to the general
organization of the essays in their most important comments, although

Ling Shi 315


the NNS teachers attended more to such comments than did the NES
teachers (chi-square = 4.19, d.f. = 1, p 0.05). Figure 3 shows a
change of tendencies in the comments ranked as second in importance. As both groups remained focused on the arguments of the
essays, the teachers also turned their attention to language intelligibility. With the comments ranked third in importance, as Figure 4
shows, the majority comments from these teacher raters remained
focused on language intelligibility, with markedly more of such attention from the NES teachers (chi-square = 17.02, d.f. = 1, p 0.01).
This shifting of focus suggests that these teachers consciously placed
different weightings on various components. Recall that these comments were primarily positive on content but negative on language,
so that the shift of focus from content to language suggests that these
participating teachers also tended to focus on the positive elements
of the essays before moving to the negative ones. As these raters were
teachers who might have emphasized positive feedback because they
believed in encouraging students by finding their strengths, this finding needs to be verified in future research to determine whether the
movement from positive to negative evaluation is typical of EFL teachers or raters in general.
Despite the general tendencies of the two groups of teachers in the
ordering of the three comments, chi-square tests showed a significant
overall difference between the two groups in the frequencies of the
12 categorical evaluations mentioned as the first, second and third
most important comments (chi-square = 51.14, 19.99, 58.42; d.f. =
11, p 0.001). This further suggests that certain elements were more
or less salient to the two groups of teachers. Aside from the differences mentioned above that the NNS teachers were more concerned
with ideas and general organization in their most important comments
and NES more with language intelligibility in the third-ranked comments, NES teachers were also found to focus significantly more than
the NNS raters on the intelligibility and accuracy of language use in
their most important comments (chi-square = 4.00, 16.33, d.f. = 1, p
0.05), again on accuracy in their second-ranked comments (chisquare = 8.17, d.f. = 1, p 0.05) and then ideas in their third-ranked
comments (chi-square = 18.18, d.f. = 1, p 0.01). These findings
suggest that the NESs not only were more aware of the language
ability of these EFL students than their NNS colleagues, but also saw
the quality of intelligibility as more salient than accuracy in evaluation. In contrast, NNS raters were found to focus significantly more
than the NES raters on the general organization in their second-ranked
comments (chi-square = 4.33, d.f. = 1, p 0.05), and then length of
the essays in their third comments (chi-square = 18.67, d.f. = 1, p
0.01). As I reported earlier these comments from the NNS were

316 Evaluation of Chinese students English writing


mostly negative. They seemed to emphasize the weakness of organization more than the problem of length in students writing. If the
rankings of these reasons or comments reflected the general characteristics of the writing samples that these teacher raters chose to attend
to, the present finding suggests that NES and NNS teachers differ in
their beliefs and weightings of these writing elements in L2 writing
evaluation. This indicates that the process of L2 writing evaluation
is as important as, and perhaps more important than, the results of
scoring in identifying the differences between NES and NNS
teacher raters.
IV Conclusions
The present study investigated whether the evaluation standards for
written compositions differ among NES and NNS English teachers
in an EFL setting such as China. I examined not only how NES and
NNS teachers holistically evaluated students writing, but also how
the two groups of teachers consciously attended to and weighted various writing features. Results indicate that NES and NNS teachers
gave similar scores to EFL students writing as no significant differences were found in the scoring of the 10 essays. However, the two
groups of teachers differed markedly in the frequencies of certain
types of comments or criteria they mentioned to justify their ratings.
The analysis of the qualitative comments suggest that NES teachers
attended more positively to content and language, whereas NNS
teachers attended more negatively to organization and length,
although both groups appeared more positive towards content but
negative toward intelligibility of the language of the essays. Furthermore, as both groups seemed to move their attention from content to
language in their rank ordering of their comments for each essay in
terms of importance, the NNSs appeared to be more concerned with
ideas and general organization in the first comments and NES teachers
focused on intelligibility in their third comments.
The above findings raise at least two important issues for language
testing. The first issue concerns the extent to which holistic ratings
reflect analytical or qualitative judgments. As the present findings
suggest, although the writing of EFL students might be rated equally
by holistic scoring, this might be evaluated from different perspectives by NES and NNS teachers. Following Hamp-Lyons (1995: 760)
who noted that the writing of second language English users is particularly likely to show varied performance on different traits, the
present study suggests that holistic scoring was not effective in distinguishing subtle differences in students writing performances, nor
was it an effective way to detect differences between NES and NNS

Ling Shi 317


teacher raters. Analytic ratings or qualitative evaluations, as the
present study implies, might be preferable to holistic scoring to assess
accurately the quality of L2 writing for purposes such as research,
high stakes testing or diagnostic assessments, where the quality of
information from evaluation is important.
Related to the first issue is the second issue of the relationship
between reliability and validity in language testing. Both groups of
the NES and NNS teachers in the present study have achieved relatively high reliability in their ratings of the 10 essays, with the former
obtaining even a higher reliability. However, the fact that these raters
differed in their qualitative comments reflecting differing understandings of what constitutes good writing raises questions about
the construct validity of the inferences that might made solely on the
basis of the holistic ratings. For example, Essay 1 was awarded a
point of 7 on the scale by both a NNS and a NES, but for different
reasons. In recording their most important reasons for giving this
mark, the NNS commented positively on its well presented ideas
whereas the NES commented negatively on the many simple grammar mistakes. This raises questions about whether these teachers
even agreed with the terms or the construct definition they used. For
example, the concept argument seems to mean different things to
different raters. The same argument in Essay 1, for instance, was rated
as low as 1 on the scale by a NES teacher because the writer tried
to argue on both sides, whereas it was awarded as high as 8.5 by a
NNS teacher because of its good argument. With these disagreements among teachers, there is no doubt that students would be confused if they were presented with these comments. There is a great
need for research to pursue construct validity in testing by comparing
differences not only among teachers/raters but also between
teachers/raters and students. The present study suggests insufficiency
of reliability and the importance of increasing validity in L2 writing
assessment. Training is highly recommended for raters to reach consensus on the criteria before evaluation.
The process of rating becomes more complicated if one takes into
consideration other related issues, such as tolerance brought by familiarity (e.g., Santos, 1988) or washback effects of tests on instruction
(e.g., Wall and Alderson, 1993). These issues lead to pedagogical
implications of the present study. For example, there is a real concern
about how EFL students can acquire English academic rhetorical patterns if their NES teachers, as the present study suggests, become
tolerant to their rhetorical problems after being exposed to their culture. Furthermore, if we assume a correlation between how teachers
evaluate students essays and teachers instructional goals, the present
findings call for further inquiry to verify whether there is a divergence

318 Evaluation of Chinese students English writing


of instructional emphasis underlying these teachers classroom teaching of EFL writing. As the NES teachers in the present study emphasized, the development of ideas and intelligibility of language use,
while the NNS teachers emphasized the general organization and the
length of essays; they may similarly have divergent instructional
focuses in their writing classes. Researchers have indicated that teaching emphases can differ when writing is taught in places where teachers are dominantly either NES or NNS (Mohan and Lo, 1985). The
present study suggests that such differences could also exist in EFL
programs that are taught either or jointly by NES and NNS teachers.
Further research is also needed to investigate whether EFL students
in China are attending to different instructions from NES and NNS
teachers, as this is important information for curriculum designers and
policy makers. In an exam-driven school system, Chinese students are
continually assessed on their writing, the results of which influence
decisions that affect their future. Chinese students need to pass exams
rated by Chinese EFL teachers in order to graduate from universities
and, at the same time, to pass tests like TWE (Test of Written English)
rated by NESs if they plan to study abroad. Many Chinese universities, being aware of the students needs, currently have their English
writing programs jointly taught by NES and local Chinese teachers.
This reflects the general concern of the qualification of NNSs in teaching writing and a call for favouring NESs to correct and guide students writing (Takashima, 1987; Kobayashi, 1992), despite recognition that the knowledge of the target language as a second or foreign
language should constitute the basis for non-native teachers confidence, not their insecurity (Seidlhofer, 1999: 238) or threatened . . .
authority (Tang, 1997: 578). If students receive different directives
from the two groups of teachers as the present study suggests may
happen, there is need for closer cooperation between NES and
NNS teachers.
NES teachers need to understand the background training of their
EFL Chinese students just as NNS teachers need to incorporate the
standards of their NES counterparts into their teaching and testing
evaluation so as to socialize their students into English academic discourse. On the one hand, NESs should be aware of the different variety of English writing being cultivated by NNS teachers. They should
adjust their teaching accordingly in a setting where most students are
studying to pass exams rated by local NNS teachers. Based on similar
findings, Kobayashi and Rinnert (1996) suggested that NES teachers
in an EFL context should be aware of the needs of local students to
meet the standards of NNS teachers. On the other hand, NNS teachers
should be aware of the diverse instructional emphases of their NES

Ling Shi 319


colleagues. Both NES and NNS teachers should also help their
students to understand how readers expectations can vary based on
their cultural backgrounds, and how this would affect the way they
should write. Seminars for teachers and training workshops for raters
should be organized to help the communication between NES and
NNS teachers so that each group gains access to the shared background and beliefs of the other. Such cooperation and share of expertise could lead to more effective writing instruction for EFL students
in China.
Finally, the present study has at least two limitations. One limitation is that it used various statistical procedures to explore group
differences in the scoring of a set of 10 essays rather than individual
essay scores from each rater. Qualitative case studies need to be conducted to explore how individual raters vary their judgments from
essay to essay. Another limitation of the study is that, although a
contrast between NES and NNS teachers is the main point of the
research, the distinction of the two groups can be problematic with
lack of precise population definitions or sampling techniques (for
various discussions on the topic, see Braine, 1999). I used the terms
of NES and NNS only to follow previous studies in distinguishing
educators in English language teaching whether English is their first
or second language. It is important to caution that neither the present
nor any other published research has utilized populations of English
instructors or raters that are truly representative of larger populations.
There is a great gap in research before generalizations could be made
about values or practices of NESs or NNSs.
Acknowledgements
This project was funded first by the Committee of Research and Conference Grants of the University of Hong Kong, then by the Humanities and Social Sciences Research Grant of the University of British
Columbia. An earlier version of the paper was presented at 22nd
Annual Language Testing Research Colloquium, 2000. I thank the
participating students and teachers, Quiofang Wen and Wenyu Wong
for assistance in data collection and analyses, and Maria Trache for
statistical advice. I also thank Alister Cumming, Bournot-Trites
Monique, Bonny Norton, Lee Gunderson, Lyle F. Bachman, and three
anonymous reviewers for their valuable comments on earlier drafts
of the paper.
V References
Basham, C. and Kwachka, P.E. 1991: Reading the world differently: a
cross-cultural approach to writing assessment. In Hamp-Lyons, L.,
editor, Assessing second language writing in academic contexts.
Norwood, NJ: Ablex, 3749.

320 Evaluation of Chinese students English writing


Braine, G. 1999: Non-native educators in English language teaching.
Mahwah, NJ: Lawrence Erlbaum.
Brown, J.D. 1991: Do English and ESL faculties rate writing samples
differently? TESOL Quarterly 25, 587603.
Connor-Linton, J. 1995a: Looking behind the curtain: what do L2
composition ratings really mean? TESOL Quarterly 29, 76265.
1995b: Crosscultural comparison of writing standards: American ESL
and Japanese EFL. World Englishes 14, 99115.
Cumming, A. 1990: Expertise in evaluating second language compositions.
Language Testing 7, 3151.
Hamp-Lyons, L. 1989: Raters respond to rhetoric in writing. In Dechert,
H.W. and Raupach, M., editors, Interlingual processes. Tu bingen:
Gunter Narr, 22944.
1991a: Issues and directions in assessing second language writing in
academic contexts. In Hamp-Lyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 32329.
1991b: Reconstructing Academic writing proficiency. In HampLyons, L., editor, Assessing second language writing in academic
contexts. Norwood, NJ: Ablex, 12753.
1995: Rating nonnative writing: the trouble with holistic scoring.
TESOL Quarterly 29, 75962.
Hamp-Lyons, L. and Zhang, B.W. 2001: World Englishes: issues in and
from academic writing assessment. In Flowerdew, L. and Peacock,
M., editors, Research perspectives on English for academic purposes.
Cambridge: Cambridge University Press, 10116.
Hinkel, E. 1994: Native and nonnative speakers pragmatic interpretations
of English texts. TESOL Quarterly 28, 35376.
Hughes, A. and Lascaratou, C. 1982: Competing criteria for error gravity.
ELT Journal 26, 17582.
James, C. 1977: Judgments of error gravities. ELT Journal 31, 11624.
Khalil, A. 1985: Communicative error evaluation: native speakers evaluation and interpretation of written errors of Arab EFL learners. TESOL
Quarterly 19, 33551.
Kobayashi, H. and Rinnert, C. 1996: Factors affecting composition
evaluation in an EFL context: cultural rhetorical pattern and readers
background. Language Learning 46, 397437.
Kobayashi, T. 1992: Native and nonnative reactions to ESL compositions.
TESOL Quarterly 26, 81112.
Land, R. and Whitley, C. 1989: Evaluating second language essays in
regular composition classes: toward a pluralistic U.S. rhetoric. In
Johnson, D. and Roen, D., editors, Richness in writing. New York:
Longman, 28993.
Machi, E. 1988: An exploratory study on essay-grading behavior of native
speakers and Japanese teachers of English. Paper presented at the 27th
Annual Japan Association of College English Teachers Convention,
Tokyo.
Mohan, B.A. and Lo, W.A.Y. 1985: Academic writing and Chinese

Ling Shi 321


students: transfer and developmental factors. TESOL Quarterly 19,
51534.
Santos, T. 1988: Professors reactions to the academic writing of nonnativespeaking students. TESOL Quarterly 22, 6990.
Seidlhofer, B. 1999: Double standards: teacher education in expanding
circle. World Englishes 18, 23345.
Shohamy, E., Gordon, C.M. and Kraemer, R. 1992: The effect of raters
background and training on the reliability of direct writing tests. Modern Language Journal 76, 2733.
Silva, T. 1997: On the ethical treatment of ESL writers. TESOL Quarterly
21, 35063.
Takashima, H. 1987: To what extent are non-native speakers qualified to
correct free compositions: a case study. British Journal of Language
Teaching 25, 4348.
Tang, C. 1997: On the power and status of nonnative ESL teachers. TESOL
Quarterly 31, 57780.
Tedick, D.J. and Mathison, M.A. 1995: Holistic scoring in ESL writing
assessment: what does an analysis of rhetorical features reveal? In Belcher, D. and Braine, G., editors, Academic writing in a second language: essays on research and pedagogy. Norwood, NJ: Ablex, 20530.
Vaughan, C. 1991: Holistic assessment: what goes on in the raters mind?
In Hamp-Lyons, L., editor, Assessing second language writing in academic contexts. Norwood, NJ: Ablex, 11125.
Wall, D. and Alderson, J.C. 1993: Examining washback: the Sri Lankan
impact study. Language Testing 10, 4169.
Zhang, W.X. 1999: The rhetorical patterns found in Chinese EFL student
writers examination essays in English and the influence of these
patterns on rater response. PhD thesis, Hong Kong Polytechnic
University.

Appendix 1 Writing sample*


Essay 1
Since the invention of TV, newspaper business has deciding. Fewer
and fewer people are reading newspaper and someone even declares
that newspaper is dying out soon.
Judging from the means of providing information, TV has its
unprevailable advantage over the newspaper. It provides visions and
sounds. It was a great event that the first day people could see with
their eyes what had happened in the other part of the world. Thus the
information on TV became powerful for pictures are different from
printed wordsseeing is believing. In this sense, TV can present the
news, the events more vividly and carry more information from
several angles. It should be a better source.

322 Evaluation of Chinese students English writing


However, the truth is that newspaper is more reliable simply
because of the commercialization of TV. TV is powerful and people
abuse its power. Most of what we get from TV is not information,
but a kind of entertainment. The news reported on TV are usually in
terse, popular language with impressive images, but without much
profounding critics. We find more good critics on newspapers. The
TV people dont give critical opinions, they are busy making questionaires about masses taste and fussy with soapy shows. And if they
have an attitude, (I feel happy as well as worried about it.) they can
use, manipulate their powerful weaponTV without letting us know.
Pictures clipped, words omitted, repetitions on certain facts, all these
can give us a totally wrong idea. Yet, the worst thing is we believe
the news, for seeing is believing. One Hong Kong banker was interviewed by foreign journalists about his opinion on Hong Kongs
future. He said There is difficulty ahead, but I have the full confidence. On screen, you just hear There is difficulty ahead and see
a shot of his lowered head. What is the truth? I remember one film
named making city starred by Dustin Huffman. The protagonist
doesnt tend to harm anyone, but is made the image as a terrorist by
the media and shoots himself at last. When Dustin ories We killed
him in the end of the film, I feel theres something much worse than
entertainment that TV can providethe false.
Of course, TV itself is not bad. Its like money and depends on
how we use it. The problem with the masses is that we are easily
taken in by what we see and by indulging too much in its entertainment, we dont think and become slave of it. In this sense, the
ancient means of data leaves us more space for thinking.
Note: *Errors and mistakes are retained verbatim from the original
students essay.
Appendix 2 Evaluation of students writing
This project aims to find out how teachers of English rate university
students essays. Please read and rate the 10 essays provided using a
10-point scale (point 10 being the highest on the scale) and then state,
in an order of importance, three reasons or characteristics in each
essay that you think have most influenced your rating of that essay
(reason 1 being the most important).

Ling Shi 323


Essays
1
2
3
4
5
6
7
8
9
10

Ratings

Reason 1

Reason 2

Reason 3

324 Evaluation of Chinese students English writing


Appendix 3 Coding scheme
Major
categories

Sub-categories

Definitions

Examples of
positive/negative
comments*

General

General

General comments on
overall quality of writing.

Well written.
It fails to complete the
task.

General

General comments on
content.

Good contents.
Content shallow.

Ideas

General or specific
comments on ideas and
thesis.

Arguments

General or specific
comments on aspects of
arguments such as
balance, use of
comparison, counter
arguments, support, uses
of details or examples,
clarity, unity, maturity,
originality, relevance,
logic, depth, objectivity,
conciseness,
development and
expression

Content

General
Organization Paragraphs

Transitions

General comments on
organization
Comments on the macro
level concerning
paragraphs introductions,
and conclusions.

Comments on the micro


level concerning
transitions, coherence,
and cohesion.

Good ideas.
Poor idea.
Thesis clearly stated.
No thesis statement.
Good argument.
Poor argument.
Arguments balanced.
Lack of arguments on
the newspaper issue.
Arguments well
supported.
Arguments not very
well supported.
Logical argument.
No logic in reasoning.
Argument well
developed.
Limited development of
argument.
Excellent organization.
Weak organization.
Paragraphs are well
arranged.
Paragraphs are poorly
organized.
Good introduction.
Poor introduction.
Unique conclusion.
No conclusion.
Good transitions
Bad use of transition
words.
Coherent
Lacks coherence.
Cohesive.
Lack of connectives.

Ling Shi 325


Appendix 3 continued
Major
categories

Language

Sub-categories

Definitions

Examples of
positive/negative
comments*

General

General comments on
language.
Comments on whether the
language is clear or easy
to understand.
General comments on
accuracy or specific
comments on word use,
grammar and mechanics.

Language good.
Poor English.
Easy to read and follow.
Meaning unclear.

Accurate language.
Too many errors.
Good use of words.
Bad diction.
Good grammar.
Poor grammar.
Few errors of spelling
and punctuation.
Too many errors in
mechanics.
Fluent language.
Language not smooth.
Concise language.
Wordy.
Complex sentences.
Use of language
immature.
Idiomatic English.
Chinglish.
Voice not appropriate.
Vivid language.
About 250 words.
Too short.

Intelligibility
Accuracy

Fluency

Length

Length

Comments on fluency,
conciseness, maturity,
naturalness,
appropriateness, and
vividness of language.

Comments on whether the


writer has fulfilled the
word limit.

Note: *Negative comments are in italics.