Further Evidence For A Functionalist Approach To Translation Quality Evaluation

Further evidence for a functionalist approach
to translation quality evaluation*

Sonia Colina
Te University of Arizona
Colina (2008) proposes a componential-functionalist approach to translation
quality evaluation and reports on the results of a pilot test of a tool designed ac-
cording to that approach. Te results show good inter-rater reliability and justify
further testing. Te current article presents an experiment designed to test the
approach and tool. Data was collected during two rounds of testing. A total of 30
raters, consisting of Spanish, Chinese and Russian translators and teachers, were
asked to rate 45 translated texts (depending on the language). Results show
that the tool exhibits good inter-rater reliability for all language groups and texts
except Russian and suggest that the low reliability of the Russian raters scores is
unrelated to the tool itself. Te fndings are in line with those of Colina (2008).
Keywords: quality, assessment, evaluation, rating, componential, functionalism,
errors
0. Introduction
Recent US federal mandates (e.g. White House Executive Order #13166),
1
requir-
ing health care providers who are recipients of federal funds to provide language
translation and interpretation for patients with limited English profciency (LEP),
have brought the long-standing issue of translation quality to a wider audience of
health care professionals (e.g. managers, decision makers, industry stakeholders,
private foundations), who generally feel unprepared to address the topic. A strik-
ing example of how challenging quality evaluation can be for health care organiza-
tions is illustrated by the experience of Hablamos Juntos, an initiative funded by
the Robert Wood Johnson Foundation to develop practical solutions to language
barriers to health care.
Several healthcare providers (including hospitals) working with the program
identifed what they believed were the best translations available. Eighty-seven
Target 21:2 (2009), 235264. doi 10.1075/target.21.2.02col
issn 09241884 / e-issn 15699986 John Benjamins Publishing Company
236 Sonia Colina
documents, rated as highly satisfactory and recommended for replication, were
collected from the providers. Examination of these health education texts by doc-
torate-level, Spanish language specialists resulted in quality being identifed as a
problem. Many of these texts were cumbersome to read, to the point that readers
required the English originals to decipher the intended meanings of some trans-
lations. It became clear that these texts were potentially hampering health care
quality and outcomes by not providing needed access to intended health care in-
formation for patients with limited English profciency. Furthermore, health care
administrators overseeing the translation processes that produced these texts had
not identifed quality as a problem and needed assistance assessing the quality of
non-English written materials. It was this context that prompted the launch of the
Translation Quality Assessment (TQA) project, funded as one of various HJ initia-
tives, to improve communication between health providers and patients with lim-
ited English profciency. Te TQA project aims to design and test a research-based
prototype tool that could be used by health care organizations to assess the quality
of translated materials, being able to identify a wide range of quality. Colina (2008)
describes the initial version of the tool and the frst phase of testing. Te results
of a pilot experiment, reported also in Colina (2008), reveal good inter-rater reli-
ability and provide justifcation for further testing. Te current article presents a
second experiment designed to test the approach and tool.
1. Translation quality revisited
Translation quality evaluation is probably one of the most controversial, intensely
debated topics in translation scholarship and practice. Yet, progress in this area
does not seem to correlate with the intensity of the debate. One may wonder
whether the situation is perhaps partly related to the diverse nature of the def-
nitions of translation. In a feld such as translation studies, flled with unstated,
ofen culturally-dependent, assumptions about the role of translation and transla-
tors, equivalence and literalness, translation norms and translation standards, it is
not surprising that quality and evaluation have remained elusive to defnition or
standards. Current reviews of the literature ofer support for this hypothesis (Co-
lina 2008, House 2001, Lauscher 2000), as they reveal a multiplicity of views and
priorities in the area of translation quality. In one recent overview, Colina (2008)
classifes the various approaches into two major groups according to whether their
orientation is experiential or theoretical; parts of that overview are reproduced
here for ease of reference (see further Colina 2008).
Further evidence for a functionalist approach to translation quality evaluation 237
1.1 Experiential approaches
Many methods of translation quality assessment fall within this category. Tey
tend to be ad hoc, anecdotal marking scales developed for the use of a particular
professional organization or industry, e.g., the ATA certifcation exam, the SAE
J2450 Translation Quality Metric for the automotive industry, the LISA QA tool
for localization.
2
While the scales are ofen adequate for the particular purposes
of the organization that created them, they sufer from limited transferability, pre-
cisely due to the absence of theoretical and/or research foundations that would
permit their transfer to other environments. For the same reason, it is difcult to
assess the replicability and inter-rater reliability of these approaches.
1.2 Teoretical approaches
Recent theoretical, research-based approaches tend to focus on the user of a trans-
lation and/or the text. Tey have also been classifed as equivalence-based or func-
tionalist (Lauscher 2000). Tese approaches arise out of a theoretical framework
or stated assumptions about the nature of translation; however, they tend to cover
only partial aspects of quality and they are ofen difcult to apply in professional
or teaching contexts.
1.2.1 Reader-response approaches
Reader-response approaches evaluate the quality of a translation by assessing
whether readers of the translation respond to it as readers of the source would re-
spond to the original (Nida 1964, Carroll 1966, Nida and Taber 1969). Te reader-
response approach must be credited with recognizing the role of the audience in
translation, more specifcally, of translation efects on the reader as a measure of
translation quality. Tis is particularly noteworthy in an era when the dominant
notion of text was that of a static object on a page.
Yet, the reader-response method is also problematic because, in addition to
the difculties inherent to the process of measuring reader response, the response
of a reader may not be equally important for all texts, especially for those that are
not reader-oriented (e.g., legal texts). Te implication is that reader response will
not be equally informative for all types of translation. In addition, this method ad-
dresses only one aspect of a translated text (i.e., equivalence of efect on the reader),
ignoring others, such as the purpose of the translation, which may justify or even
require a slightly diferent response from the readers of the translation. One also
wonders if it is in fact possible to determine whether two responses are equivalent,
as even monolingual texts can trigger non-equivalent reactions from slightly dif-
ferent groups of readers. Since, in most cases, the readership of a translated text is
238 Sonia Colina
diferent than that envisioned by the writer of the original,
3
one can imagine the
difculties entailed by equating quality with equivalence of response. Finally, as
with many other theoretical approaches, reader-response testing is time-consum-
ing and difcult to apply to actual translations. At a minimum, careful selection
of readers is necessary to make sure that they belong to the intended audience for
the translation.
1.2.2 Textual and pragmatic approaches
Textual and pragmatic approaches have made a signifcant contribution to the
feld of translation evaluation by shifing the focus from counting errors at the
word or sentence level to evaluating texts and translation goals, giving the reader
and communication a much more prominent role. Yet, despite these advances,
none of these approaches can be said to have been widely adopted by either profes-
sionals or scholars.
Some models have been criticized because they focus too much on the source
text (Reiss 1971) or on the target text (Skopos) (Reiss and Vermeer 1984, Nord
1997); Reiss argues that the text type and function of the source text is the most
important factor in translation and quality should be assessed with respect to it.
For Skopos Teory it is the text type and function of the translation that is of para-
mount importance in determining the quality of the translation.
Houses (1997, 2001) functional pragmatic model relies on an analysis of the
linguistic-situational features of the source and target texts, a comparison of the
two texts, and the resulting assessment of their match. Te basic measure of qual-
ity is that the textual profle and function of the translation match those of the
original, the goal being functional equivalence between the original and the trans-
lation. One objection that has been raised against Houses functional model is its
dependence on the notion of equivalence, ofen a vague and controversial term in
translation studies (Hnig 1997). Tis is a problem because translations sometimes
are commissioned for a somewhat diferent function than that of the original; in
addition, a diferent audience and time may require a slightly diferent function
than that of the source text (see Hnig 1997 for more on the problematic notion
of equivalence). Tese scenarios are not contemplated by equivalence-based theo-
ries of translation. Furthermore, one can argue that what qualifes as equivalent
is as variegated as the notion of quality itself. Other equivalence-based models of
evaluation are Gerzymisch-Arbogast (2001), Neubert (1985), and Van den Broeck
(1985). In sum, the reliance on an a priori notion of equivalence is problematic
and limiting in descriptive as well as explanatory value.
An additional objection against textual and pragmatic approaches is that they
are not precise about how evaluation is to proceed afer the analysis of the source or
the target text is complete or afer the function of the translation has been established
as the guiding criteria for making translation decisions. Tis obviously afects the
ease with which the models can be applied to texts in professional settings. Hnig,
for instance, afer presenting some strong arguments for a functionalist approach
to evaluation, does not ofer any concrete instantiation of the model, other than in
the form of some general advice for translator trainers. He comes to the conclusion
that the speculative element will remain at least as long as there are no hard and
fast empirical data which serve to prove what a typical readers responses are like
(1997: 32).
4
Te same criticism regarding the difculty involved in applying textual
and theoretical models to professional contexts is raised by Lauscher (2000). She
explores possible ways to bridge the gap between theoretical and practical quality
assessment, concluding that translation criticism could move closer to practical
needs by developing a comprehensive translation tool (2000: 164).
Other textual approaches to quality evaluation are the argumentation-cen-
tered approach of Williams (2001, 2004), in which evaluation is based on argu-
mentation and rhetorical structure, and corpus-based approaches (Bowker 2001).
Te argumentation-centered approach is also equivalence-based, as a translation
must reproduce the argument structure of ST to meet minimum criteria of ade-
quacy (Williams 2001: 336). Bowkers corpus-based model uses a comparatively
large and carefully selected collection of naturally occurring texts that are stored
in machine-readable form as a benchmark against which to compare and evalu-
ate specialized student translations. Although Bowker (2001) presents a novel,
valuable proposal for the evaluation of students translations, it does not provide
specifc indications as to how translations should be graded (2001: 346). In sum,
argumentation and corpus-based approaches, although presenting crucial aspects
of translation evaluation, are also complex and difcult to apply in professional
environments (and one could argue in the classroom as well).
1.3 Te functional-componential approach (Colina 2008)
Colina (2008) argues that current translation quality assessment methods have
not achieved a middle ground between theory and applicability; while anecdotal
approaches lack a theoretical framework, the theoretical models ofen do not con-
tain testable hypotheses (i.e., they are non-verifable) and/or are not developed
with a view towards application in professional and/or teaching environments. In
addition, she contends that theoretical models usually focus on partial aspects of
translation (e.g., reader response, textual aspects, pragmatic aspects, relationship
to the source, etc): Perhaps due to practical limitations and the sheer complexity
of the task, some of these approaches overlook the fact that quality in translation
is a multifaceted reality, and that a general comprehensive approach to evaluation
may need to address multiple components of quality simultaneously.
240 Sonia Colina
As a response to the inadequacies identifed above, Colina (2008) proposes an
approach to translation quality evaluation based on a theoretical approach (func-
tionalist and textual models of translation) that can be applied in professional and
educational contexts. In order to show the applicability of the model in practical
settings, as well as to develop testable hypotheses and research questions, Co-
lina and her collaborators designed a componential, functionalist, textual tool
(henceforth the TQA tool) and pilot-tested it for inter-rater reliability (cf. Colina
2008 for more on the frst version of this tool). Te tool evaluates components of
quality separately, consequently refecting a componential approach to quality; it
is also considered functionalist and textual, given that evaluation is carried out
relative to the function and the characteristics of the audience specifed for the
translated text.
As mentioned above, it seems reasonable to hypothesize that disagreements
over the defnition of translation quality are rooted in the multiplicity of views
of translation itself and on diferent priorities regarding quality components: It
is ofen the case that a requesters view of quality will not coincide with that of
the evaluators; yet, without explicit criteria on which to base the evaluation, the
evaluator can only rely on his/her own views. In an attempt to introduce fexibility
with regard to diferent conditions infuencing quality, the proposed TQA tool al-
lows for a user-defned notion of quality in which it is the user or requester who
decides which aspects of quality are more important for his/her communicative
purposes. Tis can be done either by adjusting customer-defned weights for each
component or simply by assigning higher priorities to some components. Custom
weighting of components is also important because the efect of a particular com-
ponent on the whole text may also vary depending on textual type and function.
An additional feature of the TQA tool is that it does not rely on a point deduction
system; rather, it tries to match the text under evaluation with one of several de-
scriptors provided for each category/component of evaluation. In order to capture
the descriptive, customer-defned notion of quality, the original tool was modifed
in the second experiment to include a cover sheet (see Appendix 1).
Te experiment in Colina (2008) sets out to test the functional approach to
evaluation by testing the tools inter-rater reliability. 37 raters and 3 consultants
were asked to use the tool to rate three translated texts. Te texts selected for eval-
uation consisted of reader-oriented health education materials. Raters were bi-
linguals, professional translators, and language teachers. Some basic training was
provided. Data was collected by means of the tool and a post rating survey. Some
diferences in ratings could be ascribed to rater qualifcations: teachers and trans-
lators ratings were more alike than those of bilinguals; bilinguals were found to
rate higher and faster than the other groups. Teachers also tended to assign higher
ratings than translators. It was shown that diferent types of raters were able to use
the tool without signifcant training. Pilot testing results indicate good inter-rater
reliability for the tool and the need for further testing. Te current paper focuses
on a second experiment designed to further test the approach and tool proposed
in Colina (2008).
2. Second phase of TQA testing: Methods and Results
2.1 Methods
One of the most important limitations of the experiment in Colina (2008) is in
regard to the numbers and groups of participants. Given the project objective of
ensuring applicability across languages frequently used in the USA, subject re-
cruitment was done in three languages, Spanish, Russian, and Chinese. As a result,
resources and time for recruitment had to be shared amongst the languages, with
smaller numbers of subjects per language group. Te testing described in the cur-
rent experiment includes more subjects and additional texts. More specifcally, the
study reported in this paper aims:
I. To test the TQA tool again for inter-rater reliability (i.e. to what degree trained
raters use the TQA tool consistently) by answering the following questions:
Question 1. For each text, how consistently do all raters rate the text?
Question 2. How consistently do raters in the frst session (Benchmark) rate
the texts?
Question 3. How consistently do raters in the second session (Reliability) rate
the texts?
Question 4. How consistently do raters rate each component of the tool? Are
there some test components where there is higher rater reliability?
II. Compare the rating skills/behavior of translators and teachers: Is there a difer-
ence in scoring between Translators and Teachers? (Question 5, Section 2.2).
Data was collected during two rounds of testing: the frst, referred to as the Bench-
mark Testing, included 9 Raters; the second session, the Reliability Testing, in-
cluded 21 raters. Benchmark and Reliability sessions consisted of a short training
session, followed by a rating session. Raters were asked to rate 45 translated texts
(depending on the language) and had one afernoon and one night to complete the
task. Afer their evaluation worksheets had been submitted, raters were required
to submit a survey on their experience using the tool. Tey were paid for their
participation.
242 Sonia Colina
2.1.1 Raters
Raters were drawn from the pool used for the pre-pilot and pilot testing sessions
reported in Colina (2008) (see Colina [2008] for selection criteria and additional
details). A call was sent via email to all those raters selected for the pre-pilot and
pilot testing (including those who were initially selected but did not take part). All
raters available participated in this second phase of testing.
As in Colina (2008), it was hypothesized that similar rating results would be
obtained within the members of the same group. Terefore, raters were recruit-
ed according to membership in one of two groups: Professional translators; and
language teachers (language professionals who are not professional translators).
Membership was assigned according to the same criteria as in Colina (2008). All
selected raters exhibited linguistic profciency equivalent to that of a native (or
near-native) speaker in the source and in one of the target languages.
Professional translators were defned as language professionals whose income
comes primarily from providing translation services. Signifcant professional ex-
perience (5 years minimum, most had 1220 years of experience), membership in
professional organizations, and education in translation and/or a relevant feld were
also needed for inclusion in this group. Recruitment for these types of individu-
als was primarily through the American Translators Association (ATA). Although
only two applicants were ATA certifed, almost all were ATA afliates (members).
Language teachers were individuals whose main occupation was teaching
language courses, at a university or other educational institution. Tey may have
had some translation experience, but did not rely on translation as their source
of income. A web search of teaching institutions with known foreign language
programs was used for this recruitment. We outreached to schools throughout
the country at both the community college and university levels. Te defnition of
teacher did not preclude graduate student instructors.
Potential raters were assigned to the above groups on the basis of the infor-
mation provided in their resume or curriculum vitae and a language background
questionnaire included in a rater application.
Te bilingual group in Colina (2008) was eliminated from the second experi-
ment, as subjects were only available for one of the languages (Spanish). Transla-
tion competence models and research suggest that bilingualism is only one com-
ponent of translation competence (Bell 1991, Cao 1996, Hatim and Mason 1997,
PACTE 2008). Nonetheless, since evaluating translation products is not the same
as translating, it is reasonable to hypothesize that other language professionals,
such as teachers, may have the competence necessary to evaluate translations; this
may be particularly true in cases, such as the current project, in which the object of
evaluation is not translator competence, but translation products. Tis hypothesis
would be born out if the ratings provided by translators and teachers are similar.
As mentioned above, data was collected during two rounds of testing: the frst
one, the Benchmark Testing, included 9 Raters (3 Russian; 3 Chinese; 3 Spanish);
these raters were asked to evaluate 45 texts (per language) that had been previ-
ously selected as clearly of good or bad quality by expert consultants in each lan-
guage. Te second session, the Reliability Testing, included 21 raters, distributed
as follows:
Spanish: 5 teachers, 3 translators (8)
Chinese: 3 teachers, 4 translators (7)
Russian: 3 teachers, 3 translators (6)
Diferences across groups refect general features of that language group in the US.
Among the translators, the Russians had degrees in Languages, History and Trans-
lating, Engineering and Nursing from Russian and US universities and experience
ranging from 12 to 22 years; the Chinese translators experience ranged from 6 to
30 years and their education included Chinese language and literature, Philosophy
(MA), English (PhD), Neuroscience (PhD) and Medicine (MD), with degrees ob-
tained in China and the US. Teir Spanish counterparts experience varied from
5 to 20 years and their degrees included areas such as Education, Spanish and
English Literature, Latin American Studies (MA), and Creative Writing (MA).
Te Spanish and Russian teachers were perhaps the most uniform groups, includ-
ing: College instructors (PhD students) with MAs in Spanish or Slavic Linguistics,
Literature, and Communication, and one college professor of Russian. With one
exception, they were all native speakers of Spanish or Russian with formal edu-
cation in the country of origin. Chinese teachers were college instructors (PhD
students) with MAs in Chinese, one college professor (PhD in Spanish) and an
elementary school teacher and tutor (BA in Chinese). Tey were all native speak-
ers of Chinese.
2.1.2 Texts
As mentioned above, experienced translators serving as language consultants se-
lected the texts to be used in the rating sessions. Tree consultants were instruct-
ed to identify health education texts translated from English into their language.
Texts were to be publicly available on the Internet: Half were to be very good and
the other half were to be considered very poor on reading the text. Tose texts
were used for the Benchmark session of testing during which they were rated by
the consultants and two additional expert translators. Te texts where there was
the most agreement in rating were selected for the Reliability Testing. Reliability
texts were comprised of fve Spanish texts (three good and two bad), four Russian
texts and four Chinese texts, two for each language being of good quality and two
of bad quality, making up a total of thirteen additional texts.
244 Sonia Colina
2.1.3 Tool
Te tool tested in Colina (2008) was modifed to include a cover sheet consisting
of two parts. Part I is to be completed by the person requesting the evaluation (i.e.
the Requester) and read by the rater before he/she started his/her work. It contains
the Translation Brief, relative to which the evaluation must always take place, and
the Quality Criteria, clarifying requester priorities among components. Te TQA
Evaluation Tool included in Appendix 1 contains a sample Part I, as specifed by
Hablamos Juntos (the Requester), for the evaluation of a set of health education
materials. Te Quality Criteria section refects the weights assigned to the four
components in the Scoring Worksheet at the end of the tool. Part II of the Cover
Sheet is to be flled in by the raters afer the rating is complete. An Assessment
Summary and Recommendation section was included to allow raters the oppor-
tunity to ofer an action recommendation on the basis of their ratings: I.e. What
should the requester do now with this translation? Edit it? Minor or small edits?
Redo it entirely? An additional modifcation to the tool consisted of eliminat-
ing or adding descriptors so that each category would have an equal number of
descriptors (four for each component) and revising the scores assigned so that the
maximum number of points possible would be 100. Some minor stylistic changes
were made in the language of the descriptors.
2.1.4 Rater Training
Te Benchmark and Reliability sessions included training and rating sessions. Te
training provided was substantially the same ofered in the pilot testing and de-
scribed in Colina (2008): It focused on the features and use of the tool and it con-
sisted of PDF materials (delivered via email), a Power-point presentation based on
the contents of the PDF materials, and a question-and-answer session delivered
online via Internet and phone conferencing system.
Some revisions to the training refect changes to the tool (including instruc-
tions on the new Cover Sheet), a few additional textual examples in Chinese, and a
scored, completed sample worksheet for the Spanish group. Samples were not in-
cluded for the other languages due to time and personnel constraints. Te training
served as a refresher for those raters who had already participated in the previous
pilot training and rating (Colina 2008).
5
2.2 Results
Te results of the data collection were submitted to statistical analysis to deter-
mine to what degree trained raters use the TQA tool consistently.
Table 1 and Figures 1a and 1b show the overall score of each text rated and
the standard deviation between the overall score and the individual rater scores.
200-series texts are Spanish texts, 400s are Chinese and 300s are Russian. Te stan-
dard deviations range from 8.1 to 19.2 for Spanish, from 5.7 to 21.2 for Chinese
and from 16.1 to 29.0 for Russian.
Question 1. For each text, how consistently do all raters rate the text?
Te standard deviations in Table 1 and Figures 1a and 1b ofer a good measure of
how consistently individual texts are rated. A large standard deviation suggests
that there was less rater agreement (or that the raters difered more in their assess-
ment). Figure 1b shows the average standard deviations per language. According
to this, the Russian raters were the ones with the highest average standard devia-
tion and the less consistent in their ratings. Tis is in agreement with the reliabillity
coefcients shown below (Table 5), as the Russian raters have the lowest inter-rater
reliability. Table 2 shows average scores, standard deviations, and average standard
deviations for each component of the tool, per text and per language. Figure 2
represents average standard deviations per component and per language. Tere
does not appear to be an obvious connection between standard deviations and
Table 1. Average score of each text and standard deviation
Text # of raters Average Score Standard Deviation
Spanish
210 11 91.8 8.1
214 11 89.5 11.3
215 11 86.8 15.0
228 11 48.6 19.2
235 11 56.4 18.5
Avg. 14.42
Chinese
410 10 88.0 10.3
413 10 63.0 21.0
415 10 96.0 5.7
418 10 76.0 21.2
Avg. 14.55
Russian
312 9 59.4 16.1
314 9 82.8 15.6
315 9 75.6 22.1
316 9 67.8 29.0
Avg. 20.7
246 Sonia Colina
0
20
40
60
80
100
2
1
0
2
1
4
2
1
5
2
2
8
2
3
5
4
1
0
4
1
3
4
1
5
4
1
8
3
1
2
3
1
4
3
1
5
3
1
6
Text number
Average Score
Standard Deviation
Figure 1a. Average score and standard deviation per text
0
5
10
15
20
25
Spanish Chinese Russian
Standard Deviation
(Avg.)
Figure 1b. Average standard deviations per language
components. Although generally the components Target Language (TL) and Func-
tional and Textual Adequacy (FTA) have higher standard deviations (i.e., ratings
are less consistent), this is not always the case as seen in the Chinese data (FTA).
One would in fact expect the FTA category to exhibit the highest standard devia-
tions, given its more holistic nature; yet, the data do not bear out this hypothesis, as
the TL component also shows standard deviations that are higher than Non-Spe-
cialized Content (MEAN) and Specialized Content and Terminology (TERM).
Question 2. How consistently do raters in the frst session (Benchmark) rate the
texts?
Te inter-rater reliability for the Spanish and for the Chinese raters is remark-
able; however, the inter-rater reliability for the Russian raters is too low (Table 3).
Table 2. Average scores and standard deviations for four components, per text and per
language
TL FTA MEAN TERM
Text Raters Mean SD Mean SD Mean SD Mean SD
Spanish
210 11 27.7 2.6 23.6 2.3 22.7 2.6 17.7 3.4
214 11 27.3 4.7 20.9 7.0 23.2 2.5 18.2 3.4
215 11 28.6 2.3 22.3 4.7 18.2 6.8 17.7 3.4
228 11 15.0 7.7 11.4 6.0 10.9 6.3 11.4 4.5
235 11 15.9 8.3 12.3 6.5 13.6 6.4 14.5 4.7
Avg. 5.12 5.3 4.92 3.88
Chinese
410 10 27.0 4.8 22.0 4.8 21.0 4.6 18.0 2.6
413 10 18.0 9.5 16.5 5.8 14.0 5.2 14.5 3.7
415 10 28.5 2.4 25.0 0.0 23.5 2.4 19.0 2.1
418 10 22.5 6.8 21.0 4.6 16.0 7.7 16.5 4.1
Avg. 5.875 3.8 4.975 3.125
Russian
312 9 18.3 7.1 15.0 6.1 13.3 6.6 12.8 4.4
314 9 25.6 6.3 21.7 5.0 19.4 3.9 16.1 4.2
315 9 23.3 9.4 18.3 7.9 17.8 4.4 16.1 4.2
316 9 20.0 10.3 16.7 7.9 17.2 7.1 13.9 6.5
8.275 6.725 5.5 4.825
Avg.SD (all lgs.) 6.3 5.3 5.1 3.9
248 Sonia Colina
Tis, in conjunction with the Reliability Testing results, leads us to believe in the
presence of other unknown factors, unrelated to the tool, responsible for the low
reliability of the Russian raters.
Question 3. How consistently do raters in the second session (Reliability) rate the
texts? How do the reliability coefcients compare for the Benchmark and the Reli-
ability Testing?
Te results of the reliability raters mirror those of the benchmark raters, whereby
the Spanish raters achieve a very good inter-rater reliability coefcient, the Chi-
nese raters have acceptable inter-rater reliability coefcient, but the inter-rater reli-
ability for the Russian raters is very low (Table 4).
Table 5 (see also Tables 3 and 4) shows that there was a slight drop in inter-
rater reliability for the Chinese raters (from the benchmark rating to the reliability
rating), but the Spanish raters at both rating sessions achieved remarkable inter-
rater reliability. Te slight drop among the Russian raters from the frst to the sec-
ond session is negligible; in any case, the inter-rater reliability is too low.
Average SD per tool component
0
1
2
3
4
5
6
7
8
9
TL FTA MEAN TERM
Spanish
Chinese
Russian
All languages
Figure 2. Average standard deviations per tool component and per language
Table 3. Reliability coefcients for benchmark ratings
Reliability coefcient
Spanish .953
Chinese .973
Russian .128
Question 4. How consistently do raters rate each component of the tool? Are there
some test components where there is higher rater reliability?
Te coefcients for the Spanish raters show very good reliability, with excel-
lent coefcients for the frst three components; the numbers for the Chinese raters
are also very good, but the coefcients for the Russian raters are once again low
(although some consistency is identifed for the FTA and MEAN components).
(Table 6)
Table 6. Reliability coefcients for the four components of the tool (all raters per
language group)
TL FTA MEAN TERM
Spanish .952 .929 .926 .848
Chinese .844 .844 .864 .783
Russian .367 .479 .492 .292
In sum, very good reliability was obtained for Spanish and Chinese raters, for the
two testing sessions (Benchmark and Reliability Testing) as well as for all compo-
nents of the tool. Reliability scores for the Russian raters are low. Tese results are
in agreement with the standard deviation data presented in Tables 12 and Fig-
ure 1a and 1b and Figure 2. All of this leads us to believe that whatever the cause
for the Russian coefcients, it was not related to the tool itself.
Question 5. Is there a diference in scoring between translators and teachers?
Table 7a and Table 7b show the scoring in terms of average scores and standard
deviations for the translators and the teachers for all texts. Figures 3 and 4 show
the mean scores and times for Spanish raters, comparing teachers and translators.
Table 4. Reliability coefcients for Reliability Testing
Spanish .934
Chinese .780
Russian .118
Table 5. Inter-rater reliability: Benchmark and Reliability Testing
Benchmark reliability
coefcient
(for Reliability Testing)
Spanish .953 .934
Chinese .973 .780
Russian .128 .118
250 Sonia Colina
Table 7a. Average scores and standard deviations for consultants and translators
Score Time
text Mean SD Mean SD
210 93.3 7.5 75.8 59.4
214 93.3 12.1 94.2 101.4
215 85.0 17.9 36.3 18.3
228 46.7 20.7 37.5 22.3
235 46.7 18.6 49.5 38.9
410 91.4 7.5 46.0 22.1
413 62.9 21.0 40.7 13.7
415 96.4 4.8 26.1 15.4
418 69.3 22.1 52.4 22.2
312 52.5 15.1 26.7 2.6
314 88.3 10.3 22.5 4.2
315 74.2 26.3 28.7 7.8
316 63.3 32.7 25.8 6.6
Table 7b. Average scores and standard deviations for teachers
Score Time
text Mean SD Mean SD
210 90.0 9.4 63.6 39.7
214 85.0 9.4 67.0 41.8
215 89.0 12.4 36.0 30.5
228 51.0 19.5 38.0 31.7
235 68.0 10.4 57.6 40.2
410 80.0 13.2 61.0 27.7
413 63.3 25.7 71.0 24.6
415 95.0 8.7 41.0 11.5
418 91.7 5.8 44.0 6.6
312 73.3 5.8 55.0 56.7
314 71.7 20.8 47.7 62.7
315 78.3 14.4 37.7 45.5
316 76.7 22.5 46.7 63.5
Te corresponding data for Chinese appears in Figures 5 and 6 and in Figures 7
and 8 for Russian.
Spanish teachers tend to rate somewhat higher (3 out of 5 texts) and spend
more time rating than translators (all texts).
As with the Spanish raters, it is interesting to note that Chinese teachers rate
either higher or similarly to translators (Figure 5): Only one text obtained lower
ratings from teachers than from translators. Timing results also mirror those found
for Spanish subjects: Teachers take longer to rate than translators (Figure 6).
Despite the low inter-rater reliability among Russian raters, the same trend
was found when comparing Russian translators and teachers with the Chinese and
the Spanish. Russian teachers rate similarly or slightly higher than translators, and
they clearly spend more time on the rating task than the translators (Figure 7 and
Figure 8). Tis also mirrors the fndings of the pre-pilot and pilot testing (Colina
2008).
In order to investigate the irregular behavior of the Russian raters and to try to
obtain an explanation for the low inter-rater reliability, the correlation between the
total score and at the recommendation (the feld rec) issued by each rater was con-
sidered. Tis is explored in Table 8. One would expect there to be a relatively high
(negative) correlation because of the inverse relationship between high score and a
low recommendation. As is illustrated in the three sub tables below, all Spanish rat-
ers with the exception of SP02PB show a strong correlation between the recommen-
dation and the total score, ranging from 0.854 (SP01VS) to 0.981 (SP02MC). Te
results are similar with the Chinese raters, whereby all raters correlate very highly
0
10
20
30
40
50
60
70
80
90
100
210 214 215 228 235
Mean scores for Spanish raters
Translators
Teachers
Figure 3. Mean scores for Spanish raters
252 Sonia Colina
0
10
20
30
40
50
60
70
80
210 214 215 228 235
Time for Spanish raters
Translators
Teachers
Figure 4. Time for Spanish raters
0
20
40
60
80
100
120
410 413 415 418
Mean Score for Chinese Raters
Translators
Teachers
Figure 5. Mean scores for Chinese raters
0
10
20
30
40
50
60
70
80
410 413 415 418
Time for Chinese Raters
Translators
Teachers
Figure 6. Time for Chinese raters
0
10
20
30
40
50
60
70
80
90
100
312 314 315 316
Mean scores for Russian Raters
Translators
Teachers
Figure 7. Mean scores for Russian raters
254 Sonia Colina
between the recommendation and the total score, ranging from 0.867 (CH01BJ)
to a perfect 1.00 (CH02JG). Te results are diferent for the Russian raters, however.
It appears that three raters (RS01EM, RS02MK, and RS01NM) do not correlate
highly between their recommendations and their total scores. A closer look espe-
cially at these raters is warranted, as is a closer look at RS02LB, who was excluded
from the correlation analysis due to a lack of variability (the rater uniformly recom-
mended a 2 for all texts, regardless of the total score he or she assigned). Te other
Russian raters exhibited strong correlations. Tis result suggests some unusual be-
havior in the Russian raters, independently of the tool design and tool features, as
the scores and overall recommendation do not correlate highly, as expected.
0
10
20
30
40
50
60
312 314 315 316
Time for Russian Raters
Translators
Teachers
Figure 8. Time for Russian raters
Table 8 (3 sub-tables). Correlation between recommendation and total score:
8.1 Spanish raters:
SP04AR SP01JC SP01VS SP02JA SP02LA SP02PB SP02AB SP01PC SP01CC SP02MC SP01PS
0.923 0.958 0.854 0.938 0.966 0.421 0.942 0.975 0.913 0.981 0.938
8.2 Chinese raters:
CH01RL CH04YY CH01AX CH02AC CH02JG CH01KG CH02AH CH01BJ CH01CK CH01FL
0.935 0.980 0.996 0.894 1.000 0.955 0.980 0.867 0.943 0.926
8.3 Russian raters:
RS01EG RS01EM RS04GN RS02NB RS02LB RS02MK RS01SM RS01NM RS01RW
0.998 0.115 0.933 1.000 n/a 0.500 0.982 0.500 0.993
3. Conclusions
As in Colina (2008), testing showed that the TQA tool exhibits good inter-rater
reliability for all language groups and texts, with the exception of Russian. It was
also shown that the low reliability of the Russian raters scores is probably due to
factors unrelated to the tool itself. At this point, it is not possible to determine
what these factors may have been; yet further research with Russian teachers and
translators may provide insights about the reasons for the low inter-rater reliability
obtained for this group in the current study. In addition, the fndings are in line
with those of Colina (2008) with regard to the rating behavior of translators and
teachers: Although translators and teachers exhibit similar behavior, teachers tend
to spend more time rating and their scores are slightly higher than those of trans-
lators. While, in principle, it may appear that translators would be more efcient
raters, one would have to consider the context of evaluation to select an ideal rater
for a particular evaluation task. Because they spent more time rating (and one as-
sumes refecting on their rating), teachers may be more apt evaluators in a forma-
tive context, where feedback is expected from the rater. Teachers may also be better
at refecting on the nature of the developmental process and therefore better able
to ofer more adequate evaluation of a process and/or a translator (versus evalu-
ation of a product). However, when rating involves a product and no feedback is
expected (e.g. industry, translator licensing exams, etc), a more efcient translator
rater may be more suitable to the task. In sum, the current fndings suggest that
professional translators and language teachers could be similarly qualifed to assess
translation quality by means of the TQA tool. Which of the two types of profes-
sionals is more adequate for a specifc rating task probably will depend on the
purpose and goal of evaluation. Further research comparing the skills of these two
groups in diferent evaluation contexts is necessary to confrm this view.
In summary, the results of empirical tests of the functional-componential tool
continue to ofer evidence for the proposed approach and to warrant additional
testing and research. Future research needs to focus on testing on a larger scale,
with more subjects and various text types.
Notes
* Te research described here was funded by the Robert Wood Johnson Foundation. It was part
of the Phase II of the Translation Quality Assessment project of the Hablamos Juntos National
Program. I would like to express my gratitude to the Foundation, to the Hablamos Juntos Na-
tional Program, and to the Program Director, Yolanda Partida, for their support of translation in
the USA. I owe much gratitude to Yolanda Partida and Felicia Batts for comments, suggestions
256 Sonia Colina
and revision in the write-up of the draf documents and on which this paper draws. More details
and information on the Translation Quality Assessment project, including Technical Reports,
Manuals and Toolkit Series are available on the Hablamos Juntos website (www.hablamosjuntos.
org). I would also like to thank Volker Hegelheimer for his assistance with the statistics.
1. Te legal basis for most language access legislation in the United States of America lies in
Title VI of the 1964 Civil Rights Act. At least 43 states have one or more laws addressing lan-
guage access in health care settings.
2. www.sae.org; www.lisa.org/products/qamodel.
3. One exception is that of multilingual text generation, in which an original is written to be
translated into multiple languages.
4. Note the reference to reader response within a functionalist framework.
5. Due to rater availability, 4 raters (1 Spanish, 2 Chinese, 1 Russian) were selected that had
not participated in the training and rating sessions of the previous experiment. Given the low
number, researchers did not investigate the efect of previous experience (experienced vs. inex-
perienced raters).
References
Bell, Roger T. 1991. Translation and Translating. London: Longman.
Bowker, Lynne. 2001. Towards a Methodology for a Corpus-Based Approach to Translation
Evaluation. Meta 46:2. 345364.
Cao, Deborah. 1996. A Model of Translation Profciency. Target 8:2. 325340.
Carroll, John B. 1966. An Experiment in Evaluating the Quality of Translations. Mechanical
Translation 9:34. 5566.
Colina, Sonia. 2003. Teaching Translation: From Research to the Classroom. New York: McGraw
Hill.
Colina, Sonia. 2008. Translation Quality Evaluation: Empirical evidence for a Functionalist
Approach. Te Translator 14:1. 97134.
Gerzymisch-Arbogast, Heidrun. 2001. Equivalence Parameters and Evaluation. Meta 46:2.
227242.
Hatim, Basil and Ian Mason. 1997. Te Translator as Communicator. London and New York:
Routledge.
Hnig, Hans. 1997. Positions, Power and Practice: Functionalist Approaches and Translation
Quality Assessment. Current issues in language and society 4:1. 634.
House, Julianne. 1997. Translation Quality Assessment: A Model Revisited. Tbingen: Narr.
House, Julianne. 2001. Translation Quality Assessment: Linguistic Description versus Social
Evaluation. Meta 46:2. 243257.
Lauscher, S. 2000. Translation Quality-Assessment: Where Can Teory and Practice Meet?.
Te Translator 6:2. 149168.
Neubert, Albrecht. 1985. Text und Translation. Leipzig: Enzyklopdie.
Nida, Eugene. 1964. Toward a Science of Translation. Leiden: Brill.
Nida, Eugene and Charles Taber. 1969. Te Teory and Practice of Translation. Leiden: Brill.
Nord, Christianne. 1997. Translating as a Purposeful Activity: Functionalist Approaches Ex-
plained. Manchester: St. Jerome.
PACTE. 2008. First Results of a Translation Competence Experiment: Knowledge of Transla-
tion and Efcacy of the Translation Process. John Kearns, ed. Translator and Interpreter
Training: Issues, Methods and Debates. London and New York: Continuum, 2008. 104
126.
Reiss, Katharina. 1971. Mglichkeiten und Grenzen der bersetungskritik. Mnchen: Hber.
Reiss, Katharina and Vermeer, Hans. 1984. Grundlegung einer allgemeinen Translations-Teorie.
Tbingen: Niemayer.
Van den Broeck, Raymond. 1985. Second Toughts on Translation Criticism. A Model of its
Analytic Function. Teo Hermans, ed. Te Manipulation of Literature. Studies in Literary
Translation. London and Sydney: Croom Helm, 1985. 5462.
Williams, Malcolm. 2001. Te Application of Argumentation Teory to Translation Quality
Assessment. Meta 46:2. 326344.
Williams, Malcolm. 2004. Translation Quality Assessment: An Argumentation-Centered Ap-
proach, Ottawa: University of Ottawa Press.
Rsum
Colina (2008) propose une approche componentielle et fonctionnelle de lvaluation de la qua-
lit des traductions et dresse un rapport sur les rsultats dun test-pilote portant sur un outil
conu pour cette approche. Les rsultats attestent un taux lev de fabilit entre valuateurs et
justifent la continuation des tests. Cet article prsente une exprimentation destine tester
lapproche ainsi que loutil. Des donnes ont t collectes pendant deux priodes de tests. Un
groupe de 30 valuateurs, compos de traducteurs et enseignants espagnols, chinois et russes,
ont valu 4 ou 5 textes traduits. Les rsultats montrent que loutil assure un bon taux de fabilit
entre valuateurs pour tous les groupes de langues et de textes, lexception du russe ; ils sugg-
rent galement que le faible taux de fabilit des scores obtenus par les valuateurs russes est sans
rapport avec loutil lui-mme. Ces constats confrment ceux de Colina (2008).
Mots-clefs : Mots-cls : qualit, test, valuation, notation, componentiel,
fonctionnalisme, erreurs
258 Sonia Colina
Appendix 1: Tool
Benchmark Rating Session

T i m e R a t i n g S t a r t s T i m e R a t i n g E n d s

Translation Quality Assessment Cover Sheet
For Health Education Materials

PART I: To be completed by Requester
Requester is the Health Care Decision Maker (HCDM) requesting a quality assessment of an existing translated text.

Requester:
Title/Department: Delivery Date:

T R A N S L A T I O N B R I E F

Source Language Target Language
Spanish, Russian, Chinese

Text Type:
Text Title:
Target Audience:
Purpose of Document:

P R I O R I T Y O F Q U A L I T Y C R I T E R I A
____ Target Language
____ Functional and Textual Adequacy
____ Non-Specialized Content (Meaning)

Rank EACH from 1 to 4
(1 being top priority)
____ Specialized Content and Terminology

PART II: To be completed by TQA Rater

Rater (Name): Date Completed:
Contact Information Date Received:
Total Score: Total Rating Time:

A S S E S S M E N T S U M M A R Y A N D R E C O M M E N D A T I O N

Publish and/or use as is
Minor edits needed before publishing*
Major revision needed before publishing*
Redo translation
(To be completed after
evaluating translated text.)
Translation will not be an effective communication strategy for this text.
Explore other options (e.g. create new target language materials)
Notes/Recommended Edits

- 2 -

RATING INSTRUCTIONS:
1. Carefully read the instructions for the review of the translated text. Your decisions and evaluation should be
based on these instructions only.
2. Check the description that best fits the text given in each one of the categories.
3. It is recommended that you read the target text without looking at the English and score the Target
Language and Functional categories.
4. Examples or comments are not required, but they can be useful to help support your decisions or to provide
rationale for your descriptor selection.

1. TARGET LANGUAGE
Category
Number
Description
Check one
box
1.a
The translation reveals serious language proficiency issues. Ungrammatical use of the target language, spelling
mistakes. The translation is written in some sort of third language (neither the source nor the target). The
structure of source language dominates to the extent that it cannot be considered a sample of target language
text. The amount of transfer from the source cannot be justified by the purpose of the translation. The text is
extremely difficult to read, bordering on being incomprehensible.

1.b
The text contains some unnecessary transfer of elements/structure from the source text. The structure of the
source language shows up in the translation and affects its readability. The text is hard to comprehend.

1.c
Although the target text is generally readable, there are problems and awkward expressions resulting, in most
cases, from unnecessary transfer from the source text.

1.d
The translated text reads similarly to texts originally written in the target language that respond to the same
purpose, audience and text type as those specified for the translation in the brief. Problems/awkward
expressions are minimal if existent at all.

Examples/Comments

2. FUNCTIONAL AND TEXTUAL ADEQUACY
Category
Number
Description
Check one
box
2.a
Disregard for the goals, purpose, function and audience of the text. The text was translated without considering
textual units, textual purpose, genre, need of the audience, (cultural, linguistic, etc.) Can not be repaired with
revisions.

2.b
The translated text gives some consideration to the intended purpose and audience for the translation, but
misses some important aspect/s of it (e.g. level of formality, some aspect of its function, needs of the audience,
cultural considerations, etc.). Repair requires effort.

2.c
The translated text approximates to the goals, purpose (function) and needs of the intended audience, but it is
not as efficient as it could be, given the restrictions and instructions for the translation. Can be repaired with
suggested edits.

2.d
The translated text accurately accomplishes the goals, purpose (function: informative, expressive, persuasive)
set for the translation and intended audience (including level of formality). It also attends to cultural needs and
characteristics of the audience. Minor or no edits needed.

Examples/Comments

260 Sonia Colina
- 3 -

3. NON-SPECIALIZED CONTENT-MEANING
Category
Number
Description
Check one
box
3.a
The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate
renditions and/or important omissions and additions that cannot be justified by the instructions. Very defective
comprehension of the original text.

3.b
There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation
instructions. Translation shows some misunderstanding of original and/or translation instructions.

3.c
Minor alterations in meaning, additions or omissions.

3.d
The translation accurately reflects the content contained in the original, insofar as it is required by the
instructions without unwarranted alterations, omissions or additions. Slight nuances and shades of meaning have
been rendered adequately.

Examples/Comments

4. SPECIALIZED CONTENT AND TERMINOLOGY
Category
Number
Description
Check one
box
4.a
Reveals unawareness/ignorance of special terminology and/or insufficient knowledge of specialized content.

4.b
Serious/frequent mistakes involving terminology and/or specialized content.

4.c
A few terminological errors, but the specialized content is not seriously affected.

4.d
Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific
to the subject.

Examples/Comments

TOTAL
SCORE

- 3 -

3. NON-SPECIALIZED CONTENT-MEANING
Category
Number
Description
Check one
box
3.a
The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate
renditions and/or important omissions and additions that cannot be justified by the instructions. Very defective
comprehension of the original text.

3.b
There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation
instructions. Translation shows some misunderstanding of original and/or translation instructions.

3.c
Minor alterations in meaning, additions or omissions.

3.d
The translation accurately reflects the content contained in the original, insofar as it is required by the
instructions without unwarranted alterations, omissions or additions. Slight nuances and shades of meaning have
been rendered adequately.

Examples/Comments

4. SPECIALIZED CONTENT AND TERMINOLOGY
Category
Number
Description
Check one
box
4.a
Reveals unawareness/ignorance of special terminology and/or insufficient knowledge of specialized content.

4.b
Serious/frequent mistakes involving terminology and/or specialized content.

4.c
A few terminological errors, but the specialized content is not seriously affected.

4.d
Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific
to the subject.

Examples/Comments

TOTAL
SCORE

- 4 -

S C O R I N G W O R K S H E E T

Component: Target Language Component: Functional and Textual Adequacy
Category # Value Score Category # Value Score
1.a 5

2.a 5

1.b 15

2.b 10

1.c 25

2.c 20

1.d 30

2.d 25

Component: Non-Specialized Content
Component: Specialized Content and
Terminology
Category # Value Score Category # Value Score
3.a 5

4.a 5

3.b 10

4.b 10

3.c 20

4.c 15

3.d 25

4.d 20

Tally Sheet
Component
Category
Rating
Score Value
Target Language
Functional and Textual Adequacy
Non-Specialized Content
Specialized Content and Terminology
Total Score

262 Sonia Colina
Appendix 2: Text sample
Blood Vessel Disease

Blood vessel disease is also called peripheral vascular disease or artery
disease. It is the narrowing oI the blood vessels in the abdomen, legs and
arms. When the blood vessels narrow, less oxygen-rich blood gets to your
body parts. This can cause tissue and cell death or gangrene. Blood vessel
disease is the leading cause oI amputations. Blood vessel disease is caused
by a build-up oI Iatty deposits called plaque. Some oI the blood vessels or
blood clots can cause other problems.

Signs of Blood Vessel Disease in the Abdomen, Legs and Arms
Muscle pain, aches or cramps
Cool, pale skin, cold hands and Ieet
Reddish-blue color oI the skin and nails oI the hands and Ieet
A sore that takes a long time to heal or when scabbed over, looks black
Loss oI hair on legs, Ieet or toes
Faint or no pulse in the legs or Ieet

Risk Factors
You are at higher risk Ior blood vessel disease iI you:
Smoke
Have diabetes
Are over the age oI 45
Have high cholesterol
Have high blood pressure
Have a Iamily member with heart or blood vessel disease
Are overweight
Are inactive

Your Care
Blood vessel disease may be prevented or slowed down with healthy
choices.
Have your blood pressure checked.
See your doctor each year.
Do not smoke or use tobacco.
Exercise each day.
Eat a diet low in Iat and high in Iiber.
Manage your stress.
Your care may also include medicine and surgery.

Talk to your doctor about your treatment options.

6/2005. Developed through a partnership oI The Ohio State University Medical Center, Mount Carmel
Health and OhioHealth, Columbus, Ohio. Available Ior use as a public service without copyright
restrictions at www.healthinfotranslations.com.
264 Sonia Colina
Authors address
Sonia Colina
Department of Spanish and Portuguese
Te University of Arizona
Modern Languages 545
Tucson, AZ 85721-0067
United States of America
scolina@email.arizona.edu

Further Evidence For A Functionalist Approach To Translation Quality Evaluation

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Further Evidence For A Functionalist Approach To Translation Quality Evaluation

Загружено:

Авторское право:

Доступные форматы

Further evidence for a functionalist approach

to translation quality evaluation*

Вам также может понравиться