Sonia Colina Te University of Arizona Colina (2008) proposes a componential-functionalist approach to translation quality evaluation and reports on the results of a pilot test of a tool designed ac- cording to that approach. Te results show good inter-rater reliability and justify further testing. Te current article presents an experiment designed to test the approach and tool. Data was collected during two rounds of testing. A total of 30 raters, consisting of Spanish, Chinese and Russian translators and teachers, were asked to rate 45 translated texts (depending on the language). Results show that the tool exhibits good inter-rater reliability for all language groups and texts except Russian and suggest that the low reliability of the Russian raters scores is unrelated to the tool itself. Te fndings are in line with those of Colina (2008). Keywords: quality, assessment, evaluation, rating, componential, functionalism, errors 0. Introduction Recent US federal mandates (e.g. White House Executive Order #13166), 1 requir- ing health care providers who are recipients of federal funds to provide language translation and interpretation for patients with limited English profciency (LEP), have brought the long-standing issue of translation quality to a wider audience of health care professionals (e.g. managers, decision makers, industry stakeholders, private foundations), who generally feel unprepared to address the topic. A strik- ing example of how challenging quality evaluation can be for health care organiza- tions is illustrated by the experience of Hablamos Juntos, an initiative funded by the Robert Wood Johnson Foundation to develop practical solutions to language barriers to health care. Several healthcare providers (including hospitals) working with the program identifed what they believed were the best translations available. Eighty-seven Target 21:2 (2009), 235264. doi 10.1075/target.21.2.02col issn 09241884 / e-issn 15699986 John Benjamins Publishing Company 236 Sonia Colina documents, rated as highly satisfactory and recommended for replication, were collected from the providers. Examination of these health education texts by doc- torate-level, Spanish language specialists resulted in quality being identifed as a problem. Many of these texts were cumbersome to read, to the point that readers required the English originals to decipher the intended meanings of some trans- lations. It became clear that these texts were potentially hampering health care quality and outcomes by not providing needed access to intended health care in- formation for patients with limited English profciency. Furthermore, health care administrators overseeing the translation processes that produced these texts had not identifed quality as a problem and needed assistance assessing the quality of non-English written materials. It was this context that prompted the launch of the Translation Quality Assessment (TQA) project, funded as one of various HJ initia- tives, to improve communication between health providers and patients with lim- ited English profciency. Te TQA project aims to design and test a research-based prototype tool that could be used by health care organizations to assess the quality of translated materials, being able to identify a wide range of quality. Colina (2008) describes the initial version of the tool and the frst phase of testing. Te results of a pilot experiment, reported also in Colina (2008), reveal good inter-rater reli- ability and provide justifcation for further testing. Te current article presents a second experiment designed to test the approach and tool. 1. Translation quality revisited Translation quality evaluation is probably one of the most controversial, intensely debated topics in translation scholarship and practice. Yet, progress in this area does not seem to correlate with the intensity of the debate. One may wonder whether the situation is perhaps partly related to the diverse nature of the def- nitions of translation. In a feld such as translation studies, flled with unstated, ofen culturally-dependent, assumptions about the role of translation and transla- tors, equivalence and literalness, translation norms and translation standards, it is not surprising that quality and evaluation have remained elusive to defnition or standards. Current reviews of the literature ofer support for this hypothesis (Co- lina 2008, House 2001, Lauscher 2000), as they reveal a multiplicity of views and priorities in the area of translation quality. In one recent overview, Colina (2008) classifes the various approaches into two major groups according to whether their orientation is experiential or theoretical; parts of that overview are reproduced here for ease of reference (see further Colina 2008). Further evidence for a functionalist approach to translation quality evaluation 237 1.1 Experiential approaches Many methods of translation quality assessment fall within this category. Tey tend to be ad hoc, anecdotal marking scales developed for the use of a particular professional organization or industry, e.g., the ATA certifcation exam, the SAE J2450 Translation Quality Metric for the automotive industry, the LISA QA tool for localization. 2 While the scales are ofen adequate for the particular purposes of the organization that created them, they sufer from limited transferability, pre- cisely due to the absence of theoretical and/or research foundations that would permit their transfer to other environments. For the same reason, it is difcult to assess the replicability and inter-rater reliability of these approaches. 1.2 Teoretical approaches Recent theoretical, research-based approaches tend to focus on the user of a trans- lation and/or the text. Tey have also been classifed as equivalence-based or func- tionalist (Lauscher 2000). Tese approaches arise out of a theoretical framework or stated assumptions about the nature of translation; however, they tend to cover only partial aspects of quality and they are ofen difcult to apply in professional or teaching contexts. 1.2.1 Reader-response approaches Reader-response approaches evaluate the quality of a translation by assessing whether readers of the translation respond to it as readers of the source would re- spond to the original (Nida 1964, Carroll 1966, Nida and Taber 1969). Te reader- response approach must be credited with recognizing the role of the audience in translation, more specifcally, of translation efects on the reader as a measure of translation quality. Tis is particularly noteworthy in an era when the dominant notion of text was that of a static object on a page. Yet, the reader-response method is also problematic because, in addition to the difculties inherent to the process of measuring reader response, the response of a reader may not be equally important for all texts, especially for those that are not reader-oriented (e.g., legal texts). Te implication is that reader response will not be equally informative for all types of translation. In addition, this method ad- dresses only one aspect of a translated text (i.e., equivalence of efect on the reader), ignoring others, such as the purpose of the translation, which may justify or even require a slightly diferent response from the readers of the translation. One also wonders if it is in fact possible to determine whether two responses are equivalent, as even monolingual texts can trigger non-equivalent reactions from slightly dif- ferent groups of readers. Since, in most cases, the readership of a translated text is 238 Sonia Colina diferent than that envisioned by the writer of the original, 3 one can imagine the difculties entailed by equating quality with equivalence of response. Finally, as with many other theoretical approaches, reader-response testing is time-consum- ing and difcult to apply to actual translations. At a minimum, careful selection of readers is necessary to make sure that they belong to the intended audience for the translation. 1.2.2 Textual and pragmatic approaches Textual and pragmatic approaches have made a signifcant contribution to the feld of translation evaluation by shifing the focus from counting errors at the word or sentence level to evaluating texts and translation goals, giving the reader and communication a much more prominent role. Yet, despite these advances, none of these approaches can be said to have been widely adopted by either profes- sionals or scholars. Some models have been criticized because they focus too much on the source text (Reiss 1971) or on the target text (Skopos) (Reiss and Vermeer 1984, Nord 1997); Reiss argues that the text type and function of the source text is the most important factor in translation and quality should be assessed with respect to it. For Skopos Teory it is the text type and function of the translation that is of para- mount importance in determining the quality of the translation. Houses (1997, 2001) functional pragmatic model relies on an analysis of the linguistic-situational features of the source and target texts, a comparison of the two texts, and the resulting assessment of their match. Te basic measure of qual- ity is that the textual profle and function of the translation match those of the original, the goal being functional equivalence between the original and the trans- lation. One objection that has been raised against Houses functional model is its dependence on the notion of equivalence, ofen a vague and controversial term in translation studies (Hnig 1997). Tis is a problem because translations sometimes are commissioned for a somewhat diferent function than that of the original; in addition, a diferent audience and time may require a slightly diferent function than that of the source text (see Hnig 1997 for more on the problematic notion of equivalence). Tese scenarios are not contemplated by equivalence-based theo- ries of translation. Furthermore, one can argue that what qualifes as equivalent is as variegated as the notion of quality itself. Other equivalence-based models of evaluation are Gerzymisch-Arbogast (2001), Neubert (1985), and Van den Broeck (1985). In sum, the reliance on an a priori notion of equivalence is problematic and limiting in descriptive as well as explanatory value. An additional objection against textual and pragmatic approaches is that they are not precise about how evaluation is to proceed afer the analysis of the source or the target text is complete or afer the function of the translation has been established Further evidence for a functionalist approach to translation quality evaluation 239 as the guiding criteria for making translation decisions. Tis obviously afects the ease with which the models can be applied to texts in professional settings. Hnig, for instance, afer presenting some strong arguments for a functionalist approach to evaluation, does not ofer any concrete instantiation of the model, other than in the form of some general advice for translator trainers. He comes to the conclusion that the speculative element will remain at least as long as there are no hard and fast empirical data which serve to prove what a typical readers responses are like (1997: 32). 4 Te same criticism regarding the difculty involved in applying textual and theoretical models to professional contexts is raised by Lauscher (2000). She explores possible ways to bridge the gap between theoretical and practical quality assessment, concluding that translation criticism could move closer to practical needs by developing a comprehensive translation tool (2000: 164). Other textual approaches to quality evaluation are the argumentation-cen- tered approach of Williams (2001, 2004), in which evaluation is based on argu- mentation and rhetorical structure, and corpus-based approaches (Bowker 2001). Te argumentation-centered approach is also equivalence-based, as a translation must reproduce the argument structure of ST to meet minimum criteria of ade- quacy (Williams 2001: 336). Bowkers corpus-based model uses a comparatively large and carefully selected collection of naturally occurring texts that are stored in machine-readable form as a benchmark against which to compare and evalu- ate specialized student translations. Although Bowker (2001) presents a novel, valuable proposal for the evaluation of students translations, it does not provide specifc indications as to how translations should be graded (2001: 346). In sum, argumentation and corpus-based approaches, although presenting crucial aspects of translation evaluation, are also complex and difcult to apply in professional environments (and one could argue in the classroom as well). 1.3 Te functional-componential approach (Colina 2008) Colina (2008) argues that current translation quality assessment methods have not achieved a middle ground between theory and applicability; while anecdotal approaches lack a theoretical framework, the theoretical models ofen do not con- tain testable hypotheses (i.e., they are non-verifable) and/or are not developed with a view towards application in professional and/or teaching environments. In addition, she contends that theoretical models usually focus on partial aspects of translation (e.g., reader response, textual aspects, pragmatic aspects, relationship to the source, etc): Perhaps due to practical limitations and the sheer complexity of the task, some of these approaches overlook the fact that quality in translation is a multifaceted reality, and that a general comprehensive approach to evaluation may need to address multiple components of quality simultaneously. 240 Sonia Colina As a response to the inadequacies identifed above, Colina (2008) proposes an approach to translation quality evaluation based on a theoretical approach (func- tionalist and textual models of translation) that can be applied in professional and educational contexts. In order to show the applicability of the model in practical settings, as well as to develop testable hypotheses and research questions, Co- lina and her collaborators designed a componential, functionalist, textual tool (henceforth the TQA tool) and pilot-tested it for inter-rater reliability (cf. Colina 2008 for more on the frst version of this tool). Te tool evaluates components of quality separately, consequently refecting a componential approach to quality; it is also considered functionalist and textual, given that evaluation is carried out relative to the function and the characteristics of the audience specifed for the translated text. As mentioned above, it seems reasonable to hypothesize that disagreements over the defnition of translation quality are rooted in the multiplicity of views of translation itself and on diferent priorities regarding quality components: It is ofen the case that a requesters view of quality will not coincide with that of the evaluators; yet, without explicit criteria on which to base the evaluation, the evaluator can only rely on his/her own views. In an attempt to introduce fexibility with regard to diferent conditions infuencing quality, the proposed TQA tool al- lows for a user-defned notion of quality in which it is the user or requester who decides which aspects of quality are more important for his/her communicative purposes. Tis can be done either by adjusting customer-defned weights for each component or simply by assigning higher priorities to some components. Custom weighting of components is also important because the efect of a particular com- ponent on the whole text may also vary depending on textual type and function. An additional feature of the TQA tool is that it does not rely on a point deduction system; rather, it tries to match the text under evaluation with one of several de- scriptors provided for each category/component of evaluation. In order to capture the descriptive, customer-defned notion of quality, the original tool was modifed in the second experiment to include a cover sheet (see Appendix 1). Te experiment in Colina (2008) sets out to test the functional approach to evaluation by testing the tools inter-rater reliability. 37 raters and 3 consultants were asked to use the tool to rate three translated texts. Te texts selected for eval- uation consisted of reader-oriented health education materials. Raters were bi- linguals, professional translators, and language teachers. Some basic training was provided. Data was collected by means of the tool and a post rating survey. Some diferences in ratings could be ascribed to rater qualifcations: teachers and trans- lators ratings were more alike than those of bilinguals; bilinguals were found to rate higher and faster than the other groups. Teachers also tended to assign higher ratings than translators. It was shown that diferent types of raters were able to use Further evidence for a functionalist approach to translation quality evaluation 241 the tool without signifcant training. Pilot testing results indicate good inter-rater reliability for the tool and the need for further testing. Te current paper focuses on a second experiment designed to further test the approach and tool proposed in Colina (2008). 2. Second phase of TQA testing: Methods and Results 2.1 Methods One of the most important limitations of the experiment in Colina (2008) is in regard to the numbers and groups of participants. Given the project objective of ensuring applicability across languages frequently used in the USA, subject re- cruitment was done in three languages, Spanish, Russian, and Chinese. As a result, resources and time for recruitment had to be shared amongst the languages, with smaller numbers of subjects per language group. Te testing described in the cur- rent experiment includes more subjects and additional texts. More specifcally, the study reported in this paper aims: I. To test the TQA tool again for inter-rater reliability (i.e. to what degree trained raters use the TQA tool consistently) by answering the following questions: Question 1. For each text, how consistently do all raters rate the text? Question 2. How consistently do raters in the frst session (Benchmark) rate the texts? Question 3. How consistently do raters in the second session (Reliability) rate the texts? Question 4. How consistently do raters rate each component of the tool? Are there some test components where there is higher rater reliability? II. Compare the rating skills/behavior of translators and teachers: Is there a difer- ence in scoring between Translators and Teachers? (Question 5, Section 2.2). Data was collected during two rounds of testing: the frst, referred to as the Bench- mark Testing, included 9 Raters; the second session, the Reliability Testing, in- cluded 21 raters. Benchmark and Reliability sessions consisted of a short training session, followed by a rating session. Raters were asked to rate 45 translated texts (depending on the language) and had one afernoon and one night to complete the task. Afer their evaluation worksheets had been submitted, raters were required to submit a survey on their experience using the tool. Tey were paid for their participation. 242 Sonia Colina 2.1.1 Raters Raters were drawn from the pool used for the pre-pilot and pilot testing sessions reported in Colina (2008) (see Colina [2008] for selection criteria and additional details). A call was sent via email to all those raters selected for the pre-pilot and pilot testing (including those who were initially selected but did not take part). All raters available participated in this second phase of testing. As in Colina (2008), it was hypothesized that similar rating results would be obtained within the members of the same group. Terefore, raters were recruit- ed according to membership in one of two groups: Professional translators; and language teachers (language professionals who are not professional translators). Membership was assigned according to the same criteria as in Colina (2008). All selected raters exhibited linguistic profciency equivalent to that of a native (or near-native) speaker in the source and in one of the target languages. Professional translators were defned as language professionals whose income comes primarily from providing translation services. Signifcant professional ex- perience (5 years minimum, most had 1220 years of experience), membership in professional organizations, and education in translation and/or a relevant feld were also needed for inclusion in this group. Recruitment for these types of individu- als was primarily through the American Translators Association (ATA). Although only two applicants were ATA certifed, almost all were ATA afliates (members). Language teachers were individuals whose main occupation was teaching language courses, at a university or other educational institution. Tey may have had some translation experience, but did not rely on translation as their source of income. A web search of teaching institutions with known foreign language programs was used for this recruitment. We outreached to schools throughout the country at both the community college and university levels. Te defnition of teacher did not preclude graduate student instructors. Potential raters were assigned to the above groups on the basis of the infor- mation provided in their resume or curriculum vitae and a language background questionnaire included in a rater application. Te bilingual group in Colina (2008) was eliminated from the second experi- ment, as subjects were only available for one of the languages (Spanish). Transla- tion competence models and research suggest that bilingualism is only one com- ponent of translation competence (Bell 1991, Cao 1996, Hatim and Mason 1997, PACTE 2008). Nonetheless, since evaluating translation products is not the same as translating, it is reasonable to hypothesize that other language professionals, such as teachers, may have the competence necessary to evaluate translations; this may be particularly true in cases, such as the current project, in which the object of evaluation is not translator competence, but translation products. Tis hypothesis would be born out if the ratings provided by translators and teachers are similar. Further evidence for a functionalist approach to translation quality evaluation 243 As mentioned above, data was collected during two rounds of testing: the frst one, the Benchmark Testing, included 9 Raters (3 Russian; 3 Chinese; 3 Spanish); these raters were asked to evaluate 45 texts (per language) that had been previ- ously selected as clearly of good or bad quality by expert consultants in each lan- guage. Te second session, the Reliability Testing, included 21 raters, distributed as follows: Spanish: 5 teachers, 3 translators (8) Chinese: 3 teachers, 4 translators (7) Russian: 3 teachers, 3 translators (6) Diferences across groups refect general features of that language group in the US. Among the translators, the Russians had degrees in Languages, History and Trans- lating, Engineering and Nursing from Russian and US universities and experience ranging from 12 to 22 years; the Chinese translators experience ranged from 6 to 30 years and their education included Chinese language and literature, Philosophy (MA), English (PhD), Neuroscience (PhD) and Medicine (MD), with degrees ob- tained in China and the US. Teir Spanish counterparts experience varied from 5 to 20 years and their degrees included areas such as Education, Spanish and English Literature, Latin American Studies (MA), and Creative Writing (MA). Te Spanish and Russian teachers were perhaps the most uniform groups, includ- ing: College instructors (PhD students) with MAs in Spanish or Slavic Linguistics, Literature, and Communication, and one college professor of Russian. With one exception, they were all native speakers of Spanish or Russian with formal edu- cation in the country of origin. Chinese teachers were college instructors (PhD students) with MAs in Chinese, one college professor (PhD in Spanish) and an elementary school teacher and tutor (BA in Chinese). Tey were all native speak- ers of Chinese. 2.1.2 Texts As mentioned above, experienced translators serving as language consultants se- lected the texts to be used in the rating sessions. Tree consultants were instruct- ed to identify health education texts translated from English into their language. Texts were to be publicly available on the Internet: Half were to be very good and the other half were to be considered very poor on reading the text. Tose texts were used for the Benchmark session of testing during which they were rated by the consultants and two additional expert translators. Te texts where there was the most agreement in rating were selected for the Reliability Testing. Reliability texts were comprised of fve Spanish texts (three good and two bad), four Russian texts and four Chinese texts, two for each language being of good quality and two of bad quality, making up a total of thirteen additional texts. 244 Sonia Colina 2.1.3 Tool Te tool tested in Colina (2008) was modifed to include a cover sheet consisting of two parts. Part I is to be completed by the person requesting the evaluation (i.e. the Requester) and read by the rater before he/she started his/her work. It contains the Translation Brief, relative to which the evaluation must always take place, and the Quality Criteria, clarifying requester priorities among components. Te TQA Evaluation Tool included in Appendix 1 contains a sample Part I, as specifed by Hablamos Juntos (the Requester), for the evaluation of a set of health education materials. Te Quality Criteria section refects the weights assigned to the four components in the Scoring Worksheet at the end of the tool. Part II of the Cover Sheet is to be flled in by the raters afer the rating is complete. An Assessment Summary and Recommendation section was included to allow raters the oppor- tunity to ofer an action recommendation on the basis of their ratings: I.e. What should the requester do now with this translation? Edit it? Minor or small edits? Redo it entirely? An additional modifcation to the tool consisted of eliminat- ing or adding descriptors so that each category would have an equal number of descriptors (four for each component) and revising the scores assigned so that the maximum number of points possible would be 100. Some minor stylistic changes were made in the language of the descriptors. 2.1.4 Rater Training Te Benchmark and Reliability sessions included training and rating sessions. Te training provided was substantially the same ofered in the pilot testing and de- scribed in Colina (2008): It focused on the features and use of the tool and it con- sisted of PDF materials (delivered via email), a Power-point presentation based on the contents of the PDF materials, and a question-and-answer session delivered online via Internet and phone conferencing system. Some revisions to the training refect changes to the tool (including instruc- tions on the new Cover Sheet), a few additional textual examples in Chinese, and a scored, completed sample worksheet for the Spanish group. Samples were not in- cluded for the other languages due to time and personnel constraints. Te training served as a refresher for those raters who had already participated in the previous pilot training and rating (Colina 2008). 5 2.2 Results Te results of the data collection were submitted to statistical analysis to deter- mine to what degree trained raters use the TQA tool consistently. Table 1 and Figures 1a and 1b show the overall score of each text rated and the standard deviation between the overall score and the individual rater scores. Further evidence for a functionalist approach to translation quality evaluation 245 200-series texts are Spanish texts, 400s are Chinese and 300s are Russian. Te stan- dard deviations range from 8.1 to 19.2 for Spanish, from 5.7 to 21.2 for Chinese and from 16.1 to 29.0 for Russian. Question 1. For each text, how consistently do all raters rate the text? Te standard deviations in Table 1 and Figures 1a and 1b ofer a good measure of how consistently individual texts are rated. A large standard deviation suggests that there was less rater agreement (or that the raters difered more in their assess- ment). Figure 1b shows the average standard deviations per language. According to this, the Russian raters were the ones with the highest average standard devia- tion and the less consistent in their ratings. Tis is in agreement with the reliabillity coefcients shown below (Table 5), as the Russian raters have the lowest inter-rater reliability. Table 2 shows average scores, standard deviations, and average standard deviations for each component of the tool, per text and per language. Figure 2 represents average standard deviations per component and per language. Tere does not appear to be an obvious connection between standard deviations and Table 1. Average score of each text and standard deviation Text # of raters Average Score Standard Deviation Spanish 210 11 91.8 8.1 214 11 89.5 11.3 215 11 86.8 15.0 228 11 48.6 19.2 235 11 56.4 18.5 Avg. 14.42 Chinese 410 10 88.0 10.3 413 10 63.0 21.0 415 10 96.0 5.7 418 10 76.0 21.2 Avg. 14.55 Russian 312 9 59.4 16.1 314 9 82.8 15.6 315 9 75.6 22.1 316 9 67.8 29.0 Avg. 20.7 246 Sonia Colina 0 20 40 60 80 100 2 1 0 2 1 4 2 1 5 2 2 8 2 3 5 4 1 0 4 1 3 4 1 5 4 1 8 3 1 2 3 1 4 3 1 5 3 1 6 Text number Average Score Standard Deviation Figure 1a. Average score and standard deviation per text 0 5 10 15 20 25 Spanish Chinese Russian Standard Deviation (Avg.) Figure 1b. Average standard deviations per language Further evidence for a functionalist approach to translation quality evaluation 247 components. Although generally the components Target Language (TL) and Func- tional and Textual Adequacy (FTA) have higher standard deviations (i.e., ratings are less consistent), this is not always the case as seen in the Chinese data (FTA). One would in fact expect the FTA category to exhibit the highest standard devia- tions, given its more holistic nature; yet, the data do not bear out this hypothesis, as the TL component also shows standard deviations that are higher than Non-Spe- cialized Content (MEAN) and Specialized Content and Terminology (TERM). Question 2. How consistently do raters in the frst session (Benchmark) rate the texts? Te inter-rater reliability for the Spanish and for the Chinese raters is remark- able; however, the inter-rater reliability for the Russian raters is too low (Table 3). Table 2. Average scores and standard deviations for four components, per text and per language TL FTA MEAN TERM Text Raters Mean SD Mean SD Mean SD Mean SD Spanish 210 11 27.7 2.6 23.6 2.3 22.7 2.6 17.7 3.4 214 11 27.3 4.7 20.9 7.0 23.2 2.5 18.2 3.4 215 11 28.6 2.3 22.3 4.7 18.2 6.8 17.7 3.4 228 11 15.0 7.7 11.4 6.0 10.9 6.3 11.4 4.5 235 11 15.9 8.3 12.3 6.5 13.6 6.4 14.5 4.7 Avg. 5.12 5.3 4.92 3.88 Chinese 410 10 27.0 4.8 22.0 4.8 21.0 4.6 18.0 2.6 413 10 18.0 9.5 16.5 5.8 14.0 5.2 14.5 3.7 415 10 28.5 2.4 25.0 0.0 23.5 2.4 19.0 2.1 418 10 22.5 6.8 21.0 4.6 16.0 7.7 16.5 4.1 Avg. 5.875 3.8 4.975 3.125 Russian 312 9 18.3 7.1 15.0 6.1 13.3 6.6 12.8 4.4 314 9 25.6 6.3 21.7 5.0 19.4 3.9 16.1 4.2 315 9 23.3 9.4 18.3 7.9 17.8 4.4 16.1 4.2 316 9 20.0 10.3 16.7 7.9 17.2 7.1 13.9 6.5 8.275 6.725 5.5 4.825 Avg.SD (all lgs.) 6.3 5.3 5.1 3.9 248 Sonia Colina Tis, in conjunction with the Reliability Testing results, leads us to believe in the presence of other unknown factors, unrelated to the tool, responsible for the low reliability of the Russian raters. Question 3. How consistently do raters in the second session (Reliability) rate the texts? How do the reliability coefcients compare for the Benchmark and the Reli- ability Testing? Te results of the reliability raters mirror those of the benchmark raters, whereby the Spanish raters achieve a very good inter-rater reliability coefcient, the Chi- nese raters have acceptable inter-rater reliability coefcient, but the inter-rater reli- ability for the Russian raters is very low (Table 4). Table 5 (see also Tables 3 and 4) shows that there was a slight drop in inter- rater reliability for the Chinese raters (from the benchmark rating to the reliability rating), but the Spanish raters at both rating sessions achieved remarkable inter- rater reliability. Te slight drop among the Russian raters from the frst to the sec- ond session is negligible; in any case, the inter-rater reliability is too low. Average SD per tool component 0 1 2 3 4 5 6 7 8 9 TL FTA MEAN TERM Spanish Chinese Russian All languages Figure 2. Average standard deviations per tool component and per language Table 3. Reliability coefcients for benchmark ratings Reliability coefcient Spanish .953 Chinese .973 Russian .128 Further evidence for a functionalist approach to translation quality evaluation 249 Question 4. How consistently do raters rate each component of the tool? Are there some test components where there is higher rater reliability? Te coefcients for the Spanish raters show very good reliability, with excel- lent coefcients for the frst three components; the numbers for the Chinese raters are also very good, but the coefcients for the Russian raters are once again low (although some consistency is identifed for the FTA and MEAN components). (Table 6) Table 6. Reliability coefcients for the four components of the tool (all raters per language group) TL FTA MEAN TERM Spanish .952 .929 .926 .848 Chinese .844 .844 .864 .783 Russian .367 .479 .492 .292 In sum, very good reliability was obtained for Spanish and Chinese raters, for the two testing sessions (Benchmark and Reliability Testing) as well as for all compo- nents of the tool. Reliability scores for the Russian raters are low. Tese results are in agreement with the standard deviation data presented in Tables 12 and Fig- ure 1a and 1b and Figure 2. All of this leads us to believe that whatever the cause for the Russian coefcients, it was not related to the tool itself. Question 5. Is there a diference in scoring between translators and teachers? Table 7a and Table 7b show the scoring in terms of average scores and standard deviations for the translators and the teachers for all texts. Figures 3 and 4 show the mean scores and times for Spanish raters, comparing teachers and translators. Table 4. Reliability coefcients for Reliability Testing Reliability coefcient Spanish .934 Chinese .780 Russian .118 Table 5. Inter-rater reliability: Benchmark and Reliability Testing Benchmark reliability coefcient Reliability coefcient (for Reliability Testing) Spanish .953 .934 Chinese .973 .780 Russian .128 .118 250 Sonia Colina Table 7a. Average scores and standard deviations for consultants and translators Score Time text Mean SD Mean SD 210 93.3 7.5 75.8 59.4 214 93.3 12.1 94.2 101.4 215 85.0 17.9 36.3 18.3 228 46.7 20.7 37.5 22.3 235 46.7 18.6 49.5 38.9 410 91.4 7.5 46.0 22.1 413 62.9 21.0 40.7 13.7 415 96.4 4.8 26.1 15.4 418 69.3 22.1 52.4 22.2 312 52.5 15.1 26.7 2.6 314 88.3 10.3 22.5 4.2 315 74.2 26.3 28.7 7.8 316 63.3 32.7 25.8 6.6 Table 7b. Average scores and standard deviations for teachers Score Time text Mean SD Mean SD 210 90.0 9.4 63.6 39.7 214 85.0 9.4 67.0 41.8 215 89.0 12.4 36.0 30.5 228 51.0 19.5 38.0 31.7 235 68.0 10.4 57.6 40.2 410 80.0 13.2 61.0 27.7 413 63.3 25.7 71.0 24.6 415 95.0 8.7 41.0 11.5 418 91.7 5.8 44.0 6.6 312 73.3 5.8 55.0 56.7 314 71.7 20.8 47.7 62.7 315 78.3 14.4 37.7 45.5 316 76.7 22.5 46.7 63.5 Further evidence for a functionalist approach to translation quality evaluation 251 Te corresponding data for Chinese appears in Figures 5 and 6 and in Figures 7 and 8 for Russian. Spanish teachers tend to rate somewhat higher (3 out of 5 texts) and spend more time rating than translators (all texts). As with the Spanish raters, it is interesting to note that Chinese teachers rate either higher or similarly to translators (Figure 5): Only one text obtained lower ratings from teachers than from translators. Timing results also mirror those found for Spanish subjects: Teachers take longer to rate than translators (Figure 6). Despite the low inter-rater reliability among Russian raters, the same trend was found when comparing Russian translators and teachers with the Chinese and the Spanish. Russian teachers rate similarly or slightly higher than translators, and they clearly spend more time on the rating task than the translators (Figure 7 and Figure 8). Tis also mirrors the fndings of the pre-pilot and pilot testing (Colina 2008). In order to investigate the irregular behavior of the Russian raters and to try to obtain an explanation for the low inter-rater reliability, the correlation between the total score and at the recommendation (the feld rec) issued by each rater was con- sidered. Tis is explored in Table 8. One would expect there to be a relatively high (negative) correlation because of the inverse relationship between high score and a low recommendation. As is illustrated in the three sub tables below, all Spanish rat- ers with the exception of SP02PB show a strong correlation between the recommen- dation and the total score, ranging from 0.854 (SP01VS) to 0.981 (SP02MC). Te results are similar with the Chinese raters, whereby all raters correlate very highly 0 10 20 30 40 50 60 70 80 90 100 210 214 215 228 235 Mean scores for Spanish raters Translators Teachers Figure 3. Mean scores for Spanish raters 252 Sonia Colina 0 10 20 30 40 50 60 70 80 210 214 215 228 235 Time for Spanish raters Translators Teachers Figure 4. Time for Spanish raters 0 20 40 60 80 100 120 410 413 415 418 Mean Score for Chinese Raters Translators Teachers Figure 5. Mean scores for Chinese raters Further evidence for a functionalist approach to translation quality evaluation 253 0 10 20 30 40 50 60 70 80 410 413 415 418 Time for Chinese Raters Translators Teachers Figure 6. Time for Chinese raters 0 10 20 30 40 50 60 70 80 90 100 312 314 315 316 Mean scores for Russian Raters Translators Teachers Figure 7. Mean scores for Russian raters 254 Sonia Colina between the recommendation and the total score, ranging from 0.867 (CH01BJ) to a perfect 1.00 (CH02JG). Te results are diferent for the Russian raters, however. It appears that three raters (RS01EM, RS02MK, and RS01NM) do not correlate highly between their recommendations and their total scores. A closer look espe- cially at these raters is warranted, as is a closer look at RS02LB, who was excluded from the correlation analysis due to a lack of variability (the rater uniformly recom- mended a 2 for all texts, regardless of the total score he or she assigned). Te other Russian raters exhibited strong correlations. Tis result suggests some unusual be- havior in the Russian raters, independently of the tool design and tool features, as the scores and overall recommendation do not correlate highly, as expected. 0 10 20 30 40 50 60 312 314 315 316 Time for Russian Raters Translators Teachers Figure 8. Time for Russian raters Table 8 (3 sub-tables). Correlation between recommendation and total score: 8.1 Spanish raters: SP04AR SP01JC SP01VS SP02JA SP02LA SP02PB SP02AB SP01PC SP01CC SP02MC SP01PS 0.923 0.958 0.854 0.938 0.966 0.421 0.942 0.975 0.913 0.981 0.938 8.2 Chinese raters: CH01RL CH04YY CH01AX CH02AC CH02JG CH01KG CH02AH CH01BJ CH01CK CH01FL 0.935 0.980 0.996 0.894 1.000 0.955 0.980 0.867 0.943 0.926 8.3 Russian raters: RS01EG RS01EM RS04GN RS02NB RS02LB RS02MK RS01SM RS01NM RS01RW 0.998 0.115 0.933 1.000 n/a 0.500 0.982 0.500 0.993 Further evidence for a functionalist approach to translation quality evaluation 255 3. Conclusions As in Colina (2008), testing showed that the TQA tool exhibits good inter-rater reliability for all language groups and texts, with the exception of Russian. It was also shown that the low reliability of the Russian raters scores is probably due to factors unrelated to the tool itself. At this point, it is not possible to determine what these factors may have been; yet further research with Russian teachers and translators may provide insights about the reasons for the low inter-rater reliability obtained for this group in the current study. In addition, the fndings are in line with those of Colina (2008) with regard to the rating behavior of translators and teachers: Although translators and teachers exhibit similar behavior, teachers tend to spend more time rating and their scores are slightly higher than those of trans- lators. While, in principle, it may appear that translators would be more efcient raters, one would have to consider the context of evaluation to select an ideal rater for a particular evaluation task. Because they spent more time rating (and one as- sumes refecting on their rating), teachers may be more apt evaluators in a forma- tive context, where feedback is expected from the rater. Teachers may also be better at refecting on the nature of the developmental process and therefore better able to ofer more adequate evaluation of a process and/or a translator (versus evalu- ation of a product). However, when rating involves a product and no feedback is expected (e.g. industry, translator licensing exams, etc), a more efcient translator rater may be more suitable to the task. In sum, the current fndings suggest that professional translators and language teachers could be similarly qualifed to assess translation quality by means of the TQA tool. Which of the two types of profes- sionals is more adequate for a specifc rating task probably will depend on the purpose and goal of evaluation. Further research comparing the skills of these two groups in diferent evaluation contexts is necessary to confrm this view. In summary, the results of empirical tests of the functional-componential tool continue to ofer evidence for the proposed approach and to warrant additional testing and research. Future research needs to focus on testing on a larger scale, with more subjects and various text types. Notes * Te research described here was funded by the Robert Wood Johnson Foundation. It was part of the Phase II of the Translation Quality Assessment project of the Hablamos Juntos National Program. I would like to express my gratitude to the Foundation, to the Hablamos Juntos Na- tional Program, and to the Program Director, Yolanda Partida, for their support of translation in the USA. I owe much gratitude to Yolanda Partida and Felicia Batts for comments, suggestions 256 Sonia Colina and revision in the write-up of the draf documents and on which this paper draws. More details and information on the Translation Quality Assessment project, including Technical Reports, Manuals and Toolkit Series are available on the Hablamos Juntos website (www.hablamosjuntos. org). I would also like to thank Volker Hegelheimer for his assistance with the statistics. 1. Te legal basis for most language access legislation in the United States of America lies in Title VI of the 1964 Civil Rights Act. At least 43 states have one or more laws addressing lan- guage access in health care settings. 2. www.sae.org; www.lisa.org/products/qamodel. 3. One exception is that of multilingual text generation, in which an original is written to be translated into multiple languages. 4. Note the reference to reader response within a functionalist framework. 5. Due to rater availability, 4 raters (1 Spanish, 2 Chinese, 1 Russian) were selected that had not participated in the training and rating sessions of the previous experiment. Given the low number, researchers did not investigate the efect of previous experience (experienced vs. inex- perienced raters). References Bell, Roger T. 1991. Translation and Translating. London: Longman. Bowker, Lynne. 2001. Towards a Methodology for a Corpus-Based Approach to Translation Evaluation. Meta 46:2. 345364. Cao, Deborah. 1996. A Model of Translation Profciency. Target 8:2. 325340. Carroll, John B. 1966. An Experiment in Evaluating the Quality of Translations. Mechanical Translation 9:34. 5566. Colina, Sonia. 2003. Teaching Translation: From Research to the Classroom. New York: McGraw Hill. Colina, Sonia. 2008. Translation Quality Evaluation: Empirical evidence for a Functionalist Approach. Te Translator 14:1. 97134. Gerzymisch-Arbogast, Heidrun. 2001. Equivalence Parameters and Evaluation. Meta 46:2. 227242. Hatim, Basil and Ian Mason. 1997. Te Translator as Communicator. London and New York: Routledge. Hnig, Hans. 1997. Positions, Power and Practice: Functionalist Approaches and Translation Quality Assessment. Current issues in language and society 4:1. 634. House, Julianne. 1997. Translation Quality Assessment: A Model Revisited. Tbingen: Narr. House, Julianne. 2001. Translation Quality Assessment: Linguistic Description versus Social Evaluation. Meta 46:2. 243257. Lauscher, S. 2000. Translation Quality-Assessment: Where Can Teory and Practice Meet?. Te Translator 6:2. 149168. Neubert, Albrecht. 1985. Text und Translation. Leipzig: Enzyklopdie. Nida, Eugene. 1964. Toward a Science of Translation. Leiden: Brill. Nida, Eugene and Charles Taber. 1969. Te Teory and Practice of Translation. Leiden: Brill. Further evidence for a functionalist approach to translation quality evaluation 257 Nord, Christianne. 1997. Translating as a Purposeful Activity: Functionalist Approaches Ex- plained. Manchester: St. Jerome. PACTE. 2008. First Results of a Translation Competence Experiment: Knowledge of Transla- tion and Efcacy of the Translation Process. John Kearns, ed. Translator and Interpreter Training: Issues, Methods and Debates. London and New York: Continuum, 2008. 104 126. Reiss, Katharina. 1971. Mglichkeiten und Grenzen der bersetungskritik. Mnchen: Hber. Reiss, Katharina and Vermeer, Hans. 1984. Grundlegung einer allgemeinen Translations-Teorie. Tbingen: Niemayer. Van den Broeck, Raymond. 1985. Second Toughts on Translation Criticism. A Model of its Analytic Function. Teo Hermans, ed. Te Manipulation of Literature. Studies in Literary Translation. London and Sydney: Croom Helm, 1985. 5462. Williams, Malcolm. 2001. Te Application of Argumentation Teory to Translation Quality Assessment. Meta 46:2. 326344. Williams, Malcolm. 2004. Translation Quality Assessment: An Argumentation-Centered Ap- proach, Ottawa: University of Ottawa Press. Rsum Colina (2008) propose une approche componentielle et fonctionnelle de lvaluation de la qua- lit des traductions et dresse un rapport sur les rsultats dun test-pilote portant sur un outil conu pour cette approche. Les rsultats attestent un taux lev de fabilit entre valuateurs et justifent la continuation des tests. Cet article prsente une exprimentation destine tester lapproche ainsi que loutil. Des donnes ont t collectes pendant deux priodes de tests. Un groupe de 30 valuateurs, compos de traducteurs et enseignants espagnols, chinois et russes, ont valu 4 ou 5 textes traduits. Les rsultats montrent que loutil assure un bon taux de fabilit entre valuateurs pour tous les groupes de langues et de textes, lexception du russe ; ils sugg- rent galement que le faible taux de fabilit des scores obtenus par les valuateurs russes est sans rapport avec loutil lui-mme. Ces constats confrment ceux de Colina (2008). Mots-clefs : Mots-cls : qualit, test, valuation, notation, componentiel, fonctionnalisme, erreurs 258 Sonia Colina Appendix 1: Tool Benchmark Rating Session
T i m e R a t i n g S t a r t s T i m e R a t i n g E n d s
Translation Quality Assessment Cover Sheet For Health Education Materials
PART I: To be completed by Requester Requester is the Health Care Decision Maker (HCDM) requesting a quality assessment of an existing translated text.
Requester: Title/Department: Delivery Date:
T R A N S L A T I O N B R I E F
Source Language Target Language Spanish, Russian, Chinese
Text Type: Text Title: Target Audience: Purpose of Document:
P R I O R I T Y O F Q U A L I T Y C R I T E R I A ____ Target Language ____ Functional and Textual Adequacy ____ Non-Specialized Content (Meaning)
Rank EACH from 1 to 4 (1 being top priority) ____ Specialized Content and Terminology
PART II: To be completed by TQA Rater
Rater (Name): Date Completed: Contact Information Date Received: Total Score: Total Rating Time:
A S S E S S M E N T S U M M A R Y A N D R E C O M M E N D A T I O N
Publish and/or use as is Minor edits needed before publishing* Major revision needed before publishing* Redo translation (To be completed after evaluating translated text.) Translation will not be an effective communication strategy for this text. Explore other options (e.g. create new target language materials) Notes/Recommended Edits
Further evidence for a functionalist approach to translation quality evaluation 259 - 2 -
RATING INSTRUCTIONS: 1. Carefully read the instructions for the review of the translated text. Your decisions and evaluation should be based on these instructions only. 2. Check the description that best fits the text given in each one of the categories. 3. It is recommended that you read the target text without looking at the English and score the Target Language and Functional categories. 4. Examples or comments are not required, but they can be useful to help support your decisions or to provide rationale for your descriptor selection.
1. TARGET LANGUAGE Category Number Description Check one box 1.a The translation reveals serious language proficiency issues. Ungrammatical use of the target language, spelling mistakes. The translation is written in some sort of third language (neither the source nor the target). The structure of source language dominates to the extent that it cannot be considered a sample of target language text. The amount of transfer from the source cannot be justified by the purpose of the translation. The text is extremely difficult to read, bordering on being incomprehensible.
1.b The text contains some unnecessary transfer of elements/structure from the source text. The structure of the source language shows up in the translation and affects its readability. The text is hard to comprehend.
1.c Although the target text is generally readable, there are problems and awkward expressions resulting, in most cases, from unnecessary transfer from the source text.
1.d The translated text reads similarly to texts originally written in the target language that respond to the same purpose, audience and text type as those specified for the translation in the brief. Problems/awkward expressions are minimal if existent at all.
Examples/Comments
2. FUNCTIONAL AND TEXTUAL ADEQUACY Category Number Description Check one box 2.a Disregard for the goals, purpose, function and audience of the text. The text was translated without considering textual units, textual purpose, genre, need of the audience, (cultural, linguistic, etc.) Can not be repaired with revisions.
2.b The translated text gives some consideration to the intended purpose and audience for the translation, but misses some important aspect/s of it (e.g. level of formality, some aspect of its function, needs of the audience, cultural considerations, etc.). Repair requires effort.
2.c The translated text approximates to the goals, purpose (function) and needs of the intended audience, but it is not as efficient as it could be, given the restrictions and instructions for the translation. Can be repaired with suggested edits.
2.d The translated text accurately accomplishes the goals, purpose (function: informative, expressive, persuasive) set for the translation and intended audience (including level of formality). It also attends to cultural needs and characteristics of the audience. Minor or no edits needed.
Examples/Comments
260 Sonia Colina - 3 -
3. NON-SPECIALIZED CONTENT-MEANING Category Number Description Check one box 3.a The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate renditions and/or important omissions and additions that cannot be justified by the instructions. Very defective comprehension of the original text.
3.b There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation instructions. Translation shows some misunderstanding of original and/or translation instructions.
3.c Minor alterations in meaning, additions or omissions.
3.d The translation accurately reflects the content contained in the original, insofar as it is required by the instructions without unwarranted alterations, omissions or additions. Slight nuances and shades of meaning have been rendered adequately.
Examples/Comments
4. SPECIALIZED CONTENT AND TERMINOLOGY Category Number Description Check one box 4.a Reveals unawareness/ignorance of special terminology and/or insufficient knowledge of specialized content.
4.c A few terminological errors, but the specialized content is not seriously affected.
4.d Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific to the subject.
Examples/Comments
TOTAL SCORE
- 3 -
3. NON-SPECIALIZED CONTENT-MEANING Category Number Description Check one box 3.a The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate renditions and/or important omissions and additions that cannot be justified by the instructions. Very defective comprehension of the original text.
3.b There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation instructions. Translation shows some misunderstanding of original and/or translation instructions.
3.c Minor alterations in meaning, additions or omissions.
3.d The translation accurately reflects the content contained in the original, insofar as it is required by the instructions without unwarranted alterations, omissions or additions. Slight nuances and shades of meaning have been rendered adequately.
Examples/Comments
4. SPECIALIZED CONTENT AND TERMINOLOGY Category Number Description Check one box 4.a Reveals unawareness/ignorance of special terminology and/or insufficient knowledge of specialized content.
4.c A few terminological errors, but the specialized content is not seriously affected.
4.d Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific to the subject.
Examples/Comments
TOTAL SCORE
Further evidence for a functionalist approach to translation quality evaluation 261 - 4 -
S C O R I N G W O R K S H E E T
Component: Target Language Component: Functional and Textual Adequacy Category # Value Score Category # Value Score 1.a 5
2.a 5
1.b 15
2.b 10
1.c 25
2.c 20
1.d 30
2.d 25
Component: Non-Specialized Content Component: Specialized Content and Terminology Category # Value Score Category # Value Score 3.a 5
4.a 5
3.b 10
4.b 10
3.c 20
4.c 15
3.d 25
4.d 20
Tally Sheet Component Category Rating Score Value Target Language Functional and Textual Adequacy Non-Specialized Content Specialized Content and Terminology Total Score
262 Sonia Colina Appendix 2: Text sample Blood Vessel Disease
Blood vessel disease is also called peripheral vascular disease or artery disease. It is the narrowing oI the blood vessels in the abdomen, legs and arms. When the blood vessels narrow, less oxygen-rich blood gets to your body parts. This can cause tissue and cell death or gangrene. Blood vessel disease is the leading cause oI amputations. Blood vessel disease is caused by a build-up oI Iatty deposits called plaque. Some oI the blood vessels or blood clots can cause other problems.
Signs of Blood Vessel Disease in the Abdomen, Legs and Arms Muscle pain, aches or cramps Cool, pale skin, cold hands and Ieet Reddish-blue color oI the skin and nails oI the hands and Ieet A sore that takes a long time to heal or when scabbed over, looks black Loss oI hair on legs, Ieet or toes Faint or no pulse in the legs or Ieet
Risk Factors You are at higher risk Ior blood vessel disease iI you: Smoke Have diabetes Are over the age oI 45 Have high cholesterol Have high blood pressure Have a Iamily member with heart or blood vessel disease Are overweight Are inactive
Your Care Blood vessel disease may be prevented or slowed down with healthy choices. Have your blood pressure checked. See your doctor each year. Do not smoke or use tobacco. Exercise each day. Further evidence for a functionalist approach to translation quality evaluation 263 Eat a diet low in Iat and high in Iiber. Manage your stress. Your care may also include medicine and surgery.
Talk to your doctor about your treatment options.
6/2005. Developed through a partnership oI The Ohio State University Medical Center, Mount Carmel Health and OhioHealth, Columbus, Ohio. Available Ior use as a public service without copyright restrictions at www.healthinfotranslations.com. 264 Sonia Colina Authors address Sonia Colina Department of Spanish and Portuguese Te University of Arizona Modern Languages 545 Tucson, AZ 85721-0067 United States of America scolina@email.arizona.edu
(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)