By Erwin La Cruz

Do measures that pit machine translation against human interpreters give an accurate picture, asks Erwin La Cruz

The fight for translation supremacy between machines and humans rages on. The reports from the front line are delivered in code: LEPOR 0.68, NIST 0.56, BLEU 0.46, METEOR 0.56, WER 0.39. These acronyms are translation quality evaluation metrics. These metrics, and the algorithms that calculate them, are an integral part of the MT (machine translation) technology, making it possible to evaluate and refine the output of MT systems. But could this same technology be applied to assess interpreting students?

Before we can answer that question, we need to understand what translation quality metrics are. Training and evaluating MT systems require the use of big data. Neural-network translation models are trained using huge collections of sentences and their translations. The output of these systems is so large that manually assessing its quality is not practical in terms of cost and time. Researchers have solved this problem by developing algorithms that can automatically calculate translation quality metrics. These metrics assume that quality can be evaluated by focusing on two aspects: accuracy and fluency. Accuracy refers to lexical equivalence between the original and its translation, while fluency is determined by how grammatical a translated sentence is.

In 2018, Microsoft announced that its automatic translation system had achieved parity with human Chinese translators, reporting their MT achievement as matching human translations on a 0.69 BLEU score (where 0.7 is considered very good).1 The Bilingual Evaluation Understudy (BLEU) score is one of the most common metrics used to evaluate MT quality. It is easy to compute and implement in different languages as it does not need language-specific parsers or synonym sets. BLEU scores also correlate closely with the ranking of translation quality by human assessors. 

As with other translation metrics, the basic unit of analysis for BLEU is the sentence. Taking the sentence hay un gato en la alfombra and its correspondent reference ‘there is a cat on the mat’ in an English-Spanish parallel corpus, how would the BLEU algorithm assess the translation candidate ‘mat on is cat the’? First it takes into account how many words in the candidate appear in the reference – in this case five out of seven, which suggests an accurate translation. However, comparing individual words tells us little about how readable a sentence is.

Although accurate in terms of lexical equivalence, a translation like ‘mat on is cat the’ is not fluid or grammatical. To account for fluidity – or grammatical adequacy – the BLEU algorithm calculates a higher score when longer word sequences match in the candidate and the reference sentences. The candidate ‘the cat is on the mat’ gets a higher score because it matches the longer sequence ‘on the mat’.

The BLEU score is calculated for individual sentences, but it is really meant to be used as a metric for the translation of a whole corpus. The final score for a given corpus is the mean score of all individual sentences. The scale goes from 0 to 1, with 1 meaning a perfect match between the candidate translation and reference sentences. In practical terms, a BLEU score of 0.70 is very good, while one below 0.20 means the translation is of no practical use.

When high-scoring versions fail

I set out to investigate whether the BLEU score could be used as an effective assessment tool for interpreting students. My study involved 44 students of community interpreting, covering 18 languages, including majority languages such as Arabic, Japanese and Spanish, and languages of limited diffusion such as Nepali, Tongan and Kayan. The performance of each student was assessed by experienced professional interpreters of the students’ languages, as they interpreted two dialogues on medical consultations of about 500 words each. Omissions, additions and distortions in their renditions were annotated on a copy of the original dialogue script.

Syntactic changes were not recorded, so a candidate translation such as ‘this pain is affecting my life’ for the referent ‘my life has been affected by this pain’ was considered equivalent and no annotation was required. Equally, semantically equivalent terms were not marked in the script, for example ‘I have a lot of pain’ for ‘I am very sore’. Furthermore, short paraphrases, such as ‘gastroenteritis’ for ‘inflammation of the stomach and bowels’, were marked as a valid translation and not annotated on the script. The assessors were also asked to give a pass or fail mark to each student considering how accurate and fluid their renditions were in both dialogues.

BLEU scores were calculated from the annotated scripts using the Natural Language Toolkit.2 This is the package for the programming language Python, which includes several resources for language analysis, corpus linguistics and machine translation. The scores from the students were higher than 0.7 (‘very good’), which indicates that the translations were quite accurate and fluid. However, the oral assessment (a pass or fail mark) revealed that the students needed a much higher score (0.86) to get a pass. So it seems that the assessors were more demanding of the quality of the output. Incidentally, the analysis did not indicate that language was a relevant factor for scores. 

Once I had determined the passing threshold for the students, I was able to compare human interpreting with MT. The original assessment scripts in Arabic, French, Japanese, Mandarin and Spanish were translated using Google Translate and Microsoft Translator. The same assessors who marked the students’ assessments marked these translations; they were not told that the scripts were translated by machines. As before, omissions, alterations and distortions were annotated on a copy of the scripts. Once again, these scores were very high, with a range from 0.84 to 0.95.

I used the model that had been trained with data from the interpreting students to calculate the probability of passing for MT output. The results indicated that Spanish and Arabic translations had a high probability of passing, but Mandarin and Japanese had a low probability. In reality, only the Spanish translations got a pass from the assessors. Even the Microsoft translation with a BLEU score of 0.98 and pass-rate probability of 89% failed to pass the assessment.

This is because a translation that is ‘accurate’ at word level may not be accurate at a pragmatic level. Take the sentence ‘I will be unable to work for a while, and that means less money’. The translation ‘If I am unable to work for a while, I will also be able to make money and change plans’ will get a high BLEU score, even though it says the opposite of its reference. The low pass rate reflects a well-known challenge in MT: the farther apart the source and target languages are, the more difficult automatic translation becomes.

Is BLEU a useful tool?

The performance of interpreting students and MT systems was high in comparison to the values reported in most MT studies. This was expected, given that syntactic differences were not marked and semantic equivalences were allowed. However, a high BLEU score does not mean the translation is good enough for a professional interpreter.

BLEU scores do indicate a high match between candidate and reference, but they are not fine-tuned enough to pick up small lexical or syntactical mismatches that can have a huge impact on how a translation will be understood by a human recipient. For example, when the reference ‘in your case, one option is acromioclavicular resection arthroplasty’ (a procedure to repair the joint between the scapula and the collarbone) is translated as ‘in your case one option is surgery to rebuild the shoulder’ it gets a BLEU score of 0.46, but when it is incorrectly translated as ‘in your case one option is surgery vasectomy’ it gets a much higher score of 0.60.

Despite its limitations, the BLEU assessment method can be quite informative when the original text contains explicit textual information. For example, the Japanese 先生は嘘をついてますね。外国人にお金を使いたくないんですよ (‘The doctor is lying. She doesn’t want to spend money on foreigners’) was mistranslated by Google and MS Translator as ‘The teacher is lying. I don’t want to spend money on foreigners'. In this case, you need contextual information (a medical consultation) to translate the word sensei correctly as ‘doctor’ and to determine that it is she who does not want to spend money on the speaker. In contrast, the Spanish la doctora está mintiendo. Ella no quiere gastar dinero en los extranjeros explicitly states the profession of the subject of the first sentence (doctora), and the subject of the second sentence (ella). AI systems translated this Spanish segment correctly. In such circumstances, MT achieved very high BLEU scores and also a pass mark by the assessors.

With this in mind, BLEU scores could be used as a quick and cheap way to discriminate between good and bad translations. It can also be used to teach students about the difference between formal and functional equivalence. However, as effective interpreting relies on interpreters being able to presume, complement and supply contextual information, the BLEU score is not the best metric to report news from the battlefront between human and machine.


1 Awadalla, HH et al (2018) ‘Achieving Human Parity on Automatic Chinese to English News Translation’. Cornell University.

2 Bird, S, Loper, E and Klein, E (2009) Natural Language Processing with Python, O’Reilly Media Inc: California