Machine translation systems, whether rule-based or statistical, can easily be tripped up by self-referential utterances. Such an utterance is one which refers to itself in some way or another. This represents a class of utterances that deserve careful analysis because they require a higher level of awareness of a text than the rules or statistical processing allows for in MT.
First, an example, taken from the book Harry Potter and the Order of the Phoenix by J. K. Rowling:
“The bit about not telling Harry more than he needs to know,” said Mrs. Weasley, placing a heavy emphasis on the last three words. (p. 88)
The self-reference here is the “last three words”, which refers to the “needs to know” in the quote that forms the first part of this sentence. The problem in this sentence is that the “needs to know” may or may not be three words when translated into another language.
In the Spanish translation, the language turns three words into two, so the translation reflects this:
- A lo de que no teníamos que contarle a Harry más de lo que necesita saber – dijo la señora Weasley poniendo much énfasis en las dos últimas palabras. (Harry Potter y la Orden del Fénix, p. 98)
The “dos últimas palabras” means “last two words”, referring to “necesita saber”, which is the proper translation of “needs to know”. So three has to become two. Unfortunately, current MT systems produce this:
- El pedacito sobre no decir Harry más que él necesita saber, "dijo a señora Weasley, poner un énfasis pesado en las tres palabras pasadas.
Setting aside other issues of quality in this machine-generated translation, the “dos palabras” that should have appeared is left as “tres palabras” because the MT system cannot handle this kind of self-reference in an utterance. There is a level of meaning beyond what appears in the individual words or even the meaning units to consider in this simple sample from a Harry Potter novel.
And Harry gives us another example of this problem. Usage of pronouns varies considerably from language to language. The word “you” in English covers one or many people. We got rid of thou as the singular and you as the plural before Shakespeare’s time. But many other languages preserve this distinction, so that the often-used euphemism of “You-Know-Who” for the infamous antagonist Lord Voldemort in Harry’s world has to be adjusted in languages that use a different word for one versus several people.
For instance, when Hagrid is telling Harry and his friends Ron and Hermione about his trip to find giants, he says:
“[a]nd partly ‘cause Dumbledore had warned us You-Know-Who was bound ter be after the giants an’ all.” (p. 426)
In Spanish, or for that matter French, Italian, German, or dozens of other common languages, the plural form of “you” has to be used, as in:
- Dumbledore nos había advertido que Quien-Uds.-saben también debía de andar buscando a los gigante. (p. 443).
Note that Uds. is short for Ustedes, the plural form of “you” (the singular is “Usted”). The human translation preserves the self-reference in the pronoun “you” by using the plural because there are three people in question. When the original is given to a MT system, we get:
y la causa Dumbledore había advertido en parte que Usted-Saber-Que estaba encuadernada ter estemos después de los gigantes ' a los todos.
In other words Usted, the singular, is used. This is of course incorrect. The MT system simply cannot process this self-reference and make the proper distinction between the singular and plural use of the pronoun “you” in English and Spanish. Also, as above, the Spanish here is rather poor, and doesn’t at all capture the odd dialect of Hagrid’s speech, though this is also lost to some extent in the human translation.
Current MT systems fail when attempting to translate language that involves this kind of usage. Unfortunately, such self-referential language is quite common in speech and often appears in writing. It can be even more difficult since the reference may occur within a paragraph, or among several paragraphs. In other words, while a human being can readily hold the content of previous words as the new ones are processed and therefore come up with the correct meaning, current MT systems simply have no way to do this.
This represents a fundamental barrier to high-quality MT at present. The MT systems will have to be able to handle this kind of self-referential language in order to produce satisfactory output for humans. That seems to be a ways off, the hype about MT notwithstanding.