Machine translation (MT) remains a Holy Grail of computer science, and possibly the killer app of the 21st century. That said, the MT systems as of 2007 are woefully inadequate for any real-world uses beyond basic comprehension of simple texts and gisting.
Their failure comes as a result of a variety of issues that occur in language. In an attempt to clarify these issues, I have been analyzing and classifying syntactic patterns that current systems, statistical, frame-based, or traditional, simply do not handle well.
The first class is a sentence with homonyms. As an example:
Rose rose for her rose.
Of course this sentence is a bit silly, but it will demonstrate nicely the problems that homonyms create for MT systems. Taking the above sentence and putting through Google's MT system gives the following:
|Spanish||Rose se levantó para ella color de rosa.|
|French||Rose s'est levée pour elle rose.|
|German||Rose stieg für sie rose.|
|Italian||Rosa è aumentato per lei di rosa.|
|Portuguese||Rosa levantou-se para ela cor-de-rosa.|
All of these translations are wrong, each in similar ways. The name is identified correctly, which is a relatively trivial matter since names can be stored in a lookup table or identified based on being capitalized nouns, a rule-based approach that works well in English.
However, the verb is mistranslated, and the rest of the sentence therefore falls apart. In other words, a grammatically simple sentence with only five words is actually quite complex, and requires some semantic insight, that is to say the the ability to see past the words to find the meaning, to make sense of. Current MT systems simply cannot do this, though there is no fundamental issue here that would prevent future systems from overcoming the problem of homonyms.
This class of sentences may seem trivial, given the choice of example. However, homonyms are quite common in English, and are far more common in languages like Japanese or Arabic, which have relatively few phonemes.
Further, many sentences actually combine names from several languages, and unless the name is properly identified as such, the results can be amusing. The U.N. Secretary-General from Korea, Ban (a relatively common name), has seen several ambiguous newspaper headlines.
MT systems, statistical or otherwise, will have to overcome the problem of parsing homonyms for context in order to produce a meaningful result. Until such time, the output of MT systems will be flawed in an amusing, and at times no doubt important, way.