Statistical machine translation has become the latest, fashionable way to talk about MT. But it has some underlying assumptions that make its success in the near future less than probable.
In statistical machine translation, a huge corpus of texts is assembled in two languages, and through complex processing parallel texts are found to create matches between the two languages that will serve as a basis for a statistical approach to creating a new translation. In other words, you have a gigantic collection of parallel texts in, for instance English and Arabic, and then when you need to translate a new document, the statistical MT system searches the collection of parallel texts to find exact matches, close matches, probable matches, and, perhaps, little things that can't be matched. It then creates a translation for you, all in a matter of seconds.
So far, so good. This sounds great in theory. But there is an assumption here that no one seems to be talking about. How do we know the parallel texts are actually parallel texts? What system of quality assurance and control has been used to guarantee that the parallel texts are actually faithful and natural reflections of each other? In other words, who is preparing the parallel texts and checking them, and where are all these parallel texts coming from?
It is well-known in the translation industry that many translations suck. A lage percentage of the people translating at present in the world have no business doing so; their knowledge of their languages and their skills as writers are simply insufficient to produce a quality translation. This does not bode well for parallel texts, since presumably they are being drawn from existing translations.
So we find ourselves in a bit of a vicious circle. On the one hand, the statistical approach sounds great in theory: use an existing, massive body of parallel texts (in other words, translations) to search for matches of varying degrees of accuracy or fuzziness, and then produce a translation of a new text that likely will require at most a little post-editing. In practice the problem is the parallel texts: we don't have a reliable supply of quality parallel texts, as far as anyone is aware, and the only way to create one is to have human translators produce it, with all the work, time, and funds implied in that process.
It will be interesting to see if the statistical approach yields better results than the current rules-based approach does. Inevitably the two will fuse, since the statistical approach will fail when confronting something not in its existing parallel text collection, and must have some fall-back procedure (in other words, rules) to use. But even then we are still likely to have to be patient: the dream of a universal grammar for any single language, to say nothing of human language in general, remains unfulfilled, and without rules of that sophistication, MT will likely remain a limited tool for use in particular circumstances, slowly creeping forward as more and more rules and better and bettter parallel texts are integrated into it. This may take ten, twenty, or more years, or may take far less, should a major breakthrough occur. But for now, we can watch and wait. The machines have to prove themselves to the market, and the market in general is not convinced.