The MT world is abuzz with ideas of statistical analysis and processing of texts so that the MT system creates itself: no linguists or translators are involved, just a lot of crunching of freely available data from the Web. But can this really work?
Like all good ideas, this one is old. Discussions of statistical analysis for MT go back decades, and the implementation is appearing now because of a combination of explosive growth of the Web, the power of search engines to index and store vast portions of the content online, and the processing speed of computers to crunch through all this data.
So the idea has had a lot of appeal for a long time, in particular because it takes people out of the loop. Or so the proponents of this approach claim. However, people are still completely in the loop, just not well recognized way. After all, the texts that are being processed statistically to find similarities and from there build the MT system were all produced by people.
And that is the first problem. How do you know if you have parallel texts unless you know both languages? I am very suspicious of the claims that these systems can reliably and accurately find parallel texts on their own, without human guidance or checking. Most so-called translations of news and related media material online are not actual translations but rewrites, so any system using such material as a source of parallel texts may run into some problems.
Of course, the proponents of this approach claim that the vast quantity of information processed will smooth over any errors that creep in from one text. But how do we know this? If the system is human-free, who is doing the checking? Further, much of the data available online is not freely accessible, but instead requires a subscription or membership to an organization. All of that is presumably off limits to the MT systems, and so a vast repository of potentially useful texts will go unused.
Next, the accuracy issue. Current statistical systems are reported to be upwards of 50% accurate. In other words, if you are given a text in English produced by one of these systems from an Arabic or Chinese original, half the text is accurate. But which half? The system can't tell that, and neither can you, unless you know the source language, in which case you probably wouldn't need the translation. Even if the systems produce texts that are 80% or 90% accurate, that still leaves the texts 20% or 10% inaccurate, with no identification of which part is which. Do you want to use an aircraft maintenance manual or a medical instrument user's guide that is 90% accurate? Do you want to make investment decisions, or create military strategy with anything less than 100% accuracy?
The proponents standard response to this is that the systems will become 100% accurate. But that is just a response, since no system is currently even 50% accurate with anything but the most general, ordinary material. When the MT systems' developers are claiming accuracy levels approaching 100%, then it will be time for close scrutiny of their work. They may ultimately become that accurate. But there is a long way to go between 50% accurate and 100% accurate, and as is the case with so many tasks in artificial intelligence, the first 50% is easy, the next 25% is hard, and the remainder is very, very hard to achieve. Time will tell.