Access and Feeds

Natural Language: Lost in Translation

By Dick Weisinger

The quality of computer translation has improved phenomenally over the last twenty years. Tools like Google and Bing Translate allow users to instantly translate an email or web page to and from English and a different language. The translations are often surprisingly good and, when just a high-level overview is needed, computer translations are typically more than sufficient.

But like most things in AI, the more training data used, the better the performance. Translations between English and Spanish are the most frequently requested, and because of that, of all languages, Google Translate performs at better than 90 percent accuracy when translating between English and Spanish. But translations between English and less common languages have lower performance, and a significantly lower performance when translating between two non-English languages.

When translating from English to another language using computer translation, a good rule to follow is to keep your source English sentences short, simple, and free from unnecessary jargon and slang. By following this rule, the accuracy of the translation can be quite high.

Some have pointed to this inequality of translation quality as a source of inequity. People living in the US who speak a language other than English or Spanish as their first language, like Vietnamese, Hmong or Thai, and who rely on tools like Google Translate may often need to resort to using poor or mistranslated information when interacting with the government.

In situations where accuracy and style are ranked high, computer translation still has a ways to go. High accuracy is especially critical for applications like medical, legal, and branding/advertising.

Even authors on the Google AI blog admit that computer translation still needs work: “one must remember that, especially for low-resource languages, automatic translation quality is far from perfect. These models still fall prey to typical machine translation errors, including poor performance on particular genres of subject matter (“domains”), conflating different dialects of a language, producing overly literal translations, and poor performance on informal and spoken language.”

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

three + 12 =