jump to navigation

Translation quality in the crosshairs July 3, 2007

Posted by globalizer in Localization, Translation, Unicode.

During the last few days there has been a bit of a brouhaha on the otherwise rather quiet CLDR mailing list over so-called “Google data” introduced during the 1.5 update process.

I had not been aware of the introduction of this data, but from the mailing list descriptions it sounds as if Google submitted a load of locale data translations for a slew of different locales, and that this data therefore automatically gets the “Google default vote” in the vetting process. This “default vote” means that it is actually not that easy to reject the new translations, even with a lot of other “vetters” agreeing. As Mark Davis explained:

Default data is given a default vote from the organization (in this case, Google), which means that it is ignored *if* there is another vote from that source. But otherwise it gets weighted as if it were a vote from the organization.

So unless Google specifically votes against its own data during the vetting process, the “default vote” counts as a regular Google vote in favor of the new version.

This would not necessarily be a major problem, if the new translations were more or less correct. But this is where the fun starts. Because apparently quite a few of the translations submitted by Google are not quite up to snuff. To quote a couple of comments:

today’s glance at what the data became left me quite speechless. Mockage of culture, no less.


How do we reject Google data outright? I’m getting rubbish like this in Xhosa.
Portugeuse -> portokugusseee
English -> Isingesi -> should be isiNgesi


The Belarusian locale data had some glaring errors before (cf. my emails to this list), but now it is a *raving mess*, supposedly after the “google tool” working on it. If that’s the way CLDR treats culture-related data, I’m not even going to bother with correcting that.

1. In language names, there’s now quite a quantity of entries written down in a distorted (in-group and non-normative) variant of Belarusian cultivated on Internet.

2. Some entries are plainly somebody’s uneducated invention (e.g., entry for afrikaans), unexplainable even by distortion.

3. To top that, the majority of google entries in languages are grammatically corrupt (wrong declension).

So, it would appear that Google contributed a bunch of translations that do not pass muster with expert native speakers. Mark Davis seems to admit that he is not too confident of the data in this jitterbug:

in cases like Xhosa, an organization may not have vetters working on CLDR that would be able to respond to issues and override the default where necessary. So my suggestion is to weight the vote to a be a Guest vote where the locale is not one of the main coverage locales for the organization.

He then lists the locales that Google would consider its “main coverage locales”, and they are the “usual suspects” (European and main AP languages) with a couple of slightly less mainstream locales such as Tagalog and Ukrainian.
He is not that forthcoming about how Google came up with the translations, but he does insist that they are not the result of machine translation or screen scraping, and that they were “translated by people, who took the English and translated it”.

He also adds these caveats:

Now, in some cases this may have been older data (eg before Zaire changed names) or may have been translated out of context, etc. And there is always the possibility of human error.

All of the above could of course happen on any kind of translation project, but the translations would rarely make it into a publicly released application. So this makes me suspect that the translations are a result of the volunteer translation efforts that Google has made use of in the past. The comments made about the Belarusian translations certainly would fit such an interpretation.

Now, it is certainly up to Google to decide whether they want to use those translations on their own web pages (with all the potential issues that entails, as discussed here), but I find it very disappointing that they would dump them into the CLDR. CLDR has become a very important project, with an important open source project like ICU relying completely on the CLDR data. And when you consider the growing list of major software products that in turn use ICU, loading rubbish data into the repository is really disruptive.

This, in my opinion, is clearly a case where “no translation” is better than “any translation”.

Update July 4:

Apparently I was not the only one with questions about the actual origins of this “Google data”. Here’s an update from the mailing list:

Nobody has said exactly where this data has come from. And I’m afraid this doesn’t clarify it either. My first assumption is that it was from the Google community translation tools, that is the only place I would expect Xhosa data to come from and we translated that. We want to correct bad data, but nobody seems to want to say where this is coming from.

Nobody has said exactly where this data has come from.

Here is an illustration of the problem we are seeing:

1) blank because we are uncertain of what we want
2) Google dump
3) Correct
4) Now we have a conflict



1. Musings on software globalization - July 4, 2007

The importance of the CLDR

Just a quick pointer to Naoto Sato’s blog post with his locale demo. I attended the session at the Internationalization and Unicode Conference where he used the demo, and just found the webstartable version that he posted.
The Locale Sensitive Se…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: