jump to navigation

Android string arrays and localization January 18, 2010

Posted by globalizer in Android, Localization, Programming languages, Programming practices.
1 comment so far

I really wish the Android developer guide would not (at least implicitly) recommend using localizable strings directly as array items.

In my opinion, any examples showing translatable strings as array items should in fact be accompanied by flashing red warning signs. Here’s why:

If the default version of a string array is updated with the addition of an array item, but the localized version has not been updated yet (very common scenario), then the menu item in question would simply be missing from the localized version (no fallback to the default version, since the string array represents the whole array). This can potentially result in serious functional issues in the localized versions of the software.

Instead, the examples should show array items as references to resources.


Install anywhere yes – but prepare for a slightly bumpy translation ride December 6, 2008

Posted by globalizer in Language, Locales, Localization, Programming practices.
add a comment

First, the good stuff: InstallAnywhere is actually a nifty product which allows you to create a good install program in almost no time. Kudos to the team behind it for that.

Now for the not-so-good stuff: translation

Here I am not referring to the translations provided by InstallAnywhere out of the box. They provide very good language coverage (I count 31 languages/language variants in the 2008 VP1 Enterprise Edition, including English), and the quality of the translations themselves also seems fine in this version (some of the early InstallShield translations into certain languages were rather unfortunate). [1]. 

So, no complaints there. The trouble starts when you need to modify or customize anything related to the translations.

First problem

All new/updated text strings that will need to be translated are inserted in the custom_en file which already contains all the out of the box translation strings. There is no option to choose a separate translation file for custom strings. This means that anybody using a modern translation tool with translation memory features will have to re-translate the entire InstallAnywhere GUI even if they only modify a single string (because such tools use the English source file as the starting point for all translation). Cut-and-paste from the existing translation files can make that job faster, or you may be able to use the feature of creating a translation memory based on a set of source and target files described below, but no matter what method you choose, there will be a significant workload involved.

InstallAnywhere does update all the language versions of the custom_xx files along with the English version, with the difference that only the comment line for each string is updated in the translated versions (the custom_en file contains a comment line with a copy of each translation string). After an update, the English and Danish versions look like this respectively:

# ChooseInstallSetAction.368876699cf1.bundlesTitle=Choose Install Setشقشلاهؤ
ChooseInstallSetAction.368876699cf1.bundlesTitle=Choose Install Setشقشلاهؤ

# ChooseInstallSetAction.368876699cf1.bundlesTitle=Choose Install Setشقشلاهؤ
ChooseInstallSetAction.368876699cf1.bundlesTitle=Vælg installationssæt

This seems to indicate that the designers of the product have not understood how modern translation tools work. Indeed, the detailed help indicates that the creators of the application assume the translated versions will all be created/edited via the IA designer. This feature would have been extremely useful 20 years ago, since it retains existing translations and allows translators to just go through the translated file and modify the strings where they see an update. Today it is a terrible hindrance (except for translators still working without modern tools), however.

With today’s tools translators who need to bring a translated file up to the same level as an updated English file simply take the new English source file and run it through their translation memory tool. That tool automatically translates any unchanged strings and presents the translator with just those strings that are either new or changes. For this to work the translation memory has to contain the unchanged strings, of course, and that is where the InstallAnywhere model breaks down. With some tools it is possible to create “fake” translation memories on the basis of existing source and target files, but it is a rather time-consuming process, and by no means error-free.

The easy fix would be to at least make it an option to store any customized strings in a separate translation file. Since InstallAnywhere allows users to change existing strings, this of course introduces the question of what to do with the existing strings in the custom_en and translated versions of that file.

I believe the best solution would be to delete such strings from the custom_en file and the translated versions (in other words, those files would only contain the strings that were unchanged from the out of the box version). The changed strings would instead be inserted in the “new” translation file.

Second problem – probably a minor one

There does not seem to really be an option for adding a language that is not in the list of languages provided out of the box. At least I don’t see it in the designer, and I haven’t found any information in the knowledgebase, fora, etc. With the number of languages supported, this may not be a major issue, but it would be nice to have the option.

Some of those early versions had eerily bad Danish and Norwegian translations which looked like an amalgam of Danish, Norwegian and Swedish. This old thread from an InstallShield forum may shed some light on how that happened (note also how the InstallShield support guy keeps suggesting that a Dutch version be used, when the user is looking for Danish…). But, as noted above, the current translations seem fine. I looked at the Danish version (the only one I am really competent to judge), and I have no complaints whatsoever, so I believe that the early problems have been overcome completely.

One laptop per child November 14, 2007

Posted by globalizer in global access, Language, Localization.
add a comment

Just a quick link to the olpc wiki to highlight that the core languages for the project are quite a bit different from the sets of languages that mainstream software localization usually targets, simply because they aim at providing coverage in developing countries.

I am sure that a few of those languages will pose some interesting challenges.

Unfortunately the wiki does not provide a very clear picture of the status of the various languages – exactly how much of the user interface is currently translated, for instance.

Give one, get one November 13, 2007

Posted by globalizer in global access, Localization.

A laptop, that is. You have just 2 weeks to get your hands on the very cool XO laptop produced for the One laptop per child project. And at the same time donate a laptop to a child in a developing country.

I saw a laptop prototype at the 30th Internationalization and Unicode conference where Nicholas Negroponte was the keynote speaker, and it was definitely a very different experience from your run-of-the-mill laptop, but also definitely one I would like to explore more. For one thing, if the connectivity works, it would be the perfect travel laptop – TSA could do their damnedest to it during security checks, and it would still come through unscathed.

And Negroponte was a hugely entertaining and engaging speaker, btw.

So now I am just waiting for the mailman…

Translation quality in the crosshairs July 3, 2007

Posted by globalizer in Localization, Translation, Unicode.
1 comment so far

During the last few days there has been a bit of a brouhaha on the otherwise rather quiet CLDR mailing list over so-called “Google data” introduced during the 1.5 update process.

I had not been aware of the introduction of this data, but from the mailing list descriptions it sounds as if Google submitted a load of locale data translations for a slew of different locales, and that this data therefore automatically gets the “Google default vote” in the vetting process. This “default vote” means that it is actually not that easy to reject the new translations, even with a lot of other “vetters” agreeing. As Mark Davis explained:

Default data is given a default vote from the organization (in this case, Google), which means that it is ignored *if* there is another vote from that source. But otherwise it gets weighted as if it were a vote from the organization.

So unless Google specifically votes against its own data during the vetting process, the “default vote” counts as a regular Google vote in favor of the new version.

This would not necessarily be a major problem, if the new translations were more or less correct. But this is where the fun starts. Because apparently quite a few of the translations submitted by Google are not quite up to snuff. To quote a couple of comments:

today’s glance at what the data became left me quite speechless. Mockage of culture, no less.


How do we reject Google data outright? I’m getting rubbish like this in Xhosa.
Portugeuse -> portokugusseee
English -> Isingesi -> should be isiNgesi


The Belarusian locale data had some glaring errors before (cf. my emails to this list), but now it is a *raving mess*, supposedly after the “google tool” working on it. If that’s the way CLDR treats culture-related data, I’m not even going to bother with correcting that.

1. In language names, there’s now quite a quantity of entries written down in a distorted (in-group and non-normative) variant of Belarusian cultivated on Internet.

2. Some entries are plainly somebody’s uneducated invention (e.g., entry for afrikaans), unexplainable even by distortion.

3. To top that, the majority of google entries in languages are grammatically corrupt (wrong declension).

So, it would appear that Google contributed a bunch of translations that do not pass muster with expert native speakers. Mark Davis seems to admit that he is not too confident of the data in this jitterbug:

in cases like Xhosa, an organization may not have vetters working on CLDR that would be able to respond to issues and override the default where necessary. So my suggestion is to weight the vote to a be a Guest vote where the locale is not one of the main coverage locales for the organization.

He then lists the locales that Google would consider its “main coverage locales”, and they are the “usual suspects” (European and main AP languages) with a couple of slightly less mainstream locales such as Tagalog and Ukrainian.
He is not that forthcoming about how Google came up with the translations, but he does insist that they are not the result of machine translation or screen scraping, and that they were “translated by people, who took the English and translated it”.

He also adds these caveats:

Now, in some cases this may have been older data (eg before Zaire changed names) or may have been translated out of context, etc. And there is always the possibility of human error.

All of the above could of course happen on any kind of translation project, but the translations would rarely make it into a publicly released application. So this makes me suspect that the translations are a result of the volunteer translation efforts that Google has made use of in the past. The comments made about the Belarusian translations certainly would fit such an interpretation.

Now, it is certainly up to Google to decide whether they want to use those translations on their own web pages (with all the potential issues that entails, as discussed here), but I find it very disappointing that they would dump them into the CLDR. CLDR has become a very important project, with an important open source project like ICU relying completely on the CLDR data. And when you consider the growing list of major software products that in turn use ICU, loading rubbish data into the repository is really disruptive.

This, in my opinion, is clearly a case where “no translation” is better than “any translation”.

Update July 4:

Apparently I was not the only one with questions about the actual origins of this “Google data”. Here’s an update from the mailing list:

Nobody has said exactly where this data has come from. And I’m afraid this doesn’t clarify it either. My first assumption is that it was from the Google community translation tools, that is the only place I would expect Xhosa data to come from and we translated that. We want to correct bad data, but nobody seems to want to say where this is coming from.

Nobody has said exactly where this data has come from.

Here is an illustration of the problem we are seeing:

1) blank because we are uncertain of what we want
2) Google dump
3) Correct
4) Now we have a conflict

PluralFormat to the rescue? July 2, 2007

Posted by globalizer in Java, Language, Localization, Translation.

OK, I have been missing in action for a while, I know. A new job (still in IBM, mind you) and a fairly long vacation are my only excuses. Back to business:

Here and here I complained about the localization issues involved in using ChoiceFormat. One of those issues would seem to be addressed by the new PluralRules and PluralFormat API proposal described on the ICU design mailing list recently. PluralRules would allow you to define plural cases for each language, and the numbers those plural cases apply to, while PluralFormat would then allow you to provide message text for each such case. This format would thus be able to handle languages like Russian and Polish, which use more complex plural rules than the ones that can be provided via the simple intervals of ChoiceFormat.

It is of course a step forward that the API will now allow you to actually define something that will work for (all?) languages. As far as I can see we will actually take a step backward with respect to the other problem, however: the format will be even more difficult to handle for translators.

According to the API proposal,

It provides predefined plural rules for many locales. Thus, the programmer need not worry about the plural cases of a language. On the flip side, the localizer does not have to specify the plural cases; he can simply use the predefined keywords. The whole plural formatting of messages can be done using localized patterns from resource bundles.

If this is really true, then the programmer will write a resource bundle that implements the US English keywords (in most cases, anyway), and it will be up to the localizer to know the PluralRules keywords that are defined for her language, and to implement them correctly in the localized resource bundle.

This comment on the mailing list to the proposal would seem to be an understatement:

Separating the rules from Plural Format helps some here, but translators will still have to be able to write the PluralFormat syntax, which is about as complicated as the ChoiceFormat syntax.

I think my ChoiceFormat advice will extend to the new API for the time being: don’t use it

So you think that charset= value is going to help you? March 8, 2007

Posted by globalizer in encoding, Language, Localization, web applications.
add a comment

I just ran across a real-life example of the pitfalls involved in relying on the charset value in HTML pages to tell you what the encoding is.

It is not unreasonable, of course, to assume that the encoding of the text actually corresponds to the value specified in that tag, it is just not very realistic.

Both Mark Davis from Google and Addison Phillips from Yahoo highlighted the fact that so many pages are either untagged or mistagged in their presentations at the most recent Unicode conference.

A recent question on Sun’s Java i18n forum about an “incorrect” conversion from gb2312 into Unicode made me suspicious, and wouldn’t you know it, the culprit was an incorrect charset value. This Chinese page is tagged as being encoded in gb2312, but it seems to really be in GBK. What makes this slightly tricky is that in a text using everyday Chinese the differences between those two encodings would be minimal – very few characters would be in GBK that would not be in gb2312. So a scan to detect the actual encoding might not even have made any difference.

Flex developers sure are agile though… March 2, 2007

Posted by globalizer in Locales, Localization, Programming languages.
1 comment so far

I’m impressed to see a comment from a Flex developer the day after I posted about the localization support. And even more impressed to see that the next version will actually provide exactly the kind of support I was looking for.

If that kind of attentiveness is an indication of their support in general, then I give them a thumbs up…

Flex – not so flexible, it seems March 1, 2007

Posted by globalizer in Locales, Localization, Programming languages.

I’m taking responsibility for the globalization and localization of some new applications, and just trying to figure out what they do, the programming platform, file types, etc. It turns out that one of them is apparently using Adobe Flex to provide a rich web client interface (yes, I know – sounds like a marketing blurb).

I have to admit that until a few days ago I had not even heard of Flex, but I guess I will have to dig into it. My first discoveries are not promising, but also not unexpected: when was the last time you saw a new technology introduced which had good, built-in support for localization right from the get-go?

If I am reading the documentation and the fora correctly, then you have to compile a separate SWF file per locale, and there is no built-in way of selecting the appropriate SWF file based on a locale parameter. Um, that’s like creating a separate executable per language, right? You compile your translation files into your source code, so if you want to change your source, you have to recompile 26 language versions? Isn’t that so, what should I say, ’80s?

I may be unfair here, simply because I have not delved deeply enough into the documentation, and I sure would love to be proven wrong.

Property resource bundle encoding February 8, 2007

Posted by globalizer in Java, Localization, Translation, Unicode.

Just yesterday I had a brief discussion with a development team about the encoding of Java properties files. The development team had encoded the English source files in ISO-8859-1, using various characters outside the range of the lower 128 ASCII set (sometimes referred to as “US ASCII”, or just ASCII, since they are the characters from the original ASCII character encoding) in the files.

When I told the team that for translation purposes they needed to either remove the characters beyond the lower 128 set, or convert the files to UTF-8 if they absolutely needed those characters, they complained and pointed to the Java documentation here:

When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single ‘u’ character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

I can see how this could easily lead to a belief that property resource bundles have to be encoded in 8859-1, and also a belief that such files should be able to contain any character from that set.

There are actually two distinct issues involved here:

  1. The encoding of and specific characters used in translation files, and how those items impact translatability
  2. The Java specification on property resource bundle encoding, and what exactly that means with respect to the encoding of such files

With respect to translatability, the problem with 8859-1 is that it can represent only the Latin-1 character set (Western European languages, more or less), so if you need to translate into other languages like Chinese, Thai, Hebrew, etc., then the non-ASCII characters will not be represented correctly in the code pages used by those other languages. And since our translation tools work by using the English source file as the starting point, we need the source file to either contain only the ASCII set that is common across code pages, or to be encoded in Unicode.

The easiest solution is usually to simply use only the ASCII set in the source files (eliminate “fancy quotes”, for instance). It you do that, your source files can be used across legacy code pages, and they can also masquerade as UTF-8 files (since the ASCII set is unchanged in UTF-8).

If for some reason you really really like those fancy quotes (or maybe need to use a copyright sign), then you need to convert your source file to Unicode, and then only update the files using a Unicode-enabled editor – an editor that is not Notepad, by the way, or any other editor that automatically inserts a BOM in UTF-8 files.

The second issue, the encoding used by Java to read and save properties, and the way it is worded, can easily cause confusion (and it is of course totally crazy that this was ever designed to use 8859-1 in the first place). The only thing this means, however, is that files encoded in 8859-1 can be read as-is, without any conversion. Files containing characters from other character sets have to be converted to Unicode escape sequences (can easily be done using native2ascii). And it would of course be impossible to provide Java applications in anything but Western European languages if properties files actually could use only the 8859-1 character set.

Update: With Java 6 you can use PropertyResourceBundle constructed with a Reader, which can use UTF-8 directly.