jump to navigation

What’s wrong with ChoiceFormat? (Lost in translation – part IV) January 12, 2007

Posted by globalizer in Java, Localization, Translation.
trackback

Time to address one of those truly “difficult to translate” constructions that are inherent in some programming APIs: strings using ChoiceFormat.

There are two main problems with ChoiceFormat related to translation/localization:

  1. The limitations of ChoiceFormat mean that it does not actually work correctly for all languages, unless you create a lot of custom code. And to do that, developers would have to actually understand the syntactical requirements of all their potential target languages – not an easy requirement to meet, even for professional linguists.
  2. Localizers often don’t understand how to handle strings using ChoiceFormat constructions, even the simpler ones. Since software translators are usually people with a language background, rather than a technical one, this is not surprising (and you really need the language expertise in order to get reasonable translations).

Both of these issues are discussed in this ICU bug (originally written to cover the problems specific to Russian number formats and the use of ChoiceFormat). The issue in that defect is the fact that the ChoiceFormat syntax only allows you to specify ranges, not the type of Boolean expression you would need to fully support Russian syntax.

One suggested solution would mean having the complex logic required to handle each target language, something like this for Russian, contained in the translatable files:

form.applyPattern(" {0, number, integer}{0,choice, ({0} < 2) ({0} % 10 == 1 && {0} > 20) ? архив извлекал | ({0} >= 2 && {0} <= 4) ({0} > 20 && {0}%10 >=2 && {0}%10<=4) ? архива извлекали | ? архивов извлекали}.");

I couldn’t agree more with the opinion expressed by Deborah Goldsmith (in arguing against this “fix”):

…one issue that came up is that our localizers can barely cope with format strings that reverse the order of arguments. Expecting them to write complicated logic in a format string is probably asking too much.1

I think that the example below would cause most translators trouble, for instance (and this is a fairly simple example):

There {1, choice, 0#are no files|1#is one file|1<are {1, number, integer} files} in {0}.

With the string above in a resource bundle, and this code:

ResourceBundle bundle = ResourceBundle.getBundle("ChoiceBundle");
String pattern = bundle.getString("file_message");
String dir = bundle.getString("dir");
MessageFormat mymsgfmt = new MessageFormat(pattern);
Object[] messageArgs = {dir, null};
for (int fileNum = 0; fileNum < 8; fileNum++) {
messageArgs[1] = new Integer(fileNum);
System.out.println(mymsgfmt.format(messageArgs));

we would get output like the following:

There are no files in my_dir.
There is one file in my_dir.
There are 2 files in my_dir.
There are 3 files in my_dir.
There are 4 files in my_dir.
There are 5 files in my_dir.
There are 6 files in my_dir.
There are 7 files in my_dir.

With good comments in the source file translators with a good deal of experience in software translation (and testing) would be able to handle this example if their language uses a syntax and number agreement that is very close to English, but on a typical software translation project at least half of the translators will not fall into that category, so you will end up with:

  • a number of questions about how to translate the string (if you are lucky)
  • a number of defects found during localization testing (if you are lucky/have 100% coverage in your test cases)
  • users seeing incomprehensible and/or misleading messages when they use your product

And let’s examine an example of a language that deviates somewhat from the English syntax – an invented example in this case, but one that illustrates the issues:

Let’s say that this language uses a different verb form when combined with numbers between 2 and 5, and that it also requires a different word order when the amount is ‘none’ or ‘no’. This would require the translator to:

  1. Add an additional range to the pattern
  2. Change the syntax of the pattern quite a bit

You might end up with something like this (conveniently kept in English):

There {1, choice, 0#are in {0} files none|1#is one file in {0}|1<verb_for_2_to_5 {1, number, integer} files in {0}|5<are {1, number, integer} files in {0}}.

which would produce output like the following:

There are in my_dir files none.
There is one file in my_dir.
There verb_for_2_to_5 2 files in my_dir.
There verb_for_2_to_5 3 files in my_dir.
There verb_for_2_to_5 4 files in my_dir.
There verb_for_2_to_5 5 files in my_dir.
There are 6 files in my_dir.
There are 7 files in my_dir.

I would never want to require translators to perform that kind of programming, so I for one would hate to have to explain the suggested Russian format above on a localization project. Translators can tell you what the rules are for plurals in their target language, but they cannot convert those rules to Java code.

On the other hand, the other suggested solution – having developers write custom code for each supported language – has other, obvious problems (also pointed out in the discussion in the defect): it requires developers to actually know what the linguistic problems are for each language, and how they can be alleviated, plus it would require code updates for each new “problem” language that you want to support.

So the dilemma is this: You either create resource bundles with strings that essentially leave it up to the translator to code the specific language behavior, or you have the developer write code that covers the syntax of every language in the world.

No wonder that the 2-year old ICU bug is still open…

What is a poor programmer to do then, if ChoiceFormat is so riddled with problems? (I hear you ask). Stay tuned, I’ll get back to that.

Note 1: In all fairness, the author of the proposal did mention that the logic to create the correct language versions should be contained in the localization software; the only problem with that is that no translation software has anything even remotely resembling the kind of natural language processing capabilities that this would require.

Comments»

1. How to use ChoiceFormat then? (Lost in translation - part V) « Musings on software globalization - January 30, 2007

[…] January 30, 2007 Posted by globalizer in Java, Translation, Localization. trackback Back in this post I said I would get back to what you can actually do to work around the limitations of […]

2. PluralFormat to the rescue? « Musings on software globalization - July 2, 2007

[…] Here and here I complained about the localization issues involved in using ChoiceFormat. One of those issues would seem to be addressed by the new PluralRules and PluralFormat API proposal described on the ICU design mailing list recently. PluralRules would allow you to define plural cases for each language, and the numbers those plural cases apply to, while PluralFormat would then allow you to provide message text for each such case. This format would thus be able to handle languages like Russian and Polish, which use more complex plural rules than the ones that can be provided via the simple intervals of ChoiceFormat. […]

3. Steven R. Loomis - October 28, 2010

I think we will close the ‘two year old bug’, as ‘fixed’ by plurals, now past its fifth birthday!

4. globalizer - October 28, 2010

Heh – I didn’t realize it was still open 🙂

5. Ron - May 2, 2013

Wow, this post is nice, my younger sister is analyzing these kinds of things, thus I am going to
tell her.


Leave a comment