jump to navigation

Kudos to Google: filling huge i18n gap October 25, 2010

Posted by globalizer in Android, Internationalization, Java, Locales.
2 comments

I’ve been a little harsh on Google in some previous posts, so I’m happy to have some good – no, make that great – Google news from the recent IUC34 conference. Albeit a little late compared to the tweeting others have done 🙂

Even though the internationalization community has made great progress towards more, and more uniform, locale data with the wide acceptance of the CLDR in recent years, we have been left with 2 big gaping holes: phone numbers and postal addresses. Up until now it has been practically impossible to implement correctly internationalized phone number and address formatting, parsing and validation, since the data and APIs have been unavailable.

Depending on the level of globalization awareness of the companies involved, this has resulted in implementations falling into one of these 3 broad categories :

  1. Address and phone number fields are hard coded to work for only one locale and will reject as invalid everything else. This usually takes the form of making every single field required, and doing validation of every single field, even on web sites where the address and/or phone number of the user is not actually important (such as purchases of non-restricted software delivered electronically, or web sites with required registration to enable targeted ads). This of course also results in such companies collecting an amazing amount of absolute garbage. For instance, if you make “Zip code” a required field and validate it against a list of US zip codes, then you end up with an amazing percentage of your users living in the 90210 area – simply because that is the one US zip code people living outside the US have gotten drilled into them via exposure to the TV show.
  2. Support for a limited number of countries/regions (limited by the number of regions you have the bandwidth to gather data and implement support for – with each company reinventing the wheel every time, for every country)
  3. No validation (provide the user with one , single address field, and assume that if the user wants you to be able to reach you at that address, he/she will fill it in with good data)

As described in the IUC34 session (by Shaopeng Jia and Lara Rennie), collecting reliable and complete data to fill these holes was a major task (it’s no coincidence that nobody has done it before…):

  • There is no single source of data
  • Supposedly reliable data (ITU and individual country postal/telephone unions) turns out to be unreliable (data not updated, or new schemes not implemented on time)
  • Formats differ widely between countries/regions
  • Some countries even lack clear structure
  • Some countries (e.g., UK) use many different formats
  • Some countries use different formats depending on language/script being used
    • Chinese/Japanese/Korean addresses – start with biggest unit (country) if using ideographic script, but with smallest unit (street) if using Latin script

I have looked at these issues a few times in the past, and each time the team decided that we didn’t really need this information (translation: there was no way in hell we were going to be able to get the manpower to gather the information and implement a way to process it). Since Google does in fact have a business model that makes it very important to be able to parse these elements and format them correctly for display (targeted ads and Android, to name a couple of cases), it makes sense that they bit the bullet.

They deserve a lot of kudos for also going ahead and open-sourcing both the data and the APIs that are the result of that major undertaking, however.

Check it out:

According to the IUC34 presentation, the phone number APIs will allow you to format and validate 184 regions, while they will parse all regions. And the address APIs provide detailed validation for 38 regions, with layout+basic validation for all regions.

OK, so it WAS broken, and SHOULD have been fixed… November 7, 2008

Posted by globalizer in Java, Locales.
5 comments

Here I complained about a seemingly wrongheaded change in the Java locale look-up algorithm. It turns out that the only change was to fix incorrect API docs. The original documentation described what everybody seems to agree is the desired behavior, but the APIs apparently were never implemented in that way. So, the docs were changed to describe the actual fallback behavior.

As described in the forum post linked above, in JDK 6 you can now get rid of the undesired behavior with ResourceBundle.Control.getNoFallbackControl().

It is still a mystery to me why the API was not changed to get the behavior that everybody agrees is more sensible, though. I certainly understand that API changes should not be undertaken lightly, in case you break previous implementations, but in this case I have a hard time seeing how anybody would rely on the existing behavior (and those would be easily fixable).

In any case, my previous post was wrong – I didn’t have all the old JDKs available to test, and relied on the documentation instead.

Don’t know whether to laugh or to cry… May 5, 2008

Posted by globalizer in Java, Locales.
add a comment

In this post I complained about the change to the Danish date format that Sun implemented in Java 1.5. With the rejection of a bug that tried to get this change reverted in 2007 I figured the battle was lost:

After a long investigation within the supplied URLs,
asking danish native speakers I realized it is common
to use current date format.
Closing the CR, will not fix.
Posted Date : 2007-10-16 09:37:17.0

However…

I now see that in Java 1.6 (1.6.0_05) Sun has in fact changed the short Danish locale format back to the way it was in Java 1.4. Verified by testing, and also by comparing the Danish locale formats specified in LocaleElements_da.java and FormatData_da.java respectively.

While I am very happy this was done, I am fairly flabbergasted that somebody from Sun would respond they way they did, on Oct. 16, 2007, and then not go back and update the bug with information that the bug has in fact been fixed (and the update the fix was included in).

A cautionary tale about changing locale defaults November 16, 2007

Posted by globalizer in Java, Locales.
3 comments

Here I talked about some of the pitfalls involved in using short date formats in user interfaces, and I mentioned that it was occasioned by an issue involving a change in the Danish date format in Java between JDK 1.4 and 1.5.

The change has its origin in this bug report filed back in 2003 against Java 1.4. The original poster muddied the waters from the beginning by stating that The Danish Standard is, basically, the same as the international standard (yyyy-mm-dd). That is certainly a debatable statement – since the official Danish body providing guidance in this area (Dansk Sprognævn) at most states that you can choose between the international standard and the traditional Danish date format (day-month-year). (more…)

Evil short date formats November 15, 2007

Posted by globalizer in Java, Locales.
1 comment so far

I was recently drawn into a discussion about a Danish date format being used in one of our applications. The specific issue was related to a change in the Danish date format in the JDK between 1.4 and 1.5, but I’ll get back to that in a subsequent post. For now, I just want to take a quick look at the abomination that “short” date formats are.

I am talking about formats like dd-MM-yy, MM-dd-yy and yy-mm-dd that yield dates like this:

06-09-07

Without context, it is really hard to tell whether that date refers to September 7, 2006, September 6, 2007 or June 9, 2007. Somebody born and raised in the US would probably say it means June 9, 2007, while most Europeans would interpret it as September 6, 2007.

In the application we are talking about, it actually refers to September 7, 2006, however – probably most people’s last guess in this case, and a confusion that could have been easily avoided if the developer had used the “medium” date format in Java instead of the short format:

2006-09-07

What we have here is actually the international ISO 8601 date format, a format that represents elements going from the largest to the smallest element. It is a format that is recommended for use on the web, and it is a format that I commonly recommend for use in logs, for instance.

So, for any platform where the short date format uses only 2 digits for the year, the recommendation would be:

Do not ever use the short date format!

That would at least identify the year in unmistakable terms, even though it would not necessarily do anything about the potential day/month confusion.

Which leads me to my second recommendation:

Unless you produce locale-sensitive UIs, and you can be sure the user expects the locale-dependent format, use the ISO 8601 date format

I will get back to the specific problem with the date format change for Danish in the JDK. It illustrates how much havoc you can cause by introducing changes in this area, and also how difficult it is to establish locale standards – simply because there is disagreement within the target user population about what the “standard” is.

Can we agree that all those ISO8859_1 hacks are just that – hacks September 26, 2007

Posted by globalizer in Java, Programming languages, Unicode, web applications.
4 comments

The amount of misunderstanding and misinformation surrounding the use of ISO-8859-1 character encoding in Java-based web applications (actually not just Java-based, but I will be looking at Java code here) is just incredible, and it keeps proliferating in forum postings and code snippets. Various people (I among them) keep flailing or swatting at them, but invariably someone else comes along suggesting how to make the hack work again.

The original hacks were probably legitimate since they were designed to work around the very unfortunate 8859-1 defaults used for many years in web servers, etc. (and even in parts of Java such as property files). These days, however, there is absolutely no reason to use them, and they will in fact almost inevitably cause problems at some point, when the incorrect code or configuration setting you are relying on is fixed.

Anyway, maybe I should explain which hack I am ranting about, before I go any further:

I am talking about code like this:

this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes ("ISO8859_1"),"UTF8");

This code actually does very little, except pivot through a couple of encodings (this is relevant, but only in explaining how it can be used to “fix” strings that are incorrectly encoded to begin with). The code takes a string, creates a byte array from that string, and then creates a new string from the byte array.

Such code keeps getting posted on the Sun Java I18N forum (and all kinds of other fora), with statements like this (a composite paraphrase):

This code has been working perfectly to handle data in Russian (or Greek, or Arabic, etc.), but now that I have to do x, y or z with the data, the data appears garbled – please help.

So, what is the problem?, I hear you ask. And the problem is that the only reason code like that “works” is that somewhere else, either in the code or in the environment (web server, Tomcat, etc.), ISO-8859-1 encoding is being used/defaulted.

Let me explain. Assume that we have a JSP-based web application, where we want to input some Arabic text. The text will be stored in a database encoded in UTF-8, and then retrieved and displayed in the JSPs again. The forum question that I link to above, and where I got the code snippet, states that the code is used when saving data to the database, and that the code below is used when retrieving data for display (and that UTF-8 is otherwise used everywhere):
paymentTransactionTypeDetailVO.setArabicDesc(new String (arabicDesc.getBytes("UTF8"),"ISO8859_1"));
Now the problem with this is that it is just not possible to take correctly encoded Arabic data, run it through code like that, and end up with useful data.

Let’s look at a practical example. Let’s say I want to save and re-display this Arabic text:

أتتعامل

I should say up front that I have no idea what it says (copied and pasted from the Arabic translation of What is Unicode, so hopefully it should not be an insult or otherwise offensive).

Since the getBytes method uses a Java string as the starting point and creates a byte array from that string, in the encoding specified, we would have the following situation (assuming the Arabic text had been correctly stored in a Java string to begin with):

This getBytes method
this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes ("ISO8859_1"),"UTF8");
would take the Unicode code points 0623 062A 062A 0639 0627 0645 0644 and try to convert them to the corresponding code points in the ISO-8859-1 encoding. That would fail, however, since those Unicode code points have no equivalent values in ISO-8859-1. So Java fails to convert, and substitutes question marks instead. Once that failure has happened, the string contains question marks thereafter (and Java of course has no problem creating the new string from the byte array based on the specified UTF-8 encoding). So all you end up with is a string of question marks – if, that is, the original string was correctly encoded in the first place.

The reason so many people claim that such code works is that they use incorrectly encoded strings as the starting point. Let’s look at the example above again and assume that the Arabic text was actually stored incorrectly in the original Java string, something that could happen if ISO-8859-1 is used a the default web server encoding, or if you explicitly use code like this (which was what I did to test this scenario):

request.setCharacterEncoding("ISO8859_1");

If we use the same Arabic text above and assume that the UTF-8 code points making up that text are incorrectly assumed to be in ISO-8859-1, then the original Java string would instead be created based on these code points (actually UTF-8, but thought by Java to be 8859-1 code points, because I told it so):

D8 A3 D8 AA D8 AA D8 B9 D8 A7 D9 85 D9 84

resulting in this string: أتتعاÙ?Ù?

If you then use that input to the getBytes method, you get a byte array in 8859-1 that is identical to the original UTF-8 input, and you can then create a correct string by specifying UTF-8 as the encoding for the byte array.

To sum up: this kind of code can “fix” code or strings that are incorrectly encoded to begin with, but using it is like building a house of cards. Fix the environment settings or the incorrect code – don’t perpetuate the hacks.

PluralFormat to the rescue? July 2, 2007

Posted by globalizer in Java, Language, Localization, Translation.
4 comments

OK, I have been missing in action for a while, I know. A new job (still in IBM, mind you) and a fairly long vacation are my only excuses. Back to business:

Here and here I complained about the localization issues involved in using ChoiceFormat. One of those issues would seem to be addressed by the new PluralRules and PluralFormat API proposal described on the ICU design mailing list recently. PluralRules would allow you to define plural cases for each language, and the numbers those plural cases apply to, while PluralFormat would then allow you to provide message text for each such case. This format would thus be able to handle languages like Russian and Polish, which use more complex plural rules than the ones that can be provided via the simple intervals of ChoiceFormat.

It is of course a step forward that the API will now allow you to actually define something that will work for (all?) languages. As far as I can see we will actually take a step backward with respect to the other problem, however: the format will be even more difficult to handle for translators.

According to the API proposal,

It provides predefined plural rules for many locales. Thus, the programmer need not worry about the plural cases of a language. On the flip side, the localizer does not have to specify the plural cases; he can simply use the predefined keywords. The whole plural formatting of messages can be done using localized patterns from resource bundles.

If this is really true, then the programmer will write a resource bundle that implements the US English keywords (in most cases, anyway), and it will be up to the localizer to know the PluralRules keywords that are defined for her language, and to implement them correctly in the localized resource bundle.

This comment on the mailing list to the proposal would seem to be an understatement:

Separating the rules from Plural Format helps some here, but translators will still have to be able to write the PluralFormat syntax, which is about as complicated as the ChoiceFormat syntax.

I think my ChoiceFormat advice will extend to the new API for the time being: don’t use it

Property resource bundle encoding February 8, 2007

Posted by globalizer in Java, Localization, Translation, Unicode.
7 comments

Just yesterday I had a brief discussion with a development team about the encoding of Java properties files. The development team had encoded the English source files in ISO-8859-1, using various characters outside the range of the lower 128 ASCII set (sometimes referred to as “US ASCII”, or just ASCII, since they are the characters from the original ASCII character encoding) in the files.

When I told the team that for translation purposes they needed to either remove the characters beyond the lower 128 set, or convert the files to UTF-8 if they absolutely needed those characters, they complained and pointed to the Java documentation here:

When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single ‘u’ character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

I can see how this could easily lead to a belief that property resource bundles have to be encoded in 8859-1, and also a belief that such files should be able to contain any character from that set.

There are actually two distinct issues involved here:

  1. The encoding of and specific characters used in translation files, and how those items impact translatability
  2. The Java specification on property resource bundle encoding, and what exactly that means with respect to the encoding of such files

With respect to translatability, the problem with 8859-1 is that it can represent only the Latin-1 character set (Western European languages, more or less), so if you need to translate into other languages like Chinese, Thai, Hebrew, etc., then the non-ASCII characters will not be represented correctly in the code pages used by those other languages. And since our translation tools work by using the English source file as the starting point, we need the source file to either contain only the ASCII set that is common across code pages, or to be encoded in Unicode.

The easiest solution is usually to simply use only the ASCII set in the source files (eliminate “fancy quotes”, for instance). It you do that, your source files can be used across legacy code pages, and they can also masquerade as UTF-8 files (since the ASCII set is unchanged in UTF-8).

If for some reason you really really like those fancy quotes (or maybe need to use a copyright sign), then you need to convert your source file to Unicode, and then only update the files using a Unicode-enabled editor – an editor that is not Notepad, by the way, or any other editor that automatically inserts a BOM in UTF-8 files.

The second issue, the encoding used by Java to read and save properties, and the way it is worded, can easily cause confusion (and it is of course totally crazy that this was ever designed to use 8859-1 in the first place). The only thing this means, however, is that files encoded in 8859-1 can be read as-is, without any conversion. Files containing characters from other character sets have to be converted to Unicode escape sequences (can easily be done using native2ascii). And it would of course be impossible to provide Java applications in anything but Western European languages if properties files actually could use only the 8859-1 character set.

Update: With Java 6 you can use PropertyResourceBundle constructed with a Reader, which can use UTF-8 directly.

How to use ChoiceFormat then? (Lost in translation – part V) January 30, 2007

Posted by globalizer in Java, Localization, Translation.
1 comment so far

Back in this post I said I would get back to what you can actually do to work around the limitations of ChoiceFormat. Well, this is it, and basically my advice is: don’t use it.

And what do I propose to use instead? One specific strategy has worked well for me in the past, and it has the advantage of being simple, both for developers and localizers:

Get away from the use of full sentences with subject, verb, object, etc., combined with the use of variables. These kinds of constructions, where you have to anticipate and accommodate the syntactical quirks of every single language known to man, are bound to get you in trouble. There will always be a language out there that requires you to customize your code to handle one more gender, one more verb inflection, etc. So, instead of trying to present sentences like these to your users:

Your account {0} contains {1} dollars.
The directory {0} contains {1} files

do something along these lines instead:

Account: {0}
Balance: {1}
Directory: {0}
Number of files: {1}

This type of construction is by no means perfect, but for most languages I know of (and that I have dealt with on IBM projects) it would be possible to construct something usable – since you have removed the tight coupling between nouns, verbs, etc.

Rather disappointing, I hear you say. And you may be right. But for now, and until we have much better algorithms for natural language processing built into our APIs and/or our translation tools, it is what makes sense.

Fonts, fonts, fonts… And: why don’t people use Google/Yahoo/search engine of their choice? January 25, 2007

Posted by globalizer in Java, Locales, Localization, Unicode.
add a comment

<rant>

Why, oh why can’t people figure out how to put the tools that are right in front of them to the most basic use?

15 years ago it may have been difficult to find the answer to a question like this one (not simply because Tamil was barely supported anywhere in 1992, of course, but because the search options were so poor back then), but when a basic Google search for something like {java font fallback support}, {java internationalization font} etc. today would have provided you with a wealth of useful resources like this and this, why then ask a totally open-ended question on a forum and expect others to provide you with the answer?

Unfortunately, I think I know the answer – laziness. I see it more and more at least in the Java i18n forum – people ask incredibly simple questions that could be answered by a 2-minute Google search (or basic familiarity with the Java APIs), and they even combine it with rude requests like “give me code” or “provide a solution”.

It also has the effect of vastly reducing the usefulness of fora like this – people who are reasonably knowledgeable don’t bother subscribing, so you rarely get any really interesting issues raised any more.

Sigh…

</rant>

This specific question did provide me with the impetus to actually try out Tamil in a Swing application, and confirmed my understanding of the way fallback fonts work from Java 1.5 onwards. Since I have worked only with servlets/JSPs for the past several years, that was at least marginally useful 🙂