jump to navigation

More i18n pot holes to be filled by Google October 26, 2010

Posted by globalizer in Internationalization, JavaScript, Locales, Programming languages.
3 comments

Yesterday I mentioned the big step forward that Google’s open-sourcing of phone and address libraries represents. For Java these libraries fill a major hole in an otherwise fairly smooth road, while for JavaScript they could be seen as the only smooth part of an otherwise almost completely unpaved i18n road.

As Ćirić and Shin, the authors of Google’s proposed i18n enhancements for JavaScript, (charitably) put it,

current EcmaScript I18N APIs do not meet the needs of I18N sufficiently

This has obviously been a problem for a very long time, but until Ajax gave JavaScript a second chance and the web browser became the dominant application delivery system, nobody thought it awful enough to fix properly. I remember discussions about this issue back in the 1990s, and at that time nobody in IBM was using JavaScript enough to squawk about the lack of support.

Well, things change. And with Google now being a serious player in the browser market, they seem to have found it important enough to propose a set of i18n APIs that would provide JavaScript with support similar to that found in languages like Java, covering

  • Locale support
  • Collation
  • Timezone handling
  • Number, date and time formatting
  • Number, date and time parsing
  • Message formatting

The proposal calls for using native data sources (ICU, glibc, native Windows calls), mainly because of the size of some of the data tables needed for collation, for instance. While not optimal, understandable.

The proposed message formatting is another variation of the plural and gender formatting capabilities that is all the craze these days. People who have read my previous posts on this topic will know that I am no fan of this type of formatting, and my most recent experiences with email templates using plural formatting have not changed my view. Exposing stuff like this in translatable files is just utter folly, IMHO:

var pattern = '{WHO} invited {NUM_PEOPLE, plural, offset:1 other {{PERSON} and # other people}} to {GENDER, select, female {her circle} other {his circle}}'

I did hear support for this viewpoint at IUC34, and the suggestion that these strings should not be exposed in the translation files – instead, those files should contain the full set of “expanded” string variations (male/female pronouns, singular/plural cases).

But if that is the goal, I see very little point in using the message formatters in the first place. I guess it forces the developer to think about the variations, and it would keep the strings co-located in the translation files, but that’s about all.

That’s nitpicking, however, considering the huge step forward this would represent, with an experimental implementation targeted for Q4.

Kudos to Google: filling huge i18n gap October 25, 2010

Posted by globalizer in Android, Internationalization, Java, Locales.
2 comments

I’ve been a little harsh on Google in some previous posts, so I’m happy to have some good – no, make that great – Google news from the recent IUC34 conference. Albeit a little late compared to the tweeting others have done 🙂

Even though the internationalization community has made great progress towards more, and more uniform, locale data with the wide acceptance of the CLDR in recent years, we have been left with 2 big gaping holes: phone numbers and postal addresses. Up until now it has been practically impossible to implement correctly internationalized phone number and address formatting, parsing and validation, since the data and APIs have been unavailable.

Depending on the level of globalization awareness of the companies involved, this has resulted in implementations falling into one of these 3 broad categories :

  1. Address and phone number fields are hard coded to work for only one locale and will reject as invalid everything else. This usually takes the form of making every single field required, and doing validation of every single field, even on web sites where the address and/or phone number of the user is not actually important (such as purchases of non-restricted software delivered electronically, or web sites with required registration to enable targeted ads). This of course also results in such companies collecting an amazing amount of absolute garbage. For instance, if you make “Zip code” a required field and validate it against a list of US zip codes, then you end up with an amazing percentage of your users living in the 90210 area – simply because that is the one US zip code people living outside the US have gotten drilled into them via exposure to the TV show.
  2. Support for a limited number of countries/regions (limited by the number of regions you have the bandwidth to gather data and implement support for – with each company reinventing the wheel every time, for every country)
  3. No validation (provide the user with one , single address field, and assume that if the user wants you to be able to reach you at that address, he/she will fill it in with good data)

As described in the IUC34 session (by Shaopeng Jia and Lara Rennie), collecting reliable and complete data to fill these holes was a major task (it’s no coincidence that nobody has done it before…):

  • There is no single source of data
  • Supposedly reliable data (ITU and individual country postal/telephone unions) turns out to be unreliable (data not updated, or new schemes not implemented on time)
  • Formats differ widely between countries/regions
  • Some countries even lack clear structure
  • Some countries (e.g., UK) use many different formats
  • Some countries use different formats depending on language/script being used
    • Chinese/Japanese/Korean addresses – start with biggest unit (country) if using ideographic script, but with smallest unit (street) if using Latin script

I have looked at these issues a few times in the past, and each time the team decided that we didn’t really need this information (translation: there was no way in hell we were going to be able to get the manpower to gather the information and implement a way to process it). Since Google does in fact have a business model that makes it very important to be able to parse these elements and format them correctly for display (targeted ads and Android, to name a couple of cases), it makes sense that they bit the bullet.

They deserve a lot of kudos for also going ahead and open-sourcing both the data and the APIs that are the result of that major undertaking, however.

Check it out:

According to the IUC34 presentation, the phone number APIs will allow you to format and validate 184 regions, while they will parse all regions. And the address APIs provide detailed validation for 38 regions, with layout+basic validation for all regions.

Install anywhere yes – but prepare for a slightly bumpy translation ride December 6, 2008

Posted by globalizer in Language, Locales, Localization, Programming practices.
add a comment

First, the good stuff: InstallAnywhere is actually a nifty product which allows you to create a good install program in almost no time. Kudos to the team behind it for that.

Now for the not-so-good stuff: translation

Here I am not referring to the translations provided by InstallAnywhere out of the box. They provide very good language coverage (I count 31 languages/language variants in the 2008 VP1 Enterprise Edition, including English), and the quality of the translations themselves also seems fine in this version (some of the early InstallShield translations into certain languages were rather unfortunate). [1]. 

So, no complaints there. The trouble starts when you need to modify or customize anything related to the translations.

First problem

All new/updated text strings that will need to be translated are inserted in the custom_en file which already contains all the out of the box translation strings. There is no option to choose a separate translation file for custom strings. This means that anybody using a modern translation tool with translation memory features will have to re-translate the entire InstallAnywhere GUI even if they only modify a single string (because such tools use the English source file as the starting point for all translation). Cut-and-paste from the existing translation files can make that job faster, or you may be able to use the feature of creating a translation memory based on a set of source and target files described below, but no matter what method you choose, there will be a significant workload involved.

InstallAnywhere does update all the language versions of the custom_xx files along with the English version, with the difference that only the comment line for each string is updated in the translated versions (the custom_en file contains a comment line with a copy of each translation string). After an update, the English and Danish versions look like this respectively:

custom_en:
# ChooseInstallSetAction.368876699cf1.bundlesTitle=Choose Install Setشقشلاهؤ
ChooseInstallSetAction.368876699cf1.bundlesTitle=Choose Install Setشقشلاهؤ

custom_da:
# ChooseInstallSetAction.368876699cf1.bundlesTitle=Choose Install Setشقشلاهؤ
ChooseInstallSetAction.368876699cf1.bundlesTitle=Vælg installationssæt

This seems to indicate that the designers of the product have not understood how modern translation tools work. Indeed, the detailed help indicates that the creators of the application assume the translated versions will all be created/edited via the IA designer. This feature would have been extremely useful 20 years ago, since it retains existing translations and allows translators to just go through the translated file and modify the strings where they see an update. Today it is a terrible hindrance (except for translators still working without modern tools), however.

With today’s tools translators who need to bring a translated file up to the same level as an updated English file simply take the new English source file and run it through their translation memory tool. That tool automatically translates any unchanged strings and presents the translator with just those strings that are either new or changes. For this to work the translation memory has to contain the unchanged strings, of course, and that is where the InstallAnywhere model breaks down. With some tools it is possible to create “fake” translation memories on the basis of existing source and target files, but it is a rather time-consuming process, and by no means error-free.

The easy fix would be to at least make it an option to store any customized strings in a separate translation file. Since InstallAnywhere allows users to change existing strings, this of course introduces the question of what to do with the existing strings in the custom_en and translated versions of that file.

I believe the best solution would be to delete such strings from the custom_en file and the translated versions (in other words, those files would only contain the strings that were unchanged from the out of the box version). The changed strings would instead be inserted in the “new” translation file.

Second problem – probably a minor one

There does not seem to really be an option for adding a language that is not in the list of languages provided out of the box. At least I don’t see it in the designer, and I haven’t found any information in the knowledgebase, fora, etc. With the number of languages supported, this may not be a major issue, but it would be nice to have the option.

[1]
Some of those early versions had eerily bad Danish and Norwegian translations which looked like an amalgam of Danish, Norwegian and Swedish. This old thread from an InstallShield forum may shed some light on how that happened (note also how the InstallShield support guy keeps suggesting that a Dutch version be used, when the user is looking for Danish…). But, as noted above, the current translations seem fine. I looked at the Danish version (the only one I am really competent to judge), and I have no complaints whatsoever, so I believe that the early problems have been overcome completely.

OK, so it WAS broken, and SHOULD have been fixed… November 7, 2008

Posted by globalizer in Java, Locales.
5 comments

Here I complained about a seemingly wrongheaded change in the Java locale look-up algorithm. It turns out that the only change was to fix incorrect API docs. The original documentation described what everybody seems to agree is the desired behavior, but the APIs apparently were never implemented in that way. So, the docs were changed to describe the actual fallback behavior.

As described in the forum post linked above, in JDK 6 you can now get rid of the undesired behavior with ResourceBundle.Control.getNoFallbackControl().

It is still a mystery to me why the API was not changed to get the behavior that everybody agrees is more sensible, though. I certainly understand that API changes should not be undertaken lightly, in case you break previous implementations, but in this case I have a hard time seeing how anybody would rely on the existing behavior (and those would be easily fixable).

In any case, my previous post was wrong – I didn’t have all the old JDKs available to test, and relied on the documentation instead.

Don’t know whether to laugh or to cry… May 5, 2008

Posted by globalizer in Java, Locales.
add a comment

In this post I complained about the change to the Danish date format that Sun implemented in Java 1.5. With the rejection of a bug that tried to get this change reverted in 2007 I figured the battle was lost:

After a long investigation within the supplied URLs,
asking danish native speakers I realized it is common
to use current date format.
Closing the CR, will not fix.
Posted Date : 2007-10-16 09:37:17.0

However…

I now see that in Java 1.6 (1.6.0_05) Sun has in fact changed the short Danish locale format back to the way it was in Java 1.4. Verified by testing, and also by comparing the Danish locale formats specified in LocaleElements_da.java and FormatData_da.java respectively.

While I am very happy this was done, I am fairly flabbergasted that somebody from Sun would respond they way they did, on Oct. 16, 2007, and then not go back and update the bug with information that the bug has in fact been fixed (and the update the fix was included in).

To localize or not to localize February 25, 2008

Posted by globalizer in Locales, web applications.
add a comment

That is the question.

And here I am talking not about translations, but about the kind of formatting that your software performs on dates and numbers, for instance. You really need to make a conscious decision in this area – do I want my application to be locale-dependent or not – and then make sure you follow through in every single function and on every screen.

Otherwise you may easily end up with the kind of totally unusable application that my electrical coop uses for online bill payments.

I will save them embarrassment and not actually post the specific name of the coop, although I would feel no qualms about outing them; I sent an email with the details of the problem, and the fix, more than 2 years ago, and since then

  • I have received no response what so ever,
  • and have seen no change in the behavior of the application

“OK, will you get to the point?” I hear you ask, and right you are:

My electric coop gives me the opportunity (which I appreciate, btw.) to view and pay my electric bill online. And since I use my system for a lot of different language testing my browser will quite often have a language that is not English at the top of my language preference list. Almost all web applications use that language preference to decide which locale formatting to use, and my coop application is no exception.

Thus, with my browser language set to Danish, my bill displays like this:

Coop bill

So far, all is well. The web application has seen that I prefer Danish, so it displays the amount I owe using the Danish format with a comma as the decimal separator. When I click “Pay” on the page, I get the following message displayed, however:

Coop error

Oops, I see that this error actually contains the web address – oh well.

And here’s where the disconnect is revealed: the guy implementing the entry field validation has hardcoded it to the US English format, no matter what number format the application itself has chosen to use when it displays my bill.

There is of course a whole host of issues here:

  • The message assumes the customer typed in a value, while it was in fact filled in by the application
  • The message provides very little help to the hapless user who has no idea why the application refuses to understand the number it put in the entry field itself
  • A little bit of proof reading/testing of the message would not have been amiss, that might have uncovered the missing spaces where the sentences are concatenated

Just to complete the picture, I also got a nice little NumberFormatException displayed in the browser at the end:

NumberFormatException

All in all, not a good localization story at all. And proof that you are much better off being completely locale-independent than half way there.

Indulging in some Sugar over the holidays December 26, 2007

Posted by globalizer in global access, International requirements, Keyboard layouts, Locales, Unicode.
1 comment so far

No, not that kind – the kind of Sugar that comes with the OLPC laptop:

Sugar

The Sugar user interface does take some getting used to, at least for anybody who has ever used a computer before.

Everything is centered around activities, and you need to get used to using your Journal to access your stuff, rather than the file system:

My Journal on the XO

I am slowly learning that this has its advantages. For instance, if you go back to a terminal window instead of opening a new one, you can recall your keyboard commands with the up and down arrows, even though the system has been powered down in between.

However, if you use the built-in browser (the Browse activity)

Browse activity on XO

to download a file, for instance, you have no clue where the file is placed, so if you want to do something with it afterwards, you need to go to your journal to get it.

This may very well be a good design when the target user group is supposed to be children who have never used (or even seen) a computer before, I just don’t know. I do know that the people exploring the Sugar interface right now via the G1G1 program are not the right people to test the usefulness of the design. Our brains are already hardwired to think along the lines of conventional computer interfaces as they have looked for the past 20 years or so.

So it will be really interesting to see what happens when the machines hit more places like this and we get some real, hard data about how kids use them.

Anyway, I have spent a little time exploring the internationalization support (naturally). The build that is included on the machines in this first shipment to G1G1 donors includes very little in the form of actual localizations, but I will get back to that (and the localization support) in a later post. The underlying Linux distribution (Fedora 7) and the Pango library for text layout have presumably been enlisted to do most of the heavy lifting in the i18n area; how many and what kind of improvements/changes the OLPC project have added is unclear to me right now.

Documentation seems to be scattered in various places in the wiki, with information often being incomplete or irrelevant (e.g., this page about locales seems to contain merely an attempt at a generic description or definition of the concept, and then lists the locales included in the Mandriva Linux distribution – hardly very useful if you are trying to understand the OLPC project).

Locales

There does seem to be a full list of installed locales, and I can set the LANG variable to Kurdish in Turkey, or Bengali in Bangladesh, say, and get reasonable results in the Terminal (ignore the fact that the system for some reason came configured to be one day ahead of everybody else):

Locales on the XO

There are few places in the Sugar interface, other than the Terminal activity, where locale settings are actually used/displayed (smart choice!), so it is difficult to test the extent of the locale support. (more…)

A cautionary tale about changing locale defaults November 16, 2007

Posted by globalizer in Java, Locales.
3 comments

Here I talked about some of the pitfalls involved in using short date formats in user interfaces, and I mentioned that it was occasioned by an issue involving a change in the Danish date format in Java between JDK 1.4 and 1.5.

The change has its origin in this bug report filed back in 2003 against Java 1.4. The original poster muddied the waters from the beginning by stating that The Danish Standard is, basically, the same as the international standard (yyyy-mm-dd). That is certainly a debatable statement – since the official Danish body providing guidance in this area (Dansk Sprognævn) at most states that you can choose between the international standard and the traditional Danish date format (day-month-year). (more…)

Evil short date formats November 15, 2007

Posted by globalizer in Java, Locales.
1 comment so far

I was recently drawn into a discussion about a Danish date format being used in one of our applications. The specific issue was related to a change in the Danish date format in the JDK between 1.4 and 1.5, but I’ll get back to that in a subsequent post. For now, I just want to take a quick look at the abomination that “short” date formats are.

I am talking about formats like dd-MM-yy, MM-dd-yy and yy-mm-dd that yield dates like this:

06-09-07

Without context, it is really hard to tell whether that date refers to September 7, 2006, September 6, 2007 or June 9, 2007. Somebody born and raised in the US would probably say it means June 9, 2007, while most Europeans would interpret it as September 6, 2007.

In the application we are talking about, it actually refers to September 7, 2006, however – probably most people’s last guess in this case, and a confusion that could have been easily avoided if the developer had used the “medium” date format in Java instead of the short format:

2006-09-07

What we have here is actually the international ISO 8601 date format, a format that represents elements going from the largest to the smallest element. It is a format that is recommended for use on the web, and it is a format that I commonly recommend for use in logs, for instance.

So, for any platform where the short date format uses only 2 digits for the year, the recommendation would be:

Do not ever use the short date format!

That would at least identify the year in unmistakable terms, even though it would not necessarily do anything about the potential day/month confusion.

Which leads me to my second recommendation:

Unless you produce locale-sensitive UIs, and you can be sure the user expects the locale-dependent format, use the ISO 8601 date format

I will get back to the specific problem with the date format change for Danish in the JDK. It illustrates how much havoc you can cause by introducing changes in this area, and also how difficult it is to establish locale standards – simply because there is disagreement within the target user population about what the “standard” is.

Flex developers sure are agile though… March 2, 2007

Posted by globalizer in Locales, Localization, Programming languages.
1 comment so far

I’m impressed to see a comment from a Flex developer the day after I posted about the localization support. And even more impressed to see that the next version will actually provide exactly the kind of support I was looking for.

If that kind of attentiveness is an indication of their support in general, then I give them a thumbs up…