jump to navigation

More i18n pot holes to be filled by Google October 26, 2010

Posted by globalizer in Internationalization, JavaScript, Locales, Programming languages.
3 comments

Yesterday I mentioned the big step forward that Google’s open-sourcing of phone and address libraries represents. For Java these libraries fill a major hole in an otherwise fairly smooth road, while for JavaScript they could be seen as the only smooth part of an otherwise almost completely unpaved i18n road.

As Ćirić and Shin, the authors of Google’s proposed i18n enhancements for JavaScript, (charitably) put it,

current EcmaScript I18N APIs do not meet the needs of I18N sufficiently

This has obviously been a problem for a very long time, but until Ajax gave JavaScript a second chance and the web browser became the dominant application delivery system, nobody thought it awful enough to fix properly. I remember discussions about this issue back in the 1990s, and at that time nobody in IBM was using JavaScript enough to squawk about the lack of support.

Well, things change. And with Google now being a serious player in the browser market, they seem to have found it important enough to propose a set of i18n APIs that would provide JavaScript with support similar to that found in languages like Java, covering

  • Locale support
  • Collation
  • Timezone handling
  • Number, date and time formatting
  • Number, date and time parsing
  • Message formatting

The proposal calls for using native data sources (ICU, glibc, native Windows calls), mainly because of the size of some of the data tables needed for collation, for instance. While not optimal, understandable.

The proposed message formatting is another variation of the plural and gender formatting capabilities that is all the craze these days. People who have read my previous posts on this topic will know that I am no fan of this type of formatting, and my most recent experiences with email templates using plural formatting have not changed my view. Exposing stuff like this in translatable files is just utter folly, IMHO:

var pattern = '{WHO} invited {NUM_PEOPLE, plural, offset:1 other {{PERSON} and # other people}} to {GENDER, select, female {her circle} other {his circle}}'

I did hear support for this viewpoint at IUC34, and the suggestion that these strings should not be exposed in the translation files – instead, those files should contain the full set of “expanded” string variations (male/female pronouns, singular/plural cases).

But if that is the goal, I see very little point in using the message formatters in the first place. I guess it forces the developer to think about the variations, and it would keep the strings co-located in the translation files, but that’s about all.

That’s nitpicking, however, considering the huge step forward this would represent, with an experimental implementation targeted for Q4.

Advertisements

Kudos to Google: filling huge i18n gap October 25, 2010

Posted by globalizer in Android, Internationalization, Java, Locales.
2 comments

I’ve been a little harsh on Google in some previous posts, so I’m happy to have some good – no, make that great – Google news from the recent IUC34 conference. Albeit a little late compared to the tweeting others have done 🙂

Even though the internationalization community has made great progress towards more, and more uniform, locale data with the wide acceptance of the CLDR in recent years, we have been left with 2 big gaping holes: phone numbers and postal addresses. Up until now it has been practically impossible to implement correctly internationalized phone number and address formatting, parsing and validation, since the data and APIs have been unavailable.

Depending on the level of globalization awareness of the companies involved, this has resulted in implementations falling into one of these 3 broad categories :

  1. Address and phone number fields are hard coded to work for only one locale and will reject as invalid everything else. This usually takes the form of making every single field required, and doing validation of every single field, even on web sites where the address and/or phone number of the user is not actually important (such as purchases of non-restricted software delivered electronically, or web sites with required registration to enable targeted ads). This of course also results in such companies collecting an amazing amount of absolute garbage. For instance, if you make “Zip code” a required field and validate it against a list of US zip codes, then you end up with an amazing percentage of your users living in the 90210 area – simply because that is the one US zip code people living outside the US have gotten drilled into them via exposure to the TV show.
  2. Support for a limited number of countries/regions (limited by the number of regions you have the bandwidth to gather data and implement support for – with each company reinventing the wheel every time, for every country)
  3. No validation (provide the user with one , single address field, and assume that if the user wants you to be able to reach you at that address, he/she will fill it in with good data)

As described in the IUC34 session (by Shaopeng Jia and Lara Rennie), collecting reliable and complete data to fill these holes was a major task (it’s no coincidence that nobody has done it before…):

  • There is no single source of data
  • Supposedly reliable data (ITU and individual country postal/telephone unions) turns out to be unreliable (data not updated, or new schemes not implemented on time)
  • Formats differ widely between countries/regions
  • Some countries even lack clear structure
  • Some countries (e.g., UK) use many different formats
  • Some countries use different formats depending on language/script being used
    • Chinese/Japanese/Korean addresses – start with biggest unit (country) if using ideographic script, but with smallest unit (street) if using Latin script

I have looked at these issues a few times in the past, and each time the team decided that we didn’t really need this information (translation: there was no way in hell we were going to be able to get the manpower to gather the information and implement a way to process it). Since Google does in fact have a business model that makes it very important to be able to parse these elements and format them correctly for display (targeted ads and Android, to name a couple of cases), it makes sense that they bit the bullet.

They deserve a lot of kudos for also going ahead and open-sourcing both the data and the APIs that are the result of that major undertaking, however.

Check it out:

According to the IUC34 presentation, the phone number APIs will allow you to format and validate 184 regions, while they will parse all regions. And the address APIs provide detailed validation for 38 regions, with layout+basic validation for all regions.