jump to navigation

Full Unicode repertoire in programming languages? October 29, 2010

Posted by globalizer in Programming languages, Programming practices, Unicode.
add a comment

Would we be better off if we used a programming language that allowed “the entire gamut of Greek letters, mathematical and technical symbols, brackets, brockets, sprockets, and weird and wonderful glyphs such as “Dentistry symbol light down and horizontal with wave” (0x23c7).” ?

I am usually as gung-ho about Unicode as you can get, but I have to admit I’m a little wary about this. Mind you, it would presumably spur the adoption of UTF-8 as the default encoding in development environments on all platforms, something that’s long overdue. How can MacRoman still be the default encoding for text files in Eclipse on Macs??

Via Computerworld.

Advertisements

More i18n pot holes to be filled by Google October 26, 2010

Posted by globalizer in Internationalization, JavaScript, Locales, Programming languages.
3 comments

Yesterday I mentioned the big step forward that Google’s open-sourcing of phone and address libraries represents. For Java these libraries fill a major hole in an otherwise fairly smooth road, while for JavaScript they could be seen as the only smooth part of an otherwise almost completely unpaved i18n road.

As Ćirić and Shin, the authors of Google’s proposed i18n enhancements for JavaScript, (charitably) put it,

current EcmaScript I18N APIs do not meet the needs of I18N sufficiently

This has obviously been a problem for a very long time, but until Ajax gave JavaScript a second chance and the web browser became the dominant application delivery system, nobody thought it awful enough to fix properly. I remember discussions about this issue back in the 1990s, and at that time nobody in IBM was using JavaScript enough to squawk about the lack of support.

Well, things change. And with Google now being a serious player in the browser market, they seem to have found it important enough to propose a set of i18n APIs that would provide JavaScript with support similar to that found in languages like Java, covering

  • Locale support
  • Collation
  • Timezone handling
  • Number, date and time formatting
  • Number, date and time parsing
  • Message formatting

The proposal calls for using native data sources (ICU, glibc, native Windows calls), mainly because of the size of some of the data tables needed for collation, for instance. While not optimal, understandable.

The proposed message formatting is another variation of the plural and gender formatting capabilities that is all the craze these days. People who have read my previous posts on this topic will know that I am no fan of this type of formatting, and my most recent experiences with email templates using plural formatting have not changed my view. Exposing stuff like this in translatable files is just utter folly, IMHO:

var pattern = '{WHO} invited {NUM_PEOPLE, plural, offset:1 other {{PERSON} and # other people}} to {GENDER, select, female {her circle} other {his circle}}'

I did hear support for this viewpoint at IUC34, and the suggestion that these strings should not be exposed in the translation files – instead, those files should contain the full set of “expanded” string variations (male/female pronouns, singular/plural cases).

But if that is the goal, I see very little point in using the message formatters in the first place. I guess it forces the developer to think about the variations, and it would keep the strings co-located in the translation files, but that’s about all.

That’s nitpicking, however, considering the huge step forward this would represent, with an experimental implementation targeted for Q4.

Android string arrays and localization January 18, 2010

Posted by globalizer in Android, Localization, Programming languages, Programming practices.
1 comment so far

I really wish the Android developer guide would not (at least implicitly) recommend using localizable strings directly as array items.

In my opinion, any examples showing translatable strings as array items should in fact be accompanied by flashing red warning signs. Here’s why:

If the default version of a string array is updated with the addition of an array item, but the localized version has not been updated yet (very common scenario), then the menu item in question would simply be missing from the localized version (no fallback to the default version, since the string array represents the whole array). This can potentially result in serious functional issues in the localized versions of the software.

Instead, the examples should show array items as references to resources.

Use a little imagination, for crying out loud December 17, 2008

Posted by globalizer in Eclipse, Programming languages.
add a comment

We can probably all agree that having software fill in sensible defaults for us is a good thing – it means that we are not forced to type in or browse for directories, file names, what not.

Having said that – in the world of software development it would also be nice if everybody didn’t just blindly accept all the defaults that are offered. One example is translation file names in the Eclipse development IDE. I have yet to see a plugin where the developer did not use the default file name plugin.properties. And while that is a fine name, once you have several hundred files with that same base name it becomes just a tad tedious to sit and squint at long, long file paths that are also almost identical to try to find the one file you need.

This is probably mostly a problem for software translators, since everybody else involved with software development tends to only deal with one or two different plugins at a time. But the poor translators get to translate hundreds of plugins, and thus juggle file lists with hundreds (or even thousands) of these files. So dear developers: have a heart, use your imagination, come up with a new file name once in a while!

Can we agree that all those ISO8859_1 hacks are just that – hacks September 26, 2007

Posted by globalizer in Java, Programming languages, Unicode, web applications.
4 comments

The amount of misunderstanding and misinformation surrounding the use of ISO-8859-1 character encoding in Java-based web applications (actually not just Java-based, but I will be looking at Java code here) is just incredible, and it keeps proliferating in forum postings and code snippets. Various people (I among them) keep flailing or swatting at them, but invariably someone else comes along suggesting how to make the hack work again.

The original hacks were probably legitimate since they were designed to work around the very unfortunate 8859-1 defaults used for many years in web servers, etc. (and even in parts of Java such as property files). These days, however, there is absolutely no reason to use them, and they will in fact almost inevitably cause problems at some point, when the incorrect code or configuration setting you are relying on is fixed.

Anyway, maybe I should explain which hack I am ranting about, before I go any further:

I am talking about code like this:

this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes ("ISO8859_1"),"UTF8");

This code actually does very little, except pivot through a couple of encodings (this is relevant, but only in explaining how it can be used to “fix” strings that are incorrectly encoded to begin with). The code takes a string, creates a byte array from that string, and then creates a new string from the byte array.

Such code keeps getting posted on the Sun Java I18N forum (and all kinds of other fora), with statements like this (a composite paraphrase):

This code has been working perfectly to handle data in Russian (or Greek, or Arabic, etc.), but now that I have to do x, y or z with the data, the data appears garbled – please help.

So, what is the problem?, I hear you ask. And the problem is that the only reason code like that “works” is that somewhere else, either in the code or in the environment (web server, Tomcat, etc.), ISO-8859-1 encoding is being used/defaulted.

Let me explain. Assume that we have a JSP-based web application, where we want to input some Arabic text. The text will be stored in a database encoded in UTF-8, and then retrieved and displayed in the JSPs again. The forum question that I link to above, and where I got the code snippet, states that the code is used when saving data to the database, and that the code below is used when retrieving data for display (and that UTF-8 is otherwise used everywhere):
paymentTransactionTypeDetailVO.setArabicDesc(new String (arabicDesc.getBytes("UTF8"),"ISO8859_1"));
Now the problem with this is that it is just not possible to take correctly encoded Arabic data, run it through code like that, and end up with useful data.

Let’s look at a practical example. Let’s say I want to save and re-display this Arabic text:

أتتعامل

I should say up front that I have no idea what it says (copied and pasted from the Arabic translation of What is Unicode, so hopefully it should not be an insult or otherwise offensive).

Since the getBytes method uses a Java string as the starting point and creates a byte array from that string, in the encoding specified, we would have the following situation (assuming the Arabic text had been correctly stored in a Java string to begin with):

This getBytes method
this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes ("ISO8859_1"),"UTF8");
would take the Unicode code points 0623 062A 062A 0639 0627 0645 0644 and try to convert them to the corresponding code points in the ISO-8859-1 encoding. That would fail, however, since those Unicode code points have no equivalent values in ISO-8859-1. So Java fails to convert, and substitutes question marks instead. Once that failure has happened, the string contains question marks thereafter (and Java of course has no problem creating the new string from the byte array based on the specified UTF-8 encoding). So all you end up with is a string of question marks – if, that is, the original string was correctly encoded in the first place.

The reason so many people claim that such code works is that they use incorrectly encoded strings as the starting point. Let’s look at the example above again and assume that the Arabic text was actually stored incorrectly in the original Java string, something that could happen if ISO-8859-1 is used a the default web server encoding, or if you explicitly use code like this (which was what I did to test this scenario):

request.setCharacterEncoding("ISO8859_1");

If we use the same Arabic text above and assume that the UTF-8 code points making up that text are incorrectly assumed to be in ISO-8859-1, then the original Java string would instead be created based on these code points (actually UTF-8, but thought by Java to be 8859-1 code points, because I told it so):

D8 A3 D8 AA D8 AA D8 B9 D8 A7 D9 85 D9 84

resulting in this string: أتتعاÙ?Ù?

If you then use that input to the getBytes method, you get a byte array in 8859-1 that is identical to the original UTF-8 input, and you can then create a correct string by specifying UTF-8 as the encoding for the byte array.

To sum up: this kind of code can “fix” code or strings that are incorrectly encoded to begin with, but using it is like building a house of cards. Fix the environment settings or the incorrect code – don’t perpetuate the hacks.

I can has Danish LOLCODE? July 28, 2007

Posted by globalizer in Danish, Language, Programming languages.
add a comment

I for one am a cat person and subscribe to the notion that there can never be too many cute kittens on the web. Since the lolcat or “cat macro” phenomenon combines cats and language, what’s not to love about it? Add a programming angle, and you get LOLCODE, an entirely new programming language (which has already spawned an Eclipse plug-in, for instance).

However, on behalf of the software localization community I have to ask:

Where is the internationalization support??

We need the ability to externalize and translate the text in code like this:

HAI

CAN HAS STDIO?

VISIBLE "HAI WORLD!"

KTHXBYE

A modern programming language really can’t afford to not include i18n and l10n as part of the basic architecture, you know 🙂

Flex developers sure are agile though… March 2, 2007

Posted by globalizer in Locales, Localization, Programming languages.
1 comment so far

I’m impressed to see a comment from a Flex developer the day after I posted about the localization support. And even more impressed to see that the next version will actually provide exactly the kind of support I was looking for.

If that kind of attentiveness is an indication of their support in general, then I give them a thumbs up…

Flex – not so flexible, it seems March 1, 2007

Posted by globalizer in Locales, Localization, Programming languages.
4 comments

I’m taking responsibility for the globalization and localization of some new applications, and just trying to figure out what they do, the programming platform, file types, etc. It turns out that one of them is apparently using Adobe Flex to provide a rich web client interface (yes, I know – sounds like a marketing blurb).

I have to admit that until a few days ago I had not even heard of Flex, but I guess I will have to dig into it. My first discoveries are not promising, but also not unexpected: when was the last time you saw a new technology introduced which had good, built-in support for localization right from the get-go?

If I am reading the documentation and the fora correctly, then you have to compile a separate SWF file per locale, and there is no built-in way of selecting the appropriate SWF file based on a locale parameter. Um, that’s like creating a separate executable per language, right? You compile your translation files into your source code, so if you want to change your source, you have to recompile 26 language versions? Isn’t that so, what should I say, ’80s?

I may be unfair here, simply because I have not delved deeply enough into the documentation, and I sure would love to be proven wrong.