jump to navigation

Property resource bundle encoding February 8, 2007

Posted by globalizer in Java, Localization, Translation, Unicode.
trackback

Just yesterday I had a brief discussion with a development team about the encoding of Java properties files. The development team had encoded the English source files in ISO-8859-1, using various characters outside the range of the lower 128 ASCII set (sometimes referred to as “US ASCII”, or just ASCII, since they are the characters from the original ASCII character encoding) in the files.

When I told the team that for translation purposes they needed to either remove the characters beyond the lower 128 set, or convert the files to UTF-8 if they absolutely needed those characters, they complained and pointed to the Java documentation here:

When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single ‘u’ character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.

I can see how this could easily lead to a belief that property resource bundles have to be encoded in 8859-1, and also a belief that such files should be able to contain any character from that set.

There are actually two distinct issues involved here:

  1. The encoding of and specific characters used in translation files, and how those items impact translatability
  2. The Java specification on property resource bundle encoding, and what exactly that means with respect to the encoding of such files

With respect to translatability, the problem with 8859-1 is that it can represent only the Latin-1 character set (Western European languages, more or less), so if you need to translate into other languages like Chinese, Thai, Hebrew, etc., then the non-ASCII characters will not be represented correctly in the code pages used by those other languages. And since our translation tools work by using the English source file as the starting point, we need the source file to either contain only the ASCII set that is common across code pages, or to be encoded in Unicode.

The easiest solution is usually to simply use only the ASCII set in the source files (eliminate “fancy quotes”, for instance). It you do that, your source files can be used across legacy code pages, and they can also masquerade as UTF-8 files (since the ASCII set is unchanged in UTF-8).

If for some reason you really really like those fancy quotes (or maybe need to use a copyright sign), then you need to convert your source file to Unicode, and then only update the files using a Unicode-enabled editor – an editor that is not Notepad, by the way, or any other editor that automatically inserts a BOM in UTF-8 files.

The second issue, the encoding used by Java to read and save properties, and the way it is worded, can easily cause confusion (and it is of course totally crazy that this was ever designed to use 8859-1 in the first place). The only thing this means, however, is that files encoded in 8859-1 can be read as-is, without any conversion. Files containing characters from other character sets have to be converted to Unicode escape sequences (can easily be done using native2ascii). And it would of course be impossible to provide Java applications in anything but Western European languages if properties files actually could use only the 8859-1 character set.

Update: With Java 6 you can use PropertyResourceBundle constructed with a Reader, which can use UTF-8 directly.

Advertisements

Comments»

1. Josep Condal - February 16, 2007

We have found to be true that what works best is to check in localized files directly in ASCII (i.e. with unicode scape sequences) in the version control system.

The approach is to enforce that the output from the translation process is directly ASCII, preferable by the CAT tool directly whenever possible.

The advantadges are:

1) Files can be read by the java applications without developers or people creating the installable packages havinig to be aware of the need of running native2ascii.

2) Developers are not tempted to tweak the files, especially for Asian languages. This way, changes to the files are enforced to be done via the translation memory tool (and therefore important late changes are not lost for next release).

Basically the whole idea is that when using Translation Memory systems (a must), the localized properties files are merely an output, i.e. they are not source files anymore (unlike developers think). IN this case the real source file that needs to be maintained is the translation memory.

2. globalizer - February 16, 2007

Yes, agree totally with the statements about the localized files. I also check the ASCII escaped Unicode versions into our version control system. And never tire of telling developers that any changes they make to translated files will be lost the next time translators send back an updated file – since they work only on the basis of source files and are not even aware of any updates made to translation files outside of the TM system.

3. tony - March 12, 2007

I’m an i18n n00b… bear with me.

So what you’re saying is that
1) all properties files should be in pure ASCII, though the specification says 8859-1.
2) if you need something that can’t be represented in ASCII, then escape it using native2ascii.

4. globalizer - March 12, 2007

Tony,
Yes, that is in essence what I am saying.
The reason I am saying “don’t use 8859-1” is really only related to translation, however. If you have properties files that will not be translated into any language outside of the Latin-1 set (supported by 8859-1), then you can go ahead and keep them in 8859-1, that will work. But since it is just as easy to maintain the files in UTF-8, and since that will cover all languages, why not do it from the outset? Implementing a step that runs native2ascii in your build is a 2-minute operation, so it shouldn’t be an issue (and remember, the ASCII-escaped version is only needed for the version you use in your application).

5. Chinni - May 30, 2008

can you give me an example test of ascii and unicode. using ascii2native is the unicode generated?

6. globalizer - May 30, 2008

Chinni, I recommend that you look at the syntax for native2ascii here: http://java.sun.com/javase/6/docs/technotes/tools/windows/native2ascii.html

You can convert back and forth between native encodings/Unicode and the ASCII-escaped format.

7. Ivošek - February 1, 2009

You can also use http://itpro.cz/juniconv/ for converting Unicode to ASCII and vice versa.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: