jump to navigation

Can we agree that all those ISO8859_1 hacks are just that – hacks September 26, 2007

Posted by globalizer in Java, Programming languages, Unicode, web applications.
trackback

The amount of misunderstanding and misinformation surrounding the use of ISO-8859-1 character encoding in Java-based web applications (actually not just Java-based, but I will be looking at Java code here) is just incredible, and it keeps proliferating in forum postings and code snippets. Various people (I among them) keep flailing or swatting at them, but invariably someone else comes along suggesting how to make the hack work again.

The original hacks were probably legitimate since they were designed to work around the very unfortunate 8859-1 defaults used for many years in web servers, etc. (and even in parts of Java such as property files). These days, however, there is absolutely no reason to use them, and they will in fact almost inevitably cause problems at some point, when the incorrect code or configuration setting you are relying on is fixed.

Anyway, maybe I should explain which hack I am ranting about, before I go any further:

I am talking about code like this:

this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes ("ISO8859_1"),"UTF8");

This code actually does very little, except pivot through a couple of encodings (this is relevant, but only in explaining how it can be used to “fix” strings that are incorrectly encoded to begin with). The code takes a string, creates a byte array from that string, and then creates a new string from the byte array.

Such code keeps getting posted on the Sun Java I18N forum (and all kinds of other fora), with statements like this (a composite paraphrase):

This code has been working perfectly to handle data in Russian (or Greek, or Arabic, etc.), but now that I have to do x, y or z with the data, the data appears garbled – please help.

So, what is the problem?, I hear you ask. And the problem is that the only reason code like that “works” is that somewhere else, either in the code or in the environment (web server, Tomcat, etc.), ISO-8859-1 encoding is being used/defaulted.

Let me explain. Assume that we have a JSP-based web application, where we want to input some Arabic text. The text will be stored in a database encoded in UTF-8, and then retrieved and displayed in the JSPs again. The forum question that I link to above, and where I got the code snippet, states that the code is used when saving data to the database, and that the code below is used when retrieving data for display (and that UTF-8 is otherwise used everywhere):
paymentTransactionTypeDetailVO.setArabicDesc(new String (arabicDesc.getBytes("UTF8"),"ISO8859_1"));
Now the problem with this is that it is just not possible to take correctly encoded Arabic data, run it through code like that, and end up with useful data.

Let’s look at a practical example. Let’s say I want to save and re-display this Arabic text:

أتتعامل

I should say up front that I have no idea what it says (copied and pasted from the Arabic translation of What is Unicode, so hopefully it should not be an insult or otherwise offensive).

Since the getBytes method uses a Java string as the starting point and creates a byte array from that string, in the encoding specified, we would have the following situation (assuming the Arabic text had been correctly stored in a Java string to begin with):

This getBytes method
this.arabicDesc=new String(paymentTransactionTypeDetailVO.getArabicDesc().getBytes ("ISO8859_1"),"UTF8");
would take the Unicode code points 0623 062A 062A 0639 0627 0645 0644 and try to convert them to the corresponding code points in the ISO-8859-1 encoding. That would fail, however, since those Unicode code points have no equivalent values in ISO-8859-1. So Java fails to convert, and substitutes question marks instead. Once that failure has happened, the string contains question marks thereafter (and Java of course has no problem creating the new string from the byte array based on the specified UTF-8 encoding). So all you end up with is a string of question marks – if, that is, the original string was correctly encoded in the first place.

The reason so many people claim that such code works is that they use incorrectly encoded strings as the starting point. Let’s look at the example above again and assume that the Arabic text was actually stored incorrectly in the original Java string, something that could happen if ISO-8859-1 is used a the default web server encoding, or if you explicitly use code like this (which was what I did to test this scenario):

request.setCharacterEncoding("ISO8859_1");

If we use the same Arabic text above and assume that the UTF-8 code points making up that text are incorrectly assumed to be in ISO-8859-1, then the original Java string would instead be created based on these code points (actually UTF-8, but thought by Java to be 8859-1 code points, because I told it so):

D8 A3 D8 AA D8 AA D8 B9 D8 A7 D9 85 D9 84

resulting in this string: أتتعاÙ?Ù?

If you then use that input to the getBytes method, you get a byte array in 8859-1 that is identical to the original UTF-8 input, and you can then create a correct string by specifying UTF-8 as the encoding for the byte array.

To sum up: this kind of code can “fix” code or strings that are incorrectly encoded to begin with, but using it is like building a house of cards. Fix the environment settings or the incorrect code – don’t perpetuate the hacks.

Advertisements

Comments»

1. warrior - July 9, 2009

Hi Elsebeth Flarup,

I think you have written very good article.
I was very much clear that all the characters are a codepoints in computer and that codepoint must always be accompanied with the encoding type. Like, suppose a codepoint D8 but if we donot know the encoding in which the sender has sent it, we will not be able to get exactly what it should be from the senders point of view.

So, whenever I have a byte array I must know its encoding correctly beforehand, to have correct unicode string in java.

Then suddenly I started seeing the code like you mentioned new String(String.getBytes(“ISO_8859-1″),”UTF8”); And this started me to confuse as I was not finding the answer why this works.

Somebody told me that getBytes(“ISO_8859-1”) cuts string into each byte, and byte value is the code point. So each cuts string into code points of each byte so this becomes successful.

But I think your explanation sounds more logical.

Do reply me, I wanted to be with you to make this mystery more clearer to mass of java people.

2. Arralaalitomy - December 10, 2009

I’m frequently looking for brandnew posts in the internet about this issue. Thanks!!

3. Mohammed Hewedy - April 28, 2010

Very good explanation.
BTW, the Arabic word is not insult, and it means “deals with”.

4. globalizer - April 28, 2010

Thanks Mohammed – good to know 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: