jump to navigation

UTF-8 and string length limitations January 16, 2007

Posted by globalizer in International requirements, Java, Unicode.
trackback

One of the more intractable problems that you invariably run into when you implement multilingual web applications using UTF-8 as the database encoding is the issue of length limitations on input strings – how to implement them, and specifically how to communicate them to the end-user.

Let’s say we have an input field where we allow a user to enter a name of their choice, which we will then go ahead and store in our database. The database column that will store the user input will be defined as a VARCHAR column (I am using DB2 terms here, but other databases have similar definitions), and let’s say that we define it with a maximum length of 16. This is of course entirely arbitrary, but useful for demo purposes (and we might have a back-office application with this kind of ridiculous length limitation). Since our database is defined with UTF-8 as the character encoding, this means that we have a maximum length of 16 bytes, not 16 characters. And this, in turn, is what causes us trouble when we try to communicate the length restriction to our end-user.

First off, we can’t tell the users how many characters they can enter – since it depends entirely on which characters they enter. If we have a standard US English user entering nothing but US ASCII characters, the limit would be 16 characters. However, as soon as we move to languages like French or Spanish with accented characters sprinkled here and there we get the odd 2-byte character to account for, and if you add East Asian or Indic characters, we move into 3-byte characters. I doubt anybody in their right mind would want to present users with a statement like this:

You can enter up to a maximum of 16 characters, depending on the type of characters you use. Each US ASCII character counts as one, each accented or special character from Western and Eastern European languages (Latin1 and Latin2) count as 2, as do all Greek, Cyrillic and Turkic characters, while…

– well, you get the point.

So, we can say nothing, and just count the characters and bytes as the user enters the text (which should be doable with some kind of AJAX scripting), displaying an error message when the count exceeds the limit of 16 bytes. The problem there would be that while it is easy to detect when the byte limit is reached, communicating to the user will again be difficult.

Let’s say the user entered the following string: øen på 大笔

This is just fine, within the limit with only 15 bytes. Then the user adds the character from an Indic script, and the string (øen på 大笔घ) is suddenly 18 bytes long. OK, your validation routine kicks in, and you display a message saying the text exceeds the length limitation, and that the user needs to shorten it by one character. Your user dutifully goes back – and removes not the last character added, but the character ‘n’, so the string now looks like this: øe på 大笔घ – and has a length of 17 bytes.

Not a happy experience for either you or your users…

As an alternative, we could of course simply ask ourselves: what is the maximum number of characters we can fit into this field if all of the characters take up the maximum number of bytes? The result here would be 4, since the maximum byte length of characters is 4 in UTF-8. This would ensure that no users would ever be able to take full advantage of the maximum length, however, since only supplementary code points, used to support supplementary characters, occupy 4 bytes, and the use of supplementary characters is still incredibly rare. It would also give users of East Asian scripts an advantage, since each of their single characters convey much more meaning than say a Latin character – a Chinese string of 4 characters can make up an entire sentence, while the same is rarely true for an English 4-character string.

On the other hand, it would be an easy limit to implement and convey to the user. Simply say “Maximum number of characters: 4” on the screen, and implement a check that verifies the number of characters, based on something like this:

String s= request.getParameter("name");
out.println("s = "+ s);
out.println("<br/>" + "s.length() = "+ s.length());
char[] charArr = s.toCharArray();
out.println("<br/>" + "charArr.length = "+ charArr.length);

Or actually, an amended version of this code – since it does not handle supplementary characters correctly. If you implement the code in a servlet and input a string like this:

瘌фγε

with a nice mix of Chinese, Cyrillic and Greek characters, the code will correctly report that you entered 4 characters. But if you add a character like this: 𐍅 , a Gothic character, it will suddenly report that you entered 6 characters. This happens because the length() method counts char units, not code points. So to get the supplementary Gothic character counted correctly, we would need to add a method for counting code points (codePointCount, added in Java 5), like this:

String s= request.getParameter("name");
out.println("s = "+ s);
int characterCount = s.length();
int codepCount = s.codePointCount(0, characterCount);
out.println("<br/>" + "s.length() = "+ s.length());
out.println("<br/>" + "codePointCount() = "+ codepCount);
char[] charArr = s.toCharArray();
out.println("<br/>" + "charArr.length = "+ charArr.length);

That should produce output like this:

s = 瘌фγε𐍅
s.length() = 6
codePointCount() = 5
charArr.length = 6

So, to sum up:

You can either waste a large number of the bytes that you have defined for each field that allows user entry, or you can try to communicate some of the intricate issues related to character encoding to your users.

In practical terms, even if you want to use the strict character limit, based on maximum possible string length, you should be able to use a multiplier of 3 rather than 4 – since the supplementary characters are not likely to be used in practice.

Update, Nov. 2010: High time to obsolete that last statement – supplementary characters are probably very much found in the wild now, so ignoring them won’t cut it.

Advertisements

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: