jump to navigation

Unicode – does size matter? December 7, 2006

Posted by globalizer in Unicode.
trackback

The quick answer to that is: no – because it’s not as big as you think it is.

This is one of the follow-up posts on excuses for not using Unicode that I promised here. And it is probably the one excuse heard most often – “we can’t use Unicode because our database/files/web site will get too big”.

You might hear this from people who currently use only US ASCII characters, and who complain that migrating to Unicode will make everything twice as big. The response there is of course “use UTF-8”, since that will change absolutely nothing, the size and even the content of the files will remain exactly the same (if you don’t use software that automatically inserts a byte order mark or BOM in UTF-8 files – see why you shouldn’t do that here).

For other scripts the issue becomes more real. If we stick to UTF-8, which is definitely the Unicode encoding that dominates on the web, and also increasingly is the preferred database encoding (with UTF-16 remaining the preferred processing encoding), then various scripts will take up more space in Unicode than in the various legacy encodings:

Cyrillic, Greek, Arabic, Hebrew, Syriac, Thanaa and Armenian characters each take up 2 bytes, as do Latin script characters outside the US ASCII range. For some of these languages UTF-8 thus means a doubling of bytes needed for storage, compared to legacy encodings, while for others (Latin-based languages, where only some of the characters take up 2 bytes), there would be a modest increase.

Since legacy encodings for the East Asian scripts used to write Chinese, Korean and Japanese were already multibyte, going to UTF-8 means (in general) an increase from 2 bytes per character to 3 bytes per character.

For the rest of the world’s scripts – those used for Indic languages, Georgian, Mongolian, Tibetan, Vietnamese, and so on, the impact is worse – from 1 byte (in the cases where a legacy encoding exists) to 3 bytes per character. For practical purposes we don’t need to talk about the characters that need 4 bytes in UTF-8 – since they are so rare.

So it would seem that migrating to UTF-8 from legacy encodings could be a major issue from a storage and size perspective, right? Well, no, not really. Because:

  1. Storage is cheap these days.
  2. Compared to images, text takes up small amounts of space.
  3. Even if your content is in one of the languages where each character uses 3 bytes, most of your text will actually be ASCII – in the form of markup, formatting and code.
  4. Some languages, because of the way their writing systems are structured, use fewer characters to convey the same meaning.

At the latest Internationalization and Unicode Conference, Mark Davis presented some interesting data from Google that illustrates item # 3. You can find a link to the presentation Unicode at Google on his web site. Pages 15 and 16 contain the interesting bits in this connection: approximately 60% of the text on the web consists of markup, another 15 to 20% is “common” (punctuation, etc.).

Mark Davis also provided an interesting set of figures related to item # 4, in a posting to the unicode mailing list a few months ago. As you can see from his example – where he looked at the declaration of human rights in various languages, stripped out all markup and formatting, and then calculated how many characters and how many bytes in UTF-16 and UTF-8 respectively each language used – there is not necessarily a one-to-one correlation between “large size in bytes” for each translation and “many bytes per character in UTF-8” for the language.

Taking all these items into account (and of course all the advantages that you get from using Unicode) it is very difficult to take the size issue seriously. At least as long as we are not talking about some of the really constrained platforms, such as mobile phones, etc. In those environments you do have to look at almost every byte and be miserly about it (which doesn’t mean you shouldn’t use Unicode, you just have to be more concerned with size).

Update Dec. 22, 2006: Mark Liberman made some of the same points concerning cost and ratio of text to images etc. on Language Log a long time before I posted this. He also has some comparisons of specifically Chinese and English.

Advertisements

Comments»

1. Matt Giuca - July 9, 2008

Excellent post. Someone needs to convince programmers to write Unicode-aware software. (It really helps that modern languages are implementing built-in Unicode support).

Your list of 4 reasons why UTF-8 is better than legacy encodings is good. I might also add that any compression scheme worth its salt will typically compress a UTF-8 document in a “higher plane” script into about the same size as any other encoding.

2. globalizer - July 9, 2008

Good point about compression, I agree.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: