jump to navigation

Ask not “why Unicode?” – ask “why not Unicode?” August 31, 2006

Posted by globalizer in International requirements, Unicode.
trackback

That, in my opinion, is how any new software project should be approached. If you are starting from a blank slate there really are no good excuses for not using Unicode (I will talk about some of those excuses in a later post). On the other hand, there are a slew of good reason in favor of using Unicode – here are a few of the most important ones:

  • If you want to be able to handle languages such as Hindi, Telugu, Georgian and Armenian on standard platforms such as Microsoft Windows, you have no choice – those languages are supported as Unicode-only locales and have no “legacy” code page support.
  • If you want to be able to store or display data from several different character sets on the same page/in the same database in a remotely sensible fashion1, then you have no choice.
  • If you use Unicode in all tiers and across all components, you don’t have to worry about changing charset tags or environment settings for various countries or performing conversions – you can use the same encoding everywhere.

Even if you have to support a bunch of legacy applications/environments it makes good sense to use Unicode for new components and simply create interfaces to the old legacy systems. Apparently this is not yet universally accepted, however, as exemplified by this plea for help in defending Unicode that I saw a few months ago:

We are in the process of implementing a major Data management system. Our existing database is Unicode the target server that IT manages is not. IT refuses to import the data saying they will not switch to Unicode. What is the disadvantage of building our new oracle server to accept unicode? We have been battling this for 3 weeks and our IT oracle DBA’s are not wanting to change. There argument is we never had to use unicode before.

Even though part of my job is to “evangelize” globalization and localization across development organizations, I have not been in the position of having to defend the use of Unicode as such for quite some time now; nobody questions that premise. The discussions have moved on to implementation-specifics – should the DB encoding be UTF-8 or UTF-16, what should the data model look like for multilingual data, etc. So it is very easy to forget that Unicode is still viewed as something that is superfluous or optional by many people, even IT professionals.

Note 1: The legacy of ISO-8859-1 being the de facto default encoding on the web for many years is still being felt, so unfortunately it is still not uncommon to see workarounds using Numeric Character References to store non-Latin1 characters into databases defined with 8859-1 as the character set. The additional storage requirement this imposes (probably at least 3 or 4 times higher than UTF-8) is one issue, but more important would be search and collation problems.

Comments»

1. a name - September 16, 2006

Unicode usage is yet another one of those “U.S.” vs. the world issues that’s in the same ballpark as other good but unpopular ideas such as universal metrication, A4 paper, ISO dates, 24-hour time (not to mention UTC time), etc.

2. globalizer - October 6, 2006

Fortunately Unicode usage is rapidly being adopted “under the covers” by all large companies involved in IT. I can’t think of a single major company that has not adopted Unicode for everything but their legacy systems (and even in that area some are making major conversion efforts).

3. Unicode - does size matter? « Musings on software globalization - December 7, 2006

[…] This is one of the follow-up posts on excuses for not using Unicode that I promised here. And it is probably the one excuse heard most often – “we can’t use Unicode because our database/files/web site will get too big”. […]


Leave a comment