jump to navigation

So you think that charset= value is going to help you? March 8, 2007

Posted by globalizer in encoding, Language, Localization, web applications.

I just ran across a real-life example of the pitfalls involved in relying on the charset value in HTML pages to tell you what the encoding is.

It is not unreasonable, of course, to assume that the encoding of the text actually corresponds to the value specified in that tag, it is just not very realistic.

Both Mark Davis from Google and Addison Phillips from Yahoo highlighted the fact that so many pages are either untagged or mistagged in their presentations at the most recent Unicode conference.

A recent question on Sun’s Java i18n forum about an “incorrect” conversion from gb2312 into Unicode made me suspicious, and wouldn’t you know it, the culprit was an incorrect charset value. This Chinese page is tagged as being encoded in gb2312, but it seems to really be in GBK. What makes this slightly tricky is that in a text using everyday Chinese the differences between those two encodings would be minimal – very few characters would be in GBK that would not be in gb2312. So a scan to detect the actual encoding might not even have made any difference.


No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: