jump to navigation

BOMs in UTF-8? December 7, 2006

Posted by globalizer in Java, Unicode.

During the last few days there’s been a flurry of emails on the www-international@w3.org mailing list about whether to include a byte order mark or ‘BOM’ in UTF-8 encoded files. The emails were sparked by the following advice on the W3C DTD validator:

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to

  cause problems for some text editors and older browsers. You may

  want to consider avoiding its use until it is better supported.

This was viewed as “surprising” advice because

the use of a BOM with UTF-8 files is

a) standards compliant, to Unicode and to XML and to CSS
b) common practice
c) allows text editors to auto-detect the encoding of a plain text document.

Various responders have pointed out that while it may be standards compliant, it is still true that a number of user agents have historically had problems with BOMs in UTF-8, and that PHP seems to still have problems with them in some cases.

I would add that not having a BOM in UTF-8 files is equally standards compliant – it is entirely optional for UTF-8 files. So I fail to see how warning about potential real life problems would be wrong.

If you work in a Java environment, there are very good reasons to stay away from BOMs in your UTF-8 files. Here’s one of them.

So for now, this remains my advice (in fact, an iron-clad rule) for developers on my projects: do not include BOMs in UTF-8 files. This also means that using Windows Notepad to save files is a no-no, but I would hope that they would be using more appropriate text editors for their development work in any case.



1. Property resource bundle encoding « Musings on software globalization - June 10, 2010

[…] If for some reason you really really like those fancy quotes (or maybe need to use a copyright sign), then you need to convert your source file to Unicode, and then only update the files using a Unicode-enabled editor – an editor that is not Notepad, by the way, or any other editor that automatically inserts a BOM in UTF-8 files. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: