jump to navigation

A slippery Unicode slope? December 23, 2008

Posted by globalizer in Unicode.

The recent proposal to encode a whole bunch of so-called “emoji” images in Unicode has caused quite a brouhaha the last few days on the otherwise fairly staid Unicode mailing list.

I have never delved deeply into the policies governing encoding decisions in the UTC, but I have to admit that the proposal to encode more than 600 emoji symbols does seem to be a giant step away from the encoding of plain text out onto a very slippery slope indeed. The WG2 “Principles and Procedures” document has this wording about “obscure or questionable usage symbols”:

Obscure or questionable usage symbols
The characters are part of a small or large collection that is not yet deciphered, or not completely
understood, or not well attested by substantial literature or the scholarly community. Or they are
symbols that are not normally used in in-line text, that are merely drawings, that are used only in
two-dimensional diagrams, or that may be composed (such as, a slash through a symbol to
indicate forbidden). Examples include Phaistos, Indus, Rongo-rongo, logos, pictures of cows,
circuit components, and weather chart symbols.

But hey, who hasn’t desperately needed to be able to refer to a “CAT FACE WITH RAISED EYEBROWS AND POUTING MOUTH”cat_pout, a “BEATING HEART”heart, a “BOMB”bomb or a “LOVE HOTEL”love_hotel in plain text?  I come across this kind of gaping hole in the available character repertoire all the time, so I fully understand why we need the emojis 🙂

On a slightly more serious note, it does seem rather difficult to reconcile with the WG2 language, something that was noted on the Unicode mailing list and met with this response:

> N3452 specifically mentions “pictures of cows” and “stop sign” as examples of symbols that should not be encoded.  Naturally it is a bit of a surprise to see so much official and expert support behind the encoding of COW and TRAFFIC LIGHT.
Right. And as I wrote before, subject to change. Therefore, a future revision of this document is likely to use different examples. The Unicode Standard has contained language trying to define the scope. This language has had to be changed over time, because the understanding of what is and isn’t plain text has evolved. It’s still the case that one doesn’t need the catalog of street signs as Unicode, because nobody is using this full set to communicate in text. The STOP sign is a different matter – it’s becoming something that I can definitely imagine being used in interchange without literally being an encoding of a traffic sign.

As somebody else on the mailing list said: “Flexible principles are a good sign of instability.”

Based on the responses (or mostly lack of responses) from key people on the mailing list I think that this proposal is a done deal, and that only minor adjustments wrt. names, etc. will be accepted. It apparently doesn’t hurt to have the clout of major Japanese wireless companies, plus Google, behind you when you make proposals like these. Even though the mobile phone companies in question used User-Defined Characters in the Shift-JIS encoding (and different ones per company, to boot) which should entail no need for them to be encoded in Unicode. Which tells me that the real mover behind this is probably Google, since they are the ones sucking up content from everywhere, and need to be able to store it in Unicode.

But if the concern is emojis “leaking” into databases (mentioned in a few messages) and polluting them, then I think a reasonable response to that would be: use transcoders which convert characters from private use areas into substitution characters before you stuff them into your databases. This will not make them searchable for emojis, true, but since when is it a major loss to be unable to search for images (which these basically are)? I may use smileys in messages, but do I think it essential to be able to search for them? Not really, no.

Discussion about the proposal as such (is it a good idea to encode stuff like emojis at all) has in fact been discouraged:

>> What is needed most, at this juncture, is not further opinionizing about the value of these proposed characters, but the detailed work of sorting them into the standard. There are enough hard questions to be answered:
> So, in other words, the decision to encode the entire set has been made, and resistance is futile.
I’m not in the position to speak for the UTC, or even vote in the UTC. But, yes, I think “resistance” is not helpful.

So, it is very difficult to see what type of “thingies” would not be candidates for encoding in the future – assuming you have the clout/energy to encode them in conveniently available space in an encoding like Shift-JIS with plenty of room for user-defined characters (there is no information about actual usage contained in the emoji proposal, as best I can tell, just the fact that the images are encoded by several major companies).

And what’s the deal with the selection of flags in the current set of emojis? Japan, China, USA and a handful of other countries are represented, but where, may I ask, is Denmark? I can’t wait for the fun we will have when Taiwan’s flag comes up for encoding.

I have been (quietly) critical of previous proposals to encode what I considered frivolous characters like the interrobang (why would I need a new character to write the sequence ‘?!’?), but I have to admit that I agree with this sentiment from the Unicode list:

> You think features on Japanese cell phones are not subject to sudden
> swings of fashion?

Indeed. Considering how hard it is to get actually useful *writing* stuff encoded, I really feel sad about this.

Indeed. And how to justify the rejection of Klingon now?

Once you have stepped out onto the slippery slope it is going to be very difficult to get off, and the doomsayers who have been predicting that Unicode will run out of available code points to encode new characters will turn out to be right. Not because we have finally met intelligent aliens with new languages that require encoding, but because we have frittered them away encoding images. Sad.

The discussion continues on the mailing list, and my brief excerpts don’t do it justice. So, read the whole thing, as they say.



1. Wheee! « Musings on software globalization - June 8, 2010

[…] the emoji encoding discussion back in 2008? Well, Unicode 6.0 containing the new “characters” only just came out in […]

2. bebenajib - June 9, 2010

If you press me to say why I loved him, I can say no more than because he was he, and I was I.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: