Character Encoding --- a Never Ending Story

2008-05-12 386 words 2 minutes

Contents

Character encoding really is a never ending story. I am a huge fan of Unicode and UTF-8. But it is still not the default everywhere. This blog entry is about the problems I encountered while making Klatschbase working correctly with UTF-8.

First of all, SBCL - my common lisp implementation of choice - supports UTF-8 out of the box. So, no problem here.

Hunchentoot, Edi Weitz’s wonderful http server, supports UTF-8, but per default uses ISO-8859-1. It is easy to change the default behaviour, but it is also easy to overlook this.

Now at least the http body content is set correctly. But the next pitfall for me was the hashing of passwords. I use Ironclad for this. But Ironclad’s helper functions only support ASCII. I somehow even failed to notice this problem until reading Leslie Polzer’s blog and made a blatantly stupid comment. However, first I wrote a solution for the problem using the encoding translation library Babel which is easy to use. But I did not want to introduce another dependency, so I rewrote it to use Flexi Streams.

Now with this working, I stumbled upon another interesting fact: the JavaScript function escape uses a non-standard approach to convert non-ascii unicode characters. Wikipedia’s percent encoding entry is enlightening. This is not so important, because jQuery takes care of these details. Just for the record: I don’t know why the encoding used by the escape function was rejected by the W3C, but I think it make sense to reject it, because it uses a fix size of 16 bits. Since Unicode already needs more and grows further, this is not sensible.

But still, the characters where not decoded correctly on the server side when send via http auth header. The problem this time was in Hunchentoot’s authorize method. It uses cl-base64’s base64-string-to-string which does not respect the used encoding (it does not even seem to have a way to specify it). So, I fixed this, again with flexi streams.

It is always interesting how much trouble character encodings are. I just hope the Unicode and UTF-8 become much more common. Just as a side note: a colleague just stumbled over a problem with a version of sed that did not respect the UTF-8 setting of the system, leading to umlauts beeing interpreted as two characters.