Character encoding really is a never ending story. I am a huge fan of Unicode and UTF-8. But it is still not the default everywhere. This blog entry is about the problems I encountered while making Klatschbase working correctly with UTF-8.
First of all, SBCL - my common lisp implementation of choice - supports UTF-8 out of the box. So, no problem here.
Hunchentoot, Edi Weitz’s wonderful http server, supports UTF-8, but per default uses ISO-8859-1. It is easy to change the default behaviour, but it is also easy to overlook this.
(defparameter hunchentoot:*default-content-type* "text/html; charset=utf-8") (defparameter hunchentoot:*hunchentoot-default-external-format* (flexi-streams:make-external-format :utf8))
Now at least the http body content is set correctly. But the next pitfall for me was the hashing of passwords. I use Ironclad for this. But Ironclad’s helper functions only support ASCII. I somehow even failed to notice this problem until reading Leslie Polzer’s blog and made a blatantly stupid comment. However, first I wrote a solution for the problem using the encoding translation library Babel which is easy to use. But I did not want to introduce another dependency, so I rewrote it to use Flexi Streams.
(defun sha256 (str) (let* ((utf8 (flexi-streams:make-external-format :utf-8)) (str* (flexi-streams:string-to-octets str :external-format utf8))) (ironclad:digest-sequence :sha256 str*)))
But still, the characters where not decoded correctly on the server side when send via http auth header. The problem this time was in Hunchentoot’s authorize method. It uses cl-base64’s base64-string-to-string which does not respect the used encoding (it does not even seem to have a way to specify it). So, I fixed this, again with flexi streams.
(defun authorization (&optional (request *request*)) "Returns as two values the user and password \(if any) as encoded in the 'AUTHORIZATION' header. Returns NIL if there is no such header." (let* ((authorization (header-in :authorization request)) (start (and authorization (> (length authorization) 5) (string-equal "Basic" authorization :end2 5) (scan "\\S" authorization :start 5)))) (when start (let* ((auth-octets (base64:base64-string-to-usb8-array (subseq authorization start))) (auth (octets-to-string auth-octets :external-format *hunchentoot-default-external-format*))) (destructuring-bind (&optional user password) (split ":" auth) (values user password))))))
It is always interesting how much trouble character encodings are. I just hope the Unicode and UTF-8 become much more common. Just as a side note: a colleague just stumbled over a problem with a version of sed that did not respect the UTF-8 setting of the system, leading to umlauts beeing interpreted as two characters.