Character Encoding --- Part 2: HTTP Auth

My last blog entry was about the character encoding problems I had with Klatschbase. Of course, there has to be a follow-up, since encoding problems are so ubiquitous.

I did a change to hunchentoot’s basic authorization method to allow utf-8 login name and password. As Edi Weitz pointed out in the hunchentoot mailinglist, this change is not standard compliant. Thinking about it, my assumptions were quite stupid. Why should the header, that is sent by the client, care about the standard encoding of the server. If the encoding could depend on anything, than it would be the content encoding. But this is for the body, and certainly not for a header field that might even appear before the content encoding definition.

Anyway, Edi Weitz’s post showed up the interesting parts of the concerned RFCs. RFC 2617 defines the basic auth itself. However, RFC 2616 defines how words of TEXT must be encoded, which basically says, that non-ISO-8859-1 characters must use RFC 2047 encoding.

The consequences are disturbing. RFC 2047 defines an encoding scheme, that itself might use Base 64. So you have a Base 64 encoded string inside a Base 64 encoded string. It’s a bit overhead, but that’s not the bad part. A big problem in RFC 2047 IMHO is the restriction that encoded words have to be no longer than 75 charcters. If you want to encode more, you must use multi-line fields. Combined with a variable length encoding, this is messy. If you want to use some other library that takes care of the actual encoding of the characters to octets, you only have the choice to assume the worst case, i.e., that each character takes up the maximum number of bytes it can (4 for UTF-8 and UTF-16). In sum, this yields a lot of overhead.

And I somehow doubt that a lot of HTTP clients and servers implement this correctly or at all. Most HTTP software will probably simply assume ISO-8859-1 encoded credentials. But with the rising of more and more REST APIs this will become a real issue.

I have started an ASDF-installable library called CL-RFC2047 to handle this de- and encoding according to the RFC. It only works on strings, but you probably don’t want to handle data with such an encoding that does not easily fit into memory. I also ported the encoding function to javascript, so I can use it in Klatschbase on the client side.

For the server side, I once again patched Hunchentoot’s authentication method:

(defun authorization (&optional (request *request*))
  "Returns as two values the user and password \(if any) as encoded in
the 'AUTHORIZATION' header.  Returns NIL if there is no such header."
  (let* ((authorization (header-in :authorization request))
         (start (and authorization
                     (> (length authorization) 5)
                     (string-equal "Basic" authorization :end2 5)
                     (scan "\\S" authorization :start 5))))
    (when start
      (destructuring-bind (&optional user password)
      (split ":" (base64:base64-string-to-string
              (subseq authorization start)))
    (labels ((decode (str)
           (if (and (> (length str) 2)
                (string= "=?" str :end2 2))
               (cl-rfc2047:decode str)
      (values (decode user) (decode password)))))))

There is still a problem here: The RFC defines username and password as *TEXT, which means, they could be split up, meaning that encoded and non-encoded text can be mixed. The above method just assumes that the whole text (i.e., username or password) is either unecoded or encoded.

And by the way: Another consequence is that the username must not contain a collon. I think that could have been avoided if username and password would have been two colon seperated fields that are Base 64 encoded, instead of having the whole thing Base 64 encoded.

I do not want to rant about the HTTP protocol here. It is a really great protocol. I just wanted to comment on a problem I had here.