Daily Static

Technology, Dynamic Programming and Entrepreneurship

This page is powered by Blogger. Isn't yours?
Tuesday, June 28, 2005
Python Unicode Woes

Ok, so I love using Python at work. It's a much more fun and open language than C# or even worse Visual Basic (shivers). The one thing that causes pain though is working with international strings. The most frustrating aspect is strings that use characters in the 128-255 range. Basic strings in Python are 7 bit, 0 - 127. If it encounters a string above 128, it pukes and asks for a codepage. Says it can't find a codec for the character. The irritating thing about this is that pretty much every other programming language I've used just deals with these types of strings. C#, VB, Lisp, Ruby, Perl, I've never run into these problems with them.

If that wasn't enough, Python is constantly touting that it deals with Unicode, which is great, but the solution at the present time is lacking. There are some basic and well known Byte Order Marks for unicode files. For instance if you run a windows box (most of the world) your unicode will be encoded in Little Endian whereas if you run a *nix box your unicode will be encoded in Big Endian. The byte order marks for Little Endian are FF FE, which means the first two bytes of the file will be those values. Whereas for Big Endian it will be FE FF.

To complicate matters a bit those byte order marks are for utf-16 encoding. There are several other types of encoding (utf-8, utf-32, etc). The thing is though, the BOM's are specified, they are published, they are known. Why then does Python insist on making the programmer know the specific encoding when they open the file? I've had to write my own smart wrapper functions that reads the different byte order marks and uses the correct encoding in order to read the data appropriately.

Here are the Byte Order Marks:
UTF-16 Big Endian: FE FF
UTF-16 Little Endian: FF FE
UTF-32 Big Endian: 00 00 FE FF
UTF-32 Little Endian: FF FE 00 00
UTF-7: 2B 2F 76 and one of the following byte sequences [ 38 | 39 | 2B | 2F | 38 2D ]
UTF-EBCDIC: DD 73 66 73
BOCU-1: FB EE 28

You can find more information at this link:

I don't expect a language to be perfect ( although I would like it ), but what I do expect is when a solution to an obvious and recurrent problem is apparent, it should be implemented in the language. To me and to many others, this has been a major source of frustration.

Comments: Post a Comment