Once upon a time…
Things were nice and simple; all characters were the same size, and that size was 8-bits. Things were easy to handle, for the most part. Much of C’s string handling is rooted in this era, with functions like isalpha inexorably tied to the English language.
Of course, since then the world has changed.
Things got ugly: Multi-byte character sets
Now, these aren’t all bad; many can be considered, for most things, like the one byte per character sets that preceded them. They maintain the invariant that the contents of a multi-byte character cannot be interpreted as a single byte character. This does of course make your coding take more space, but makes legacy programs work.
Some encodings, however, are not so kind. A good example of this is Shift-JIS. Shift-JIS reuses the single byte character space in the second bytes of multi-byte characters. This means that, if you do strchr('e'), you might not actually find an E.
Now things are getting really messy, because C doesn’t provide standard functions for dealing with these character sets. The other problem is that dealing with the thousands of character sets in existence is rather complex.
The solution: Unicode
Unicode, as you probably know, provides an incredible repertoire of characters, and is capable of representing pretty much every character in common (and uncommon) use.
Unfortunately, there is a great big problem.
C’s Unicode support sucks
C99 doesn’t actually define any official support for Unicode; it defines support for “Wide Characters”, which every implementation defines to be Unicode because this is the only sane option.
The problem is that the C standard requires the wide character set to be a fixed width encoding – which requires UCS-4/UTF-32, but nobody else uses UTF-32.
Worse, C’s character handling functions (quite rightfully) only provide a small set of features – and so everybody ignores them.
You see, everyone has their own opinion as to which Unicode encoding to use:
- Unix – UTF-8 for backwards compatibility with traditional APIs
- Windows – UTF-16 (Predates Unicode 2.0)*
- Java – UTF-16 (Predates Unicode 2.0)
- ICU – UTF-16
The end result is that the world’s most popular programming languages (C and C++) are left with Unicode support that is quite frankly useless.
C++0X doesn’t touch on the topic, but C1X saves our bacon. C1X adds the types char16_t and char32_t, which quite natively map on to UTF-16 and UTF-32. It also allows creating character constants in those encodings, which will vastly simplify use of them.
It will also make use of libraries like ICU feel a lot less like stabbing ones self in the eye.
In fact, there are only two major issues I see with C1X’ Unicode support: Firstly, there is no convertibility defined to wchar_t. Secondly, and probably as a consequence of the former, there is no support provided for outputting Unicode strings to the console. Perhaps with time this will change.
* Microsoft defines wchar_t to be 16-bits. This is definitely non-compliant, but wasn’t when they did it: at the time 16 bits covered the entirety of Unicode. Changing it now, of course, would be impossible, as pretty much the entire Windows API is built upon passing around wchar_t`s.
[Correction 2010-07-07: Mistake in my recollection of C++0x removed - C++0x does in fact add char16_t/char32_t.]