Some things are created with the best of intentions.
C’s wide character support, for example. It was introduced as an amendment to the ISO C90 standard, and intended to add support for dealing with multilingual text to the C standard. It added a new character type – “wide characters” to the C standard, provided by the type wchar_t, which was intended to provide one character per character – as opposed to the legacy “char” which simply cannot provide this.
Some systems ran with this. For example, when the Windows NT team were designing the Win32 API, they standardized on Unicode everywhere. All the system functions which dealt with strings took and returned Unicode strings. They acknowledged that nobody would port their legacy applications if they didn’t provide any backwards compatibility, so they also provided non-Unicode versions of all these functions, but internally they just called through to the Unicode versions.
Most others, however, took different paths with it. Unix, for example, moved to the newly introduced UTF-8 encoding, avoiding the need for a new set of APIs, though temporarily making working with the old legacy encoding in addition more difficult. (Sites today can still be found which run Unix machines with non-UTF encodings)
Mac OS Classic introduced Unicode support at the UI layer in order to support multilingual text. Mac OS X extended this with Foundation’s NSString (inherited from NeXTStep) and CFStringRef being “encoding agnostic”; exposing the internal storage only if the application asks for the encoding internally used (and otherwise translating as needed on access). The underlying system, however, inherited the Unix fondness for UTF-8.
Two issues affect people trying to use these features:
- The size of wchar_t differing between common platforms. Win32′s API lockin means that now wchar_t is forever stuck at 16-bit.
- Conformance issues. Most Unixes migrated to UCS-4 when Unicode 2.0 added the supplemental planes in order to conform to the C standard’s requirement that any character be fit a single wchar_t. Because it is confined by history, Windows applications will never be able to safely use the standard wide character functions
What this has meant is that most people needing Unicode support have rolled their own solutions. The older of these went for UTF-16, because at the time Unicode would be a 16-bit character set, and so stuck with that for compatibility. The newer of these have tended towards UTF-8, on the basis that once you’re dealing with a variable width encoding you may as well use the one which is more compact for most texts anyway.
In spite of the best of intentions, the wide character routines remain unused. Outside of Windows, the narrow I/O routines will satisfy the vast majority of users. On Windows, the wide I/O routines in both Microsoft’s and MinGW’s C runtime libraries are broken (when working with the terminal, anyway) to such an extent that they’re useless anyhow.
Today, if you need to actually do string processing, and it must be done in C (or C++), you’re far better off turning to an external library to do the work
Context: I’m in the process of implementing said routines