When Character Sets Attack
Unicode is also ISO-10646 and springs from the ISO-646 heritage (of which ASCII is the American National Variant), so it's not surprising that they get along at least middling well.
UTF-8, apparently invented at Bell Labs as "runes" for Plan 9, is still a remarkably clever solution to the question of "How do you shoehorn Unicode into an ASCII-based system without driving yourself mad with complexity?". Basically, UTF-8 is an encoding of Unicode (which has several other encodings, such as UTF-16, suitable for other uses) with two very important characteristics:
- Every ASCII character 'n', when converted to the equivalent Unicode character U+n and encoded as UTF-8, returns to its original encoding bit-for-bit
- When any Unicode U+n, where n > 127 and thus not also an ASCII character, is encoded as UTF-8, no octet is <= 127.
Together, those two conspire to go a long ways towards automatically convert an 8-bit-clean but ASCII-only-using system into a UTF-8 Unicode system. All ASCII files are also Unicode files. All ASCII filenames are also Unicode filenames. Most importantly for Unix and Unix-like systems, all APIs defined in terms of NUL-terminated ASCII strings can be made valid for both UTF-8 Unicode and ASCII strings simultaneously, requiring no changes in Unicode-naïve applications; most such APIs are automatically Unicode-supporting -- only ones needing to do actual text processing need to know anything about UTF-8 and Unicode.
All that's fine and well for Unix, but all the world's not ASCII, or even ISO-646. What about those poor IBM Mainframe EBCDIC users trying to transition to Unicode?
Well, there's UTF-EBCDIC
Reading this, I can see that it works, and I can see that it would be an important advance to EBCDIC users, as important as UTF-8 is to the rest of us...and yet, and yet...once again, I'm very glad I'm not an IBM Mainframe guy.
The 64 control characters (
U+0000toU+001F,U+0080toU+009F), the ASCIIDELETEcharacter (U+007F), the 95 ASCII graphic characters (including theSPACEcharacter) (U+0020toU+007E) are mapped respecting EBCDIC conventions, as defined in IBM Character Data Representation Architecture, CDRA, with one exception -- the pairing of EBCDIC Line Feed and New Line control characters are swapped from their CDRA default pairings to ISO/IEC 6429 Line Feed (U+000A) and Next Line (U+0085) control characters (to be in line with IBM OS/390 UNIX Services, or Open MVS practice and preference, stemming from the hard-coding ofX'0A'as the New Line in most ASCII-C compilers.).
The map preserves the invariance for a set of 82 graphic characters (including SPACE) (known as the IBM Syntactic Graphic Character set), and maintains consistency with the IBM MVS Open Systems Code page (CPGID 1047) for the variant characters from within the ASCII repertoire.
2003-04-15 18:37:00 | Comments (0) | TrackBack (0) | Computers::CharacterSets