--- utf-8.7~ 2011-02-27 18:26:48.000000000 +0100 +++ utf-8.7 2011-02-27 18:24:22.000000000 +0100 @@ -42,8 +42,10 @@ parts of many 16-bit characters bytes like \(aq\\0\(aq or \(aq/\(aq which have a special meaning in filenames and other C library function arguments. -In addition, the majority of UNIX tools expects ASCII files and can't -read 16-bit words as characters without major modifications. +In addition, the majority of UNIX tools expects +.B ASCII +files and can't read 16-bit words as characters without major +modifications. For these reasons, .B UCS-2 is not a suitable external encoding of @@ -51,7 +53,9 @@ in filenames, text files, environment variables, etc. The .BR "ISO 10646 Universal Character Set (UCS)" , -a superset of Unicode, occupies even a 31-bit code space and the obvious +a superset of +.BR Unicode , +occupies even a 31-bit code space and the obvious .B UCS-4 encoding for it (a sequence of 32-bit words) has the same problems. @@ -73,10 +77,13 @@ .B UCS characters 0x00000000 to 0x0000007f (the classic .B US-ASCII -characters) are encoded simply as bytes 0x00 to 0x7f (ASCII +characters) are encoded simply as bytes 0x00 to 0x7f +.RB ( ASCII compatibility). This means that files and strings which contain only -7-bit ASCII characters have the same encoding under both +7-bit +.B ASCII +characters have the same encoding under both .B ASCII and .BR UTF-8 . @@ -85,7 +92,8 @@ All .B UCS characters greater than 0x7f are encoded as a multibyte sequence -consisting only of bytes in the range 0x80 to 0xfd, so no ASCII +consisting only of bytes in the range 0x80 to 0xfd, so no +.B ASCII byte can appear as part of another character and there are no problems with, for example, \(aq\\0\(aq or \(aq/\(aq. .TP @@ -95,7 +103,9 @@ strings is preserved. .TP * -All possible 2^31 UCS codes can be encoded using +All possible 2^31 +.B UCS +codes can be encoded using .BR UTF-8 . .TP * @@ -104,7 +114,8 @@ encoding. .TP * -The first byte of a multibyte sequence which represents a single non-ASCII +The first byte of a multibyte sequence which represents a single non- +.B ASCII .B UCS character is always in the range 0xc0 to 0xfd and indicates how long this multibyte sequence is. @@ -119,12 +130,15 @@ .B UCS characters may be up to six bytes long, however the .B Unicode -standard specifies no characters above 0x10ffff, so Unicode characters -can only be up to four bytes long in +standard specifies no characters above 0x10ffff, so +.B Unicode +characters can only be up to four bytes long in .BR UTF-8 . .SS Encoding The following byte sequences are used to represent a character. -The sequence to be used depends on the UCS code number of the character: +The sequence to be used depends on the +.B UCS +code number of the character: .TP 0.4i 0x00000000 \- 0x0000007F: .RI 0 xxxxxxx @@ -168,15 +182,19 @@ .PP The .B UCS -code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and -0xffff (UCS noncharacters) should not appear in conforming +code values 0xd800\(en0xdfff +.RB ( UTF-16 +surrogates) as well as 0xfffe and 0xffff +.RB ( UCS +noncharacters) should not appear in conforming .B UTF-8 streams. .SS Example The .B Unicode -character 0xa9 = 1010 1001 (the copyright sign) is encoded -in UTF-8 as +character 0xa9 = 1010 1001 (the copyright sign) is encoded in +.B UTF-8 +as .PP .RS 11000010 10101001 = 0xc2 0xa9 @@ -256,8 +274,12 @@ ("\\x1b%G"). The corresponding return sequence from .B UTF-8 -to ISO 2022 is ESC % @ ("\\x1b%@"). -Other ISO 2022 sequences (such as +to +.B ISO 2022 +is ESC % @ ("\\x1b%@"). +Other +.B ISO 2022 +sequences (such as for switching the G0 and G1 sets) are not applicable in UTF-8 mode. .PP It can be hoped that in the foreseeable future,