UTF-1
UTF-1 was the first UCS transformation format for Unicode and ISO 10646 and was published in Appendix G of the original version of ISO 10646 in 1993, but is no longer part of this standard today. UTF-1 is compatible with ISO 2022 .
ASCII characters, C0 and C1 control characters are coded unchanged (1: 1) as in ISO 8859 . Other characters are coded as character strings of 2, 3 or 5 bytes in length using modulo 190 arithmetic that is relatively computationally intensive. ASCII characters can also be part of these character strings. This has the disadvantage that, for example, the slash can be included in such a character string, so that this encoding cannot be used for file names.
Because of this disadvantage, another encoding for Unicode was later developed, which was initially called "UTF-FSS" ("file system safe") and has now become generally accepted under the name UTF-8 .
Coding
Code range (hex) | Coding | Remarks |
---|---|---|
0… 9F | 0… 9F hex | 1: 1 coding of ASCII and C0 and C1 control characters |
A0… FF | A0 x | x is the original octet |
100… 4015 | A1… F5 p | 2-byte sequence |
4016… 38E2D | F6 ... FB p q | 3-byte sequence |
≥ 38E2E | FC… FF p q r s | 5-byte sequence |
To generate the character strings, the character code is represented as a number with base 190 and the "digits" in this representation are converted into bytes using a special lookup function so that only bytes from the range 21 hex … 7E hex and A0 hex … FF hex are created to achieve compatibility with ISO 2022:
UTF-1: function T (x) x T ( x ) Remarks formula Result 00… 5D x + 21 21… 7E Only these values appear in the modulo 190 calculation. 5E… BD x + 42 A0… FF BE ... DE x - BE 00… 20 For completeness only.
These values can not occur in modulo 190 arithmetic .DF… FF x - 60 7F… 9F
The values of the individual bytes of the character string result from the following table. The modulo operation is %
marked with , the division is an integer division without a remainder. All numbers are hexadecimal numbers .
x (hex) | Auxiliary variable | String |
---|---|---|
0… 9F | x | |
A0… FF | A0 x | |
100… 4015 | y = x - 100 | A1 + y / BE T (y% BE) |
4016… 38E2D | y = x - 4016 | F6 + y / BE 2 T (( y / BE)% BE) T ( y % BE) |
≥ 38E2E | y = x - 38E2E | FC + y / BE 4 T (( y / BE 3 )% BE) T (( y / BE 2 )% BE) T (( y / BE)% BE) T ( y % BE) |
Coding examples
The following table shows the encoding of some Unicode characters in UTF-8 and UTF-1.
Note: In the meantime, UCS and Unicode have been limited to the range up to max. U + 10FFFF limited. When UTF-1 and UTF-8 were developed, this limitation did not exist.
Codepoint | UTF-8 | UTF-1 | Remarks |
---|---|---|---|
U + 007F | 7F | 7F | |
U + 0080 | C2 80 | 80 | |
U + 009F | C2 9F | 9F | |
U + 00A0 | C2 A0 | A0 A0 | |
U + 00BF | C2 BF | A0 BF | |
U + 00C0 | C3 80 | A0 C0 | |
U + 00FF | C3 BF | A0 FF | |
U + 0100 | C4 80 | A1 21 | The 2nd octet in UTF-1 is in the range of ASCII codes. |
U + 015D | C5 9D | A1 7E | |
U + 015E | C5 9E | A1 A0 | |
U + 01BD | C6 BD | A1 FF | |
U + 01BE | C6 BE | A2 21 | |
U + 07FF | DF BF | AA 72 | largest code point that UTF-8 can encode in 2 bytes |
U + 0800 | E0 A0 80 | AA 73 | |
U + 0FFF | E0 BF BF | B5 48 | |
U + 1000 | E1 80 80 | B5 49 | |
U + 4015 | E4 80 95 | F5 FF | largest code point that UTF-1 can encode in 2 bytes |
U + 4016 | E4 80 96 | F6 21 21 | |
U + FFFF | EF BF BF | F7 65 AF | |
U + 10000 | F0 90 80 80 | F7 65 B0 | |
U + 38E2D | F0 B8 B8 AD | FB FF FF | largest code point that UTF-1 can encode in 3 bytes |
U + 38E2E | F0 B8 B8 AE | FC 21 21 21 21 | from here on, UTF-1 requires 5 bytes and is therefore more inefficient than UTF-8 |
U + FFFFF | F3 BF BF BF | FC 21 37 B2 7A | |
U + 100,000 | F4 80 80 80 | FC 21 37 B2 7B | |
U + 10FFFF | F4 8F BF BF | FC 21 39 6E 6C | Largest code point that is allowed in Unicode today |
U + 7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 |
Web links
- http://www.czyborra.com/utf/
- 5 The universal charset ( Memento from February 11, 2012 in the Internet Archive )
- http://www.std.com/obi/Standards/Network/UTF/utf.c
Individual evidence
- ↑ http://kikaku.itscj.ipsj.or.jp/ISO-IR/178.pdf ( Memento from March 18, 2015 in the Internet Archive )