UTF-1

UTF-1 was the first UCS transformation format for Unicode and ISO 10646 and was published in Appendix G of the original version of ISO 10646 in 1993, but is no longer part of this standard today. UTF-1 is compatible with ISO 2022 .

ASCII characters, C0 and C1 control characters are coded unchanged (1: 1) as in ISO 8859 . Other characters are coded as character strings of 2, 3 or 5 bytes in length using modulo 190 arithmetic that is relatively computationally intensive. ASCII characters can also be part of these character strings. This has the disadvantage that, for example, the slash can be included in such a character string, so that this encoding cannot be used for file names.

Because of this disadvantage, another encoding for Unicode was later developed, which was initially called "UTF-FSS" ("file system safe") and has now become generally accepted under the name UTF-8 .

Coding

UTF-1: coding areas
Code range (hex)	Coding	Remarks
0… 9F	0… 9F _hex	1: 1 coding of ASCII and C0 and C1 control characters
A0… FF	A0 x	x is the original octet
100… 4015	A1… F5 p	2-byte sequence
4016… 38E2D	F6 ... FB p q	3-byte sequence
≥ 38E2E	FC… FF p q r s	5-byte sequence

To generate the character strings, the character code is represented as a number with base 190 and the "digits" in this representation are converted into bytes using a special lookup function so that only bytes from the range 21 _hex … 7E _hex and A0 _hex … FF _hex are created to achieve compatibility with ISO 2022:

UTF-1: function T (x)
x	T ( x )		Remarks
x	formula	Result	Remarks
00… 5D	x + 21	21… 7E	Only these values appear in the modulo 190 calculation.
5E… BD	x + 42	A0… FF	Only these values appear in the modulo 190 calculation.
BE ... DE	x - BE	00… 20	For completeness only. These values can not occur in modulo 190 arithmetic .
DF… FF	x - 60	7F… 9F

The values of the individual bytes of the character string result from the following table. The modulo operation is %marked with , the division is an integer division without a remainder. All numbers are hexadecimal numbers .

UTF-1: calculation of the character strings
x (hex)	Auxiliary variable	String
0… 9F		x
A0… FF		A0 x
100… 4015	y = x - 100	A1 + y / BE T (y% BE)
4016… 38E2D	y = x - 4016	F6 + y / BE ² T (( y / BE)% BE) T ( y % BE)
≥ 38E2E	y = x - 38E2E	FC + y / BE ⁴ T (( y / BE ³ )% BE) T (( y / BE ² )% BE) T (( y / BE)% BE) T ( y % BE)

Coding examples

The following table shows the encoding of some Unicode characters in UTF-8 and UTF-1.

Note: In the meantime, UCS and Unicode have been limited to the range up to max. U + 10FFFF limited. When UTF-1 and UTF-8 were developed, this limitation did not exist.

Codepoint	UTF-8	UTF-1	Remarks
U + 007F	7F	7F
U + 0080	C2 80	80
U + 009F	C2 9F	9F
U + 00A0	C2 A0	A0 A0
U + 00BF	C2 BF	A0 BF
U + 00C0	C3 80	A0 C0
U + 00FF	C3 BF	A0 FF
U + 0100	C4 80	A1 21	The 2nd octet in UTF-1 is in the range of ASCII codes.
U + 015D	C5 9D	A1 7E	The 2nd octet in UTF-1 is in the range of ASCII codes.
U + 015E	C5 9E	A1 A0
U + 01BD	C6 BD	A1 FF
U + 01BE	C6 BE	A2 21
U + 07FF	DF BF	AA 72	largest code point that UTF-8 can encode in 2 bytes
U + 0800	E0 A0 80	AA 73
U + 0FFF	E0 BF BF	B5 48
U + 1000	E1 80 80	B5 49
U + 4015	E4 80 95	F5 FF	largest code point that UTF-1 can encode in 2 bytes
U + 4016	E4 80 96	F6 21 21
U + FFFF	EF BF BF	F7 65 AF
U + 10000	F0 90 80 80	F7 65 B0
U + 38E2D	F0 B8 B8 AD	FB FF FF	largest code point that UTF-1 can encode in 3 bytes
U + 38E2E	F0 B8 B8 AE	FC 21 21 21 21	from here on, UTF-1 requires 5 bytes and is therefore more inefficient than UTF-8
U + FFFFF	F3 BF BF BF	FC 21 37 B2 7A
U + 100,000	F4 80 80 80	FC 21 37 B2 7B
U + 10FFFF	F4 8F BF BF	FC 21 39 6E 6C	Largest code point that is allowed in Unicode today
U + 7FFFFFFF	FD BF BF BF BF BF	FD BD 2B B9 40

Web links

http://www.czyborra.com/utf/
5 The universal charset ( Memento from February 11, 2012 in the Internet Archive )
http://www.std.com/obi/Standards/Network/UTF/utf.c

Individual evidence

↑ http://kikaku.itscj.ipsj.or.jp/ISO-IR/178.pdf ( Memento from March 18, 2015 in the Internet Archive )

[1] ttp://kikaku.itscj.ipsj.or.jp/ISO-IR/178.pdf ( Memento from March 18, 2015 in the Internet Archive )