Standard Compression Scheme for Unicode

The Standard Compression Scheme for Unicode ( SCSU , English for Standard Compression Scheme for Unicode ) is a character encoding for texts made up of Unicode characters, which, in contrast to most other encodings, is designed to require as little storage space as possible.

history

The coding was originally developed by Reuters . The authors of the method described in the technical standard UTS # 6 are Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, Asmus Freytag and Markus Scherer. The first publication took place in May 1997, since May 2005 the standard has remained unchanged in revision 4.

idea

Traditional pre-Unicode character sets, such as the ISO-8859 character sets , only required one byte per character, character sets for East Asian scripts required two bytes. When using Unicode, the memory requirement usually increases: With UTF-32 to four bytes per character, with UTF-16 it is two or four bytes per character, with UTF-8 between one and four bytes per character. Ordinary texts only use a very small part of all the characters available in Unicode. Most of the characters used are on the one hand in the ASCII area (especially punctuation marks), on the other hand in a small contiguous area that often corresponds to a Unicode block . The algorithm uses a dynamically positioned window that contains 128 consecutive characters. Characters in this window are encoded by a byte in the range from 0x80 to 0xFF, characters in the ASCII range (with the exception of most control characters) by a byte in the range from 0x20 to 0x7F. The remaining bytes are used as commands to reposition this window or to switch to uncompressed mode in which the following bytes are interpreted as UTF-16. This mode is particularly useful when the text uses a large number of characters from a range of more than 128 consecutive characters, for example in Chinese.

algorithm

This idea is implemented using the following procedure. The method is defined with which a text from Unicode characters can be obtained from an SCSU byte stream. Various algorithms can be used for coding, which lead to a result that can be correctly decoded. How such an algorithm is designed depends, among other things, on whether more emphasis is placed on fast coding or on good compression.

window

The algorithm knows two types of windows: static windows, which are predefined in the algorithm, and dynamic windows, the position of which can be changed if necessary. There are eight of each type, numbered from 0 to 7. The position of a window can be indicated by the code point of the first character in this window.

Static windows

The eight static windows are defined as follows:

Window number	begin	contained characters
0	U + 0000	Basic Latin
1	U + 0080	Latin-1, supplement
2	U + 0100	Latin, extended-A
3	U + 0300	COMBINING DIACRITICS
4th	U + 2000	General punctuation and superscripts
5	U + 2080	Subscripts , Currency Signs , and Combining Diacritics for Symbols
6th	U + 2100	Letter-like symbols and numerals
7th	U + 3000	CJK symbols and punctuation

Dynamic windows

The starting positions of the eight dynamic windows are as follows:

Window number	begin	contained characters
0	U + 0080	Latin-1, supplement
1	U + 00C0	Parts of Latin-1, supplement and Latin, extended-A
2	U + 0400	Cyrillic
3	U + 0600	Arabic
4th	U + 0900	Devanagari
5	U + 3040	Hiragana
6th	U + 30A0	Katakana
7th	U + FF00	Full-width forms

The dynamic window 0 is active at the beginning.

Various commands are available to change the position of a dynamic window. The two simple commands (SDn and UDn) for definition determine the new position of the window using a byte according to the following table:

Byte ( hex )	begin	annotation
00	reserved	reserved for internal use
01-67	U + 0080-U + 3380	the byte is multiplied by 0x80
68-A7	U + E000-U + FF80	the byte is multiplied by 0x80 and 0xAC00 is added
A8-F8	reserved	reserved for future use
F9	U + 00C0	Parts of Latin-1, supplement and Latin, extended-A
FA	U + 0250	IPA extensions
FB	U + 0370	Greek
FC	U + 0530	Armenian
FD	U + 3040	Hiragana
FE	U + 30A0	Katakana
FF	U + FF60	Half-width katakana

The two extended commands (SDX and UDX) for window definition use two bytes. The top three bits indicate the number of the window, 0x10000 is added to the remaining 13 bits and the result is taken as the first character of the window.

Modes

The algorithm uses two different modes. It is initially in one-byte mode, where characters are encoded by a single byte. Byte values in the range 0x20 to 0x7F as well as 0x00 ( NUL ), 0x09 (horizontal tab character ), 0x0A ( LF ) and 0x0D ( CR ) are interpreted as characters in the static window 0, values in the range 0x80 to 0xFF as characters in the active dynamic window. All other bytes are interpreted as commands.

The other mode is a two-byte mode. With a few exceptions, all byte pairs are interpreted as UFT-16BE-coded characters, only a few bytes represent commands.

Commands

In the one-byte mode, the following byte values represent commands:

Byte ( hex )	Surname	meaning
01-08	SQ0-SQ7	changes the window for the following byte: 0x00 to 0x7F are interpreted as characters in the static window n, 0x80 to 0xFF in the dynamic window n
0B	SDX	uses the following two bytes for the extended definition of a dynamic window; this window is then active
0C	reserved	reserved for future use
0E	SQU	interprets the following two bytes as a UTF-16 coded character
0F	SCU	changes to two-byte mode
10-17	SC0-SC7	makes the dynamic window n the active window
18-1F	SD0-SD7	uses the following byte as a simple definition for the dynamic window n, this window is then active

If a control character is to be coded which is represented by a byte which represents a command, the command SQ0 can be used.

In two-byte mode, the following byte values represent commands, provided they appear in the first position in a possible byte pair:

Byte ( hex )	Surname	meaning
E0-E7	UC0-UC7	changes to the one-byte mode and activates the dynamic window n
E8-EF	UD0-UD7	uses the following byte as a simple definition for dynamic window n, activates this window and switches to one-byte mode
F0	UQU	interprets the following two bytes as a UTF-16 coded character
F1	UDX	uses the following two bytes for the extended definition of a dynamic window, activates this window and changes to one-byte mode
F2	reserved	reserved for future use

If a character (from the area for private use) is to be encoded that begins with a byte occupied by a command, the UQU command can be used.

properties

The process has some properties that were deliberately chosen:

There is no change for texts that consist exclusively of Latin 1 characters without control characters.
For texts without characters from the area for private use, you can always switch to two-byte mode with an additional byte, so that the memory requirement in this case corresponds to that of UTF-16.
Even in the worst case , the storage requirement is only 1.5 times greater than UTF-16.
With optimal encoding, normal texts are stored more compactly than in UTF-8 or UTF-16. How big these savings are depends on the language: While SCSU requires just as much space as UTF-8 for English and French texts, this is reduced to 85% for Korean, 70% for Chinese, Greek, Russian, Arabic, Hebrew and Japanese to 55%, in Hindi even to 40%.

The following properties can be problematic in some applications:

Zero bytes can occur in the compressed byte stream, which is one of the reasons why the coding is not MIME- compatible. BOCU-1 can be used here instead .
The same text can be encoded in different ways.
Texts with few different characters, but which are spread over several disjointed areas, cannot be compressed well. This is the case in Vietnamese , for example .

Possible encodings

Sequences of characters from the ASCII range and the predefined dynamic windows are most efficiently encoded in one-byte mode. If there is no suitable predefined window, a dynamic window that is not required can be redefined. Apart from the Chinese and Korean characters , most areas can be selected as dynamic windows.

The two-byte mode should be switched to for sequences of characters outside of small areas.

Individual characters that are in a window that is currently not active can be coded using the SQn command, single characters outside the possible window can be coded using the SQU command.

Examples

German

In order to encode the text "Wikipedia - the free encyclopedia" (with a typographic dash ) with SCSU, all predefined windows are sufficient: only the dash and the ä are not in the ASCII area. The ä is in the active dynamic window, the dash in the static window 4. The result is the following hexadecimal byte sequence:

57 69 6B 69 70 65 64 69 61 20 05  13 20 64 69 65 20 66 72 65 69 65 20
W  i  k  i  p  e  d  i  a     SQ4 –     d  i  e     f  r  e  i  e

45 6E 7A 79 6B 6C 6F 70 E4 64 69 65
E  n  z  y  k  l  o  p  ä  d  i  e

Except for the dash, the coding corresponds to ISO 8859-1 .

Greek

All characters of the Greek word for Wikipedia "Βικιπαίδεια" are in the Unicode block for Greek. It can therefore be coded by first covering this block with a dynamic window, with the help of which the letters are then coded.

18  FB A2 C9 CA C9 D0 C1 BF C4 C5 C9 C1
SD0    Β  ι  κ  ι  π  α  ί  δ  ε  ι  α

The coding only needs two bytes more than ISO 8859-7 , but is shifted by 0x20 compared to this.

Japanese

The Japanese Wikipedia article on Wikipedia begins like this:

"ウィキペディア（英: Wikipedia）は、ウィキメディア財団が運営営するインターーネット百科事典である。" "

- Wikipedia authors : " ウィキペディア " in the version of January 26, 2013

Different fonts are used:

Latin letters and punctuation marks that are in the static window 0
Katakana from the dynamic window 6
occasional hiragana from the dynamic window 5
CJK characters that are not in any possible window
Full-width punctuation marks from the dynamic window 7
CJK punctuation from the static window 7

The following tables represent one of the many possible codings: Most of the time, the dynamic window 6 (Katakana) is used. Individual characters from other areas are encoded without a permanent change. For longer sequences of CJK characters, a switch is made to two-byte mode; only when longer sequences of hiragana or katakana have to be encoded is it switched back to one-byte mode.

byte	16	86	83	8D	BA	A7	83	82	08	88	0E	82	F1	3A	20th	57	69	6B	69	70	65	64	69	61	08	89
Sign command	SC6	ウ	ィ	キ	ペ	デ	ィ	ア	SQ7	（	SQU	英		:		W.	i	k	i	p	e	d	i	a	SQ7	）
Code point (U +)		30A6	30A3	30AD	30DA	30C7	30A3	30A2		FF08		82F1		003A	0020	0057	0069	006B	0069	0070	0065	0064	0069	0061		FF09

byte	06	AF	08	01	86	83	8D	C1	A7	83	82	0F	8C	A1	56	E3	30th	4C	90	4B	55	B6
Sign command	SQ5	は	SQ7	、	ウ	ィ	キ	メ	デ	ィ	ア	SCU	財		団		が		運		営
Code point (U +)		306F		3001	30A6	30A3	30AD	30E1	30C7	30A3	30A2		8CA1		56E3		304C		904B		55B6

byte	E5	99	CB	16	84	D3	9F	DC	AC	A3	A8	0F	76	7E	79	D1	4E	8B	51	78	E5	A7	82	CB	08	02
Sign command	UC5	す	る	SC6	イ	ン	タ	ー	ネ	ッ	ト	SCU	百		科		事		典		UC5	で	あ	る	SQ7	。
Code point (U +)		3059	308B		30A4	30F3	30BF	30FC	30CD	30C3	30C8		767E		79D1		4E8B		5178			3067	3042	308B		3002

use

In practice, SCSU could never prevail. Only a few programs use this encoding, including Microsoft SQL Server and Symbian .

One of the main problems of the method is to find a good algorithm for compression and to execute it. Since it is usually more efficient to save computing time than storage space, the effort of compressing with SCSU is not worthwhile for most applications compared to UTF-8 or UTF-16. In addition, the lack of support for SCSU in application programs meant that SCSU was hardly used, which in turn meant that the coding was still not supported. Since misinterpretation by programs that do not support SCSU can lead to unexpected behavior and even security problems, the use of SCSU in HTML5 is expressly excluded.

swell

Asmus Freytag et al: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. (on-line)
Doug Ewell: Unicode Technical Note # 14: A Survey of Unicode Compression. (on-line)

Individual evidence

↑ Asmus Freytag et al .: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. Revision 1.0
↑ Measured against What is Unicode in different languages in: Markus W. Scherer, Mark Davis: Unicode Technical Note # 6: BOCU-1. BOCU-1 performance
↑ Unicode Compression Implementation , accessed January 26, 2013.
↑ Forum Nokia Library: Compressed Unicode resource format ( page no longer available , search in web archives ) Info: The link was automatically marked as defective. Please check the link according to the instructions and then remove this notice. , accessed January 26, 2013.@1@ 2

↑ HTML Standard : Character Encodings, accessed December 3, 2015.

Web links

ICU User Guide : Compression (English)

[1] Asmus Freytag et al .: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. Revision 1.0

[2] Measured against What is Unicode in different languages in: Markus W. Scherer, Mark Davis: Unicode Technical Note # 6: BOCU-1. BOCU-1 performance

[3] Unicode Compression Implementation , accessed January 26, 2013.

[4] Forum Nokia Library: Compressed Unicode resource format ( page no longer available , search in web archives ) Info: The link was automatically marked as defective. Please check the link according to the instructions and then remove this notice. , accessed January 26, 2013.@1@ 2

[5] HTML Standard : Character Encodings, accessed December 3, 2015.