Standard Compression Scheme for Unicode
The Standard Compression Scheme for Unicode ( SCSU , English for Standard Compression Scheme for Unicode ) is a character encoding for texts made up of Unicode characters, which, in contrast to most other encodings, is designed to require as little storage space as possible.
history
The coding was originally developed by Reuters . The authors of the method described in the technical standard UTS # 6 are Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, Asmus Freytag and Markus Scherer. The first publication took place in May 1997, since May 2005 the standard has remained unchanged in revision 4.
idea
Traditional pre-Unicode character sets, such as the ISO-8859 character sets , only required one byte per character, character sets for East Asian scripts required two bytes. When using Unicode, the memory requirement usually increases: With UTF-32 to four bytes per character, with UTF-16 it is two or four bytes per character, with UTF-8 between one and four bytes per character. Ordinary texts only use a very small part of all the characters available in Unicode. Most of the characters used are on the one hand in the ASCII area (especially punctuation marks), on the other hand in a small contiguous area that often corresponds to a Unicode block . The algorithm uses a dynamically positioned window that contains 128 consecutive characters. Characters in this window are encoded by a byte in the range from 0x80 to 0xFF, characters in the ASCII range (with the exception of most control characters) by a byte in the range from 0x20 to 0x7F. The remaining bytes are used as commands to reposition this window or to switch to uncompressed mode in which the following bytes are interpreted as UTF-16. This mode is particularly useful when the text uses a large number of characters from a range of more than 128 consecutive characters, for example in Chinese.
algorithm
This idea is implemented using the following procedure. The method is defined with which a text from Unicode characters can be obtained from an SCSU byte stream. Various algorithms can be used for coding, which lead to a result that can be correctly decoded. How such an algorithm is designed depends, among other things, on whether more emphasis is placed on fast coding or on good compression.
window
The algorithm knows two types of windows: static windows, which are predefined in the algorithm, and dynamic windows, the position of which can be changed if necessary. There are eight of each type, numbered from 0 to 7. The position of a window can be indicated by the code point of the first character in this window.
Static windows
The eight static windows are defined as follows:
Window number | begin | contained characters |
---|---|---|
0 | U + 0000 | Basic Latin |
1 | U + 0080 | Latin-1, supplement |
2 | U + 0100 | Latin, extended-A |
3 | U + 0300 | COMBINING DIACRITICS |
4th | U + 2000 | General punctuation and superscripts |
5 | U + 2080 | Subscripts , Currency Signs , and Combining Diacritics for Symbols |
6th | U + 2100 | Letter-like symbols and numerals |
7th | U + 3000 | CJK symbols and punctuation |
Dynamic windows
The starting positions of the eight dynamic windows are as follows:
Window number | begin | contained characters |
---|---|---|
0 | U + 0080 | Latin-1, supplement |
1 | U + 00C0 | Parts of Latin-1, supplement and Latin, extended-A |
2 | U + 0400 | Cyrillic |
3 | U + 0600 | Arabic |
4th | U + 0900 | Devanagari |
5 | U + 3040 | Hiragana |
6th | U + 30A0 | Katakana |
7th | U + FF00 | Full-width forms |
The dynamic window 0 is active at the beginning.
Various commands are available to change the position of a dynamic window. The two simple commands (SDn and UDn) for definition determine the new position of the window using a byte according to the following table:
Byte ( hex ) |
begin | annotation |
---|---|---|
00 | reserved | reserved for internal use |
01-67 | U + 0080-U + 3380 | the byte is multiplied by 0x80 |
68-A7 | U + E000-U + FF80 | the byte is multiplied by 0x80 and 0xAC00 is added |
A8-F8 | reserved | reserved for future use |
F9 | U + 00C0 | Parts of Latin-1, supplement and Latin, extended-A |
FA | U + 0250 | IPA extensions |
FB | U + 0370 | Greek |
FC | U + 0530 | Armenian |
FD | U + 3040 | Hiragana |
FE | U + 30A0 | Katakana |
FF | U + FF60 | Half-width katakana |
The two extended commands (SDX and UDX) for window definition use two bytes. The top three bits indicate the number of the window, 0x10000 is added to the remaining 13 bits and the result is taken as the first character of the window.
Modes
The algorithm uses two different modes. It is initially in one-byte mode, where characters are encoded by a single byte. Byte values in the range 0x20 to 0x7F as well as 0x00 ( NUL ), 0x09 (horizontal tab character ), 0x0A ( LF ) and 0x0D ( CR ) are interpreted as characters in the static window 0, values in the range 0x80 to 0xFF as characters in the active dynamic window. All other bytes are interpreted as commands.
The other mode is a two-byte mode. With a few exceptions, all byte pairs are interpreted as UFT-16BE-coded characters, only a few bytes represent commands.
Commands
In the one-byte mode, the following byte values represent commands:
Byte ( hex ) |
Surname | meaning |
---|---|---|
01-08 | SQ0-SQ7 | changes the window for the following byte: 0x00 to 0x7F are interpreted as characters in the static window n, 0x80 to 0xFF in the dynamic window n |
0B | SDX | uses the following two bytes for the extended definition of a dynamic window; this window is then active |
0C | reserved | reserved for future use |
0E | SQU | interprets the following two bytes as a UTF-16 coded character |
0F | SCU | changes to two-byte mode |
10-17 | SC0-SC7 | makes the dynamic window n the active window |
18-1F | SD0-SD7 | uses the following byte as a simple definition for the dynamic window n, this window is then active |
If a control character is to be coded which is represented by a byte which represents a command, the command SQ0 can be used.
In two-byte mode, the following byte values represent commands, provided they appear in the first position in a possible byte pair:
Byte ( hex ) |
Surname | meaning |
---|---|---|
E0-E7 | UC0-UC7 | changes to the one-byte mode and activates the dynamic window n |
E8-EF | UD0-UD7 | uses the following byte as a simple definition for dynamic window n, activates this window and switches to one-byte mode |
F0 | UQU | interprets the following two bytes as a UTF-16 coded character |
F1 | UDX | uses the following two bytes for the extended definition of a dynamic window, activates this window and changes to one-byte mode |
F2 | reserved | reserved for future use |
If a character (from the area for private use) is to be encoded that begins with a byte occupied by a command, the UQU command can be used.
properties
The process has some properties that were deliberately chosen:
- There is no change for texts that consist exclusively of Latin 1 characters without control characters.
- For texts without characters from the area for private use, you can always switch to two-byte mode with an additional byte, so that the memory requirement in this case corresponds to that of UTF-16.
- Even in the worst case , the storage requirement is only 1.5 times greater than UTF-16.
- With optimal encoding, normal texts are stored more compactly than in UTF-8 or UTF-16. How big these savings are depends on the language: While SCSU requires just as much space as UTF-8 for English and French texts, this is reduced to 85% for Korean, 70% for Chinese, Greek, Russian, Arabic, Hebrew and Japanese to 55%, in Hindi even to 40%.
The following properties can be problematic in some applications:
- Zero bytes can occur in the compressed byte stream, which is one of the reasons why the coding is not MIME- compatible. BOCU-1 can be used here instead .
- The same text can be encoded in different ways.
- Texts with few different characters, but which are spread over several disjointed areas, cannot be compressed well. This is the case in Vietnamese , for example .
Possible encodings
Sequences of characters from the ASCII range and the predefined dynamic windows are most efficiently encoded in one-byte mode. If there is no suitable predefined window, a dynamic window that is not required can be redefined. Apart from the Chinese and Korean characters , most areas can be selected as dynamic windows.
The two-byte mode should be switched to for sequences of characters outside of small areas.
Individual characters that are in a window that is currently not active can be coded using the SQn command, single characters outside the possible window can be coded using the SQU command.
Examples
German
In order to encode the text "Wikipedia - the free encyclopedia" (with a typographic dash ) with SCSU, all predefined windows are sufficient: only the dash and the ä are not in the ASCII area. The ä is in the active dynamic window, the dash in the static window 4. The result is the following hexadecimal byte sequence:
57 69 6B 69 70 65 64 69 61 20 05 13 20 64 69 65 20 66 72 65 69 65 20 W i k i p e d i a SQ4 – d i e f r e i e
45 6E 7A 79 6B 6C 6F 70 E4 64 69 65 E n z y k l o p ä d i e
Except for the dash, the coding corresponds to ISO 8859-1 .
Greek
All characters of the Greek word for Wikipedia "Βικιπαίδεια" are in the Unicode block for Greek. It can therefore be coded by first covering this block with a dynamic window, with the help of which the letters are then coded.
18 FB A2 C9 CA C9 D0 C1 BF C4 C5 C9 C1 SD0 Β ι κ ι π α ί δ ε ι α
The coding only needs two bytes more than ISO 8859-7 , but is shifted by 0x20 compared to this.
Japanese
The Japanese Wikipedia article on Wikipedia begins like this:
"ウ ィ キ ペ デ ィ ア (英: Wikipedia) は 、 ウ ィ キ メ デ ィ ア 財 団 が 運 営 営 す る イ ン タ ー ー ネ ッ ト 百科 事 典 で あ る。" "
Different fonts are used:
- Latin letters and punctuation marks that are in the static window 0
- Katakana from the dynamic window 6
- occasional hiragana from the dynamic window 5
- CJK characters that are not in any possible window
- Full-width punctuation marks from the dynamic window 7
- CJK punctuation from the static window 7
The following tables represent one of the many possible codings: Most of the time, the dynamic window 6 (Katakana) is used. Individual characters from other areas are encoded without a permanent change. For longer sequences of CJK characters, a switch is made to two-byte mode; only when longer sequences of hiragana or katakana have to be encoded is it switched back to one-byte mode.
byte | 16 | 86 | 83 | 8D | BA | A7 | 83 | 82 | 08 | 88 | 0E | 82 | F1 | 3A | 20th | 57 | 69 | 6B | 69 | 70 | 65 | 64 | 69 | 61 | 08 | 89 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sign command |
SC6 | ウ | ィ | キ | ペ | デ | ィ | ア | SQ7 | ( | SQU | 英 | : | W. | i | k | i | p | e | d | i | a | SQ7 | ) | ||
Code point (U +) | 30A6 | 30A3 | 30AD | 30DA | 30C7 | 30A3 | 30A2 | FF08 | 82F1 | 003A | 0020 | 0057 | 0069 | 006B | 0069 | 0070 | 0065 | 0064 | 0069 | 0061 | FF09 |
byte | 06 | AF | 08 | 01 | 86 | 83 | 8D | C1 | A7 | 83 | 82 | 0F | 8C | A1 | 56 | E3 | 30th | 4C | 90 | 4B | 55 | B6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sign command |
SQ5 | は | SQ7 | 、 | ウ | ィ | キ | メ | デ | ィ | ア | SCU | 財 | 団 | が | 運 | 営 | |||||
Code point (U +) | 306F | 3001 | 30A6 | 30A3 | 30AD | 30E1 | 30C7 | 30A3 | 30A2 | 8CA1 | 56E3 | 304C | 904B | 55B6 |
byte | E5 | 99 | CB | 16 | 84 | D3 | 9F | DC | AC | A3 | A8 | 0F | 76 | 7E | 79 | D1 | 4E | 8B | 51 | 78 | E5 | A7 | 82 | CB | 08 | 02 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sign command |
UC5 | す | る | SC6 | イ | ン | タ | ー | ネ | ッ | ト | SCU | 百 | 科 | 事 | 典 | UC5 | で | あ | る | SQ7 | 。 | ||||
Code point (U +) | 3059 | 308B | 30A4 | 30F3 | 30BF | 30FC | 30CD | 30C3 | 30C8 | 767E | 79D1 | 4E8B | 5178 | 3067 | 3042 | 308B | 3002 |
use
In practice, SCSU could never prevail. Only a few programs use this encoding, including Microsoft SQL Server and Symbian .
One of the main problems of the method is to find a good algorithm for compression and to execute it. Since it is usually more efficient to save computing time than storage space, the effort of compressing with SCSU is not worthwhile for most applications compared to UTF-8 or UTF-16. In addition, the lack of support for SCSU in application programs meant that SCSU was hardly used, which in turn meant that the coding was still not supported. Since misinterpretation by programs that do not support SCSU can lead to unexpected behavior and even security problems, the use of SCSU in HTML5 is expressly excluded.
swell
- Asmus Freytag et al: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. (on-line)
- Doug Ewell: Unicode Technical Note # 14: A Survey of Unicode Compression. (on-line)
Individual evidence
- ↑ Asmus Freytag et al .: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. Revision 1.0
- ↑ Measured against What is Unicode in different languages in: Markus W. Scherer, Mark Davis: Unicode Technical Note # 6: BOCU-1. BOCU-1 performance
- ↑ Unicode Compression Implementation , accessed January 26, 2013.
- ↑ Forum Nokia Library: Compressed Unicode resource format ( page no longer available , search in web archives ) Info: The link was automatically marked as defective. Please check the link according to the instructions and then remove this notice. , accessed January 26, 2013.
- ↑ HTML Standard : Character Encodings, accessed December 3, 2015.
Web links
- ICU User Guide : Compression (English)