Standard Compression Scheme for Unicode

from Wikipedia, the free encyclopedia

The Standard Compression Scheme for Unicode ( SCSU , English for Standard Compression Scheme for Unicode ) is a character encoding for texts made up of Unicode characters, which, in contrast to most other encodings, is designed to require as little storage space as possible.

history

The coding was originally developed by Reuters . The authors of the method described in the technical standard UTS # 6 are Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, Asmus Freytag and Markus Scherer. The first publication took place in May 1997, since May 2005 the standard has remained unchanged in revision 4.

idea

Traditional pre-Unicode character sets, such as the ISO-8859 character sets , only required one byte per character, character sets for East Asian scripts required two bytes. When using Unicode, the memory requirement usually increases: With UTF-32 to four bytes per character, with UTF-16 it is two or four bytes per character, with UTF-8 between one and four bytes per character. Ordinary texts only use a very small part of all the characters available in Unicode. Most of the characters used are on the one hand in the ASCII area (especially punctuation marks), on the other hand in a small contiguous area that often corresponds to a Unicode block . The algorithm uses a dynamically positioned window that contains 128 consecutive characters. Characters in this window are encoded by a byte in the range from 0x80 to 0xFF, characters in the ASCII range (with the exception of most control characters) by a byte in the range from 0x20 to 0x7F. The remaining bytes are used as commands to reposition this window or to switch to uncompressed mode in which the following bytes are interpreted as UTF-16. This mode is particularly useful when the text uses a large number of characters from a range of more than 128 consecutive characters, for example in Chinese.

algorithm

This idea is implemented using the following procedure. The method is defined with which a text from Unicode characters can be obtained from an SCSU byte stream. Various algorithms can be used for coding, which lead to a result that can be correctly decoded. How such an algorithm is designed depends, among other things, on whether more emphasis is placed on fast coding or on good compression.

window

The algorithm knows two types of windows: static windows, which are predefined in the algorithm, and dynamic windows, the position of which can be changed if necessary. There are eight of each type, numbered from 0 to 7. The position of a window can be indicated by the code point of the first character in this window.

Static windows

The eight static windows are defined as follows:

Window number begin contained characters
0 U + 0000 Basic Latin
1 U + 0080 Latin-1, supplement
2 U + 0100 Latin, extended-A
3 U + 0300 COMBINING DIACRITICS
4th U + 2000 General punctuation and superscripts
5 U + 2080 Subscripts , Currency Signs , and Combining Diacritics for Symbols
6th U + 2100 Letter-like symbols and numerals
7th U + 3000 CJK symbols and punctuation

Dynamic windows

The starting positions of the eight dynamic windows are as follows:

Window number begin contained characters
0 U + 0080 Latin-1, supplement
1 U + 00C0 Parts of Latin-1, supplement and Latin, extended-A
2 U + 0400 Cyrillic
3 U + 0600 Arabic
4th U + 0900 Devanagari
5 U + 3040 Hiragana
6th U + 30A0 Katakana
7th U + FF00 Full-width forms

The dynamic window 0 is active at the beginning.

Various commands are available to change the position of a dynamic window. The two simple commands (SDn and UDn) for definition determine the new position of the window using a byte according to the following table:

Byte
( hex )
begin annotation
00 reserved reserved for internal use
01-67 U + 0080-U + 3380 the byte is multiplied by 0x80
68-A7 U + E000-U + FF80 the byte is multiplied by 0x80 and 0xAC00 is added
A8-F8 reserved reserved for future use
F9 U + 00C0 Parts of Latin-1, supplement and Latin, extended-A
FA U + 0250 IPA extensions
FB U + 0370 Greek
FC U + 0530 Armenian
FD U + 3040 Hiragana
FE U + 30A0 Katakana
FF U + FF60 Half-width katakana

The two extended commands (SDX and UDX) for window definition use two bytes. The top three bits indicate the number of the window, 0x10000 is added to the remaining 13 bits and the result is taken as the first character of the window.

Modes

The algorithm uses two different modes. It is initially in one-byte mode, where characters are encoded by a single byte. Byte values ​​in the range 0x20 to 0x7F as well as 0x00 ( NUL ), 0x09 (horizontal tab character ), 0x0A ( LF ) and 0x0D ( CR ) are interpreted as characters in the static window 0, values ​​in the range 0x80 to 0xFF as characters in the active dynamic window. All other bytes are interpreted as commands.

The other mode is a two-byte mode. With a few exceptions, all byte pairs are interpreted as UFT-16BE-coded characters, only a few bytes represent commands.

Commands

In the one-byte mode, the following byte values ​​represent commands:

Byte
( hex )
Surname meaning
01-08 SQ0-SQ7 changes the window for the following byte: 0x00 to 0x7F are interpreted as characters in the static window n, 0x80 to 0xFF in the dynamic window n
0B SDX uses the following two bytes for the extended definition of a dynamic window; this window is then active
0C reserved reserved for future use
0E SQU interprets the following two bytes as a UTF-16 coded character
0F SCU changes to two-byte mode
10-17 SC0-SC7 makes the dynamic window n the active window
18-1F SD0-SD7 uses the following byte as a simple definition for the dynamic window n, this window is then active

If a control character is to be coded which is represented by a byte which represents a command, the command SQ0 can be used.

In two-byte mode, the following byte values ​​represent commands, provided they appear in the first position in a possible byte pair:

Byte
( hex )
Surname meaning
E0-E7 UC0-UC7 changes to the one-byte mode and activates the dynamic window n
E8-EF UD0-UD7 uses the following byte as a simple definition for dynamic window n, activates this window and switches to one-byte mode
F0 UQU interprets the following two bytes as a UTF-16 coded character
F1 UDX uses the following two bytes for the extended definition of a dynamic window, activates this window and changes to one-byte mode
F2 reserved reserved for future use

If a character (from the area for private use) is to be encoded that begins with a byte occupied by a command, the UQU command can be used.

properties

The process has some properties that were deliberately chosen:

  • There is no change for texts that consist exclusively of Latin 1 characters without control characters.
  • For texts without characters from the area for private use, you can always switch to two-byte mode with an additional byte, so that the memory requirement in this case corresponds to that of UTF-16.
  • Even in the worst case , the storage requirement is only 1.5 times greater than UTF-16.
  • With optimal encoding, normal texts are stored more compactly than in UTF-8 or UTF-16. How big these savings are depends on the language: While SCSU requires just as much space as UTF-8 for English and French texts, this is reduced to 85% for Korean, 70% for Chinese, Greek, Russian, Arabic, Hebrew and Japanese to 55%, in Hindi even to 40%.

The following properties can be problematic in some applications:

  • Zero bytes can occur in the compressed byte stream, which is one of the reasons why the coding is not MIME- compatible. BOCU-1 can be used here instead .
  • The same text can be encoded in different ways.
  • Texts with few different characters, but which are spread over several disjointed areas, cannot be compressed well. This is the case in Vietnamese , for example .

Possible encodings

Sequences of characters from the ASCII range and the predefined dynamic windows are most efficiently encoded in one-byte mode. If there is no suitable predefined window, a dynamic window that is not required can be redefined. Apart from the Chinese and Korean characters , most areas can be selected as dynamic windows.

The two-byte mode should be switched to for sequences of characters outside of small areas.

Individual characters that are in a window that is currently not active can be coded using the SQn command, single characters outside the possible window can be coded using the SQU command.

Examples

German

In order to encode the text "Wikipedia - the free encyclopedia" (with a typographic dash ) with SCSU, all predefined windows are sufficient: only the dash and the ä are not in the ASCII area. The ä is in the active dynamic window, the dash in the static window 4. The result is the following hexadecimal byte sequence:

57 69 6B 69 70 65 64 69 61 20 05  13 20 64 69 65 20 66 72 65 69 65 20
W  i  k  i  p  e  d  i  a     SQ4 –     d  i  e     f  r  e  i  e
45 6E 7A 79 6B 6C 6F 70 E4 64 69 65
E  n  z  y  k  l  o  p  ä  d  i  e

Except for the dash, the coding corresponds to ISO 8859-1 .

Greek

All characters of the Greek word for Wikipedia "Βικιπαίδεια" are in the Unicode block for Greek. It can therefore be coded by first covering this block with a dynamic window, with the help of which the letters are then coded.

18  FB A2 C9 CA C9 D0 C1 BF C4 C5 C9 C1
SD0    Β  ι  κ  ι  π  α  ί  δ  ε  ι  α

The coding only needs two bytes more than ISO 8859-7 , but is shifted by 0x20 compared to this.

Japanese

The Japanese Wikipedia article on Wikipedia begins like this:

"ウ ィ キ ペ デ ィ ア (英: Wikipedia) は 、 ウ ィ キ メ デ ィ ア 財 団 が 運 営 営 す る イ ン タ ー ー ネ ッ ト 百科 事 典 で あ る。" "

- Wikipedia authors : " ウ ィ キ ペ デ ィ ア " in the version of January 26, 2013

Different fonts are used:

  • Latin letters and punctuation marks that are in the static window 0
  • Katakana from the dynamic window 6
  • occasional hiragana from the dynamic window 5
  • CJK characters that are not in any possible window
  • Full-width punctuation marks from the dynamic window 7
  • CJK punctuation from the static window 7

The following tables represent one of the many possible codings: Most of the time, the dynamic window 6 (Katakana) is used. Individual characters from other areas are encoded without a permanent change. For longer sequences of CJK characters, a switch is made to two-byte mode; only when longer sequences of hiragana or katakana have to be encoded is it switched back to one-byte mode.

byte 16 86 83 8D BA A7 83 82 08 88 0E 82 F1 3A 20th 57 69 6B 69 70 65 64 69 61 08 89
Sign
command
SC6 SQ7 SQU :   W. i k i p e d i a SQ7
Code point (U +)   30A6 30A3 30AD 30DA 30C7 30A3 30A2   FF08   82F1 003A 0020 0057 0069 006B 0069 0070 0065 0064 0069 0061   FF09
byte 06 AF 08 01 86 83 8D C1 A7 83 82 0F 8C A1 56 E3 30th 4C 90 4B 55 B6
Sign
command
SQ5 SQ7 SCU
Code point (U +)   306F   3001 30A6 30A3 30AD 30E1 30C7 30A3 30A2   8CA1 56E3 304C 904B 55B6
byte E5 99 CB 16 84 D3 9F DC AC A3 A8 0F 76 7E 79 D1 4E 8B 51 78 E5 A7 82 CB 08 02
Sign
command
UC5 SC6 SCU UC5 SQ7
Code point (U +)   3059 308B   30A4 30F3 30BF 30FC 30CD 30C3 30C8   767E 79D1 4E8B 5178   3067 3042 308B   3002

use

In practice, SCSU could never prevail. Only a few programs use this encoding, including Microsoft SQL Server and Symbian .

One of the main problems of the method is to find a good algorithm for compression and to execute it. Since it is usually more efficient to save computing time than storage space, the effort of compressing with SCSU is not worthwhile for most applications compared to UTF-8 or UTF-16. In addition, the lack of support for SCSU in application programs meant that SCSU was hardly used, which in turn meant that the coding was still not supported. Since misinterpretation by programs that do not support SCSU can lead to unexpected behavior and even security problems, the use of SCSU in HTML5 is expressly excluded.

swell

  • Asmus Freytag et al: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. (on-line)
  • Doug Ewell: Unicode Technical Note # 14: A Survey of Unicode Compression. (on-line)

Individual evidence

  1. Asmus Freytag et al .: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. Revision 1.0
  2. Measured against What is Unicode in different languages ​​in: Markus W. Scherer, Mark Davis: Unicode Technical Note # 6: BOCU-1. BOCU-1 performance
  3. Unicode Compression Implementation , accessed January 26, 2013.
  4. Forum Nokia Library: Compressed Unicode resource format  ( page no longer available , search in web archivesInfo: The link was automatically marked as defective. Please check the link according to the instructions and then remove this notice. , accessed January 26, 2013.@1@ 2Template: Dead Link / library.developer.nokia.com  
  5. HTML Standard : Character Encodings, accessed December 3, 2015.

Web links