Binary Ordered Compression for Unicode

Binary Ordered Compression for Unicode (English for binary ordered compression for Unicode , BOCU for short ) is a family of encodings for texts made up of Unicode characters, which is geared on the one hand to the smallest possible storage requirement and on the other hand to preserve the binary order. The best-known representative BOCU-1 is also directly compatible with the MIME protocol. However, the process could not establish itself in practice.

history

BOCU was developed in 2001 by Mark Davis and Markus Scherer for the ICU project . BOCU-1 is an application of this principle, which is described in UTN # 6. However, there is no formal definition, only the code of a C program describes the coding. The BOCU algorithm is patented in the United States Patent and Trademark Office .

idea

The idea of BOCU is that the code points of consecutive characters usually differ only slightly. If characters are coded by the difference to the previous character and small differences are represented with one byte, while larger differences are represented with more bytes, this saves memory space. In fact, instead of the difference to the last character, BOCU uses the difference to a basic character that can be determined in various ways. For example, the middle character in the last used Unicode block can be used as the basic character in order to avoid long jumps from one end of the block to the other. It is also possible not to change the basic character immediately when changing the block in order to avoid a long jump back between characters from another block in the case of spaces or punctuation marks from the ASCII area .

In BOCU-1, the amount of byte values that are used to encode the differences is so limited that compatibility with MIME is guaranteed. Control characters and spaces are also coded directly.

properties

Due to their construction, BOCU and BOCU-1 have the following properties:

BOCU maintains the binary order. If a list of character strings is ordered in binary form according to the code points , this also applies to the BOCU-coded byte sequences.
In contrast to SCSU , BOCU is deterministic, each text has a unique coding. However, the same character can be coded differently in different places.
BOCU-1 is MIME compatible: The ASCII control characters NUL (0x00), LF (0x0A), CR (0x0D) and nine others are coded as in ASCII, and these byte values are only used to encode these control characters.
BOCU-1 allows random access to a limited extent .
For normal texts, BOCU requires as much storage space as traditional character sets prior to Unicode or like SCSU.
BOCU-1 requires a maximum of 4 bytes per character.

A number of properties have a negative effect on practical usability:

Although the BOCU algorithm was expressly designed more simply than SCSU, it takes significantly longer in practice.
BOCU-1 is not backwards compatible with ASCII. Texts that consist exclusively of ASCII characters require the same storage space in BOCU-1 coding, but are represented by different byte values. This is particularly a problem if the character encoding is to be specified in the document itself, as in XML .

swell

Markus W. Scherer, Mark Davis: Unicode Technical Note # 6: BOCU-1: MIME-compatible Unicode Compression. ( Online )
Doug Ewell: Unicode Technical Note # 14: A Survey of Unicode Compression. ( Online )
Patent application US6737994 : Binary-ordered compression for unicode. Filed May 13, 2002 , published November 13, 2003 , Applicant: IBM, Inventor: Davis, Mark Edward; Clipper Markus Walter. ‌

Web links

Mark Davis, Markus Scherer: BOCU: Binary-Ordered Compression for Unicode
ICU User Guide : Compression