Control characters in Unicode

from Wikipedia, the free encyclopedia

Control characters in Unicode are themselves not displayable characters, but affect the display and formatting of the other characters. Since Unicode encodes a large number of different writing systems that place certain requirements on optimal display, it is necessary in some cases to influence the display algorithms using such invisible control characters .

For example, the display of ligatures can be influenced with control characters . Depending on whether a program provides this automatically or not, it may be necessary in some cases to use certain control characters to require a connection between two letters to form a ligature or to prevent such a connection.

General properties of control characters

Most control characters are identified as such by the general category , with a few exceptions. Here the value stands Ccfor general Cfformatting control characters. Many control characters are also marked as default ignorable , which means that programs that cannot process these characters correctly should ignore them.

Control character areas C0 and C1

The characters from U + 0000 to U + 001F (decimal 0–31) and U + 007F (decimal 127) apply as the C0 range [c-zero], the C1 range from U + 0080 to U + 009F ( decimal 128–159). As a superset of ASCII and Latin-1 , Unicode adopts the C0 and C1 control characters of these standards without assigning their own interpretation to them. Only some of these characters have a function defined in the Unicode standard, these include in particular the characters for line breaks .

Upheaval

For the line break and the division of a text into individual characters, words or sentences there is the Unicode line break algorithm and a number of segmentation algorithms . In addition to the classic control characters for the forced end of a line, there are also control characters that can be used to signal these algorithms at which places in the text no break may be made and at which it should also be possible.

To prevent a break, the word connector (U + 2060) is usually used, unless there is a separate non-break variant, as is the case with spaces. Before this control character was introduced in Unicode 3.2, the non-breaking space character (U + FEFF) was used, but today it is mainly used in its function as a byte sequence character.

Conversely, to allow a break, the non-width space (U + 200B) or the conditional hyphen (U + 00AD) is used.

The characters line separator (U + 2028) and paragraph separator (U + 2029) exist in Unicode for the end of line and paragraph, which, in contrast to most other control characters, are marked as whitespace in their general category .

Script and ligatures

In some writing systems, such as Arabic, the characters within a word are connected with the neighboring characters, which means that a character can have a different appearance depending on its position. It is also possible for two adjacent characters to be represented by a single ligature of these characters. In order to force or prevent the connection of two adjacent characters in such cases, the Unicode standard defines control characters that influence the corresponding algorithms .

These are the tie inhibitor (U + 200C) and the widthless connector (U + 200D).

Combining grapheme joiner

The Combining Grapheme Joiner (CGJ; U + 034F) is formally not a control character, but a combining character , which can be used to influence the display of diacritical marks and the sorting of digraphs using the Unicode Collation Algorithm .

Bidirectional texts

For bidirectional texts there are a number of special control characters that force a certain writing direction and thus influence the display.

Obsolete formatting characters

Some control characters are marked as deprecated , their use is not recommended. These are the following characters:

U + 206A (prevent symmetrical mirroring) and U + 206B (activate symmetrical mirroring) deactivate or activate the normal behavior that mirrorable characters (such as brackets) are displayed mirrored in left-hand text when using the Unicode-Bidi algorithm .

U + 206C (prevent Arabic shapes) and U + 206D (activate Arabic shapes) deactivate or activate the normally deactivated behavior of replacing Arabic compatibility symbols for certain character shapes with the correct shape in the respective context.

U + 206E (national digit forms) and U + 206F (nominal digit forms) activate or deactivate an otherwise not performed replacement of the usual digits 0 to 9 with the usual ones in the user's language (Arabic-Indian etc.).

Variant selectors

Variant selectors offer the option of using certain glyph variants for output even in plain text without meta information about the desired font. In formal terms, variant selectors are characters that combine, that is, they directly follow the character for which they select a specific form variant. There are 259 different such variant selectors defined: U + 180B to U + 180D are intended for use with Mongolian characters , U + FE00 to U + FE0F and U + E0100 to U + E01EF for general characters. The exact changes that the variant selectors cause is specified in two documents, the Unicode Ideographic Variation Database and the StandardizedVariants.txt file. For example, the variant selector U + FE00, if it follows the union character U + 222A, specifies that this should be displayed with serifs .

Locked code points

Some code points are permanently blocked and are never assigned a character. In addition to the last two code points of each level (U + FFFE, U + FFFF, U + 1FFFE, U + 1FFFF,…, U + 10FFFE, U + 10FFFF), these are the characters in the range U + FDD0 to U + FDEF. The byte sequence FFFE must remain free as a byte sequence character (U + FEFF) in order to be able to recognize the byte sequence of the data stream, and the byte sequence FFFF (all 16 bits set) cannot be distinguished from a missing signal in various data transfers. The other code points correspond to bit sequences which are required for internal code purposes. These code points are therefore not control characters in the narrower sense and programs can use these code points internally as required, but they are not suitable for the transmission and display of characters. They are not to be confused with currently unused code points, which, however, could be assigned a character in later versions.

Byte order characters

In addition to its original meaning for the break, the U + FEFF character now has the task of specifying the byte order of a text as a byte order mark and to facilitate automatic determination of the coding.

Note marks

The characters in the range from U + FFF9 to U + FFFB from the Unicode block Special allow you to insert comments in the text, which are usually displayed above the annotated text. They make it possible, for example, to mark Furigana symbols as such. U + FFF9 (interlinear annotation anchor) introduces the annotated text, U + FFFA (interlinear annotation divider) separates it from the following annotation, U + FFFB (interlinear annotation terminator) marks the end of the annotation.

Obsolete tags

Application example of language tags

The Tags Unicode block (U + E0000 to U + E007F) contains characters that were originally intended to convey speech and other meta information in plain text using tags . These characters are now deprecated in favor of higher-level protocols such as XML . 95 of these characters correspond to the printable characters of the ASCII standard, plus a few other characters that define the type of meta information or the end of its effect. The sequence <U + E0001, U + E006A, U + E0061> defines that the following text is Japanese: U + E0001 introduces language identifiers, the next two characters can (after E0000 16 has been subtracted) as jaread in ASCII , the ISO-639 language code for Japanese.

literature

  • Julie D. Allen et al .: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, 2012. ISBN 978-1-936213-07-8 . Chapter 16: Special Areas and Format Characters. ( online ; PDF; 426 kB)

Individual evidence

  1. Ken Lunde, Richard Cook, John H. Jenkins: Unicode Technical Standard # 37: Unicode Ideographic Variation Database. on-line
  2. StandardizedVariants ( Memento of the original from May 4, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. with visual representation in the Unicode Character Database @1@ 2Template: Webachiv / IABot / www.unicode.org