Bidirectional control character
Various bidirectional control characters are defined in Unicode , i.e. control characters that influence the writing direction. They are used in computer typesetting for bidirectional text , i.e. text that contains both characters from scripts written from left to right (such as German , English or Russian ) and characters from scripts written from right to left (such as Arabic and Hebrew ).
In Unicode, each letter is assigned its writing direction, while punctuation marks , for example , are adapted to the writing direction of the surrounding text. If texts with different writing directions meet on such a character, the algorithm that is supposed to determine the writing direction for the display can fail. In this case, the Unicode Bidi algorithm can be influenced with the corresponding control characters and the writing direction can be corrected.
Character encoding
While the left-to-right and right-to-left characters are available in different character sets, the other control characters only appear in Unicode. Except for the ALM from the Arabic Unicode block , all characters are in the General Punctuation Unicode block .
Unicode number | description | Official name | abbreviation | meaning |
---|---|---|---|---|
U + 061C (1564) | Arabic letter sign | ARABIC LETTER MARK | ALM | is treated like an invisible Arabic letter |
U + 200E (8206) | Left-to-right sign | LEFT-TO-RIGHT MARK | LRM | is treated like a letter in a script written from left to right |
U + 200F (8207) | Right-to-left sign | RIGHT-TO-LEFT MARK | RLM | is treated like a letter in a script written from right to left |
U + 202A (8234) | Left-to-right embedding | LEFT-TO-RIGHT EMBEDDING | LRE | the basic writing direction of the following text is from left to right |
U + 202B (8235) | Right-to-left embedding | RIGHT-TO-LEFT EMBEDDING | RLE | the basic writing direction of the following text is from right to left |
U + 202C (8236) | Reset direction formatting | POP DIRECTIONAL FORMATTING | terminates the effect of one of the characters LRE, RLE, RLO, LRO | |
U + 202D (8237) | Left-to-right compulsion | LEFT-TO-RIGHT OVERRIDE | LRO | all of the following characters are treated like characters in a script written from left to right |
U + 202E (8238) | Right-to-left compulsion | RIGHT-TO-LEFT OVERRIDE | RLO | all subsequent characters are treated like characters in a font written from right to left |
U + 2066 (8294) | Left-to-right isolation | LEFT-TO-RIGHT ISOLATE | LRI | the basic writing direction of the following text runs from left to right, without affecting any characters outside |
U + 2067 (8295) | Right-to-left isolation | RIGHT-TO-LEFT ISOLATE | RLI | the basic writing direction of the following text runs from right to left, without affecting any characters outside |
U + 2068 (8296) | Bidirectional insulation | FIRST STRONG ISOLATE | FSI | the following text is treated in isolation from the rest |
U + 2069 (8297) | Reset directional isolation | POP DIRECTIONAL ISOLATE | PDI | terminates the effect of one of the characters LRI, RLI, FSI |
The left-to-right ( LEFT-TO-RIGHT MARK ), the right-to-left ( RIGHT-TO-LEFT MARK ) and the Arabic letter characters ( ARABIC LETTER MARK ) are called implicit control characters, the others as explicit.
HTML has named entities for the left-to-right and right-to-left characters : ‎
and ‏
. According to a recommendation by the Unicode Consortium, the other characters should not be used on websites; instead, the dir
- attribute is provided with the values "ltr"
or "rtl"
as well as the tags <bdi>
and <bdo>
.
The effect of the explicit control characters can also be nested with a depth of up to 125 levels, and their effect ends without the characters PDF
or PDI
at the end of the paragraph.
Text passages that are surrounded by control characters for embedding or overwriting have the influence of characters with the corresponding writing direction on their neighboring characters, while the control characters for isolation that were newly introduced with Unicode 6.3 keep the enclosed text separate from its surroundings, and thus on its surroundings have no influence.
example
An Arabic text about the programming language C ++ could start with (from right to left)
C ++ هي لغة برمجة تستخدم ...
The ++, which itself has no fixed writing direction, stands between the C (a character of a language written from left to right) and the Arabic text. The web browser is therefore based on the main writing direction of the paragraph and thus displays the ++ written from right to left, i.e. incorrectly to the left of the C.
If you insert a left-to-right character after the ++, the ++ is surrounded by two characters, both of which are written as from left to right, so that the browser also uses the ++ from left to right and thus to the right from C indicates:
C ++ هي لغة برمجة تستخدم ...
Alternatively, you could insert an LRE in front of the C and a PDF after the ++ to indicate that C ++ is a term embedded in the Arabic text and written from left to right. The control characters LRI or FSI before and PDI after C ++ can also be used. The presentation is identical in all these cases; However, if the text contained other weak characters, the choice of control characters would also have a possible influence on their display.
Web links
Individual evidence
- ↑ Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 2.7 Markup and Formatting Characters. May 14, 2017, accessed March 29, 2018 .
- ↑ Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 3.1 Definitions. May 14, 2017, accessed March 29, 2018 .