Bidirectional control character

from Wikipedia, the free encyclopedia

Various bidirectional control characters are defined in Unicode , i.e. control characters that influence the writing direction. They are used in computer typesetting for bidirectional text , i.e. text that contains both characters from scripts written from left to right (such as German , English or Russian ) and characters from scripts written from right to left (such as Arabic and Hebrew ).

In Unicode, each letter is assigned its writing direction, while punctuation marks , for example , are adapted to the writing direction of the surrounding text. If texts with different writing directions meet on such a character, the algorithm that is supposed to determine the writing direction for the display can fail. In this case, the Unicode Bidi algorithm can be influenced with the corresponding control characters and the writing direction can be corrected.

Character encoding

While the left-to-right and right-to-left characters are available in different character sets, the other control characters only appear in Unicode. Except for the ALM from the Arabic Unicode block , all characters are in the General Punctuation Unicode block .

Coding in Unicode
Unicode number description Official name abbreviation meaning
U + 061C (1564) Arabic letter sign ARABIC LETTER MARK ALM is treated like an invisible Arabic letter
U + 200E (8206) Left-to-right sign LEFT-TO-RIGHT MARK LRM is treated like a letter in a script written from left to right
U + 200F (8207) Right-to-left sign RIGHT-TO-LEFT MARK RLM is treated like a letter in a script written from right to left
U + 202A (8234) Left-to-right embedding LEFT-TO-RIGHT EMBEDDING LRE the basic writing direction of the following text is from left to right
U + 202B (8235) Right-to-left embedding RIGHT-TO-LEFT EMBEDDING RLE the basic writing direction of the following text is from right to left
U + 202C (8236) Reset direction formatting POP DIRECTIONAL FORMATTING PDF terminates the effect of one of the characters LRE, RLE, RLO, LRO
U + 202D (8237) Left-to-right compulsion LEFT-TO-RIGHT OVERRIDE LRO all of the following characters are treated like characters in a script written from left to right
U + 202E (8238) Right-to-left compulsion RIGHT-TO-LEFT OVERRIDE RLO all subsequent characters are treated like characters in a font written from right to left
U + 2066 (8294) Left-to-right isolation LEFT-TO-RIGHT ISOLATE LRI the basic writing direction of the following text runs from left to right, without affecting any characters outside
U + 2067 (8295) Right-to-left isolation RIGHT-TO-LEFT ISOLATE RLI the basic writing direction of the following text runs from right to left, without affecting any characters outside
U + 2068 (8296) Bidirectional insulation FIRST STRONG ISOLATE FSI the following text is treated in isolation from the rest
U + 2069 (8297) Reset directional isolation POP DIRECTIONAL ISOLATE PDI terminates the effect of one of the characters LRI, RLI, FSI

The left-to-right ( LEFT-TO-RIGHT MARK ), the right-to-left ( RIGHT-TO-LEFT MARK ) and the Arabic letter characters ( ARABIC LETTER MARK ) are called implicit control characters, the others as explicit.

HTML has named entities for the left-to-right and right-to-left characters : &lrm;and &rlm;. According to a recommendation by the Unicode Consortium, the other characters should not be used on websites; instead, the dir- attribute is provided with the values "ltr"or "rtl"as well as the tags <bdi> and <bdo>.

The effect of the explicit control characters can also be nested with a depth of up to 125 levels, and their effect ends without the characters PDFor PDIat the end of the paragraph.

Text passages that are surrounded by control characters for embedding or overwriting have the influence of characters with the corresponding writing direction on their neighboring characters, while the control characters for isolation that were newly introduced with Unicode 6.3 keep the enclosed text separate from its surroundings, and thus on its surroundings have no influence.

example

An Arabic text about the programming language C ++ could start with (from right to left)

C ++ هي لغة برمجة تستخدم ...

The ++, which itself has no fixed writing direction, stands between the C (a character of a language written from left to right) and the Arabic text. The web browser is therefore based on the main writing direction of the paragraph and thus displays the ++ written from right to left, i.e. incorrectly to the left of the C.

If you insert a left-to-right character after the ++, the ++ is surrounded by two characters, both of which are written as from left to right, so that the browser also uses the ++ from left to right and thus to the right from C indicates:

C ++ هي لغة برمجة تستخدم ...

Alternatively, you could insert an LRE in front of the C and a PDF after the ++ to indicate that C ++ is a term embedded in the Arabic text and written from left to right. The control characters LRI or FSI before and PDI after C ++ can also be used. The presentation is identical in all these cases; However, if the text contained other weak characters, the choice of control characters would also have a possible influence on their display.

Web links

Individual evidence

  1. Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 2.7 Markup and Formatting Characters. May 14, 2017, accessed March 29, 2018 .
  2. Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 3.1 Definitions. May 14, 2017, accessed March 29, 2018 .