Bidirectional control character

Various bidirectional control characters are defined in Unicode , i.e. control characters that influence the writing direction. They are used in computer typesetting for bidirectional text , i.e. text that contains both characters from scripts written from left to right (such as German , English or Russian ) and characters from scripts written from right to left (such as Arabic and Hebrew ).

In Unicode, each letter is assigned its writing direction, while punctuation marks , for example , are adapted to the writing direction of the surrounding text. If texts with different writing directions meet on such a character, the algorithm that is supposed to determine the writing direction for the display can fail. In this case, the Unicode Bidi algorithm can be influenced with the corresponding control characters and the writing direction can be corrected.

Character encoding

While the left-to-right and right-to-left characters are available in different character sets, the other control characters only appear in Unicode. Except for the ALM from the Arabic Unicode block , all characters are in the General Punctuation Unicode block .

Coding in Unicode
Unicode number	description	Official name	abbreviation	meaning
U + 061C (1564)	Arabic letter sign	ARABIC LETTER MARK	ALM	is treated like an invisible Arabic letter
U + 200E (8206)	Left-to-right sign	LEFT-TO-RIGHT MARK	LRM	is treated like a letter in a script written from left to right
U + 200F (8207)	Right-to-left sign	RIGHT-TO-LEFT MARK	RLM	is treated like a letter in a script written from right to left
U + 202A (8234)	Left-to-right embedding	LEFT-TO-RIGHT EMBEDDING	LRE	the basic writing direction of the following text is from left to right
U + 202B (8235)	Right-to-left embedding	RIGHT-TO-LEFT EMBEDDING	RLE	the basic writing direction of the following text is from right to left
U + 202C (8236)	Reset direction formatting	POP DIRECTIONAL FORMATTING	PDF	terminates the effect of one of the characters LRE, RLE, RLO, LRO
U + 202D (8237)	Left-to-right compulsion	LEFT-TO-RIGHT OVERRIDE	LRO	all of the following characters are treated like characters in a script written from left to right
U + 202E (8238)	Right-to-left compulsion	RIGHT-TO-LEFT OVERRIDE	RLO	all subsequent characters are treated like characters in a font written from right to left
U + 2066 (8294)	Left-to-right isolation	LEFT-TO-RIGHT ISOLATE	LRI	the basic writing direction of the following text runs from left to right, without affecting any characters outside
U + 2067 (8295)	Right-to-left isolation	RIGHT-TO-LEFT ISOLATE	RLI	the basic writing direction of the following text runs from right to left, without affecting any characters outside
U + 2068 (8296)	Bidirectional insulation	FIRST STRONG ISOLATE	FSI	the following text is treated in isolation from the rest
U + 2069 (8297)	Reset directional isolation	POP DIRECTIONAL ISOLATE	PDI	terminates the effect of one of the characters LRI, RLI, FSI

The left-to-right ( LEFT-TO-RIGHT MARK ), the right-to-left ( RIGHT-TO-LEFT MARK ) and the Arabic letter characters ( ARABIC LETTER MARK ) are called implicit control characters, the others as explicit.

HTML has named entities for the left-to-right and right-to-left characters : &lrm;and &rlm;. According to a recommendation by the Unicode Consortium, the other characters should not be used on websites; instead, the dir- attribute is provided with the values "ltr"or "rtl"as well as the tags <bdi> and <bdo>.

The effect of the explicit control characters can also be nested with a depth of up to 125 levels, and their effect ends without the characters PDFor PDIat the end of the paragraph.

Text passages that are surrounded by control characters for embedding or overwriting have the influence of characters with the corresponding writing direction on their neighboring characters, while the control characters for isolation that were newly introduced with Unicode 6.3 keep the enclosed text separate from its surroundings, and thus on its surroundings have no influence.

example

An Arabic text about the programming language C ++ could start with (from right to left)

C ++ هي لغة برمجة تستخدم ...

The ++, which itself has no fixed writing direction, stands between the C (a character of a language written from left to right) and the Arabic text. The web browser is therefore based on the main writing direction of the paragraph and thus displays the ++ written from right to left, i.e. incorrectly to the left of the C.

If you insert a left-to-right character after the ++, the ++ is surrounded by two characters, both of which are written as from left to right, so that the browser also uses the ++ from left to right and thus to the right from C indicates:

C ++ هي لغة برمجة تستخدم ...

Alternatively, you could insert an LRE in front of the C and a PDF after the ++ to indicate that C ++ is a term embedded in the Arabic text and written from left to right. The control characters LRI or FSI before and PDI after C ++ can also be used. The presentation is identical in all these cases; However, if the text contained other weak characters, the choice of control characters would also have a possible influence on their display.

Web links

Bidi algorithm in the appendix of the Unicode standard (English)

Individual evidence

↑ Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 2.7 Markup and Formatting Characters. May 14, 2017, accessed March 29, 2018 .
↑ Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 3.1 Definitions. May 14, 2017, accessed March 29, 2018 .

[Unicode_section_2-1] Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 2.7 Markup and Formatting Characters. May 14, 2017, accessed March 29, 2018 .

[Unicode_section_3-2] Mark Davis: UAX # 9: Unicode Bidirectional Algorithm. 3.1 Definitions. May 14, 2017, accessed March 29, 2018 .