Unicode bidi algorithm

from Wikipedia, the free encyclopedia

The Unicode Bidirectional Algorithm ( English Unicode Bidirectional Algorithm , UBA for short ) is the algorithm published by the Unicode Consortium for the representation of bidirectional texts , i.e. texts that contain both characters from fonts that are written from left to right and from such written from right to left.

conditions

While a character string of Unicode characters is always present in the memory in the logical order, the parts that originate from left-hand scripts must be reversed for output. It should be noted that numbers are also written in left-to-right texts from left to right. In addition, some characters, such as brackets, must be output mirrored when the writing direction is reversed.

history

Mark Davis, Aharon Lanin and Andrew Glass are named as authors of the current formulation of the algorithm. The algorithm was originally described directly in the Unicode standard, but was then moved to an appendix. Up until the first official version of this appendix on February 8, 1999, there had been repeated corrections and other changes to the algorithm; the following revisions mainly contained clarifications of imprecise formulations. It was not until Revision 29 of Unicode 6.3.0 that the algorithm was comprehensively expanded with a few more control characters and new rules. Before, it could happen that one of the two brackets was counted as left-to-right text and the other as right-to-left text, which led to illegible output. In addition, the maximum nesting depth has been doubled from 64 to 128.

Basics

Several bidirectional control characters are defined to influence the algorithm , in particular the left-to-right and the right-to-left characters.

A bidi class ( Bidi_Class) is also assigned to each Unicode character . These are divided into four categories:

Strong characters have a clear writing direction. There are the following values:

  • Lfor characters from scripts written from left to right, especially all Latin letters. The left-to-right sign also has this value.
  • Rfor characters from scriptures written from right to left, e.g. Hebrew. The right-to-left sign also has this value. Characters for Arabic , Syriac and Thaana , as well as the Arabic character introduced with Unicode 6.3, have the same value AL.

The following values ​​occur with the so-called weak characters:

  • ENfor the Indian digits used in Europe , ESfor characters that can appear within a number from these digits (plus and minus signs) and ETfor characters that can appear at the beginning or end of a number, such as currency symbols.
  • AN for Arabic-Indic numerals and punctuation used in such numbers.
  • CS for characters that can separate both European and Arabic numerals, such as a period and comma.
  • NSMfor combining characters such as accents that are combined with the preceding character when displayed.
  • BNfor characters like the conditional hyphen that do not appear in the output.

There are also neutral characters with the values B, S, WSand ONthese are different white spaces and other neutral character ( exclamation assigned etc.).

Finally, there is the explicit control characters, they are divided into control characters for embedding and for overwriting ( LRE, RLE, LRO, RLO, PDF), and the control characters for isolating ( LRI, RLI, FSI, PDI). The last group was newly introduced with Unicode 6.3.0. All explicit control characters have their own value for the bidi class that matches their short name.

The property Bidi_Paired_Bracket_Typeidentifies opening and closing brackets and Bidi_Paired_Bracketindicates the respective counterpart.

Finally, some characters (such as brackets) are marked as mirrorable ( Bidi_Mirrored), for many of these characters there is also a Unicode character that represents this mirror image ( Bidi_Mirroring_Glyph). For the other characters, the rendering program itself must generate a mirror image. Mirror image does not always mean an exact mirror image, for example in the case of the symbol for the third root, only the root, but not the 3, should be mirrored.

algorithm

First, the text is broken down into paragraphs. The further algorithm treats the paragraphs individually.

First, a classification is into levels ( Level ) in accordance with the explicit control characters and it is his for each character Bidi_Classdetermines property. This value is then changed gradually until finally each character as L, R, ENor ANindicated. To do this, the weak characters are first adjusted: Separators between digits are treated as digits, European digits that follow right-hand text are Ltreated as this . After the weak characters, the neutral characters are finally converted; these adapt to the surrounding text in their writing direction. Digits are treated Las if they were not given the type in the previous step R.

Based on these changed properties, the level division is changed in such a way that (starting from level 0) levels with an even ordinal number contain text written from left to right, while the writing direction in odd levels is from right to left.

Finally, the presentation takes place: To do this, the paragraph is divided into lines and the text rearranged according to the levels. First all blocks of the highest level are reversed, then all of the second highest and highest, and so on, until finally all blocks with a number greater than 0 are reversed. Characters that are written from right to left and that can be mirrored are replaced by their mirror image.

Details

heels

  • Break the text down into paragraphs and apply the following steps to each paragraph individually.
  • Find the first character of type L, Ror in each paragraph AL. Characters within an isolated area are ignored.
  • If such a character exists and it is of the type Ror AL, the basic writing direction is counterclockwise, the level counting starts at 1. Otherwise, the paragraph is clocked, the level counting starts at 0.

Explicit control characters

  • First store the determined base level in a stack that can contain up to 128 entries, the value "neutral" as the overwrite status and the value "false" as the isolation status. In the following steps, entries are repeatedly saved in this stack. If this is full or the level count is greater than 125, the entry is discarded.
  • For one, RLEdetermine the next higher odd level and save it together with the overwrite status "neutral" and the isolation status "false".
  • For one, LREdetermine the next higher even level and save it together with the overwrite status "neutral" and the isolation status "false".
  • For one, RLOdetermine the next higher odd level and save it together with the overwrite status "right-to-left" and the isolation status "false".
  • For one, LROdetermine the next larger even level and save it together with the overwrite status "left-to-right" and the isolation status "wrong".
  • For one, RLIdetermine the next higher odd level and save it together with the overwrite status "neutral" and the isolation status "true".
  • For one, LRIdetermine the next larger even level and save it together with the overwrite status "neutral" and the isolation status "true".
  • A FSIis treated like a RLIor a LRI, depending on the type of the first strong character, which is determined as with paragraphs.
  • In one PDIor PDFset level, overwriting state and insulation state to the values returned, which (front of the associated control characters RLI, LRIor FSIfor PDI, and RLE, LRE, RLOor LROfor PDF) were considered. If there is no such sign, the PDIor is PDFignored.
  • Assign the current level and its bidi class to any other character that is not of type BNor B. If the overwrite status is not neutral, this determines the bidi class ( Lor R). Also, the control characters RLI, LRI, FSIand PDIobtain in this way a layer counting, whereby the three initial control characters belong to the previous level, the terminating PDIto the next. The other explicit control characters are BNremoved along with the characters of the type .
  • Break the paragraph down into sequences of consecutive characters or characters on the same level separated by isolated texts and apply the following steps to these sequences. The beginning and end of the sequences are treated as if there were a character in the writing direction of the higher of the two adjacent levels. At the beginning and at the end of the paragraph, the basic level takes on the role of the other level.

Weak characters

  • Change NSMto the value of the preceding character or - at the beginning of a layer or isolation where they cannot connect with the preceding character - toON
  • For each European digit ( EN), determine the preceding strong character. If this is the type AL, change ENto AN.
  • Change ALto R.
  • Change EN-ES-ENto EN-EN-EN, EN-CS-ENto EN-EN-ENand AN-CS-ANto AN-AN-AN.
  • Change every sequence of ETthat ENborders on to EN.
  • Change all remaining ES, ETand CSto ON.
  • Determine ENthe preceding strong sign for each . If this is the type L, change ENto L.

Neutral characters

  • Since Unicode 6.3, brackets have been adjusted first. Opening and closing brackets that belong together are uniformly converted to Ror Lconverted. Normally, the writing direction of the current level is selected; the two brackets only assume this direction if the content and at least one adjacent character is outside the other direction. Neutral characters are skipped ENand treated ANas if they were R. If there is no strong character at all within the brackets, no changes are made initially.
  • Change any sequence of neutral characters delimited on both sides by characters of the same type to this one. Are treated ENand ANhow R.
  • Change all remaining neutral characters to the basic writing direction.

Corrected division of levels

  • In even levels, increase Rthe level by 1 for characters of type, ENand ANby 2 for characters of type .
  • Increase in odd levels for character type L, ENand ANthe plane to the first

Rearrangement

  • Break the text down into lines.
  • In each line, first invert all the strings on the highest level, then all on the highest and second highest, and so on, until all the strings on levels 1 and above are reversed.
  • Put combining characters in uneven levels behind their associated characters.
  • Represent characters in odd planes to be mirrored by their corresponding mirror image.

Higher protocols

The algorithm allows the influence of higher protocols. An example of this is the dir- attribute and the <bdo>- tag in HTML as well as the unicode-bidi- and the - directionproperties in CSS , with which the basic writing direction can be determined or the same effect can be achieved as with the explicit control characters.

implementation

The algorithm does not prescribe any special implementation as long as the result matches what one would get by strictly following the algorithm. For example, it is possible to first check whether characters from counter-clockwise writing systems actually occur in the text, and otherwise not to carry out the algorithm in the first place. This variant is implemented in the Firefox web browser, among other things .

A Unicode-compatible program does not necessarily have to fully implement the Unicode Bidi algorithm. For example, an integrated development environment for programming languages could generally output all text from left to right and thus completely ignore the UBA. It is also possible not to consider all explicit control characters. The changes with Unicode 6.3 meant that older programs no longer correctly implement the algorithm.

Examples

example 1

At the top right it says: "דניאל (ראובן) קזין נפל בעמק הירדן ז׳ אייר תש״ח 16.5.1948 בן 22 במותו"

The following picture description is considered:

At the top right it says: "דניאל (ראובן) קזין נפל בעמק הירדן ז׳ אייר תש״ח 16.5.1948 בן 22 במותו"

The basic writing direction of the paragraph is determined from the first character and runs from left to right, without explicit control characters all characters are provisionally assigned to level 0. First you determine the bidi values ​​of the individual characters (second line of the following table, N stands for any neutral character).

Now the weak characters, i.e. the digits, are dealt with (third line). The points directly between the digits are converted into the type EN, i.e. treated like the digits themselves in the further course. The first strong character Rprecedes the digits by one of the type , so they are left as it is.

Then it is the turn of the neutral characters (fourth line). Where Land Rmeet, the neutral characters are Lconverted to - just like at the end of the text - as this corresponds to the main direction of writing. Since digits are Rconsidered to be how to handle neutral characters , the neutral characters will be converted between Rand ENto R.

This finally results in the adjusted level count (fifth line). Based on level 0, characters of type Rare assigned to level 1, characters of type ENto level 2.

O b e n ... t : " ד נ י א ל   ( ר א ו ב ן )   ק ... ח   1 6th . 5 . 1 9 4th 8th   ב ... ו "
L. L. L. L. ... L. N N R. R. R. R. R. N N R. R. R. R. R. N N R. ... R. N EN EN IT EN IT EN EN EN EN N R. ... R. N
L. L. L. L. ... L. N N R. R. R. R. R. N N R. R. R. R. R. N N R. ... R. N EN EN EN EN EN EN EN EN EN N R. ... R. N
L. L. L. L. ... L. L. L. R. R. R. R. R. R. R. R. R. R. R. R. R. R. R. ... R. R. EN EN EN EN EN EN EN EN EN R. R. ... R. L.
0 0 0 0 ... 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 2 2 2 2 2 2 2 2 2 1 1 ... 1 0

For the display, the direction of level 2 is first reversed:

O b e n ... t : " ד נ י א ל   ( ר א ו ב ן )   ק ... ח   8th 4th 9 1 . 5 . 6th 1   ב ... ו "

Then levels 1 and 2 are mirrored:

O b e n ... t : " ו ... ב   1 6th . 5 . 1 9 4th 8th   ח ... ק   ) ן ב ו א ר (   ל א י נ ד "

Finally, the two brackets in the left-hand text are replaced by their respective mirror images:

O b e n ... t : " ו ... ב   1 6th . 5 . 1 9 4th 8th   ח ... ק   ( ן ב ו א ר )   ל א י נ ד "

In this way the representation results:

At the top right it says דניאל (ראובן) קזין נפל בעמק הירדן ז׳ אייר תש״ח May 16, 1948 בן 22 במותו

If line breaks occur within the text, as in the caption, the break is made first, then the direction is reversed.

Example 2

A biographical text about Reuven Rivlin could begin with:

Reuven Rivlin (ראובן ריבלין; * 1939 in Jerusalem) has been President of Israel since 2014.

This obviously leads to an incorrect representation: The year of birth comes before the Hebrew spelling of the name instead of after it. To understand where the problem lies, one can do the algorithm by hand.

As in the first example, the basic writing direction of the paragraph runs from left to right and the bidi values ​​of the individual characters are determined again (second line). Since the next strong character has the value before the year of birth R, it is ENnot Lconverted into here, in contrast to the other digits in this sentence (third line). Therefore, in the next step, in which the neutral characters are resolved, these digits are treated like R (fourth line). This results in the adjusted level counting (fifth line). Since the characters between the Hebrew name and the year of birth are now in an odd plane, the result is the incorrect representation.

... R. i v l i n   ( ר א ו ב ן   ר י ב ל י ן ;   *   1 9 3 9   i n ... t   2 0 1 4th   S. ...
... L. L. L. L. L. L. N N R. R. R. R. R. N R. R. R. R. R. R. N N N N EN EN EN EN N L. L. ... L. N EN EN EN EN N L. ...
... L. L. L. L. L. L. N N R. R. R. R. R. N R. R. R. R. R. R. N N N N EN EN EN EN N L. L. ... L. N L. L. L. L. N L. ...
... L. L. L. L. L. L. L. L. R. R. R. R. R. R. R. R. R. R. R. R. R. R. R. R. EN EN EN EN L. L. L. ... L. L. L. L. L. L. L. L. ...
... 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 0 0 ... 0 0 0 0 0 0 0 0 ...

To solve this problem, add a left-to-right character (LRM) immediately after the Hebrew name . So the digits are Lpreceded by a character of the type instead of the previous one of the type R. Therefore the type of the digits is Lchanged to and the characters between the name and the year are treated correctly. If the text is HTML code, the use of the named entity &lrm;also results in the correct representation in the source text, because here, too, the year is preceded by a character of type L (here the m).

Reuven Rivlin (ראובן ריבלין; * 1939 in Jerusalem) has been President of Israel since 2014.

Example 3

Screenshot of the example, once according to the current form of the algorithm (above, Firefox), once according to the old version (below, Google Chrome)

As an example of the new rule for pairs of brackets introduced with Unicode 6.3, the following text, which specifies a character and code point :

Alif (sign: ا): 0627

Actually, the German name Alif should be followed by the Arabic letter in brackets, followed by the code point. In browsers that still use the old algorithm, however, the picture is illegible: Both brackets look like opening ones, the order of the individual elements is chaotic.

Here, too, there is only one level with level counting 0. The bidi values ​​are given in the second line. Since the European digits are preceded by an Arabic letter, they are treated like Arabic numerals and the value is then Rassigned to the Arabic letter (line 3). According to the old rule, the neutral characters between the Arabic letter and the digits would have been Rtreated as, the others as L(line 4). According to the new rule, however, the brackets are first considered separately. Its content is uniformly left-handed, but none of the neighbors has this direction. This means that both brackets are Ltreated as , and with it the remaining neutral characters (line 5). The resulting adjusted level counts are given in lines 6 and 7.

A. l i f   ( Z e i c H e n :   ا ) :   0 6th 2 7th  
L. L. L. L. N N L. L. L. L. L. L. L. N N AL N N N EN EN EN EN  
L. L. L. L. N N L. L. L. L. L. L. L. N N R. N N N ON ON ON ON  
L. L. L. L. L. L. L. L. L. L. L. L. L. L. L. R. R. R. R. ON ON ON ON (old)
L. L. L. L. L. L. L. L. L. L. L. L. L. L. L. R. L. L. L. ON ON ON ON (New)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 2 2 2 (old)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 2 2 2 (New)

According to the old algorithm, the repeated reversals and reflections result in this picture:

A. l i f   ( Z e i c H e n :   ا ) :   7th 2 6th 0
A. l i f   ( Z e i c H e n :   0 6th 2 7th   : ) ا
A. l i f   ( Z e i c H e n :   0 6th 2 7th   : ( ا

With the new method, the desired result comes out, brackets do not have to be mirrored:

A. l i f   ( Z e i c H e n :   ا ) :   7th 2 6th 0
A. l i f   ( Z e i c H e n :   ا ) :   0 6th 2 7th

Web links

Individual evidence

  1. ^ Third revision of the UBA
  2. Fifth field in UnicodeData.txt
  3. BidiBrackets.txt
  4. Tenth field in UnicodeData.txt
  5. BidiMirroring.txt
  6. CSS vs. Two-way document markup , accessed January 28, 2012
  7. Documentation for BiDi Mozilla , accessed January 28, 2012