Unicode line break algorithm

from Wikipedia, the free encyclopedia

The Unicode newline algorithm ( English Unicode Line Breaking Algorithm ) is the Unicode Consortium published algorithm for the line break . This algorithm decides at which points a break is mandatory, at which it is possible and at which it is prohibited.

history

The authors of the algorithm are Asmus Freytag and Andy Heninger. The first draft was published on May 19, 1998. As of September 2012, version 30 of the algorithm is available.

scope of application

The algorithm does not actually perform the line break, but only specifies the points at which a break is possible and leaves open how the actual break is made from this information. However, two possible algorithms are mentioned for the selection of the break points: On the one hand the simple procedure of always choosing the latest possible position, on the other hand the complex procedure implemented in TeX that tries to optimize the breaks globally. Necessities following the break, such as the compensation of word spacing in justified sentences, are also mentioned, but are not part of the algorithm.

Automatic hyphenation is also outside the scope of application . This can, however, be done simply by automatically inserting conditional hyphens at the appropriate places before the algorithm is applied .

The standard behavior of the algorithm is mainly designed for western writing systems, but can be influenced by an adapted configuration. The algorithm cannot be used for some fonts, such as Thai , that do not use spaces and require a break in language syllables.

By higher-level protocols, the algorithm can be influenced, for example, HTML , where the <pre>- day prevents a break or the appropriate CSS property white-space.

algorithm

Each Unicode character is a LineBreak- property assigned, describing his behavior during break.

There are many possible values ​​that this property can assume. Some of these values ​​cover large classes of characters, and some are only used for a single character to cover special cases. For example, both the line feed and the carriage return have a special value so that both characters produce a line break on their own, but their combination also only results in one break in order to take into account the different conventions for the end of line. Another example of a character with its own class is the hyphen minus (U + 002D), which in most cases represents the orthographic hyphen (e.g. word processing), just not in a numerical context, where it is interpreted as a minus (e.g. spreadsheet).

Ordinary letters and digits belong to a class that prevents a break if the character were thereby separated from another character in the same class. Classes that allow a break after the character in question contain various spaces and hyphens , among other things . Punctuation marks that end sentences or parts of sentences belong to classes that prevent a break before the symbol. This is also the case with combining characters . Punctuation marks for the beginning of a sentence, i.e. opening brackets or the inverted question marks and exclamation marks used in Spanish, on the other hand, belong to a class that prevents a break after the character.

Classes that prevent a break both before and after the character exist for non- breaking spaces and hyphens , but also for quotation marks . You could break before opening and after closing quotation marks, but since these are used differently in different languages, it is much easier to prevent all breaks.

Conversely, in the case of Chinese characters , for example, there is a possibility of breaking both before and after the character.

Based on this property, the algorithm determines for each position whether a break must, may or is prohibited. The first criterion that fits is used. Most of these rules specify where no break should occur, the last rule then allows a break at all positions that were not excluded in another rule.

The order in which these rules are tested plays a crucial role. For example, the rule that there must be no break before the exclamation mark precedes the rule that a line break is possible after a space in order to enable the use of the space in French at this point without an unwanted break.

implementation

The algorithm does not specify any special implementation as long as the result is correct. Extensive test files are available to check this. Since most of the rules only consider the two characters before and after the possible break, but not the wider environment, there are implementations that essentially look up in a table whether a break is allowed at one point or not and only for the more complex rules examine the surrounding context.

In addition to the implementation of the ICU project, which also includes some language-specific adaptations, there are also many more, for example as a Perl module .

Since the algorithm allows extensive adaptations of the standard behavior, application programs can deviate significantly from the proposals without violating the standard. Internet Explorer , for example, adheres very closely to the standard rules, while Mozilla Firefox deviates greatly from them.

Web links

Individual evidence

  1. Version 0.3 of the Unicode Line Breaking Algorithm
  2. Version 30 of the Unicode Line Breaking Algorithm
  3. Determining Line Break Opportunities , Unicode Line Breaking Algorithm
  4. LineBreak.txt
  5. LineBreakTest.txt
  6. ^ Boundary Analysis in the ICU User Guide, accessed on September 20, 2012
  7. Unicode :: LineBreak in the CPAN
  8. Jukka Korpela: Word division in IE and other notes on the nobrmarkup , accessed September 20, 2012
  9. Bug 56652 in Mozilla's bug tracker, accessed September 20, 2012