Unicode segmentation algorithm

The Unicode segmentation algorithms are a group of algorithms published by the Unicode Consortium for breaking up a text into segments, such as words. The algorithms found in a text, the points at which it can be divided into segments, and can, for example, word processing programs are used to cursor movements to control the word- mark control and the like. The algorithms are deliberately kept general in order to work well in as many languages as possible. This can mean that the limits found do not always meet expectations and that corresponding adjustments to the algorithms are necessary. The Common Locale Data Repository offers a number of language-specific adaptations .

history

The author of the standard Appendix 29, which describes various segmentation algorithms, is Mark Davis. The first draft was published on March 11, 2001, the first version recognized as a standard on April 17, 2003. As of November 2012, the algorithms are in version 21.

character

The characters in the sense of code points do not always correspond to the characters from the user's point of view, the graphemes . This occurs above all with combining characters : A letter followed by a combining diacritical mark is perceived as a character, as is Korean syllable blocks or Indian characters , which are also formed with combining characters in Unicode.

The standard describes two different algorithms for breaking down a text into individual graphemes, one of which is primarily used for downward compatibility .

Both algorithms make use of the property Grapheme_Cluster_Break . Using several rules, it is gradually determined at which points a grapheme ends or does not end. These rules name possible combinations of values of the Grapheme_Cluster_Breakproperty that can have two consecutive characters and indicate whether or not there is a grapheme boundary between the two characters in this case.

Various ways can be chosen for the implementation , such as looking up in a lookup table or determination using a regular expression .

The original algorithm used the two properties Grapheme_Baseand Grapheme_Extend. The first went into the property Grapheme_Cluster_Break, the second proved impractical and is no longer used.

Words

The algorithm for determining word boundaries proceeds in a similar manner. The property Word_Breakdetermines how a character behaves when the text is broken down into words. A number of rules determine the combination in which a word boundary is recognized. However, there are cases in which not only two adjacent characters are taken into account, but a larger area.

In many cases, the algorithm requires adjustments to the language used. This is especially the case with languages that do not use spaces. The question of whether and when hyphens and apostrophes should function as word separators is also problematic.

If the text is broken down at the determined word boundaries, not only do individual words result, but numbers and individual punctuation marks are also found. Depending on the application - e.g. when counting the words in a text - all words found by the algorithm that do not contain any letters must be sorted out.

example

The following sentence is used for an exemplary application of the algorithm:

The "fast" runner needed 2.5 minutes for the route.

The algorithm initially provides for a word end at the beginning and at the end of the text. There is no word boundary between two letters, nor between digits and characters that can appear within numbers, such as the comma in this case. The algorithm finds a word limit at all other places. The sentence is broken down as follows:

The " fast " runner needed 2.5 minutes for the route .

sentences

The breakdown of a text into sentences works in the same way. Again, there is Sentence_Breaka property that is used by a number of rules to determine sentence boundaries. Problems exist mainly with the distinction between whether a period is in an abbreviation or whether it ends the sentence. Punctuation in direct speech is also difficult in some languages.

Lines

The Unicode line break algorithm, a separately defined algorithm , is used to determine the places at which text can be broken into lines .

Individual evidence

↑ Version 1 of the Unicode Standard Annex # 29: Unicode Text Segmentation
↑ Version 4 of the Unicode Standard Annex # 29: Unicode Text Segmentation
↑ Version 21 of the Unicode Standard Annex # 29: Unicode Text Segmentation

Web links

Official formulation of the algorithms
Demonstration of the algorithms
Boundary Analysis in the ICU User Guide (English)

[1] Version 1 of the Unicode Standard Annex # 29: Unicode Text Segmentation

[2] Version 4 of the Unicode Standard Annex # 29: Unicode Text Segmentation

[3] Version 21 of the Unicode Standard Annex # 29: Unicode Text Segmentation