Unicode Collation Algorithm

from Wikipedia, the free encyclopedia

The Unicode Collation Algorithm (short UCA ) is the Unicode Consortium published algorithm for strings from Unicode to compare character and so be alphabetical . It is deliberately kept so open that language-specific features and special user requests can be taken into account. A table with standard values ​​for sorting is also available for the algorithm, the Default Unicode Collation Element Table (DUCET). In addition, the Common Locale Data Repository provides corresponding tables for many other languages ​​that are used, for example, in the implementation of the ICU project.

conditions

The large number of characters that are encoded in Unicode makes it difficult to sort strings. For example, it is not clear whether words made from Greek letters should precede those made from Cyrillic letters or vice versa. Also, the order in which the individual characters are coded does not always correspond to the desired sorting order.

In different languages ​​there are different ideas about the order of individual letters. For example, in German lexicons, words with å are sorted as if there were an ordinary a, while in Scandinavian languages the å is a separate letter that comes after the z. Such differences can even occur within a language, for example in German, where the ä is sometimes treated like an a, sometimes like an ae.

It also happens that it is not easy to sort by individual characters. In traditional Spanish , the digraph ch is treated as a separate letter that comes between c and d.

Furthermore, depending on the application, certain additional requirements can be made, for example St. like Santa could be sorted.

history

The authors of the algorithm are Mark Davis and Ken Whistler. The first version was released on March 30, 1997. As of September 2012, version 26 of the algorithm is available. With ISO 14651 there is a similar algorithm of the ISO , which however offers fewer possibilities. With newer versions, more and more configuration options were added, such as the ability to sort numbers numerically.

algorithm

To compare two strings, the algorithm proceeds in different steps and compares the two strings on different levels. The number and meaning of the levels can in principle be freely selected, but the standard are three levels with the following meanings:

Level 1: basic letters

On the first level the character strings are compared according to their basic letters. Accents, upper and lower case, punctuation marks and the like are usually ignored. So on this level the words garbage and garbage are considered the same, but the word mule comes before them.

Level 2: accents

If the words match in the basic letters, the next step is to compare the accents. In almost all languages, the first difference is searched for from left to right and then sorted according to a fixed order: letters without an accent first, the other accents in a fixed order. This results in the order cote - coté - côte - côté. Canadian French is an exception: traditionally, the last difference is sorted here: cote - côte - coté - côté.

Level 3: upper and lower case

If the words also match in the accents, then uppercase and lowercase letters are used, whereby the lowercase letters are usually sorted before the uppercase letters.

More levels

If necessary, further levels can follow to enable an even finer differentiation. Often the conclusion is sorted according to the individual code points. This ensures that two different strings are always sorted in the same order.

Sorting weights

In order to compare character strings, the algorithm supplies a binary sort key for a character string made up of Unicode characters. This key is then used in the actual sorting algorithm for the comparison. To determine the sort key, a table is used that lists binary weights for individual characters or character combinations for all levels, a so-called collation element table . Several weights per level can be assigned to an entry. A weight of 0000means that the corresponding character should be ignored at this level. A weight must be calculated for characters that are not listed in the table. For the standard table , this is only the case with CJKV characters ; an algorithm is also given for calculating the weights.

First, the character string is broken down into pieces as long as possible, which have an entry in the table, and the associated weights are read from the table. These weights are initially attached to each other for each level, 0000omitting them. For a Canadian-French sort, the order in level 2 is reversed. The keys for the individual levels are finally 0000attached to one another to form a single sorting key.

Examples

Most of these examples for entries in a collation element table are taken from DUCET version 6.1.0. The weights of the first three levels are given here as hexadecimal numbers . The weights are separated by periods and enclosed in square brackets.

Simple letters

character Weights annotation
a [15D4.0020.0002] 15D4 is the weight for the basic letter a.
A. [15D4.0020.0008] The capital A differs from the small a on the third level.
b [15EA.0020.0002] 15EA is the weight for the basic letter b, this comes after a (with some space for user-specific additions).
c [1602.0020.0002]
z [187A.0020.0002]
α [190E.0020.0002] The letters of the Latin alphabet are followed by those of the other alphabets, such as the Greek.

Accented letters, ligatures and combinations of letters

Letters with diacritical marks are broken down into a basic character and the following combining characters .

character Weights annotation
à [15D4.0020.0002], [0000.0035.0002] After the weight for a comes the weight for the grave accent , which is not taken into account on level 1 ( 0000).
å [15D4.0020.0002], [0000.0043.0002] Usually the å is seen as a variant of the a.
å [187B.0020.0002] In Swedish a different weight is used, here follows the å as a separate letter after the z.
Ä [15D4.0020.0002], [0000.0047.0002]
æ [15D4.0020.0004], [0000.0139.0004], [1631.0020.001F] æ is treated like ae on the first level.
ch [1603.0020.0002] In traditional Spanish , ch is treated like a letter that comes after c.

Control characters, symbols and digits

character Weights annotation
LRM [0000.0000.0000] Control characters such as B. the left-to-right characters are completely ignored.
$ [15A4.0020.0002]
[15BC.0020.0002]
1 [15CB.0020.0002]
2 [15CC.0020.0002]
² [15CC.0020.0014] Like capital letters, superscript digits only differ from the basic digit on the third level.
9 [15D3.0020.0002]

Punctuation marks and spaces

For punctuation marks and spaces, there are various options for choosing the weights.

In many implementations, such as PHP , these characters are weighted as shown in the table.

character Weights annotation
ordinary space [020A.0020.0002]
non-breaking space [020A.0020.001B] The non-breaking space is only distinguished from the usual at level 3.
Hyphen-minus (-) [020E.0020.0002]
Semicolon (-) [0216.0020.0002]
Comma (,) [0221.0020.0002]
Colon (:) [0237.0020.0002]
Exclamation mark (!) [025E.0020.0002]
Period (.) [0273.0020.0002]
Apostrophe (') [02EA.0020.0002]
Apostrophe (') [02EC.0020.0002]
Quotation marks (") [02F1.0020.0002]
Quotation marks (") [02F2.0020.0002]
Quotation marks (") [02F4.0020.0002]
opening bracket (() [02FB.0020.0002]
closing bracket ()) [02FC.0020.0002]
Slash (/) [0372.0020.0002]

The original plan was not to consider these characters as standard behavior at all, i.e. [0000.0000.0000]to give them weight like control characters .

In the current version of the algorithm, something similar is required as the standard behavior: Here, too, the weight is [0000.0000.0000]selected, but a fourth level is added in which the value actually specified for level 1 is used as the weight, while other characters use the value for this level get the highest possible value FFFF.

There are also other variants that the user can choose between.

Adaptation

The standard table can be customized in many ways:

  • Language-specific changes can be made to the individual weights and additional character combinations with weights can be added. Appropriately adapted tables are already available for many languages. There is a separate syntax for user-defined adjustments to be specified, which can be translated into a table with appropriate sorting weights by suitable programs.
  • If necessary, you can specify that sorting should be carried out on the second level from the end of the character string, as is customary in Canadian French. In principle, this is also possible for other levels, even if it does not make sense.
  • You can choose the number of levels at which the comparison should take place. This number is known as strength. The default is 3, but if the table gives weights for more levels, a larger number can be chosen. However, a smaller number can also be selected if a short sort key has priority over a detailed sort.
  • For certain characters (mostly spaces and punctuation marks) you can choose between different variants for the weights.

The standard describes many other options.

variants

There are various methods of changing the algorithm, for example to obtain shorter binary keys. So it is possible to do without the separators 0000, as long as the weights decrease from level to level. There are also other variants that lead to greater savings.

example

The terms Nina , Nino , NINO , Niño and Ninu are to be put in alphabetical order. They are broken down into individual letters, their weights are determined and the sorting key is then put together. In the key, the part that belongs to the first level is highlighted in blue, the second level is green, and the third is yellow.

Nina comes first, the word already differs on the first level from the following word Nino (underlining). This in turn corresponds to NINO on the first two levels; a difference only results in the third level. With the next word Niño, the key is longer than with the other words because of the tilde in the second and third levels; this tilde also provides the first difference to the previous word. Last is Ninu, which again differs from the preceding words on the first level.

If one were to sort these words according to code points , the sequence NINO - Nina - Nino - Ninu - Niño would result instead.

word disassembled Sorting key
Nina N i n a 1734.16B2.1734.15D4.0000.0020.0020.0020.0020.0000.0008.0002.0002.0002
[1734.0020.0008] [16B2.0020.0002] [1734.0020.0002] [15D4.0020.0002]
Nino N i n O 1734.16B2.1734.1756.0000.0020.0020.0020.0020.0000.0008.0002.0002.0002
[1734.0020.0008] [16B2.0020.0002] [1734.0020.0002] [1756.0020.0002]
NINO N I. N O 1734.16B2.1734.1756.0000.0020.0020.0020.0020.0000.0008.0008.0008.0008
[1734.0020.0008] [16B2.0020.0008] [1734.0020.0008] [1756.0020.0008]
Niño N i ñ O 1734.16B2.1734.1756.0000.0020.0020.0020.004E.0020.0000.0008.0002.0002.0002.0002
[1734.0020.0008] [16B2.0020.0002] [1734.0020.0002], [0000.004E.0002] [1756.0020.0002]
Ninu N i n u 1734.16B2.1734.181B.0000.0020.0020.0020.0020.0000.0008.0002.0002.0002
[1734.0020.0008] [16B2.0020.0002] [1734.0020.0002] [181B.0020.0002]

Search algorithm

Parts of the algorithm can also be used for text searches , for example if a search for ss should also find words with ß. In this case, a match should be recognized when the search word and the possible hit match on the first level.

Web links

Individual evidence

  1. ^ First revision of the UCA
  2. Unicode FAQ: Collation What are the differences between the UCA and ISO 14651?
  3. allkeys.txt , version 6.1.0
  4. PHP manual : The Collator class
  5. UCA, Version 1 , Section Variable Collation Elements