Lexicographical order

from Wikipedia, the free encyclopedia

The lexicographical ordering is a method of deriving a linear ordering for simple objects, for example alphabetically arranged letters, into a linear ordering for compound objects, for example words composed of letters. The naming example is the arrangement of the words in a lexicon: They are first sorted according to their first letter, then the words with the same first letter after the second letter, etc. If a word is contained entirely in another part than the first part (such as "Tal" in "Talent"), the shorter word is listed first.

The lexicographical order above the standard alphabet is generally taken for granted in such a way that one speaks briefly and simply of the alphabetical order . A search term can be found extremely quickly in a very large number of search terms if it is sorted (according to a total order relation).

definition

A quasi-ordered alphabet is given , i.e. i. a lot of signs with the relation . A string of characters of this alphabet is lexicographically smaller than a string , that is located in the sort before , when the component-wise comparison character by character

  1. the character of with the lowest index , in which the two strings differ, is (real) smaller than the corresponding character of , i.e. ,
  2. or if a start of (i.e. for all available ) but is shorter.

Usually the same comparison symbol (or for the associated strict order ) is used for the composite relation as in the alphabet , since the latter relation is a restriction of the former and thus contradictions cannot arise.

The lexicographical composition turns a quasi-order in the alphabet into a quasi-order on the set of strings, from a total quasi-order again into a total quasi-order, from a partial order into a partial order and from a total order into a total order (the most common case).

A special case of this combination is the lexicographical ordering of sequences of a fixed finite length . Then the 2nd requirement is not required. For example, an ordered pair is lexicographically smaller than a pair if

  • either
  • or and

applies.

Examples

An example of such an order is the time sequence for triples of numbers (year, month, day): A date X is earlier than another date Y if

  • either the year of X is less than the year of Y
  • or the years are the same, but X is in a month earlier in the year
  • or the years and months are the same but the day of X is less than the day of Y.

Another example is the usual ranking within a medal table , in which the first criterion is the number of gold medals , if the number of gold medals is the same, the number of silver medals and if there is another tie, the number of bronze medals:

country gold Gold medal-2008OB.svg silver Silver medal-2008OB.svg bronze Bronze medal-2008OB.svg
Country 1 10 5 7th
Country 2 8th 7th 4th
Country 3 8th 5 7th
Country 4 5 3 7th
Country 5 5 3 2

Mathematical use

Infinite consequences

The lexicographical order can be continued to infinite sequences: A sequence is lexicographically smaller than a sequence if both sequences are the same before an index , but is. If the alphabet consists e.g. B. from the digits , the sequence can be interpreted as a decimal fraction that represents a real number between and . The lexicographical order of the sequences essentially corresponds to the real order in the interval . However, the ( countable infinitely many) breaking decimal fractions , which therefore only have digits or at their ends, have two lexicographically different archetypes, for example , but lexicographically .

Further generalization

The principle can be extended further to sequences in which the index range is any well-ordered set . In this case one defines for functions (whereby it is linearly ordered), if applies to the minimal element of the domain for which and differ . The resulting order on the functions is again linearly ordered.

Application: chains in the power set of an ordinal number

In set theory, the special case is often considered in which the index set is an ordinal number and the sequence members only take the values or . This basic set is denoted by and it has a bijective relationship to the power set of . An ordinal number is always seen as the set of its predecessor ordinals. The function for if and if can be assigned to a subset of . Conversely, one comes from a function with the set again to a subset of . We now consider with the lexicographical order as defined above. This linear order can be used for combinatorial questions about infinite cardinal numbers . The following applies:

For every well-ordered subset of true .

For the proof by induction we assume that the statement is already given for all ordinal numbers . If so we consider the restrictions of the functions on the subset . The sets are then well-ordered subsets of the lexicographically ordered sets . From the induction hypothesis it follows that . Now we take an f again in the well-ordered set and also consider the direct successor . We define as the smallest with . Then applies to always as well as and . Two functions and in with have to differ from below . Let us assume that . Then , , and . It follows that in the lexicographical order also and applies and thus and so . The quantities for a given one are therefore mapped to a subsets of by restricting to injective and thus have only one thickness . But it has been proven.

Use in computer science

The main memory of a computer knows a smallest addressable unit , also called a "memory location". An example is the byte consisting of 8 bits , but it can also consist of a different number of bits or, if the machine calculates in the decimal system , it can accommodate a decimal digit.

The contents of two storage locations (bytes) on the lowest machine level are expediently always comparable with one another (in the sense of a total order). Appropriately, the numbers are resp. Letters are assigned to the bit combinations of a byte in such a way that this order with the usual order in the number system, respectively. Alphabet matches. Based on this basic component of a comparison, data types composed by lexicographical composition, for example multi-digit character strings, can be compared with one another.

If the lexicographic indexing correlates with the memory addresses, i.e. if the higher-ranking byte in the comparison has the lower address, then the comparison is done in big-endian style, and in little-endian style if the higher-ranking byte has the higher address. Since, in the best case, the lexicographical comparison is already decided in the first, highest-ranking byte, it is faster if this first byte is in direct access.

Comparison of long numerical data

On many newer machines, numeric data types of fixed length (hardware-based) are stored in little-endian format. (For the motivation and the machine types see the article byte order .) For these short aggregates (mostly 2, 4, 8 or 16 bytes long) there are corresponding machine instructions for the comparisons. These instructions are not composed so that the lexicographical principle does not apply.

The high-precision arithmetic supports arithmetic with integers of any length, a length that is only limited by the available memory. The numerical comparison, as will the lexicographical both operands at the highest-end. The rank of a position is not determined by its distance from it, but by its distance from the lower end, the "ones position". Before the comparison is made, the lengths of the two operands must be matched by padding them with leading zeros. Then (for numerical and lexicographic comparison) the same ranks are compared in both operands.

A selection of long number arithmetic implementations can be found in the section Long number arithmetic # programming languages . Compared to the four basic arithmetic operations, comparing in these software packages is more of a waste product. As with division, the processing direction is from high-ranking to low-ranking, but reversed for addition, subtraction and multiplication. Both types of storage, big and little endian, can be well supported in these five operations, provided that both ends of the chain can be accessed efficiently.

Use with bit chains

Conceptually, bit strings are nothing more than strings of characters over the two-digit alphabet . If a bit number is stored in (at least) one byte, the implementation (and the lexicographical comparison) corresponds to the usual character strings .

In a more compact implementation with 1 bit per digit, for example 8 bit digits per byte or 32 per word , it can happen that since all words contain exactly the same number of bits and a chain can have any non-negative length unspecified bits occur in the last (lower-ranking) word, so-called filler bits .

As with the character strings, the lexicographical comparison of bit strings starts at the beginning of the chain, with the highest-ranking component (this is big-endian style and is independent of the endianness of the machine). On big-endian machines, the highest-ranking word has the lowest index and the comparison goes to the higher addresses as with the strings. On little-endian machines, however, if the comparison instructions offered by the machine are to be used (per word), the components must be arranged in exactly the opposite way and the highest-ranking word must have the highest index: the lexicographical comparison must begin there and continue to the bits of lower rank (and lower address).

Use with strings

With the character string data type , the highest-ranking component on every machine type is the (first) with the lowest address. Character strings are also stored in big-endian style on little-endian machines.

There is quite good support for the lexicographical comparison of strings in C , C ++ with:

  • strcmp, lexicographical comparison taking into account different lengths in the sense of the 2nd requirement , but restriction to one character set (alphabet) without the terminatornull .
  • memcmp, lexicographical comparison with the same (byte) lengths of the two character strings, full character set.
  • wcscmp, Lexicographical comparison of 16-bit wide string s with the basic type wchar_tunder consideration of different lengths in the sense of 2. proviso , but limited to a 2-byte character set (alphabet) without the terminator (wide) null.

Application in microeconomics

→ See also: preference relation

Either by using the bundle of goods / alternative thereto and with the alternative ( is accordingly, for example, the amount of good 2 in the bundle of goods ). A preference-indifference relation R is called lexicographical, if then and only if, either or and at the same time . In other words, in a lexicographical preference-indifference relation, a bundle of goods is only weakly preferred to a second (i.e. viewed as at least as good as this one) if it contains more units of the first good or, alternatively, if both bundles of goods have the same number of units of this good if it contains more units of the second good.

Properties of the lexicographical order of preferences:

  1. A lexicographical order of preferences is complete, asymmetric (and consequently also antisymmetric), negatively transitive and transitive. (For a definition of the properties, see the article Preference Relation.)
  2. A lexicographical order of preferences cannot be represented by a utility function . (Debreu 1959)

See also

literature

Remarks

  1. As a rule, there will usually be a total order . A quasi-order is only the minimum requirement here.
  2. D. h. set theoretic is a " word " of the Kleene shell of the alphabet , as well as .
  3. The countable infinite sequences of characters from the alphabet are denoted by, see: = .
  4. Assuming that leading zeros are suppressed, the (unsigned) numeric order corresponds to the quasi-lexicographic (in English quasi-lexicographic , radix , length-plus-lexicographic or shortlex order ).
  5. The conversion of a bit string into a binary long number (with the lowest-ranking bit in the ones place) requires a right shift by the number of filler bits on the data side . Otherwise only the description has to be changed. As with binary long numbers, shifts and logical links can be defined for bit strings ; plus the chain operations such as chaining , reversing and forming partial chains.
  6. strcmp . en.cppreference.com. Retrieved March 31, 2017.
  7. memcmp . en.cppreference.com. Retrieved March 31, 2017.
  8. On many binary-oriented big-endian machines, a character string can also be unsignedinterpreted as an unsigned binary number. The associated numerical comparison, however, arranges the content in a slightly different way than the lexicographical one of the character strings (see #Comparison of long numerical data ).
  9. wcscmp . en.cppreference.com. Retrieved March 31, 2017.
  10. ^ The C99 standard draft + TC3 . Retrieved March 31, 2017.
  11. See Mas-Colell / Whinston / Green 1995, p. 46.
  12. See Moore 2007, p. 14.
  13. ^ Gerard Debreu: Theory of value. An axiomatic analysis of economic equilibrium. John Wiley & Sons, New York 1959.