Sequitur

Sequitur is an algorithm for lossless data compression , which was described in 1997 in the work "Identifying hierarchical structure in sequences: A linear-time algorithm" by Craig Nevill-Manning and Ian Witten from the University of Waikato ( New Zealand ).

description

Sequitur replaces repeated strings in character strings (e.g. texts) with the help of grammatical rules. This process is performed recursively . As a result, Sequitur provides a hierarchical representation of the original sequence, which gives insight into its lexical structure. The scope of the grammar is reduced and the sequence is structured as a “by-product”. The advantage of Sequitur is the iterative approach.

Sequitur generates one of several possible context-free grammars for a given string. This grammar does not necessarily have to be the optimal one for the purpose of compression. In addition, Sequitur can be very memory-intensive, which is why this algorithm is not as suitable for compression as other common methods.

functionality

Using the Sequitur algorithm is a string in a context-free grammar with converted. The grammar is built up step by step. For this, it is read in character by character. If a partial sequence occurs twice, this partial sequence is replaced by a variable that is entered in a dictionary (as a rule of grammar). In the course of the construction of the grammar, not only partial sequences of the original character string that occur several times, but also partial sequences consisting of variables (in whole or in part) can be replaced by variables, thus removing redundancies . ${\ displaystyle z \ in \ Sigma ^ {*}}$ ${\ displaystyle G}$ ${\ displaystyle L (G) = \ {z \}}$ ${\ displaystyle G}$ ${\ displaystyle z}$ ${\ displaystyle z}$

As a result, the Sequitur algorithm delivers a grammar that represents the input string with fewer symbols. The compression rate depends on the coding of the result (the grammar). For example, arithmetic coding is used for this.

Strings or substrings are called " n-grams ". An “n-gram” of length 2 is called a “bigram” or “digram”.

The following mandatory rules of the algorithm generate the hierarchical structure of the result string already mentioned:

Digram uniqueness : Each digram in the string to be compressed may occur at most once. Otherwise, it is replaced by a (existing or newly created) variable.

Variable usefulness : Every variable that replaces a substring of the original character string must be used at least twice.

Generation of a Sequitur grammar

It starts with (see: formal grammar ). ${\ displaystyle G: = (\ {S \}, \ Sigma, S \ rightarrow \ varepsilon, S)}$

The symbols from are added one after the other to the right-hand side of the associated production (= string of characters read so far).

{\ displaystyle z}

{\ displaystyle S}

If step 2 turns two symbols into neighbors, a digram is created . There are two options for this digram: ${\ displaystyle \ alpha, \ beta \ in V \ cup \ Sigma}$ ${\ displaystyle \ alpha \ beta}$

${\ displaystyle \ alpha \ beta}$ is clear in the resulting grammar.

${\ displaystyle \ alpha \ beta}$ occurs exactly one more time without overlap. In case 2 there are again two cases:

${\ displaystyle \ alpha \ beta}$ is right side of a production (i.e. there is already a dictionary entry for ). Replace newly created variables with existing ones . Jump to step 4, then step 3 again. ${\ displaystyle X}$ ${\ displaystyle \ alpha \ beta}$ ${\ displaystyle \ alpha \ beta}$ ${\ displaystyle X}$

${\ displaystyle \ alpha \ beta}$ is not right side of a production. Add a new rule to the productions and replace both occurrences of with the new variable . Jump to step 4, then step 3 again. ${\ displaystyle X \ rightarrow \ alpha \ beta}$ ${\ displaystyle \ alpha \ beta}$ ${\ displaystyle X}$

If a digram is replaced by a variable, there are two possibilities again: ${\ displaystyle \ alpha \ beta}$

${\ displaystyle \ alpha \ in \ Sigma \ land \ beta \ in \ Sigma}$ ( and do not contain any variables) ${\ displaystyle \ alpha}$ ${\ displaystyle \ beta}$

${\ displaystyle \ alpha \ in V \ lor \ beta \ in V}$ ( or or and already contain variables). Differentiate two cases for all variables (variables that are contained in or ): ${\ displaystyle \ alpha}$ ${\ displaystyle \ beta}$ ${\ displaystyle \ alpha}$ ${\ displaystyle \ beta}$ ${\ displaystyle \ gamma \ in \ {\ alpha, \ beta \}}$ ${\ displaystyle \ alpha}$ ${\ displaystyle \ beta}$

Variable occurs in other places. ${\ displaystyle \ gamma}$

Otherwise the variable will no longer appear. Remove the rule from the productions and replace the only occurrence of with the right-hand side (i.e. with the dictionary entry for the corresponding variable). Skip to step 3. ${\ displaystyle \ gamma}$ ${\ displaystyle \ gamma \ rightarrow r \ gamma}$ ${\ displaystyle \ gamma}$ ${\ displaystyle r \ gamma}$

The places highlighted in yellow symbolize replacement processes and therefore always cause step 4 to be called. If a previously used variable no longer occurs as a result of one of these processes, the required variable usability must be restored.

All three places marked in color cause recursive calls to step 3, since a replacement process always results in new digrams being created. It must therefore always be checked whether the required digram uniqueness is still given, or this must be restored if necessary.

example

The following table shows a simple example of how the Sequitur algorithm works. In the example, the input string "abcdbcabcd" is compressed without loss using the Sequitur algorithm. It is shown step by step how the input string is run through and the new grammar is generated.

number	Previous string	grammar	annotation
1	a	S = a
2	from	S = from
3	ABC	S = abc
4th	abcd	S = abcd
5	abcdb	S = abcdb
6th	abcdbc	S = abcdbc	bc double
	abcdbc	S = aAdA A = bc	Establish digram uniqueness
7th	abcdbca	S = aAdAa A = bc
8th	abcdbcab	S = aAdAab A = bc
9	abcdbcabc	S = aAdAabc A = bc	bc double
		S = aAdAaA A = bc	Establish digram uniqueness aA double
		S = BdAB A = bc B = aA	Establish digram uniqueness
10	abcdbcabcd	S = BdABd A = bc B = aA	Bd double
		S = CAC A = bc B = aA C = Bd	Establish digram uniqueness B is only used once
		S = CAC A = bc C = aAd	Variable utility established

The string "abcdbcabcd" is read in character by character and run through. So at the beginning only the character "a" is considered. Since the checked string only consists of one character at this point in time, there can of course be no repetitions, so there can be no replacement by a variable.

String a
dictionary (empty)

Then the character “b” is added. In the string “ab” there is still no character sequence at least twice, so no substitution is possible here either. The same applies to the strings "abc", "abcd" and "abcdb".

String ab, abc, abcd, abcdb
Dictionary (empty)

After reading in the 6th character, the string "abcdbc" is created. The string “bc” occurs twice in this string (“a bc d bc ”). The digram "bc" is now replaced by the character "A". A new variable “A → bc” is created and the string “abcdbc” is saved as “aAdA”.

String aAdA
dictionary {A → bc}

Now the character “a” is read in. The result is the string "aAdAa". No new string occurs twice in this string.

String aAdAa
dictionary {A → bc}

After reading in the 8th character, the string "aAdAab" is created. Again, no new string occurs at least twice.

String aAdAab
dictionary {A → bc}

Now the character “c” is read in. The result is the string "aAdAabc". This string contains the character sequence “bc” already entered in the dictionary as variable “A”. It will now be replaced by the variable. The new string "aAdAaA" is created. The digram "aA" now appears twice and is replaced by the character "B". A dictionary entry "B → aA" is created. The variable “B” now stands for the character string “abc”. The string is saved as "BdAB".

String BdAB
dictionary {A → bc}, {B → aA}

Finally, the last character “d” is read. The result is the string “BdABd”. The string “Bd” occurs twice in this string. The new variable C (C → Bd) is defined for this character string. The string "Bd" is replaced by the variable C. The new string "CAC" is created. The variable "B" no longer appears in this string and only occurs once in the dictionary. The condition of variable usefulness (r2) is therefore no longer fulfilled. The variable "B" is therefore deleted. The variable “C”, which was previously saved as “Bd”, is changed to “aAd” when the variable “B” is omitted (this results in a longer dictionary entry).

String CAC
dictionary {A → bc}, {C → aAd}

The version of the character string "abcdbcabcd" compressed losslessly with Sequitur is therefore "CAC" with the dictionary {A → bc}, {C → aAd}.

Compressed string: CAC

Dictionary entries: A → bc C → aAd

Reconstructed original string: abcdbcabcd

analysis

In order to be able to implement the individual replacement processes more quickly if a digram occurs several times, the productions of the grammar are stored as linked lists . These lists are also linked to one another. This allows the definition of the variable (i.e. the dictionary entry) to be found quickly based on the place where the variable was used. In addition, a digram index is administered. This enables you to quickly check whether a digram already exists. The index is saved as a hash table . This makes it possible to find the positions of all occurrences in the productions for a digram in constant time. With an implementation of the Sequitur algorithm as described here, the running time and required memory are linearly dependent on the length of the output string. The runtime behavior corresponds to a length of ${\ displaystyle O (n),}$ ${\ displaystyle n =}$ ${\ displaystyle z.}$

Web links

Official homepage for Sequitur