# Chomsky hierarchy

Chomsky hierarchy , occasionally Chomsky-Schützenberger hierarchy (named after the linguist Noam Chomsky and the mathematician Marcel Schützenberger ), is a term from theoretical computer science . It is a hierarchy of classes of formal grammars that produce formal languages and was first described by Noam Chomsky in 1956. The hierarchy levels differ in how rigid the restrictions on the form of permitted production rules are at the respective level; with type 0 grammars they are unrestricted, with higher levels progressively more restricted.

Lower-type grammars are more productive than higher-type grammars. A language produced by a grammar of type k is called a language of type k . In addition to the Chomsky hierarchy of grammars, there is a Chomsky hierarchy of languages ​​in this sense.

## backgrounds

Formal languages ​​have the advantage that they can be precisely analyzed mathematically. Formal languages ​​are based on a predefined alphabet (a character set), which is often referred to as. Arbitrarily long, finite sequences of elements of the alphabet are called words above the alphabet. After all, sets of such words are formal languages . The most comprehensive formal language over a given alphabet is infinitely large; it contains all the words that can be formed by arbitrarily combining characters from the alphabet. It can be described by the Kleenesche shell of the alphabet, so . The smallest formal language contains no words at all. Other formal languages ​​only consist of a few words and can thus be noted as a finite set; for example, the alphabet , the language of all words of length two: . ${\ displaystyle \ Sigma}$${\ displaystyle \ Sigma}$${\ displaystyle \ Sigma ^ {*}}$${\ displaystyle \ textstyle \ Sigma = \ {a, b \}}$${\ displaystyle \ textstyle L = \ {aa, ab, ba, bb \}}$

Understandably, infinite languages ​​cannot be noted by enumerating them. In order to describe them exactly, a concept is necessary that can also define infinite quantities. For this purpose, production processes are mainly used in the context of formal languages .

If one assumes an existing word over an alphabet, sets of words can be specified by semi-Thue systems , which result from the repeated application of replacement rules . Replacement rules of the form stipulate that in a word that contains a segment , that segment is replaced by. Semi-Thue systems therefore indicate a derivation relation between words in formal languages: Two words are related if one word can be derived from the other word by a substitution rule. ${\ displaystyle \ alpha \ rightarrow \ beta}$${\ displaystyle \ alpha}$${\ displaystyle \ beta}$

Formal grammars are suitable for describing formal languages , a concept that is expanded and more powerful than Semi-Thue systems. They differ from Semi-Thue systems in that they know what are known as terminal symbols and non - terminal symbols . Terminal symbols in a grammar are all characters in its character set . The first rule that applies always assumes a non-terminal start symbol, and the result of a substitution is considered a word in your formal language only if it does not contain any non-terminal symbols. ${\ displaystyle \ Sigma}$

The following example describes a language that can be used to express sums of the digits 1, 2 and 3 of any length. The appropriate alphabet for this is . The corresponding terminal symbols are underlined, non-terminals in italics. The following lines represent the rules. ${\ displaystyle \ Sigma = \ {+, 1,2,3 \}}$

{\ displaystyle {\ begin {aligned} {\ mathit {sum}} & \ rightarrow {\ mathit {number}} \\ {\ mathit {sum}} & \ rightarrow {\ mathit {sum}} \ {\ underline { +}} \ {\ mathit {number}} \\ {\ mathit {number}} & \ rightarrow {\ underline {1}} \\ {\ mathit {number}} & \ rightarrow {\ underline {2}} \ \ {\ mathit {number}} & \ rightarrow {\ underline {3}} \\\ end {aligned}}}

The start symbol of this language is sum . On this basis, you can apply any number of rules one after the other to produce a valid expression:

{\ displaystyle {\ begin {aligned} & {\ mathit {sum}} & \ quad & {\ text {(start symbol)}} \\ & {\ mathit {sum}} \ {\ underline {+}} \ { \ mathit {number}} && {\ text {(with second rule)}} \\ & {\ mathit {number}} \ {\ underline {+}} \ {\ mathit {number}} && {\ text {( with first rule)}} \\ & {\ underline {1 \ +}} \ {\ mathit {number}} && {\ text {(with 1st number rule)}} \\ & {\ underline {1+ 2}} && {\ text {(with 2nd digit rule)}} \\\ end {aligned}}}

Not all infinite languages ​​can be described as formal languages ​​with this generation principle, but no more suitable concept is known. Another common formalism for describing languages ​​are machine models , especially Turing machines . Unrestricted formal grammars and Turing machines are equally powerful when describing formal languages; H. For every language generated by a formal grammar there is a Turing machine that accepts exactly this, and vice versa.

Just as different machine models were defined, Chomsky defined different types of grammar in his work. With three continuously stricter restrictions on the substitution rules permitted therein, he established a hierarchy of four classes of formal grammars, whereby the less restricted classes really include the more regulated classes. The same applies to the language classes described by the respective grammar classes: they too form a hierarchy.

Languages ​​of more limited grammar types are usually easier to process algorithmically - at the price of being less expressive. Regular expressions , which are used to define patterns for searching text, for example, correspond to the very restricted Chomsky grammars of type 3 (regular grammars) and such text searches are effective and fast. On the other hand, because of their simplicity, type 3 grammars are not suitable for describing programming languages ​​for which less restricted type 2 grammars are usually used.

## The hierarchy

The Chomsky hierarchy includes four types of formal grammars, counted from 0 to 3. Type 0 includes all formal grammars at all, i.e. without restriction to the form of admissible generation rules. Grammars of type 1 to type 3 are then increasingly restricted. By definition, a grammar of type 1 is also of type 0, a grammar of type 2 is also of type 1, and a grammar of type 3 is also of type 2; the inversions do not hold. Therefore these classes form a real hierarchy. In the case of the corresponding language classes, counterexamples show that the inversions also do not apply, which is why they also form a real hierarchy.

For type 1, type 2 and type 3 grammars, Chomsky demanded that the right-hand side of production rules cannot be shorter than the left-hand side, which also excludes the empty word on each right-hand side of a production rule. Today one is often less restrictive; the definitions are often formulated in such a way that the hierarchy of the languages ​​is not disturbed, but they also allow grammars of type 1 (context-sensitive grammars) to generate the empty word through an exception rule, and types 2 (context-free grammars) and 3 ( regular grammars) even allow the empty word as the right side of substitution rules without restriction.

### Type 0 grammar (general Chomsky grammar or phrase structure grammar)

#### definition

Type 0 grammars are also called unbounded grammars. It is the class of all formal grammars of the form , where is a vocabulary consisting of the disjoint union of a finite alphabet (terminal symbols) and a set of nonterminal (variables) . The distinguished symbol is called the start symbol and is a set of production rules , i.e. H. on the left side of each production rule there is at least one non-terminal symbol. For individual rules you write instead of    mostly , for the set of rules with on the left side    too  . ${\ displaystyle G =}$ ${\ displaystyle (V, T, P, S)}$${\ displaystyle V = T \ cup N}$${\ displaystyle T}$${\ displaystyle N}$${\ displaystyle S \ in N}$${\ displaystyle P}$${\ displaystyle P \ subset}$ ${\ displaystyle (V ^ {*} \ setminus {T} ^ {*})}$ ${\ displaystyle \ times}$ ${\ displaystyle V ^ {*}}$ ${\ displaystyle =}$ ${\ displaystyle V ^ {*} NV ^ {*}}$ ${\ displaystyle \ times}$ ${\ displaystyle V ^ {*}}$${\ displaystyle (\ alpha, \ beta) \ in P}$${\ displaystyle \ alpha \ rightarrow \ beta}$${\ displaystyle \ alpha}$${\ displaystyle \ {(\ alpha, \ beta _ {1}), ... (\ alpha, \ beta _ {n}) \} \ subseteq P}$${\ displaystyle \ alpha \ rightarrow \ beta _ {1} | ... | \ beta _ {n}}$

To express belonging to the class of type 0 grammars, one writes ${\ displaystyle G \ in {\ mbox {Type}} _ {0}.}$

#### Languages ​​generated from type 0 grammars

Every type 0 grammar produces a language that can be accepted by a Turing machine , and conversely, for every language that can be accepted by a Turing machine, there is a type 0 grammar that produces that language. These languages ​​are also known as the recursively enumerable languages (often also called semi-decidable languages ); H. the associated Turing machine accepts every word in the language. With a word that is not in the language, the Turing machine may not terminate, i.e. H. does not provide any information about the affiliation of the word to the language.

Note that this set of languages ​​differs from the set of recursive languages (often called decidable languages ) that can be decided by Turing machines .

### Type 1 grammar (context-sensitive grammar)

#### definition

Type 1 grammars are also called context-sensitive grammars . These are type 0 grammars, in which all production rules have the form or , where a non-terminal and words consisting of terminals ( ) and non-terminals ( ) are. denotes the empty word. The words and can be empty, but must contain at least one symbol (i.e. a terminal or a non-terminal). This property is called length constraint. ${\ displaystyle \ alpha A \ beta \ rightarrow \ alpha \ gamma \ beta}$${\ displaystyle S \ rightarrow \ varepsilon}$${\ displaystyle A}$${\ displaystyle \ alpha, \ gamma, \ beta}$${\ displaystyle T}$${\ displaystyle N}$${\ displaystyle \ varepsilon}$${\ displaystyle \ alpha}$${\ displaystyle \ beta}$${\ displaystyle \ gamma}$

The only exception to this requirement is if the start symbol does not appear anywhere on the right side of a production. This means that context-sensitive languages ​​can also contain the often desired empty word, which then only ever arises in a one-step derivation from the start symbol itself, and without disturbing the property that the sentence form is longer in each partial step for all other derivations becomes or remains the same length. ${\ displaystyle S \ rightarrow \ varepsilon}$

If a grammar is context-sensitive, it is written . ${\ displaystyle G}$${\ displaystyle G \ in {\ mbox {Type}} _ {1}}$

In contrast to context-free grammars, on the left-hand side of the productions of context-sensitive grammars, instead of individual symbols, there can be entire symbol sequences, provided they contain at least one non-terminal symbol.

#### Languages ​​generated from type 1 grammars

The context-sensitive grammars produce exactly the context-sensitive languages . That means: Every type 1 grammar creates a context-sensitive language and for every context-sensitive language there is a type 1 grammar that creates it.

The context-sensitive languages ​​are precisely those languages ​​that can be recognized by a nondeterministic , linearly restricted Turing machine ; that is, from a nondeterministic Turing machine whose band is linearly bounded by the length of the input (that is, there is a constant number , so that the band of the Turing machine has at most fields, where the length of the input word is). ${\ displaystyle a}$${\ displaystyle a \ cdot x}$${\ displaystyle x}$

#### Monotonous grammars

Some authors describe a grammar as context-sensitive if, with the exception (see above), all production rules only meet the condition , i.e. that is, the right side of such a production is not shorter than its left side. More often, however, one finds the term monotonous grammar or non-shortening grammar. ${\ displaystyle S \ rightarrow \ varepsilon}$${\ displaystyle w_ {1} \ rightarrow w_ {2}}$${\ displaystyle \ left | w_ {1} \ right | \ leq \ left | w_ {2} \ right |}$

It turns out that the monotonous grammars exactly generate the context-sensitive languages ​​again, which is why the two classes of grammars are considered equivalent and some authors only treat one or the other grammar class at all. But not every monotonous replacement rule is also a context-sensitive one, which is why not every monotonous grammar is also context-sensitive.

### Type 2 grammar (context-free grammar)

#### definition

Type 2 grammars are also called context-free grammars . There are grammars that must apply: In every rule of the grammar there must be exactly one non-terminal symbol on the left side and any non-empty sequence of terminal and non-terminal symbols from the entire vocabulary can be on the right side . ${\ displaystyle \ forall (w_ {1} \ rightarrow w_ {2}) \ in P: (w_ {1} \ in N) \ land \ left (w_ {2} \ in V ^ {+} \ right). }$${\ displaystyle V = N \ cup T}$

In addition, as with type 1 grammars, the exception rule can be allowed if there is no right-hand side of a rule. ${\ displaystyle S \ rightarrow \ varepsilon}$${\ displaystyle S}$

One writes ${\ displaystyle G \ in {\ mbox {Type}} _ {2}.}$

Context-free grammars are often defined so that the right side may be empty too, so . Such grammars no longer fulfill all the properties of type 2 grammars and would no longer be part of the hierarchy defined by Chomsky. But they meet the requirements of monotonous grammars. ${\ displaystyle (w_ {1} \ rightarrow \ varepsilon) \ in P}$

#### Languages ​​generated from type 2 grammars

Context-free grammars (with an exception rule) generate exactly the context-free languages , every type 2 grammar therefore generates a context-free language and for every context-free language there is a type 2 grammar that generates it. ${\ displaystyle \ varepsilon}$

The context-free languages ​​are exactly those languages ​​that can be recognized by a non-deterministic push-down automaton (NPDA). A subset of these languages ​​forms the theoretical basis for the syntax of most programming languages .

See also: Backus-Naur-Form (BNF) / Extended Backus-Naur-Form (EBNF): another, equivalent scheme of the designations.

### Type 3 grammar (regular grammar)

#### definition

Type 3 grammars are also called regular grammars . These are type 2 grammars in which exactly one terminal symbol and a maximum of one additional non-terminal symbol may appear on the right-hand side of productions. The permitted position of such non-terminal symbols must also always be in front of or always behind the terminal symbol across all productions , depending on which one speaks more precisely of left-regular and right- regular grammars . They correspond to the left and right linear grammars, whereas linear grammars do not correspond to the regular grammars.

For left-regular type 3 grammars, the condition that

${\ displaystyle \ forall (w_ {1} \ rightarrow w_ {2}) \ in P: (w_ {1} \ in N) \ land}$ ${\ displaystyle (w_ {2} \ in T \ cup NT)}$.

They are therefore only allowed to contain regular productions on the left (non-terminal symbol on the right in front ).

Usually, rules with an empty right-hand side are allowed for regular grammars, as well as for context-free grammars . ${\ displaystyle (w_ {1} \ rightarrow \ varepsilon) \ in P}$

Legal grammars , on the other hand, meet the analogous condition

${\ displaystyle \ forall (w_ {1} \ rightarrow w_ {2}) \ in P: (w_ {1} \ in N) \ land}$ ${\ displaystyle (w_ {2} \ in T \ cup TN \ cup \ {\ varepsilon \})}$.

So these contain only right regular productions (non-terminal symbol on the right side if necessary in rear position). This condition already expresses the permission for empty right-hand pages.

Left-regular and right-regular grammars generate exactly the same language class, so for every left-regular grammar there is also a right-regular grammar that generates the same language, and vice versa.

Note that if left-regular and right- regular productions occur in the same Chomsky grammar, this is not regular. The grammars with both left-regular and right-regular productions create a really larger language class.

Some authors allow in the definitions for regular / left-regular / right-regular grammars wherever only a single non-terminal character is allowed in productions, also any non-empty terminal string. This does not change anything about the power of generation of the classes.

You write for regular grammars . ${\ displaystyle G \ in {\ mbox {Type}} _ {3}}$${\ displaystyle G}$

#### Languages ​​generated from type 3 grammars

Regular grammars create exactly the regular languages , that is, every type 3 grammar creates a regular language and for every regular language there is a type 3 grammar that creates it.

Alternatively, regular languages ​​can also be described by regular expressions and the regular languages ​​are precisely those languages ​​that can be recognized by finite automata . They are often used to define search patterns or the lexical structure of programming languages.

### Overview

The following table lists the form of their rules for the four grammar types, what names the languages ​​generated have and what types of machines recognize these languages, i.e. at least semi-decide the word problem (word in language: machine stops in accepting final state, word not in Language: machine never stops or stops in a non-accepting state → the machine only stops safely if the word is in the language). Since in the Chomsky hierarchy there is a real subset relationship for the language sets (see next section), e.g. B. Turing machines, of course, also the word problem for languages ​​of types 1 to 3 (deciding means: word in language: machine stops in an accepting end state, word not in language: machine stops in a non-accepting state → at some point the machine and the problem (word in language?) is decided) In addition, it is noted against which operations the generated languages ​​are closed .

grammar regulate languages Decidability Vending machines Seclusion Time estimate
Type-0
Any formal grammar
${\ displaystyle \ alpha \ rightarrow \ beta}$
 ${\displaystyle \alpha \in V^{*}\setminus \Sigma ^{*},\beta \in V^{\ast }}$

recursively enumerable (not "only" recursively , they would be decidable!) - Turing machine (regardless of whether it is deterministic or non-deterministic) ${\ displaystyle \ circ, \ cap, \ cup, \ ast}$ unlimited
Type-1
context sensitive grammar
${\ displaystyle \ alpha A \ beta \ rightarrow \ alpha \ gamma \ beta}$
 ${\displaystyle A\in N,\alpha ,\beta \in V^{\ast },\gamma \in V^{+}\;}$
${\displaystyle S\rightarrow \varepsilon }$ ist erlaubt, wenn es keine Regel ${\displaystyle \alpha \rightarrow \beta S\gamma }$ in ${\displaystyle P}$ gibt.

context sensitive Word problem linear space-constrained nondeterministic Turing machine ${\ displaystyle \ complement, \ circ, \ cap, \ cup, \ ast}$ ${\ displaystyle O (2 ^ {n})}$
Type-2
context free grammar
${\ displaystyle A \ rightarrow \ gamma}$
 ${\displaystyle A\in N,\gamma \in V^{\ast }}$

context free Word problem , nondeterministic basement automaton ${\ displaystyle \ circ, \ cup, \ ast}$ ${\ displaystyle O (n ^ {3})}$
Type-3
regular grammar
${\ displaystyle A \ rightarrow aB}$(right regular) or (left regular)
${\ displaystyle A \ rightarrow Ba}$
${\ displaystyle A \ rightarrow a}$
${\ displaystyle A \ rightarrow \ varepsilon}$
${\ displaystyle A, B \ in N, a \ in T}$

Only left-wing or only right-wing regular productions

regular Word problem , Finite automaton (regardless of whether it is deterministic or non-deterministic) ${\ displaystyle \ complement, \ circ, \ cap, \ cup, \ ast}$ ${\ displaystyle O (n)}$
Legend for the Rules column
• ${\ displaystyle T}$... finite alphabet (set of terminal symbols, often also referred to with , but which can also mean)${\ displaystyle \ Sigma}$${\ displaystyle V \ cup N}$
• ${\ displaystyle N}$... the disjoint set of nonterminal symbols (sometimes also called variables and denoted by, instead of ... entire vocabulary )${\ displaystyle V}$${\ displaystyle V: = N \ cup T}$
• ${\ displaystyle S \ in N}$... start symbol
• ${\ displaystyle \ varepsilon}$... empty word (often also marked with )${\ displaystyle \ lambda}$
• ${\ displaystyle P}$... set of production rules (as a subset of a Cartesian product, a relation )
• ${\ displaystyle \ setminus}$... amount difference formation
• ${\ displaystyle ^ {\ ast}}$... Kleenescher closure (shell), see Kleenesche and positive shell
• ${\ displaystyle ^ {+}}$... positive envelope, e.g. B. thinks ... must contain at least one symbol (terminal or a non-terminal)${\ displaystyle \ gamma \ in (T \ cup N) ^ {+} = V ^ {+}}$${\ displaystyle \ gamma}$

In the table above, therefore, non-terminal symbols are represented with Latin capital letters, terminal symbols are used with Latin lower case letters, and Greek lower case letters are used when both non- terminal and terminal symbols are involved . ${\ displaystyle A, B \ in N}$${\ displaystyle a \ in T}$

Attention: With and a Greek lower case letter can stand for words from several terminal or non-terminal symbols! ${\ displaystyle ^ {\ ast}}$${\ displaystyle ^ {+}}$

Legend for the isolation column
• ${\ displaystyle \ complement}$... complement formation
• ${\ displaystyle \ circ}$... concatenation
• ${\ displaystyle \ cap}$... intersection
• ${\ displaystyle \ cup}$... union set
• ${\ displaystyle ^ {\ ast}}$... Kleenescher graduation (cover)

## Chomsky hierarchy for formal languages

Subset relationship of the language classes in the Chomsky hierarchy

A formal language is of the type according to the hierarchy for grammars when it is generated from a type grammar. Expressed formally this means: is of the type if there is a grammar with . Then you write${\ displaystyle i}$${\ displaystyle i}$${\ displaystyle L}$${\ displaystyle i \ in \ {0, \ ldots, 3 \},}$${\ displaystyle G \ in {\ mbox {Type}} _ {i}}$${\ displaystyle L = L \ left (G \ right)}$${\ displaystyle L \ in {\ mathcal {L}} _ {i}.}$

In the Chomsky hierarchy for formal languages, there is a true subset relationship between the language sets of adjacent levels. Any context-sensitive language is recursively enumerable, but there are recursively enumerable languages ​​that are not context-sensitive. Likewise, every context-free language is also context-sensitive, but not the other way around, and every regular language is context-free, but not every context-free language is regular.

Expressed formally, this means for the classes of the languages ​​generated by the above grammars: whereby the following symbols are occasionally used:${\ displaystyle {\ mathcal {L}} _ {3} \ subset {\ mathcal {L}} _ {2} \ subset {\ mathcal {L}} _ {1} \ subset {\ mathcal {L}} _ {0},}$${\ displaystyle \ mathrm {REG} \ subset \ mathrm {CF} \ subset \ mathrm {CS} \ subset \ mathrm {RE.}}$

Examples of languages ​​in the respective difference sets are:

• ${\ displaystyle L_ {1} = \ left \ {a ^ {n} b ^ {n} c ^ {n} \, | \, n \ geq 1 \ right \}}$ is of type 1 but not of type 2
• ${\ displaystyle L_ {2} = \ left \ {a ^ {n} b ^ {n} \, | \, n \ geq 1 \ right \}}$ is of type 2 but not of type 3

Evidence that certain languages ​​do not belong to the language classes and, as is the case here, are often provided with the loop theorem. ${\ displaystyle {\ mathcal {L}} _ {2}}$${\ displaystyle {\ mathcal {L}} _ {3}}$

## Natural languages

Although Chomsky pursued his research with the aim of finding a mathematical description of natural languages , to date no natural language has been able to prove a correct and complete formal grammar. The problem may be a. in the interplay of the various grammar parts that model individual linguistic phenomena. But even with the practical use of formal grammars in computational linguistics , ambiguities can arise on different levels of language consideration; these have to be resolved (e.g. in machine translation ) based on the context .

## literature

• Noam Chomsky : Three models for the description of language . In: IRE Transactions on Information Theory . Vol. 2, 1956, pp. 113-124 ( PDF ).
• Noam Chomsky: On certain formal properties of grammars . In: Information and Control . Vol. 2, 1959, pp. 137-167 ( PDF ).
• Noam Chomsky, Marcel P. Schützenberger: The algebraic theory of context free languages, computer programming and formal languages . Ed .: P. Braffort, D. Hirschberg. North Holland, Amsterdam 1963, p. 118-161 .
• Peter Sander, Wolffried Stucky, Rudolf Herschel: Automata, Languages, Predictability . Teubner, Stuttgart 1992, ISBN 3-519-02937-5 .

## Individual evidence

1. In connection with formal grammars, the character is used here instead of - as usual - for the 'target alphabet' of the terminal symbols . Attention: Some authors use different names for the entire vocabulary (here marked with ).${\ displaystyle T}$${\ displaystyle \ Sigma}$${\ displaystyle \ Sigma}$${\ displaystyle N \ cup T}$${\ displaystyle V}$
2. Uwe Schöning : Theoretical Computer Science - in brief . 5th edition. Spektrum Akademischer Verlag, Heidelberg 2008, ISBN 978-3-8274-1824-1 , p. 81 .