Chomsky hierarchy
Chomsky hierarchy , occasionally ChomskySchützenberger hierarchy (named after the linguist Noam Chomsky and the mathematician Marcel Schützenberger ), is a term from theoretical computer science . It is a hierarchy of classes of formal grammars that produce formal languages and was first described by Noam Chomsky in 1956. The hierarchy levels differ in how rigid the restrictions on the form of permitted production rules are at the respective level; with type 0 grammars they are unrestricted, with higher levels progressively more restricted.
Lowertype grammars are more productive than highertype grammars. A language produced by a grammar of type k is called a language of type k . In addition to the Chomsky hierarchy of grammars, there is a Chomsky hierarchy of languages in this sense.
backgrounds
Formal languages have the advantage that they can be precisely analyzed mathematically. Formal languages are based on a predefined alphabet (a character set), which is often referred to as. Arbitrarily long, finite sequences of elements of the alphabet are called words above the alphabet. After all, sets of such words are formal languages . The most comprehensive formal language over a given alphabet is infinitely large; it contains all the words that can be formed by arbitrarily combining characters from the alphabet. It can be described by the Kleenesche shell of the alphabet, so . The smallest formal language contains no words at all. Other formal languages only consist of a few words and can thus be noted as a finite set; for example, the alphabet , the language of all words of length two: .
Understandably, infinite languages cannot be noted by enumerating them. In order to describe them exactly, a concept is necessary that can also define infinite quantities. For this purpose, production processes are mainly used in the context of formal languages .
If one assumes an existing word over an alphabet, sets of words can be specified by semiThue systems , which result from the repeated application of replacement rules . Replacement rules of the form stipulate that in a word that contains a segment , that segment is replaced by. SemiThue systems therefore indicate a derivation relation between words in formal languages: Two words are related if one word can be derived from the other word by a substitution rule.
Formal grammars are suitable for describing formal languages , a concept that is expanded and more powerful than SemiThue systems. They differ from SemiThue systems in that they know what are known as terminal symbols and non  terminal symbols . Terminal symbols in a grammar are all characters in its character set . The first rule that applies always assumes a nonterminal start symbol, and the result of a substitution is considered a word in your formal language only if it does not contain any nonterminal symbols.
The following example describes a language that can be used to express sums of the digits 1, 2 and 3 of any length. The appropriate alphabet for this is . The corresponding terminal symbols are underlined, nonterminals in italics. The following lines represent the rules.
The start symbol of this language is sum . On this basis, you can apply any number of rules one after the other to produce a valid expression:
Not all infinite languages can be described as formal languages with this generation principle, but no more suitable concept is known. Another common formalism for describing languages are machine models , especially Turing machines . Unrestricted formal grammars and Turing machines are equally powerful when describing formal languages; H. For every language generated by a formal grammar there is a Turing machine that accepts exactly this, and vice versa.
Just as different machine models were defined, Chomsky defined different types of grammar in his work. With three continuously stricter restrictions on the substitution rules permitted therein, he established a hierarchy of four classes of formal grammars, whereby the less restricted classes really include the more regulated classes. The same applies to the language classes described by the respective grammar classes: they too form a hierarchy.
Languages of more limited grammar types are usually easier to process algorithmically  at the price of being less expressive. Regular expressions , which are used to define patterns for searching text, for example, correspond to the very restricted Chomsky grammars of type 3 (regular grammars) and such text searches are effective and fast. On the other hand, because of their simplicity, type 3 grammars are not suitable for describing programming languages for which less restricted type 2 grammars are usually used.
The hierarchy
The Chomsky hierarchy includes four types of formal grammars, counted from 0 to 3. Type 0 includes all formal grammars at all, i.e. without restriction to the form of admissible generation rules. Grammars of type 1 to type 3 are then increasingly restricted. By definition, a grammar of type 1 is also of type 0, a grammar of type 2 is also of type 1, and a grammar of type 3 is also of type 2; the inversions do not hold. Therefore these classes form a real hierarchy. In the case of the corresponding language classes, counterexamples show that the inversions also do not apply, which is why they also form a real hierarchy.
For type 1, type 2 and type 3 grammars, Chomsky demanded that the righthand side of production rules cannot be shorter than the lefthand side, which also excludes the empty word on each righthand side of a production rule. Today one is often less restrictive; the definitions are often formulated in such a way that the hierarchy of the languages is not disturbed, but they also allow grammars of type 1 (contextsensitive grammars) to generate the empty word through an exception rule, and types 2 (contextfree grammars) and 3 ( regular grammars) even allow the empty word as the right side of substitution rules without restriction.
Type 0 grammar (general Chomsky grammar or phrase structure grammar)
definition
Type 0 grammars are also called unbounded grammars. It is the class of all formal grammars of the form , where is a vocabulary consisting of the disjoint union of a finite alphabet (terminal symbols) and a set of nonterminal (variables) . The distinguished symbol is called the start symbol and is a set of production rules , i.e. H. on the left side of each production rule there is at least one nonterminal symbol. For individual rules you write instead of mostly , for the set of rules with on the left side too .
To express belonging to the class of type 0 grammars, one writes
Languages generated from type 0 grammars
Every type 0 grammar produces a language that can be accepted by a Turing machine , and conversely, for every language that can be accepted by a Turing machine, there is a type 0 grammar that produces that language. These languages are also known as the recursively enumerable languages (often also called semidecidable languages ); H. the associated Turing machine accepts every word in the language. With a word that is not in the language, the Turing machine may not terminate, i.e. H. does not provide any information about the affiliation of the word to the language.
Note that this set of languages differs from the set of recursive languages (often called decidable languages ) that can be decided by Turing machines .
Type 1 grammar (contextsensitive grammar)
definition
Type 1 grammars are also called contextsensitive grammars . These are type 0 grammars, in which all production rules have the form or , where a nonterminal and words consisting of terminals ( ) and nonterminals ( ) are. denotes the empty word. The words and can be empty, but must contain at least one symbol (i.e. a terminal or a nonterminal). This property is called length constraint.
The only exception to this requirement is if the start symbol does not appear anywhere on the right side of a production. This means that contextsensitive languages can also contain the often desired empty word, which then only ever arises in a onestep derivation from the start symbol itself, and without disturbing the property that the sentence form is longer in each partial step for all other derivations becomes or remains the same length.
If a grammar is contextsensitive, it is written .
In contrast to contextfree grammars, on the lefthand side of the productions of contextsensitive grammars, instead of individual symbols, there can be entire symbol sequences, provided they contain at least one nonterminal symbol.
Languages generated from type 1 grammars
The contextsensitive grammars produce exactly the contextsensitive languages . That means: Every type 1 grammar creates a contextsensitive language and for every contextsensitive language there is a type 1 grammar that creates it.
The contextsensitive languages are precisely those languages that can be recognized by a nondeterministic , linearly restricted Turing machine ; that is, from a nondeterministic Turing machine whose band is linearly bounded by the length of the input (that is, there is a constant number , so that the band of the Turing machine has at most fields, where the length of the input word is).
Monotonous grammars
Some authors describe a grammar as contextsensitive if, with the exception (see above), all production rules only meet the condition , i.e. that is, the right side of such a production is not shorter than its left side. More often, however, one finds the term monotonous grammar or nonshortening grammar.
It turns out that the monotonous grammars exactly generate the contextsensitive languages again, which is why the two classes of grammars are considered equivalent and some authors only treat one or the other grammar class at all. But not every monotonous replacement rule is also a contextsensitive one, which is why not every monotonous grammar is also contextsensitive.
Type 2 grammar (contextfree grammar)
definition
Type 2 grammars are also called contextfree grammars . There are grammars that must apply: In every rule of the grammar there must be exactly one nonterminal symbol on the left side and any nonempty sequence of terminal and nonterminal symbols from the entire vocabulary can be on the right side .
In addition, as with type 1 grammars, the exception rule can be allowed if there is no righthand side of a rule.
One writes
Contextfree grammars are often defined so that the right side may be empty too, so . Such grammars no longer fulfill all the properties of type 2 grammars and would no longer be part of the hierarchy defined by Chomsky. But they meet the requirements of monotonous grammars.
Languages generated from type 2 grammars
Contextfree grammars (with an exception rule) generate exactly the contextfree languages , every type 2 grammar therefore generates a contextfree language and for every contextfree language there is a type 2 grammar that generates it.
The contextfree languages are exactly those languages that can be recognized by a nondeterministic pushdown automaton (NPDA). A subset of these languages forms the theoretical basis for the syntax of most programming languages .
See also: BackusNaurForm (BNF) / Extended BackusNaurForm (EBNF): another, equivalent scheme of the designations.
Type 3 grammar (regular grammar)
definition
Type 3 grammars are also called regular grammars . These are type 2 grammars in which exactly one terminal symbol and a maximum of one additional nonterminal symbol may appear on the righthand side of productions. The permitted position of such nonterminal symbols must also always be in front of or always behind the terminal symbol across all productions , depending on which one speaks more precisely of leftregular and right regular grammars . They correspond to the left and right linear grammars, whereas linear grammars do not correspond to the regular grammars.
For leftregular type 3 grammars, the condition that
.
They are therefore only allowed to contain regular productions on the left (nonterminal symbol on the right in front ).
Usually, rules with an empty righthand side are allowed for regular grammars, as well as for contextfree grammars .
Legal grammars , on the other hand, meet the analogous condition
.
So these contain only right regular productions (nonterminal symbol on the right side if necessary in rear position). This condition already expresses the permission for empty righthand pages.
Leftregular and rightregular grammars generate exactly the same language class, so for every leftregular grammar there is also a rightregular grammar that generates the same language, and vice versa.
Note that if leftregular and right regular productions occur in the same Chomsky grammar, this is not regular. The grammars with both leftregular and rightregular productions create a really larger language class.
Some authors allow in the definitions for regular / leftregular / rightregular grammars wherever only a single nonterminal character is allowed in productions, also any nonempty terminal string. This does not change anything about the power of generation of the classes.
You write for regular grammars .
Languages generated from type 3 grammars
Regular grammars create exactly the regular languages , that is, every type 3 grammar creates a regular language and for every regular language there is a type 3 grammar that creates it.
Alternatively, regular languages can also be described by regular expressions and the regular languages are precisely those languages that can be recognized by finite automata . They are often used to define search patterns or the lexical structure of programming languages.
Overview
The following table lists the form of their rules for the four grammar types, what names the languages generated have and what types of machines recognize these languages, i.e. at least semidecide the word problem (word in language: machine stops in accepting final state, word not in Language: machine never stops or stops in a nonaccepting state → the machine only stops safely if the word is in the language). Since in the Chomsky hierarchy there is a real subset relationship for the language sets (see next section), e.g. B. Turing machines, of course, also the word problem for languages of types 1 to 3 (deciding means: word in language: machine stops in an accepting end state, word not in language: machine stops in a nonaccepting state → at some point the machine and the problem (word in language?) is decided) In addition, it is noted against which operations the generated languages are closed .
grammar  regulate  languages  Decidability  Vending machines  Seclusion  Time estimate 

Type0 Any formal grammar 

recursively enumerable (not "only" recursively , they would be decidable!)    Turing machine (regardless of whether it is deterministic or nondeterministic)  unlimited  
Type1 context sensitive grammar 

context sensitive  Word problem  linear spaceconstrained nondeterministic Turing machine  
Type2 context free grammar 

context free  Word problem ,  nondeterministic basement automaton  
Type3 regular grammar 
(right regular) or (left regular) Only leftwing or only rightwing regular productions 
regular 
Word problem ,
Emptiness problem , finiteness problem , equivalence problem 
Finite automaton (regardless of whether it is deterministic or nondeterministic) 
 Legend for the Rules column
 ... finite alphabet (set of terminal symbols, often also referred to with , but which can also mean)
 ... the disjoint set of nonterminal symbols (sometimes also called variables and denoted by, instead of ... entire vocabulary )
 ... start symbol
 ... empty word (often also marked with )
 ... set of production rules (as a subset of a Cartesian product, a relation )
 ... amount difference formation
 ... Kleenescher closure (shell), see Kleenesche and positive shell
 ... positive envelope, e.g. B. thinks ... must contain at least one symbol (terminal or a nonterminal)
In the table above, therefore, nonterminal symbols are represented with Latin capital letters, terminal symbols are used with Latin lower case letters, and Greek lower case letters are used when both non terminal and terminal symbols are involved .
Attention: With and a Greek lower case letter can stand for words from several terminal or nonterminal symbols!
 Legend for the isolation column
 ... complement formation
 ... concatenation
 ... intersection
 ... union set
 ... Kleenescher graduation (cover)
Chomsky hierarchy for formal languages
A formal language is of the type according to the hierarchy for grammars when it is generated from a type grammar. Expressed formally this means: is of the type if there is a grammar with . Then you write
In the Chomsky hierarchy for formal languages, there is a true subset relationship between the language sets of adjacent levels. Any contextsensitive language is recursively enumerable, but there are recursively enumerable languages that are not contextsensitive. Likewise, every contextfree language is also contextsensitive, but not the other way around, and every regular language is contextfree, but not every contextfree language is regular.
Expressed formally, this means for the classes of the languages generated by the above grammars: whereby the following symbols are occasionally used:
Examples of languages in the respective difference sets are:
 is of type 1 but not of type 2
 is of type 2 but not of type 3
Evidence that certain languages do not belong to the language classes and, as is the case here, are often provided with the loop theorem.
Natural languages
Although Chomsky pursued his research with the aim of finding a mathematical description of natural languages , to date no natural language has been able to prove a correct and complete formal grammar. The problem may be a. in the interplay of the various grammar parts that model individual linguistic phenomena. But even with the practical use of formal grammars in computational linguistics , ambiguities can arise on different levels of language consideration; these have to be resolved (e.g. in machine translation ) based on the context .
literature
 Noam Chomsky : Three models for the description of language . In: IRE Transactions on Information Theory . Vol. 2, 1956, pp. 113124 ( PDF ).
 Noam Chomsky: On certain formal properties of grammars . In: Information and Control . Vol. 2, 1959, pp. 137167 ( PDF ).
 Noam Chomsky, Marcel P. Schützenberger: The algebraic theory of context free languages, computer programming and formal languages . Ed .: P. Braffort, D. Hirschberg. North Holland, Amsterdam 1963, p. 118161 .
 Peter Sander, Wolffried Stucky, Rudolf Herschel: Automata, Languages, Predictability . Teubner, Stuttgart 1992, ISBN 3519029375 .
Individual evidence
 ↑ In connection with formal grammars, the character is used here instead of  as usual  for the 'target alphabet' of the terminal symbols . Attention: Some authors use different names for the entire vocabulary (here marked with ).
 ↑ Uwe Schöning : Theoretical Computer Science  in brief . 5th edition. Spektrum Akademischer Verlag, Heidelberg 2008, ISBN 9783827418241 , p. 81 .