Formal grammars are mathematical models of grammars that are used to uniquely generate and describe formal languages . They are used in theoretical computer science , especially in computability theory , and in compiler construction, on the one hand, to clearly determine whether a word is an element of a language and, on the other hand, to examine or prove properties of these formal languages. Formal grammars are classified in the Chomsky hierarchy given using Semi-Thue systems .
With a formal grammar, starting from a start symbol (also called start variable ), production rules from a set of rules can be applied, which generate new character strings ( words ) from the start symbol , which in turn can be further replaced. This process is also called derivation .
The vocabulary of a grammar, consisting of the disjoint union of an alphabet of terminal symbols with a set of non-terminal symbols , specifies which symbols can be used for this. The set of terminal symbols defines which characters are made up of words that cannot be further derived. Taken together, these words make up the formal language described by the grammar. The start symbol, however, must be a non-terminal symbol. Additional non-terminal symbols allow more differentiated rules.
By definition, production rules are ordered pairs that are also written. It is applied to a character string by replacing one occurrence of the character string in with . There must always be at least one non-terminal symbol on the left side of the rule. A lot of rules with the same left-hand side, so , are also abbreviated as .
For example, one can use the rule set to either derive or to derive the string .
Just as several rules can be applied to a given string, there does not always have to be just one place in the string that a rule fits. Formal grammars do not dictate any order. All words that consist only of terminal symbols and that can be derived from the start symbol belong to the language described by the grammar.
A formal grammar is represented by the 4- tuple , in which:
- , a finite set of symbols , which is called a symbol set or vocabulary ,
- , a subset of , also called the alphabet and whose elements are called terminal symbols,
- , a finite set of production rules , as well
- , the start symbol .
Here, the Kleenesche shell describes a set of characters (or words).
The set is the set of non-terminal symbols (also called non-terminal or meta symbols ), in particular the start character belongs to it. The word on the left side of the rule pairs must not consist exclusively of terminal characters, which can also be expressed by a concatenation :
It doesn't make much sense if the word on the right has the start symbol. Some authors take this into account by restricting the associated set accordingly, i. H. by replacing.
Some authors alternatively refer to the quadruple as grammar . There are also the following different designations:
- for the terminal characters , here ,
- for the entire vocabulary ( set of symbols ) of all symbols , here ,
- for the non-terminal characters ( variables ), here ,
- for the empty word, here .
The language described by a grammar
A rule of a given grammar says that in a word with R as an infix , R can be replaced by Q, so that a new word with an infix is created. One also speaks of the fact that in goes over with the grammar or through the rule , or that it was derived from. This can be noted through (often instead of ). If it is only to be expressed that the word from can arise in the grammar without naming a rule, one writes instead ( if the grammar is obvious from the context, also simple ). Accordingly, such a transition from into a transition relation is a natural extension of all possible contexts, namely:
where here denotes the concatenation of words.
If there is a sequence of words in which it applies that for every natural number with the word in passes over ( ), then what is represented by can be derived in steps from . Such a word sequence is called a derivative or calculation of in length . Furthermore, in means derivable if there is at least one such that it can be derived in steps from . If in can be derived, this is represented by the notation . It is also defined that for every word that is, so that the relation is the reflexive-transitive envelope of the relation .
Now the formal language generated by the grammar is the language that consists of all words that, on the one hand, consist only of terminal symbols and, on the other hand, can be derived from the start symbol with a finite number of steps:
It does not matter in which order the production rules are applied to the derived words, or whether there are several options for deriving a word from . Two grammars and are equivalent if and only if the languages generated by and are equal:
If all terminal characters appear in the words of the formal languages, then the terminal characters must match. The non-terminal characters, on the other hand, are completely free.
be a grammar with the terminal symbols , the non-terminal symbols , the start symbol and with the rules
Here is the empty word , which is a word of length 0. This grammar defines the language of all words of the form with . So, due to the following derivations, the words , and elements of the language described by :
- , using rule (2),
- , using rules (1), (4) and (6),
- , by means of rules (1), (1), (4), (3), (5), (6) and (7).
But there are other ways to derive the word from .
Another grammar that describes the same language is context-free grammar with the rules:
Classes of grammars
The Chomsky hierarchy divides the grammars according to the type of production rules into classes from type 0 to type 3:
- Type 0 grammars: Phrase structure grammars are unrestricted formal grammars.
- Type 1 grammars: Context-sensitive grammars may only consist of rules in which exactly one non-terminal symbol is replaced by a character string. This symbol may also be surrounded by other symbols on the left side of the rule, which indicate a context within which the replacement must take place.
- Type 2 grammars: In context-free grammars , on the other hand, there may only be exactly one non-terminal symbol on the left side of the rules. The symbol can then be replaced regardless of the context.
- Type 3 grammars: In the case of regular grammars , the left-hand sides of the rules also only contain exactly one non-terminal symbol. In left-regular grammars, the right-hand side of the rule may consist of at most one non-terminal symbol, followed by at most one terminal symbol (e.g. ). In the case of legal grammars, on the other hand, the right-hand side may consist of at most one terminal symbol followed by a maximum of one non-terminal (e.g. ).
The associated language classes are decreasing in size. The following real inclusion relationship exists :
For the language classes by type with the following applies: .
Other language classes
Apart from the Chomsky hierarchy, other classes of grammars have become established:
- Monotonous grammars describe the same language class as the context-sensitive grammars. The increasingly context-sensitive grammars that describe a subclass of the context-sensitive language class are somewhat stricter .
- Deterministic context-free grammars describe the deterministic context-free languages . They are also described by the LR (k) grammars , which are important for building compilers. Other grammars known in compiler construction are LL (k) grammars and LF (k) grammars .
- Graph grammar
- Backus-Naur-Form and Extended Backus-Naur-Form
- Syntax theory on (formal) grammars in linguistics
- Katrin Erk, Lutz Priese: Theoretical Computer Science. A comprehensive introduction . 2nd expanded edition. Springer-Verlag, Berlin et al. 2002, ISBN 3-540-42624-8 , pp. 53-61 .
- Peter Bachmann: Fundamentals of Theoretical Computer Science . Cottbus 2001, p. 47 ( PDF [accessed on February 12, 2011] lecture notes).
- While the meaning of and or in the given case is clear, one has to clarify what is meant with (as well as the frequently used one ) by checking the context; where one-quadruple grammar to the not able to leave.
- Klaus Reinhardt: Priority payer machines and the synchronization of half-track languages ( memento of the original from January 17, 2018 in the Internet Archive ) Info: The archive link was inserted automatically and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. , Faculty of Computer Science at the University of Stuttgart; Doctoral thesis 1994. The grammar quadruple is literally specified here, which means in the designation chosen here .