Context-free grammar

In the theory of formal languages one is context-free grammar ( English context-free grammar , CFG ) is a formal grammar that only those replacement rules contains, where exactly one non-terminal symbol on an arbitrarily long sequence of Nichtterminal- and terminal symbols is derived. The replacement rules have the following form (with a non-terminal symbol and a character string consisting of non-terminal and / or terminal symbols). ${\ displaystyle V \ rightarrow w}$ ${\ displaystyle V}$ ${\ displaystyle w}$

Because the left side of a rule consists of a single nonterminal symbol , its applicability to a character string only depends on whether the nonterminal symbol occurs in the character string, not on the context in which it is located, i.e. H. which characters are to the left and / or right of it. So the rules are context-free. ${\ displaystyle V}$ ${\ displaystyle V}$

The context-free grammars are identical to the type 2 grammars of the Chomsky hierarchy .

definition

A context-free grammar is a 4- tuple with the following properties: ${\ displaystyle G}$ ${\ displaystyle (V, T, P, S)}$

{\ displaystyle V}

is a finite set, called a vocabulary ,

a subset of terminal symbols (also called terminals for short ), ${\ displaystyle T \ subset V}$

This includes the difference set of non-terminal symbols (also called non-terminals or variables for short ). and are disjoint alphabets

{\ displaystyle N: = V \ setminus T}

{\ displaystyle N}

{\ displaystyle T}

a finite set of production rules (short productions ) , ${\ displaystyle P \ subset N \ times V ^ {*}}$
a start symbol . ${\ displaystyle S \ in N}$

Here the Kleenesche describes the shell . ${\ displaystyle ^ {*}}$

Explanation

Some authors alternatively refer to the quadruple as grammar , with the requirement that and are two finite, disjoint sets, and . ${\ displaystyle (N, T, P, S)}$ ${\ displaystyle G}$ ${\ displaystyle N}$ ${\ displaystyle T}$ ${\ displaystyle V: = N \ cup T}$

Occasionally, the non-terminals (variables) are designated differently with and the terminals or the entire vocabulary with . ${\ displaystyle V}$ ${\ displaystyle \ Sigma}$

A rule is usually noted in the form . ${\ displaystyle (\ alpha, \ beta) \ in P}$ ${\ displaystyle \ alpha \ rightarrow \ beta}$

According to the definition, the following applies to a rule , that is, that there is exactly one non-terminal on the left side of the replacement rule . As a rule, it is not surrounded by other characters on the left-hand side, and therefore the same rules are always available for each character string that contains this nonterminal, regardless of which characters surround the nonterminal in a character string. In short, the choice of rules is independent of the context of . ${\ displaystyle \ alpha \ rightarrow \ beta}$ ${\ displaystyle \ alpha \ in N}$ ${\ displaystyle \ alpha}$ ${\ displaystyle \ alpha}$

Speech produced by G

The context-free grammars produce exactly the context-free languages ; That is, every type 2 grammar creates a context-free language and for every context-free language there is a type 2 grammar that creates it.

The production rules are applied in such a way that in a word with R as an infix (partial word, English substring ), this can be replaced by Q, so that a new word is created with as an infix. The set (as a subset of a Cartesian product, a relation ) is thereby expanded to ${\ displaystyle R \ rightarrow Q \ in P}$ ${\ displaystyle w \ in V ^ {\ ast}}$ ${\ displaystyle w ^ {\ prime}}$ ${\ displaystyle Q}$ ${\ displaystyle P}$

{\ displaystyle \ rightsquigarrow _ {G} \;: = \; \ {(uRv, uQv) \ mid u, v \ in V ^ {*} \ land (R, Q) \ in P \}}

.

These substitutions can be made several times: If a word results from a word by using n times , one writes, if this is the case with arbitrary finite application, then . The relation ( derivative ) stands for any finite sequence of rule applications with regard to grammar . See also: Homogeneous Relations . ${\ displaystyle v}$ ${\ displaystyle u}$ ${\ displaystyle \ rightsquigarrow}$ ${\ displaystyle u {\ rightsquigarrow _ {G}} ^ {n} v}$ ${\ displaystyle u {\ rightsquigarrow _ {G}} ^ {*}}$ ${\ displaystyle {\ rightsquigarrow _ {G}} ^ {*}}$ ${\ displaystyle G}$

The context-free language that is generated by the context-free grammar is then defined as the set of all words that can be derived from the start symbol in this way and that only consist of terminals: ${\ displaystyle L (G)}$ ${\ displaystyle G}$

{\ displaystyle L (G) = \ {w \ mid w \ in T ^ {*} \ land S {\ rightsquigarrow _ {G}} ^ {*} w \}}

.

From the start symbol, non-terminals must be replaced with the help of the rules until only terminals are left. Apparently applies . ${\ displaystyle S}$ ${\ displaystyle L (G) \ subseteq T ^ {*}}$

The context-free languages are exactly the languages of a non-deterministic pushdown automata accepted be. If a deterministic push-down automaton also exists , the language is also called deterministic context-free . This real subset of context-free languages forms the theoretical basis for the syntax of most programming languages .

Context-free languages can contain the empty word, e.g. B. by a production rule . However, some sentences about context-free grammars additionally require that the empty word must not be generated by it. So there are z. B. an equivalent grammar in Greibach normal form only for context-free grammars if the empty word cannot be generated by it, since exactly one terminal is generated in each derivation step. ${\ displaystyle (S \ rightarrow \ varepsilon)}$

Normal forms

Various normal forms are defined for context-free grammars. Under the Chomsky Normal Form (CNF) the right-hand sides of the non-terminal productions are restricted, i. H. on the right side there may be either a single terminal symbol or exactly two non-terminal symbols. If the start symbol is on the left, the right side of the production can also be the empty word. Any context-free grammar can be transferred to the CNF using an algorithm .

A context-free grammar is in the Greibach normal form (GNF) if it does not generate the empty word and the right-hand sides of the productions begin with a maximum of one terminal symbol and otherwise only contain non-terminal symbols. Any context-free grammar that does not generate the empty word can be transferred to the GNF with an algorithm.

properties

Word problem

The word problem for context-free languages, i.e. the problem of whether a word can be generated by a context-free grammar, is decidable . On the way to solving the word problem, a derivation tree can also be generated. This derivation tree is also called a parse tree, and a program that creates a parse tree is a parser . A parser can be generated automatically for every context-free grammar (see also CYK algorithm ). The worst-case runtime complexity of a parser for any context-free grammar is (see Landau symbols ). For subclasses of context-free grammars , parsers can be generated whose runtime is in . A typical application of an efficient context-free parser linear term is the parsing of a programming language - the source text by a compiler . ${\ displaystyle w}$ ${\ displaystyle {\ mathcal {O}} \ left (n ^ {3} \ right)}$ ${\ displaystyle {\ mathcal {O}} (n)}$

If a word of the language L ( ) can be generated by the grammar in several different ways, then that grammar is ambiguous . In the case of an ambiguous grammar, a parser can generate not just one but several derivation trees for a given word. Ambiguity is not a problem if only the word problem is to be solved. If, however, a different meaning is assigned to the different derivation trees, then a word can have several different meanings in an ambiguous grammar. An example of the need for a unique context-free grammar is a compiler that must generate deterministically and uniquely executable target code for each valid input. ${\ displaystyle w}$ ${\ displaystyle w \ in L (G)}$ ${\ displaystyle G}$

ambiguity

The problem of whether (any) context-free grammar is ambiguous or non-ambiguous cannot be decided. However, there are test procedures that can determine ambiguity or non-ambiguity for certain sub-classes of context-free grammars. Depending on the test procedure, the ambiguity test does not terminate or the test returns that the ambiguity cannot be determined if the context-free input grammar is not an element of a certain subclass of context-free grammars.

equivalence

The problem of whether two context-free grammars and generate the same language (i.e. whether ) cannot be decided. ${\ displaystyle G_ {1}}$ ${\ displaystyle G_ {2}}$ ${\ displaystyle L (G_ {1}) = L (G_ {2})}$

Subset

The problem of whether the language generated by a context-free grammar is also generated by a context-free grammar (i.e. whether ) cannot be decided. ${\ displaystyle G_ {1}}$ ${\ displaystyle G_ {2}}$ ${\ displaystyle L (G_ {1}) \ subseteq L (G_ {2})}$

Union

The union of the languages of two context-free grammars and can also be generated by a context-free grammar, namely ${\ displaystyle L (G_ {1}) \ cup L (G_ {2})}$ ${\ displaystyle G_ {1} = (V_ {1}, T_ {1}, P_ {1}, S_ {1})}$ ${\ displaystyle G_ {2} = (V_ {2}, T_ {2}, P_ {2}, S_ {2})}$

{\ displaystyle G_ {1} \ cup G_ {2}: = (\ {S \} \ cup V_ {1} \ cup V_ {2}, T_ {1} \ cup T_ {2}, P_ {1} \ cup P_ {2} \ cup \ {S \ rightarrow S_ {1}, S \ rightarrow S_ {2} \}, S)}

.

It is assumed that the two non-terminal sets and are disjoint ( ), and any additional character is ( ), but this can be achieved for all . ${\ displaystyle N_ {1} = V_ {1} \ setminus T_ {1}}$ ${\ displaystyle N_ {2} = V_ {2} \ setminus T_ {2}}$ ${\ displaystyle N_ {1} \ cap N_ {2} = \ emptyset}$ ${\ displaystyle S}$ ${\ displaystyle S \ notin N_ {1} \ cup N_ {2} \ cup T_ {1} \ cup T_ {2}}$ ${\ displaystyle G_ {1}, G_ {2}}$

cut

The problem of whether the intersection of the languages of two context-free grammars is also generated by a context-free grammar cannot be decided. ${\ displaystyle G_ {1}, G_ {2}}$

complement

The complement of a context-free grammar is generally not context-free.

Examples

Be a context-free grammar with and ${\ displaystyle G = (V, T, P, S)}$ ${\ displaystyle N = V \ setminus T}$

${\ displaystyle T = \ {x, y, z \}}$

${\ displaystyle N = \ {S, A, B \}}$

${\ displaystyle P}$ contains 4 productions or production rules:

${\ displaystyle {\ begin {aligned} S & \ rightarrow & A \\ A & \ rightarrow & xAy \\ A & \ rightarrow & xBy \\ B & \ rightarrow & z \ end {aligned}}}$

${\ displaystyle w_ {1} = xxzyy}$ can be generated by the grammar with the following derivation: ${\ displaystyle G}$

${\ displaystyle t (w_ {1}) = S (A (x, A (x, B (z), y), y))}$

${\ displaystyle t (w_ {1})}$ is the derivation tree in term notation. The root and the inner nodes are labeled with non-terminal symbols and the leaves with terminal symbols.

So is . ${\ displaystyle w_ {1} \ in L (G)}$

The example word with is not part of the language because the non-terminal is not the start symbol and every word in the language must be enclosed by the terminal symbols and via the start symbol . In formula notation: ${\ displaystyle w_ {2}}$ ${\ displaystyle w_ {2} = z}$ ${\ displaystyle L (G)}$ ${\ displaystyle B}$ ${\ displaystyle x}$ ${\ displaystyle y}$

${\ displaystyle w_ {2} \ notin L (G)}$

Grammar is not ambiguous. ${\ displaystyle G}$

Language of the palindromes

The grammar with given as creates the language of all palindromes above the alphabet . ${\ displaystyle G (\ {S, a, b \}, \ {a, b \}, P, S)}$ ${\ displaystyle P}$ ${\ displaystyle S \ rightarrow \ varepsilon | a | b | aSa | bSb}$ ${\ displaystyle \ {a, b \}}$

Ambiguous example

An example of an ambiguous grammar is with and ${\ displaystyle G_ {2} = (V_ {2}, T_ {2}, P_ {2}, S_ {2})}$ ${\ displaystyle N_ {2} = V_ {2} \ setminus T_ {2}}$

${\ displaystyle T_ {2} = \ {x, y \}}$

${\ displaystyle N_ {2} = \ {S_ {2}, A \}}$

${\ displaystyle P_ {2}}$ contains the following productions:

${\ displaystyle {\ begin {aligned} S_ {2} & \ rightarrow & A \\ A & \ rightarrow & AA \\ A & \ rightarrow & xAy \\ A & \ rightarrow & \ varepsilon \ end {aligned}}}$

For , among other things, there are the derivatives , and . So is ambiguous. ${\ displaystyle w_ {3} = xy}$ ${\ displaystyle S_ {2} (A (x, A (\ varepsilon), y))}$ ${\ displaystyle S_ {2} (A (A (\ varepsilon), A (x, A (\ varepsilon), y)))}$ ${\ displaystyle S_ {2} (A (A (x, A (\ varepsilon), y), A (\ varepsilon)))}$ ${\ displaystyle G_ {2}}$

extension

An extension of context-free grammars are stochastic context-free grammars (SCFG), also known as probabilistic context-free grammars (PCFG). Here, each production rule is assigned a probability of occurrence: so that there is even for each . ${\ displaystyle \ rho \ colon P \ rightarrow \ mathbb {R} ^ {\ geq 0}}$ ${\ displaystyle \ alpha '\ in N}$ ${\ displaystyle \ sum _ {\ begin {smallmatrix} {\ beta} \\ (\ alpha ', \ beta) \ in P \ end {smallmatrix}} \ rho (\ alpha', \ beta) = 1}$

These probabilities of occurrence of the individual rules induce a probability distribution on the set of words generated by the grammar.

A stochastically context-free grammar can be used, for example, to calculate the most likely parse in a syntactically ambiguous grammar for an input word. Another application is the stochastic sampling of derivation trees under the given rule probabilities of an ambiguous grammar. The language generated by an SCFG is defined in exactly the same way as the language of a CFG. SCFGs are e.g. B. used in bioinformatics and computational linguistics .

literature

John E. Hopcroft , Jeffrey D. Ullman : Introduction to automata theory, languages, and computation . Addison-Wesley, 1979, ISBN 0-201-02988-X , pp. 77 ff .
Taylor L. Booth and Richard A. Thomson: Applying probability measures to abstract languages . In: IEEE Transactions on Computers . C-22, no. 5 , 1973, p. 442-450 , doi : 10.1109 / TC.1973.223746 .
J. Baker: Trainable grammars for speech recognition . In: JJ Wolf and DH Klatt (Eds.): Speech communication papers presented at the 97th meeting of the Acoustical Society of America . MIT, Cambridge, MA June 1979, pp. 547-550 (JASA Vol. 65, issue S1, p. S132 is only the abstract in an abstract volume).
Uwe Schöning : Theoretical Computer Science - in a nutshell . 4th edition. Spektrum Akademischer Verlag, Berlin 2001, ISBN 3-8274-1099-1 , p. 13, 51 .

Individual evidence

↑ Uwe Schöning : Theoretical Computer Science - in short . 5th edition. Spektrum Akademischer Verlag, Heidelberg 2008, ISBN 978-3-8274-1824-1 , p. 13 .
^ Alfred V. Aho and Jeffrey D. Ullman: The Theory of Parsing, Translation, and Compiling. Volume 1: Parsing . Prentice-Hall, 1972, ISBN 0-13-914556-7 , pp. 202 .
↑ HJS Basten: Ambiguity Detection Methods for Context-Free Grammars . August 17, 2007 ( cwi.nl [PDF] Master Thesis).
↑ ^a ^b ^c Schöning, 2001, p. 137.

[1] Uwe Schöning : Theoretical Computer Science - in short . 5th edition. Spektrum Akademischer Verlag, Heidelberg 2008, ISBN 978-3-8274-1824-1 , p. 13 .

[2] Alfred V. Aho and Jeffrey D. Ullman: The Theory of Parsing, Translation, and Compiling. Volume 1: Parsing . Prentice-Hall, 1972, ISBN 0-13-914556-7 , pp. 202 .

[3] HJS Basten: Ambiguity Detection Methods for Context-Free Grammars . August 17, 2007 ( cwi.nl [PDF] Master Thesis).

[Schöning-S137-4] Schöning, 2001, p. 137.