Syntax tree

A syntax , derivation or parse tree is a term from theoretical computer science and linguistics . It describes a hierarchical representation of the breakdown of a text. Syntax trees are used both as an aid for graphical visualization of the breakdown and, in the form of a data structure , to display this breakdown for machine processing, e.g. B. in a compiler or translator .

The various terms are not used uniformly in the literature. Only the term derivation tree , which is based on the term derivation, is formally precisely defined . Other names for different types of trees can then be technically defined in more detail, as described below, if necessary.

In contrast to computer science, in which languages can also be defined according to the technical possibilities, linguistics finds more difficult conditions when dealing with natural languages , mainly because the order of the components in a sentence can vary.

introduction

Representation of a derivation tree

In the (mechanical) analysis of natural language sentences or formal texts (e.g. computer programs), directly after the lexical analysis (the breakdown into tokens or symbols ), the symbols are often hierarchically combined to form related parts of sentences ( constituents ) or sections of the formal Text instead. Conversely, this can also be seen as a dissection of the text. The result is a tree like the one shown on the right. In addition to the graphic form, bracketed representations are also used for syntax trees: ${\ displaystyle [_ {S} \ [_ {\ mathit {NP}} \ John] \ [_ {\ mathit {VP}} \ [_ {V} \ hit] \ [_ {\ mathit {NP}} \ [_ {\ mathit {Det}} \ the] \ [_ {N} \ ball]]]]].}$

Technically, the tree on the left is also called a concrete derivation tree , as it shows the resulting structure exactly based on the concrete text. In linguistics, however, models are also common that provide several layers of representation (e.g. surface and deep structure ).

The nodes of the tree are often enriched with attributes (in linguistics these are mainly morphological categories ). This gives you an attributed syntax tree with the associated attributed grammar . While a context-free grammar is used in the first two tree representations , the context dependency comes into play in the latter . These differences are reflected in the Chomsky hierarchy . In such cases, the term semantic analysis is used in compiler construction .

Derivation trees

Consider a context-free grammar . A derivation tree for this is a tree whose nodes are labeled with symbols from (i.e. terminal and non-terminal symbols and the empty word ). The tree is ordered , i.e. H. the children of each node have a fixed order, and the following applies to the labeling: ${\ displaystyle G = \ left (N, \ Sigma, P, S \ right)}$ ${\ displaystyle \ Sigma \ cup N \ cup \ {\ varepsilon \}}$

The root is labeled with the start symbol . This property is sometimes not required. A tree that satisfies them is called a complete derivation tree . ${\ displaystyle S}$
If the children of an inner node labeled with are labeled with the symbols (in that order), the grammar must contain the rule . ${\ displaystyle A}$ ${\ displaystyle z_ {1}, \ ldots, z_ {m}}$ ${\ displaystyle A \ to z_ {1} \ ldots z_ {m}}$
The leaves of the tree are labeled with symbols . ${\ displaystyle \ Sigma \ cup \ {\ varepsilon \}}$
If a leaf is marked with , it is the only successor to its predecessor node. ${\ displaystyle \ varepsilon}$

Only non-terminal symbols can be used as inner nodes, and only the terminal symbols or the empty word for the leaves.

Construction of derivation trees

Possible syntax trees / diagrams can often be easily created for short texts by following the production rules. There are many mechanical methods available for longer texts .

For example, the syntax diagram shown in the introduction u. a. the following rules apply:

${\ displaystyle {\ begin {array} {lll} S & \ rightarrow & NP \ VP \\ NP & \ rightarrow & {\ mathtt {John}} \\ NP & \ rightarrow & Det \ N \\ VP & \ rightarrow & V \ NP \\\ end {array}} \ quad \ quad {\ begin {array} {lll} V & \ rightarrow & {\ mathtt {hit}} \\ Det & \ rightarrow & {\ mathtt {the}} \\ N & \ rightarrow & {\ mathtt {ball}} \ end {array}}}$

In order to generate a derivation tree, the rules can be applied step-by-step from the root by systematically replacing one nonterminal on the left side of the rule with the symbols on the right until only terminals are left:

${\ displaystyle S \ \ Rightarrow \ NP \ VP \ \ Rightarrow \ {\ mathtt {John}} \ VP \ \ Rightarrow \ {\ mathtt {John}} \ V \ NP \ \ Rightarrow \ {\ mathtt {John \ hit }} \ NP \ \ Rightarrow \ {\ mathtt {John \ hit}} \ Det \ N \ Rightarrow \ {\ mathtt {John \ hit \ the}} \ N \ Rightarrow \ {\ mathtt {John \ hit \ the \ ball}}}$

With each of the steps you draw a piece of the syntax tree from top to bottom. But you can also apply the rules the other way around and start with the written sentence and build up the tree step by step from bottom to top.

Derivation trees for unambiguous and ambiguous grammars

If there is more than one derivation tree for a word in the language of a grammar, one speaks of an ambiguous grammar , otherwise of an unambiguous one. For example, the following grammar is ambiguous

${\ displaystyle {\ begin {array} {lll} S & \ rightarrow & S \ S \\ S & \ rightarrow & a \ end {array}}}$

because you can divide "aa a" into two different ways: "[aa] a" and "a [aa]". However, only one possible classification allows the following grammar

${\ displaystyle {\ begin {array} {lll} S & \ rightarrow & a \ S \\ S & \ rightarrow & a \ end {array}}}$

In the case of ambiguous grammars, the number of possible derivation trees for one and the same word can increase sharply with the length of the word. In this case, derivation trees are no longer a suitable representation for the totality of possible derivations. In the case of formal languages, the concrete (surface) grammar is usually formulated clearly. Abstract grammars, on the other hand, are often ambiguous, whereby the uniqueness of the abstract derivation tree then results from the concrete through the course of the analysis.

Abstract syntax trees

For the representation of syntax trees as a data structure in a computer, the term abstract syntax tree (AST) is now used quite uniformly, although the terminology here also fluctuates and z. B. also of abstract derivation trees , operator trees or the like. can be talked about. An exact connection between abstract syntax tree and concrete derivation tree is shown in the literature e.g. T. indicated. However, in addition to a coarsening of the derivation tree, requirements for further processing are also included in the structure, so that a direct formal derivation from the surface grammar is usually unsatisfactory as a result.

The context-free surface grammar is then opposed to an abstract grammar , which in the narrower sense is mostly an algebraic data type . The syntax trees are then technically represented as versatile terms . The analysis is in the transition between grammatical and algebraic-logical terms, so that one can speak fluently here of non-terminals and types or of trees and terms.

example

Concrete (left) and abstract (right) syntax tree for the expression

{\ displaystyle a \ times (b + 3)}

The illustration opposite shows concrete and abstract syntax trees for the following grammars.

concrete grammar	abstract grammar	algebraic type
E :: E "+" T - expression :: T T :: T "*" F - term :: F F :: V factor :: N :: "(" E ")" V - variable N - number	E :: E "+" E :: E "*" E :: V :: N	type E = add (E, E); mul (E, E); var (V); num (N)

The concrete grammar in this example must regulate the order in which the operators are applied to the (partial) expressions - that is, dot before dash and the partial expressions of the same priority are to be grouped together from left to right. The possibility of creating a different summary is also offered with expressions in brackets. Together with certain terminals (here "(", ")", "+", "*") these are merely properties of the syntactic surface that no longer play a role in later analysis and further processing. In particular, the distinction between different types of expressions (here E, T and F) and the key words can be completely dispensed with, as can be seen from the abstract syntax tree, which is also much closer to the "content" of the expression. Furthermore, due to these surface details, concrete derivation trees not only quickly become confusing, but also take up more storage space than necessary as a data structure in the computer due to their details. This is also reflected in the runtime and complexity of the programs that are later to process the derivation tree. For technical reasons, the breakdown of a source text is therefore usually not represented by a specific derivation tree.

Representation of abstract syntax trees

In addition to the graphical representation as (operator) tree shown in the example, abstract syntax trees are also technically noted as terms , e.g. B .: mul(var('a'), add(var('b'), num(3))).

Abstract grammar

While abstract syntax trees are data structures and algebraic types take on the role of grammar in them, in the literature, especially in connection with calculi, often only a coarse, ambiguous grammar is given, which, as shown in the example above, has the same structure as the terms but still contain keywords. This form enables a pleasant writing of abstract syntax trees, which is often very close to the actual source. Usually it is pointed out in the introduction that brackets may be used to disambiguate. An abstract syntax tree for the above example would then actually be a * (b + 3)written down as. In the context of this literature, however, the focus is always on the term. As mentioned, the boundaries between grammar and algebra are blurred by playing with form.

A typical example are the expressions in the lambda calculus , the abstract grammar of which is often only just written down as. The same technique is also used for extensive grammars. ${\ displaystyle E: = \ lambda VE \, | \, EE \, | \, V}$

literature

Ingo Wegener : Theoretical Computer Science . An algorithm-oriented introduction. BG Teubner, Stuttgart, ISBN 3-519-02123-4 , 6.1 Examples of context-free languages and syntax trees, p. 147-148 .
Uwe Schöning : Theoretical Computer Science - in a nutshell . 5th edition. Spectrum Akademischer Verlag, Heidelberg, ISBN 978-3-8274-1824-1 , 1.1.4 Syntax trees, p. 15-17 .
Juraj Hromkovič : Theoretical Computer Science . Formal languages, predictability, complexity theory, algorithms, communication and cryptography. 3. Edition. BG Teubner Verlag, Heidelberg, ISBN 978-3-8351-0043-5 , 10.4 Context-free grammars and push-down automata, p. 378 .
Hans Zima: Compiler I . Analysis. Bibliographisches Institut, Mannheim / Vienna / Zurich 1982, ISBN 3-411-01644-2 , 4.3 Abstract trees and their attribution, p. 216-229 .
Stefan Müller : Grammatical Theory. From transformational grammar to constraint-based approaches. 2nd Edition. Language Science Press, Berlin 2018, ISBN 978-3-96110-074-3 , chap. 2 ( langsci-press.org ).

Individual evidence

↑ Müller (2018), p. 59f.

[1] Müller (2018), p. 59f.