Tomita parser

A Tomita parser (after Masaru Tomita ) is a parsing method for context-free grammars that is a generalization of the LR (k) method. The procedure is therefore also called the GLR (k) procedure (for Generalized LR (k)).

The starting point of the Tomita parser is the table creation process of the LR (k) method. For grammars that do not have the LR (k) property (including, but not limited to, ambiguous grammars), this process leads to multiple entries, so-called conflicts :

Shift-Reduce conflict : it is possible to put the next input symbol on the stack of the parser or to replace a recognized right side of a production rule with the left side of the rule.
Reduce-Reduce conflict : there are at least two production rules that can be used to achieve a reduction.

The Tomita parser's algorithm follows up these conflicts in a pseudo-parallel manner. The data structure is a so-called. Graph stack ( graph structured stack ) - a directed acyclic graph - used represents all the already completed Parsoperationen.

Graph stack

The used representation of the parsing results is done - similar to the chart parser - by means of a directed acyclic graph.

Fig. 1 .: Graphstack for the sentence She watches the burglar with binoculars

Fig. 1 shows such a graph after the parsing process has been completed for the example sentence “ She observes the intruder with binoculars ”.

The ambiguous grammar used is:

S → DP VP
VP → Vbar
VP → VP PP
Vbar → V _trans DP
DP → Det NP
DP → pron
NP → Nbar
Nbar → N
Nbar → Nbar PP
PP → P DP

For the example sentence, it allows two different syntactic analyzes . Because of this ambiguity, it does not have the LR (k) property, so it leads to multiple entries in the par table.

Each graph node has a unique node name (this starts with "n"). The red nodes contain elements from the grammar vocabulary, i.e. on the one hand non-terminal symbols and on the other hand words ( terminal symbols ). The blue nodes, on the other hand, contain status numbers from the LR par table. One can clearly see how in node n21 two previously different analyzes come together again when integrating the preposition “with”. The following prepositional phrase "with binoculars" is analyzed only once.

Parsing algorithm

As with the LR (k) method, the parsing process consists of a series of table-controlled shift or reduction steps. In the shift step, a word is removed from the input and placed on the stack. In a reduction step, the right-hand side (γ) of a production rule A → γ, which is placed on the stack in reverse order, is replaced by the left-hand rule side (A). In contrast to the LR (k) method, the reduced part is not deleted from the stack, but is retained. This means that the stack is growing monotonically.

The process is controlled by the GLR (k) table. The parser is in a defined state at all times (see push-button machine ). With this status and the next symbol (s) of the input, the GLR (k) table is consulted and the next operations (shift, reduce, accept, error) and the subsequent status are determined. In the case of multiple entries (conflicts), these are followed virtually in parallel. However, subsequent shift operations can re-synchronize these parallel processing threads; this is important for the time complexity of the procedure.

Relationship to other parsing methods

The Tomita parser is a precompiled chart parser .

literature

Dick Grune, Jacobs, Ceriel JH: Parsing Techniques . Springer Science + Business Media, 2008, ISBN 978-0-387-20248-8 .

Masaru Tomita: LR parsers for natural languages . In: Coling 1984: 10th International Conference on Computational Linguistics . 1984, p. 354-357 .

Masaru Tomita: An efficient context-free parsing algorithm for natural languages . In: International Joint Conference on Artificial Intelligence . 1985, p. 756-764 .