Earley's algorithm

from Wikipedia, the free encyclopedia

The Earley algorithm or Earley parser is an algorithm in computer science that decides whether a word can be generated by a context-free grammar . It was developed by Jay Earley in 1970 . It is similar to the Cocke-Younger-Kasami algorithm and, like it, solves the word problem of context-free languages . He uses the dynamic programming method .

use

One of the tasks of a compiler or parser is to check whether the source text entered follows the rules of the relevant grammar, i.e. whether it corresponds to the syntax of the programming language . This is equivalent to solving the word problem. Since most programming languages ​​are context-sensitive , certain conditions are initially ignored. This means that only the much simpler (not NP-complete ) word problem of context-free languages ​​has to be solved. The context-sensitive secondary conditions , such as the completeness of the variable declarations , must then be checked with a different algorithm. The first step in the syntax check is traced back to the word problem of context-free languages. This is the Earley algorithm and also from CYK algorithm - time reached at unambiguous grammars and in some specific grammars . Both are optimal for solving the problem for general context-free language. However, the Earley algorithm has the advantage that no conversion of the grammar into Chomsky normal form is necessary. The disadvantage is the restriction to epsilon- free grammars. Epsilon rules can, however, always be eliminated by reformulating the grammar.

In practice, attempts are usually made to avoid or reduce the relatively high computational effort of the two algorithms. The compiler or parser is optimized specifically for the programming language used and this often results in less computational effort. Particularly great improvements can be achieved if the syntax of the programming language is restricted to such an extent that it has LR1 or even LL1 properties. This is taken into account when developing newer programming languages. For such programming languages ​​there are algorithms that are faster in practice and require less memory than the Earley parser. For general context-free grammars, however, the Earley parser and related algorithms are superior to others.

algorithm

The algorithm takes as input a context-free grammar and a word over the same alphabet . He then decides whether the grammar can generate the word. Unlike the CYK algorithm, it does not go back to the start symbol of the grammar, but tries to generate the word character by character. In each calculation step he tries to move one character of the word further until the whole word is generated. In such a case the word can be generated by the grammar. If the word cannot be generated (i.e. not contained in the language), the algorithm aborts because it arrives at a character that it cannot predict. When entering a word , the algorithm uses the state sets . A state (or Earley state , Earley item , also dottet rule ) of the algorithm is a production , the right side of which is divided by a division point . All characters before this point are considered to have already been checked. Such a state is contained in a state set and identified by an additional counter . This determines the quantity from which the state originally originates so that the algorithm can later quickly generate a syntax tree with reconstruction steps .

At the beginning is set. It is the start symbol of the grammar. The algorithm now runs character by character and always applies the following three rules in the th step until no further states can be added. It should be noted that changes to the quantity of status can also re-apply previous rules. For example, the predictor must be called again for states that are added by the completer.

Prediction (P)
(engl. Predictor )
If in is included, add to every rule of grammar the state to add.
Verification (S)
(English scanner )
If and , add to .
Completion (C)
(engl. Completer )
If a final state exists, add a state to for all states .

The three steps are also called predictive expansion , lexical consumption, and constituent completion . In the definition of small letters mean terminated symbols (also lexical category icons , Eng. Terminals ), uppercase letters nichtterminierte symbols (also complex syntactic category icons , Eng. Non-terminals ) and Greek letters, the entire right side of a rule, consisting of different symbols.

The grammar can generate the word precisely when it is contained in the state set at the end .

The individual states then have to be linked to one another again using a suitable recursive search algorithm ( walker ) in order to generate the syntax tree.

Example: simple math expression

The following grammar (application example from)

describes simple math expressions. The symbols here stand for start ( ), expression ( ), term ( ), factor ( ) and number ( , placeholders for numbers. Here, additional rules for the exact structure of numbers could be inserted). The expression should be recognized as an example . The states in the memory after the Earley algorithm has run are shown in tables to

n
P:
P:
P:
P:
P:
P:
P:
P:
P:
P:
+
S:
C:
C:
C:
C:
C:
C:
C:
n
S:
P:
P:
P:
P:
P:
P:
P:
 
S:
C:
C:
C:
C:
C:
C:
C:

shown. They were generated by applying the three steps of prediction (P), checking (S) and completion (C) several times , as indicated. The final states whose point has reached the end of the rule are marked in red. So up to this point the expression corresponds to a given rule. However, only if, as in this example, the final state of the start rule is contained in the final state set, the entire expression has been successfully recognized and is consequently generated by the grammar. With a recursive search, starting from this last state, the path can now be traced back to the beginning and a syntax tree can be generated.

The syntax tree you are looking for results in:

+

Construction of the syntax tree

For the computation of the syntax tree, J. Earley suggested in his original work in 1970 that the states should be traced back, with each state storing a pointer to the state that created it. Tomita showed in 1986 that this approach leads to incorrect syntax trees for some ambiguous grammars. To compute syntax trees for general context-free grammars, a recursive search must be performed.

This is done with the final start rule, which is in the last set of states. With a recursive search, the sets of states and the input string are traversed backwards. This can in turn be implemented with a kind of backward scanner, backward predictor and backward completer, until the initial start rule has been reproduced. The path back to the beginning reflects the syntax tree you are looking for. In ambiguous situations, all options must be traversed recursively. If several correct syntax trees are to be calculated for an ambiguous grammar, the status found is saved and the recursion continues until all ambiguities have been processed in all recursion levels.

In the following, an algorithm is described that only requires the final states and the input string for the recursion. Non-terminated states can be deleted beforehand. It is a modification of the algorithm described in D. Grune's textbook, which does not need the input string for recursion, instead all non-terminated states. Both algorithms should in principle be equivalent.

Earleytree.svg

In the figure, the recursive search is equivalent to walking right to left across the stack of nested rules. In the figure, the top recursion level of the search corresponds to the right end of the lower start rule and the bottom recursion level to the left end. The recursion levels of the search are therefore perpendicular in the picture. The point is initially in all states on the far right at the end and is pushed back to the beginning during the tree construction. If the point is at the beginning of all states that are part of the tree, a complete tree has been found. Each step across the stack has three cases:

Backward Predictor (Level Up)
This occurs when a non-terminated symbol precedes the period. The still empty recess is filled with a suitable state that corresponds to the symbol in front of the point. A state may only be inserted if there is either no gap between the start positions or if at least as many positions remain free as there are symbols in the symbols in between. Otherwise, incorrect trees can arise. If there are several states to choose from, you have to go back to this point later and run through the next alternative. If a new state has been successfully placed, you can move up one level.
Reverse scanner (step to the left)
This case occurs when there is a terminated symbol in front of the point. The agreement with the entry must be checked again. If they match, the point of the state is shifted one more to the left and one step is taken to the left in the figure.
Backward Completer (level down)
This case occurs when the point has reached the beginning of a rule. In this case it goes down a step to the left. In the state in which you arrive, the point is shifted back to the left by one.

The algorithm can get stuck en route if the backward scanner detects a difference or the backward predictor has no alternative. Then you have to go back to the previous backward predictor and try the next alternative there. If you finally get to the top of the stack to the left, a possible syntax tree has been found. In order to find the further trees of an ambiguous grammar, the process continues with the last backward predictor and there the search is continued until the next complete alternative.

Pseudocode

The following pseudocode describes the entire sequence of the algorithm:

Gegeben seien die zu parsenden Symbole s1...sN
(Q0,...,QN) = ParseText(s1...sN)
Falls die finale Startregel "S ->...*" in QN enthalten ist:
 i = N
 j = 0
 b = Liste von Earley-Zuständen mit als einzigem Element der finalen Startregel
 BaumRekursion(i,j,b,(Q0,...,QN))
Sonst:
 gebe Syntaxfehler aus
Funktion ParseText(s1...sN):
 Lege leere Zustandsmengen Qi mit i=0..N an
 Füge die Startregel "S ->*..." zu Q0 dazu
 Für i=0..N:
   Füge Early-Zustände gemäß den drei Regeln predictor, scanner und completer
       zu Qi hinzu, solange bis keine mehr hinzugefügt werden können
 gebe Zustandsmengen (Q0,...,QN) zurück
Funktion BaumRekursion(i,j,b,(Q0,...,QN)):
 //i: Index der aktuellen Zustandsmenge Qi
 //j: Index des aktuellen Zustandes bj im Baum b
 Solange i > 0:
   //Rückwärts-Predictor
   Falls im Zustand bj vor dem Punkt ein Nonterminal steht:
     Wj = Wert des Nonterminals vor dem Punkt des Zustandes bj
     kj = Startindex des Zustandes bj
     pj = Punktposition des Zustandes bj (0 entspricht dem Anfang)
     Für alle finalen Zustände z in Qi mit Wertigkeit W, Startindex k und Punktposition p,
         für die gilt W = Wj und ((pj-1 = 0 und k = kj) oder (pj-1 > 0 und k-kj >= pj-1)):
       Hänge Kopie von z hinten an b an (alternativ: füge z nach bj ein)
       BaumRekursion(i,j+1,b,(Q0,...,QN))
       Lösche das hinterste Element von b (alternativ: lösche Element bj+1)
     Beende den Funktionsaufruf (return)
   //Rückwärts-Scanner
   Falls im Zustand bj vor dem Punkt ein Terminal steht:
     Falls das Terminal vor dem Punkt gleich si ist:
       Setze Punkt von bj um eins zurück
       i = i-1
   //Rückwärts-Completer
   Falls der Punkt von bj am Anfang steht:
     j = j-1
     Setze Punkt von bj um eins zurück
 Falls bj die Startregel ist:
   Gebe gefundenen Syntax-Baum b aus
//Ausgabe bei der oben gezeigten Grammatik und dem Beispiel n+n
b =
  S -> *E  , 0
  E -> *E+T , 0
  E -> *T  , 0
  T -> *F  , 0
  F -> *n  , 0
  T -> *F  , 2
  F -> *n  , 2

Evaluation of the syntax tree

In order to evaluate the mathematical expression of the example grammar shown, an operating rule for the calculation can be generated directly with the help of a stack. To do this, the sequence of the list of states determined is reversed (or run through backwards) and all rules that are not relevant for the calculation are deleted or ignored. For each step, the corresponding number of input values ​​of the operation is fetched from the stack or the result of the operation is pushed back into the stack. The above example results in:

F -> *n  , 2: Schiebe n in den Stack
T -> *F  , 2
F -> *n  , 0: Schiebe n in den Stack
T -> *F  , 0
E -> *T  , 0
E -> *E+T , 0: Hole zwei Werte aus dem Stack, addiere
        sie und schiebe das Ergebnis in den Stack
S -> *E  , 0: Hole Wert aus dem Stack und gebe ihn aus

In this way, any complex and nested mathematical expression can be parsed and evaluated.

literature

  • Jay Earley: An efficient context-free parsing algorithm . In: Communications of the Association for Computing Machinery . tape 13 , no. 2 , 1970, p. 94-102 ( ccl.pku.edu.cn [PDF; 902 kB ]).
  • John Aycock, R. Nigel Horspool: Practical Earley Parsing . In: The Computer Journal . tape 45 , no. 6 , 2002, pp. 620–630 ( cs.uvic.ca [PDF; 162 kB ]).
  • Dick Grune, Ceriel JH Jacobs: Parsing Techniques. A Practical Guide . 1st edition. Ellis Horwood, New York 1990, ISBN 0-13-651431-6 , pp. 149–163 ( ftp.cs.vu.nl [PDF; 1.9 MB ]).

Individual evidence

  1. J. Aycock, N. Horspool: Directly-Executable Earley parsing. In: Lecture Notes in Computer Science. 2027, 2001, pp. 229-243, doi: 10.1007 / 3-540-45306-7 ( rsdn.ru PDF).
  2. Masaru Tomita, Efficient parsing for natural language , Kluwer Academic Publishers, Boston, p. 201, 1986.
  3. Dick Grune, Ceriel JH Jacobs: Parsing Techniques. A Practical Guide . 1st edition. Ellis Horwood, New York 1990, ISBN 0-13-651431-6 , chapter 7.2.1.2, p. 153–154 ( ftp.cs.vu.nl [PDF; 1.9 MB ]).