Chart parser

from Wikipedia, the free encyclopedia

A chart parser , even chart parser written is a parser for context-free grammars , which part analyzes (Teilkonstituenten) in a table (Chart) noted. This caching and reuse of partial analyzes improves efficiency considerably and makes the parsing of context-free languages a problem that can be solved in polynomial time.

Chart parsing is an umbrella term for all parsing methods that use such a table. Depending on the parsing algorithm used, a distinction is made between different subtypes:

  • Top-down chart parser ( Earley parser )
  • Left corner chart parser
  • Island chart parser

Chart

The chart is an nx n matrix, where n is the length of the sentence to be analyzed. The spaces between the words in this sentence are numbered from 0 to n. In each chart cells (are called. Dotted rules dotted rules . See LR parser ).

Formally, a chart can be described as a set of 3-tuples <i, j, A → α. β>, where:

  • i (0 ≤ i ≤ n) is the position from which a constituent is expected category A.
  • j (> = i) is the position up to which α is recognized.
  • A → α. β is a dotted rule, from which β has yet to be recognized; β is therefore also called the open part of this rule , α is the closed part . α and β consist of any sequence of terminal and non-terminal symbols of the context-free grammar. α and / or β can also be empty, i.e. H. be equal to ε.

example

A single chart entry can look like this, for example:

<2, 5, VP → V NP. NP>

This means:

  • A verb phrase (VP) is expected from sentence position 2 .
  • the analysis of the VP has progressed to sentence position 5. The closed part V NP of the VP rule is located between positions 2 and 5 .
  • Another NP is expected from position 5, if the relevant VP rule can be applied at all.

Parser operations

Chart parsers typically use three different operations during analysis:

  1. Shell formation ( predict )
  2. Integration of a terminal symbol ( scan )
  3. Combination of two constituents ( complete )

Shell formation

If < i , j , A → α. B β> ∈ chart and

if B → γ is a rule of grammar,

then add

< j , j , B →. γ>

into the chart if this tuple does not yet exist.

A category B constituent is expected from sentence position j . For the expansion of B there is a rule with right side γ. So one generates a new expectation to find γ starting from position j .

Integration of a terminal symbol

If < i , j , A → α. w β> ∈ Chart (w is a terminal symbol or pre- terminal ) and

is w , the j -th word of the sentence to be analyzed, s = w 0 w 1 ... w n ,

then add

< i , j + 1 , A → α w. β>

into the chart if this tuple does not yet exist.

The analysis has thus progressed so far that a terminal symbol or a word category (such as verb ) is expected after position j . If the j th word is actually w (or of the part of speech w), then this word can be integrated into the analysis. The point is then moved over the recognized word.

Combination of two constituents

Note : the combination operation described here is that of the top-down chart parser. Other methods of chart parsing work a little differently here.

If < i , j , A → α. B β> ∈ Chart (B is a non-terminal symbol ) and

is also < j , k , B → γ. > ∈ Chart

then add

< i , k , A → α B. β>

into the chart if this tuple does not yet exist.

During the analysis a complete constituent B was found between positions j and k . Another tuple exists in the chart, which reflects the expectation of a constituent B from position j . So both can be combined to a new tuple, which covers the positions i to k . The point was continued over the recognized constituent B.

Parsing algorithm

Input : One sentence s = w 0 w 1 ... w n .

  1. Add <0,0, S '→. S> in the chart (S is the start symbol of the grammar, S 'is a previously non- terminal symbol ).
  2. Use the steps Predict , Scan and Complete described above until no further chart entries can be created.

Output : yes , if <0, n, S '→ S. > ∈ Chart, otherwise no .

Note : Actually this is just a detection process. However, the actual record structures can be reconstructed from the chart with some additional administrative information (so-called shared packed parse forest).

The steps under 2. are not in their order. Your order can using a variety of search methods ( depth-first search , breadth-first search , best-first search ) are systematized.

example

Given a context-free grammar with the following production rules :

  1. SNP VP
  2. SS PP
  3. VPV NP
  4. NPNP PP
  5. NPtype N
  6. PPP NP

Lexicon rules

  1. NP → Donald
  2. NP → Daisy
  3. V → observed
  4. N → binoculars
  5. P → with
  6. Art → dem

The sentence to be parsed is "Donald is watching Daisy with the binoculars"

No Inserted chart entry Reason
1 <0, 0, S '→. S> initialization
2 <0, 0, S →. NP VP> P 1, 1
3 <0, 0, S →. S PP> P 1, 2
4th <0, 0, NP →. NP PP> P 2, 4
5 <0, 0, NP →. Type N> P 2, 5
6th <0, 1, NP → Donald. > S 2, L1
7th <0, 1, S → NP. VP> C 2, 6
8th <0, 1, NP → NP. PP> C 4.6
9 <1, 1, VP →. V NP> P 7, 3
10 <1, 1, PP →. P NP> P 8, 6
11 <1, 2, V → observed. > S 9, L3
12 <1, 2, VP → V. NP> C, 9, 11
13 <2, 2, NP →. NP PP> P 12, 4
14th <2, 2, NP →. Type N> P 12, 5
15th <2, 3, NP → Daisy. > S 12, L2
16 <1, 3, VP → V NP. > C 12.15
17th <2, 3, NP → NP. PP> C 13.15
18th <0, 3, S → NP VP. > C 7, 16
19th <3, 3, PP →. P NP> P 17, 6
20th <3, 4, PP → with. NP> S 19, L5
21st <4, 4, NP →. NP PP> P 20, 4
22nd <4, 4, NP →. Type N> P 20, 5
23 <4, 5, Art → dem. > S 22, L6
24 <0, 3, S → S. PP> C 3, 18
25th <4, 5, NP → Art. N> C 22, 23
26th <5, 6, N → binoculars. > S 25, L4
27 <4, 6, NP → type N. > C, 25, 26
28 <3, 6, PP → with NP. > C 20, 27
29 <2, 6, NP → NP PP. > C 17, 28
30th <1, 6, VP → V NP. > C, 12, 29
31 <0, 6, S → NP VP. > C 7,30
32 <0, 6, S → S PP. > C 24, 28
33 <0, 0, S '→ S. > C 1.31

Explanation :

  • Pn, m = envelope formation (predict) of entry n with production rule m,
  • Sn, Lm = integration (scan) of the envelope formation of entry n with lexicon rule m,
  • Cn, m = combination (complete) of the two entries n and m

The fact that entry 33 could also have been formed by combining entry 1 with entry 32 shows that the sentence can be parsed in two ways (i.e. it is ambiguous).

Extensions

Repayment rules

Repayment rules are u. a. Production rules of the form A → ε. Such rules are usually integrated into the chart parser using special preprocessing strategies.

Perspectives ( lookahead )

The creation of superfluous chart entries can be prevented by integrating k lookahead symbols into the chart tuple. This technique is also used with LR (k) parsers .

Separate grammar

So-called separated grammars are generally used for parsing natural languages, in which the lexicon and phrase structure rules are separated from one another. The right-hand pages of the rule of the context-free grammar thus contain either only terminal symbols (alphabet symbols) or non-terminal symbols. This makes the Predict and Scan process more efficient as they only advance to the level of the preterminal (parts of speech).

robustness

Since the inputs of the parser are not always well-formed in the sense of the grammar (see the application of the grammar checker ), it is useful to make the parser robust, i.e. H. to make them immune to grammatical errors. This concerns, for example, unknown words for which all probable types of speech are entered in the scan step, or missing or redundant words that are recognized with special error productions.

complexity

O (n 3 ) for any context-free grammars, O (n 2 ) for non-ambiguous context-free grammars.

Applications

Chart parsers are mostly used in connection with the syntactic analysis of natural languages, since they - in addition to the Tomita parser - have the best time complexity for any (i.e. also ambiguous) context-free grammars. For example, the Microsoft Word grammar checker uses a chart parser. For programming languages ​​whose syntax has special properties, more efficient parsers such as LR (k) or LL (k) parsers are usually used.

See also

literature

  • Jay Earley: An Efficient Context-Free Parsing Algorithm . In: Communications of the ACM 13, 1970, ISSN  0001-0782 , pp. 94-102.
  • Gerald Gazdar , Chris Mellish: Natural Language Processing in Prolog. An Introduction to Computational Linguistics . Addison-Wesley, Wokingham et al. 1989, ISBN 0-201-18053-7 .