A parser [ ˈpɑːʁzɐ ] ( English to parse, "analyze", or Latin pars, "part"; in German occasionally also splitter ) is a computer program that is used in computer science for the decomposition and conversion of an input into one for further processing more suitable format is responsible. Often, parsers are used to develop the semantics of the input after the analysis process and then to carry out actions.
Compared to a recognizer , which analyzes the input and outputs whether it is correct or incorrect in terms of the specifications , the parser outputs the analysis of an input in a desired form and generates additional structure descriptions.
Syntax analysis (parsing) is also used outside of computer science, e.g. B. in the study of the structure of natural languages . In grammar , the syntax analysis of a sentence would correspond to breaking down the sentence into its grammatical components ( syntax ). See also linguistics .
Application and examples
In general, a parser is used to translate a text into a new structure, e.g. B. in a syntax tree , which expresses the hierarchy between the elements.
- HTML code consists of pure text. The parser contained in a web browser creates the logical structure as a data structure. The appearance of these elements is defined separately via CSS .
- RSS parsers convert RSS feeds into another data format, for example for an HTML page.
- XML parsers analyze XML documents and make the information contained therein (i.e. elements, attributes, etc.) available for further processing.
- URI parsers resolve schemes such as URLs in their hierarchical structure (see RFC 3986 ).
- Log file parsers are used to extract relevant information from web server log files, event logs and other information stored in log files for automated analysis.
- Search engines parse websites and crawl relevant text passages.
- Reading out a programming language . A compiler can then generate machine code or bytecode from the data structure obtained .
- A command line interpreter parses commands and their parameters for the correct execution of the user's instructions (e.g. via COMMAND.COM ).
- In text adventures , the character is controlled by entering commands in natural language, e.g. B. "Unlock the front door with the house key". The parser accesses a database of all manipulable objects in the game and analyzes which interaction with which objects in the game world the player meant with his command input.
Parsers usually use a separate lexical scanner (also called a lexer ) to analyze the text . This breaks down the input data ( present as a simple string of characters) into tokens (input symbols or "words" that the parser understands); Because the tokenization follows a regular grammar , the scanner is usually a finite automaton . These tokens serve as atomic input characters for the parser. Parsers that do not use a separate scanner are called scannerless parsers .
The actual parser as an implementation of an abstract machine (usually implemented as a pushdown automaton ) cares contrast to the grammar of the input, performs a syntax check the input data, and usually created from the data a derivation tree (similar to the English occasionally as Parse -Tree ). This is then used for further processing of the data; Typical applications are semantic analysis , code generation in a compiler or execution by an interpreter .
With HTML, a lexical scanner would break the HTML file into HTML tags and running text and pass these components on to the parser - i.e. H. the scanner is only “interested” in the appearance of the syntax elements (“if it is in angle brackets, it is an HTML tag”). The parser, on the other hand, processes the syntactic relationships, i.e. H. examines which pairs of tags belong together or how the tags are nested within one another; The parser is not interested in the meaning of the tags in terms of content, but is only taken into account by the subsequent processing.
A parser is the software that checks, processes and forwards the instructions in the user's source text.
A distinction is made between different parse methods. The general procedure, i.e. the distinction according to the order in which the nodes of the derivation tree are created ( top-down , also theory-driven parsing or bottom-up , also input-driven parsing, as well as left corner ), specific procedure (LL, LR, SLR, LALR, LC, ...) and implementation technique ( recursively descending , recursively ascending or table-controlled). A distinction is also made according to the type of grammar.
Context-free grammar parser
Here are a few methods based on context-free grammars :
- Top-down parser
- Bottom-up parser
- Chart parser
- Left corner parser
Parser for context-sensitive grammars
- Packrat Parser (Parsing Expression Grammars)
The parsing of well-defined artificial languages (see formal languages , programming languages ) is less complex than the parsing of freely grown natural languages such as English or German, which are characterized by a multitude of ambiguities , irregularities and inconsistencies. See also computational linguistics .
Note: The term parse should not be confused with compile . The latter generates a target code from a source code, including parsing, but other actions also take place.
Parsers are often used to create a tree structure from a sequence of symbols. A typical example of this are mathematical expressions such as
This expression, as it stands here, initially only consists of a series of symbols. It is the task of a tokenizer as part of a parser to identify and classify these symbols (for example, from left to right in reading direction ). The result is a list , which is shown below as a table and can be read line by line:
|sin||Symbol name||(here: the sine function)|
|π||Symbol name||(here: the circle number π)|
The (further) task of the parser is to recognize the underlying structure of this symbol sequence. Often this happens in the form of a parse tree ( abstract syntax tree ), which in this case can look like this:
This is the output of a simple parser. This output can now be analyzed by other programs.
- AW Appel: Modern Compiler Implementation in Java. Cambridge 1998.
- Paul Bennett: A Course in Generalized Phrase Structure Grammar. London: UCL Press 1995. ISBN 1-85728-217-5 .
- Robert D. Borsley: Syntactic Theory. A unified approach. London: Edward Arnold 1991. 2nd revised edition 1998, ISBN 0-340-70610-4 . German translation: Syntax theory. A consolidated approach. Tübingen: Niemeyer Verlag 1997. ISBN 3-484-22055-4 .
- Sven Naumann, Hagen Langer: Parsing. Teubner Verlag 1994.