Tokenizer

from Wikipedia, the free encyclopedia

A tokenizer (also lexical scanner , shortly Lexer ) is a computer program for decomposing Plain text (for example, source code ) into sequences of logically related units called tokens ( English tokens ). As such, it is often part of a compiler .

Basics

When an input is broken down into a sequence of logically related units, the so-called tokens , one also speaks of lexical analysis . Typically, the decomposition takes place according to the rules of regular grammars , and the tokenizer is implemented by a set of finite automata . The Berry-Sethi method as well as the Thompson construction are used to convert a regular expression into a nondeterministic finite automaton . By using the power set construction , this can be converted into a deterministic finite automaton.

A tokenizer can be part of a parser and has a preprocessing function. It recognizes keywords, identifiers, operators and constants within the input. These consist of several characters, but each form logical units, so-called tokens. These are passed to the parser for further processing (i.e. syntactic analysis).

Programs for generation

If you can give a formal description of the dictionary to be recognized , a tokenizer can be generated automatically. The Lex program contained in Unix operating systems and Flex, which was developed as free software , fulfill precisely this function. From the formal description, these programs generate a function that determines and returns the next token from an entered text. This function is then mostly used in a parser .

Web links

Individual evidence

  1. Stanford Dragon Book compiler construction ( Memento of the original from March 6, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. (English) @1@ 2Template: Webachiv / IABot / dragonbook.stanford.edu