Tokenization

from Wikipedia, the free encyclopedia

In computational linguistics, tokenization refers to the segmentation of a text in units of the word level (sometimes also sentences , paragraphs, etc.). The tokenization of the text is a prerequisite for its further processing, for example for syntactic analysis by parsers , in text mining or information retrieval .

In computer science , the term refers to the breakdown of a computer program written in a programming language into the smallest units, see tokens ( compiler construction ) and token-based compression .

Tokenization issues

Usually a text is broken down into its words during tokenization . White space tokenization is the simplest form of such a decomposition. In this process, the text is separated at the spaces and punctuation marks . It cannot be used with non-segmentizing fonts such as Chinese or Japanese , as there are no spaces in them.

In an alternative tokenization process, sequences of letters form a token, as do all sequences of digits. All other characters in themselves form a token.

Both methods are problematic in the case of multi-word phrases, especially proper names, currency information, etc. For the sentence Klaus-Rüdiger buys fish'n'chips in New York for $ 2.50. From a linguistic point of view, segmentation into the following token sequence would be more adequate:

 Klaus-Rüdiger
 kauft
 in
 New York
 für
 $2.50
 Fish'n'Chips

literature