Text classification

from Wikipedia, the free encyclopedia

The text classification is a very important criterion in the field of information extraction .

In the case of differently structured texts, different processes are used that differ from one another in terms of features such as complexity , restrictions or the extraction process. So there are z. E.g .: a speech-based procedure ( Perl ) or a wrapper induction-based procedure. It is therefore necessary to classify the analyzed texts.

The texts are divided according to their structure:

  • Natural and unstructured plain texts,
  • Structured information,
  • Semi-structured texts.

Natural and unstructured plain texts

The natural and unstructured plain texts are processed with systems that enable a morphological and syntactic analysis. The procedure is very complex and sometimes superfluous because the information you are looking for can be found using simple patterns.

Structured information

The structured information is mainly tables and relational databases. No linguistic analysis is required here. To find the information you are looking for, it is enough just to recognize the structure.

Semi-structured texts

The HTML documents are referred to as semi-structured texts and represent a major challenge for information extraction systems. They have an inconsistent structure, some are marked by the HTML tags , others are natural texts. In order to extract the information, the information extraction systems must recognize the HTML structure and the text patterns. The HTML tags are an important reference to the structure.

Web links

Wiktionary: Text classification  - explanations of meanings, word origins, synonyms, translations