Porter-Stemmer algorithm

from Wikipedia, the free encyclopedia

The Porter-Stemmer algorithm is a common algorithm in computational linguistics for automatically tracing words back to their stem ( stemming ). The algorithm is based on a set of shortening rules that are applied to a word until it has a minimum number of syllables . The algorithm originally developed for words in the English language can be ported to other languages ​​with relative ease.

functionality

Determination of the number of syllables

Strictly speaking, it is not the number of syllables that is decisive, but the number of vowel-consonant sequences. Each word can be interpreted as a character string of the form [C] (VC) m [V] , where C stands for a sequence of one or more consonants and V for a sequence of one or more vowels. The number m of vowel-consonant sequences between optionally leading consonants and an optional sequence of vowels at the end is measured .

Examples :

  • tr-ee, to (m = 0)
  • w- eb , ant (m = 1)
  • b- sth-een (m = 2)
  • W- ik-ip-ed -ia (m = 3)

Abbreviation rules

The abbreviation rules consist of pairs of conditions and derivatives for different suffixes (word endings). The rules are summarized in groups that are processed one after the other. Only one rule can be applied from each group.

Example : The first group contains the suffix shortening rules "sses" → "s", "ies" → "i" and "s" → "", which for example lead to the derivatives "librar ies " → "librari" and "Wiki s "→" Wiki lead. A group that follows later consists of the rule "y" → "i", so that for example the word "librar y " is traced back to the same stem ("library" → "librari").

Implementations

Implementations in several programming languages can be found on the website of the Porter-Stemmer algorithm . The string processing language "Snowball" developed by Martin Porter can be found at snowballstem.org, which can be used to describe Porter Stemmer. There you will also find a Porter Stemmer for the German language.

Remarks

The stems derived from a word often do not correspond to the linguistically correct word stems. However, since the goal of stemming is not a linguistic analysis, but rather related words are to be traced back to the same character string, this does not matter.

Like practically all stemming algorithms, the Porter Stemmer does not work with one hundred percent accuracy, so that with some words it can happen that too much ( overstemming ) or too little ( understemming ) is cut off. In practice, however, it is good enough. (See also further background information on the topic in the article Stemming ).

literature

  • MF Porter: An algorithm for suffix stripping . In: Program, 14 (3), pp. 130-137, July 1980

Web links

Individual evidence

  1. ^ Martin Porter: Snowball: A language for stemming algorithms. Accessed February 11, 2019 .