Stop word

from Wikipedia, the free encyclopedia

In information retrieval , stop words are called words that are not taken into account in full-text indexing because they occur very frequently and are usually not relevant for capturing the document content. The stopwords were usually on a list in older search engines and were removed from the text and not indexed. Today most Internet search engines are based on full indexing, the stop words are displayed, but do not contribute to the search.

Stop words are usually the most common words in a language. One then speaks of a “fixed stop word list”. What all stop words have in common is that they primarily assume grammatical / syntactic functions and therefore do not allow any conclusions to be drawn about the content of the document. Another thing they have in common is their large number: They occur very often in every document and appear in a great many documents, which means that they would cause a great deal of effort in indexing the documents. If these are the most common words in a set of documents (for example files and reports), one speaks of a "calculated stop word list". Recognizing stop words makes search engines more efficient . If you were to consider stop words in a search, almost every document would be a hit. Such a search result would be useless for the user. However, it does not always make sense to completely hide stop words. Examples of this are the rock group “ The Who ” in English or “ Die Ärzte ” in German and people with the surname “ Weil ”. Therefore, it is now possible to search for these combinations with full indexing. In the past, most search engines required an operator, for example "+" or the phrase search.

Common stop words in German-language documents are certain articles ('der', 'die', 'das'), indefinite articles (' ein ',' eine ',' ein '), conjunctions (e.g.' and ',' or ',' but ',' because ') and frequently used prepositions (e.g.' an ',' in ',' von ') and the negation ' not '. In English, 'a', 'of', 'the', 'I', 'it', 'you' and 'and' are stop words. Depending on the documents to be indexed, stop words can also be available in several languages. Although they should be called stop characters, the period (.), Comma (,) and semicolon (;) are also often referred to as stop words. In the free software - Library NLTK lists of stop words for 21 languages and ready methods are included for their use.

Hans Peter Luhn , one of the pioneers of information retrieval, coined the concept of stop words and used this concept in the design and implementation of the indexer KWIC .

The stop word is to be separated from the so-called black lists , which are a list of prohibited words. The occurrence of such a word does not lead to the exclusion of the word from the indexing, but to the elimination of the entire document.

Web links

Individual evidence

  1. a b Daniel Koch: Search engine optimization: website marketing for developers . Pearson Germany, 2007, ISBN 978-3-8273-2469-6 , pp. 35 .
  2. Mario Fischer: Website Boosting 2.0: Search Engine Optimization, Usability, Online Marketing . mitp Verlag, 2009, ISBN 978-3-8266-1703-4 , p. 203 .
  3. André Klahold: Recommender Systems: Recommender Systems - Basics, Concepts and Solutions . Springer-Verlag, 2009, ISBN 978-3-8348-0568-3 , pp. 25 .
  4. The Invisible Prime Minister stop word. November 2017. Retrieved November 2, 2017 .
  5. a b Tom Slevin: Stop Words. Kids, Code, and Computer Science, November 2013, accessed May 11, 2016 .
  6. ^ Philipp Wiedmaier: Search engine optimization using the example of Google . Diplom.de, 2006, ISBN 978-3-8324-9838-2 , pp. 55 .
  7. 2. Accessing Text Corpora and Lexical Resources. NLTK.org, accessed April 10, 2019 .
  8. python - Stopword removal with NLTK. stackoverflow.com, accessed April 10, 2019 .
  9. Michael Glöggler: Search engines on the Internet: functionality, ranking methods, top positions . Springer-Verlag, 2013, ISBN 978-3-642-59321-5 , p. 56 .