Vector space retrieval

The vector space retrieval ( vector space model ( VSM )) is a method for information retrieval in which the information is represented as points in a high-dimensional, metric vector space . The mathematical distance between the search vector and the document / information vector is used for the evaluation. The vector space model was first implemented in the SMART system developed under the direction of Gerard Salton at Cornell University .

Simplified description

In a very simplified way, the model on which this form of information retrieval is based can be imagined as follows: Each word in the document is assigned a dimension . In order to determine the point of a document (or a query) in this vector space, a very simple variant of the vector space model can be used, for example, to count how often the individual words appear in the document. The point of the document in vector space (the document vector ) then corresponds to the frequencies of these words. For example, the one-sentence document “The explosion destroys the vegetation” could be described as a vector (0,…, 2,…, 1,…, 1,…, 1,…): The word that occurs twice, explosion , destroyed and vegetation once each; other words do not appear (0 times).

Search queries can be coded in the same way; a fictitious search query “Will the explosion destroy the vegetation?” would correspond to exactly the same (query) vector (0,…, 2,…, 1,…, 1,…, 1,…) in this case because of the same word distribution. The problem of finding documents that match the search query as closely as possible can therefore be solved with the help of the vector space model by looking for those documents whose vector is as "similar" as possible to the vector of the search query. A simple possibility could be, for example, to search for document vectors that are parallel to the query vector or only deviate from it by a small angle.

In reality, vector space models are considerably more complex and take into account, for example, different word frequencies. Words like “die” or “ist” appear in almost every German-language document and are therefore not very meaningful, whereas words like “ deoxyribonucleic acid ” are less common and therefore potentially better suited to differentiating the document from other content.

method

Some preliminary work is necessary to enable vector space retrieval. The first step consists in building a document vector space and document indexing , in which the documents of the document set are mapped onto exactly one point (document vectors) in the document vector space. A large number of feature weighting models exist for this, all of which are based on the frequency of features such as terms, lemmas or n-grams in individual documents as well as the entire set of documents.

The retrieval in the vector space model first carries out a query indexing in which the query is mapped to a vector in the vector space. The subsequent retrieval function determines a subset of the document vectors that have a certain similarity with respect to the query vector, and the ranking function maps this subset onto an ordered list of document vectors. The user who made the query is presented with a list of documents which corresponds to the list of document vectors.

VSM implementing software

Apache Lucene is a Java - program library for full-text search .
Elasticsearch is a search engine based on Lucene.
Gensim is a program library based on Python and NumPy for modeling Vector Space.
Weka is a software tool that provides various techniques from the areas of machine learning and data mining .
Word2vec consists of a group of models with flat, two-layer artificial neural networks that are trained to grasp linguistic relationships between words.

literature

Baeza-Yates, Richardo; Ribeiro-Neto, Berthier: Modern Information Retrieval . ACM Press, New York, 1999, ISBN 0-201-39829-X .
Ferber, Reginald: Information Retrieval - Search Models and Data Mining Methods for Text Collections and the Web . Heidelberg, 2003, ISBN 3-89864-213-5 .
Grossman, DA; Frieder, O .: Information Retrieval . Springer, Netherlands, 2nd edition, 2004, ISBN 1-4020-3004-5 .
Kowalski, Gerald; Maybury, MT: Information Storage and Retrieval Systems . Kluwer, Boston, 2000.
Panyr, Jiří : Automatic classification and information retrieval . Tuebingen, 1986.
Panyr, Jiří: Vector space model and cluster analysis in information retrieval systems . In: News for Documentation 38, pp. 13-20, 1987.
Salton, Gerard; McGill, MJ: Information Retrieval . MacGraw-Hill, 1987.

Individual evidence

↑ The European Technology Platform on Smart Systems Integration (EPoSS)
^ Software Framework for Topic Modeling with Large Corpora. In: gensim. Retrieved February 3, 2019 .
↑ A Beginner's Guide to Word2Vec and Neural Word Embeddings. skymind.ai, accessed on February 3, 2019 .

[1] The European Technology Platform on Smart Systems Integration (EPoSS)

[2] Software Framework for Topic Modeling with Large Corpora. In: gensim. Retrieved February 3, 2019 .

[3] A Beginner's Guide to Word2Vec and Neural Word Embeddings. skymind.ai, accessed on February 3, 2019 .