Latent Semantic Analysis

from Wikipedia, the free encyclopedia

Latent Semantic Indexing ( LSI for short ) is a (no longer patent-protected ) information retrieval method that was first developed in 1990 by Deerwester et al. was mentioned. Methods such as the LSI are of particular interest for searching large amounts of data such as the Internet . The goal of LSI is to find main components of documents . These main components (concepts) can be thought of as general terms. For example, horse is a concept that uses terms like mare , nag or nagincludes. This method is therefore suitable, for example, to find out from a large number of documents (such as can be found on the Internet, for example) those that deal thematically with 'cars', even if the word car does not appear in them explicitly. Furthermore, can the LSI help articles where it really comes to cars, from those of in which only the word car is mentioned (such as in pages where a car is touted as profit).

Math background

The name LSI means that the term frequency matrix (hereinafter TD matrix) is approximated by the singular value decomposition and thus approximated . A dimension reduction to the units of meaning ( concepts ) of a document is carried out, which facilitates the further calculation.

LSI is just an additional method based on vector space retrieval . The TD matrix known from there is additionally processed by the LSI in order to reduce it. This is particularly useful for larger document collections, since the TD matrices are generally very large here. For this purpose, the TD matrix is broken down using the singular value decomposition. Then "unimportant" parts of the TD matrix are cut off. This reduction in size helps to save complexity and thus computing time in the retrieval process (comparison of documents or inquiries).

At the end of the algorithm there is a new, smaller TD matrix, in which the terms of the original TD matrix are generalized into concepts.

The semantic space

LSI - The illustration shows a typical term-document matrix (1) in which for each term (word) it is specified in which documents it occurs. The original 5x7 dimensions can be reduced to 2x7 here (2). So brain and lung are combined in one concept, data , information and retrieval in another.

The TD matrix is split into matrices from its eigenvectors and eigenvalues using the singular value decomposition . The idea is that the TD matrix (representative of the document) consists of main dimensions (words that are important for the meaning of the document) and less important dimensions (words that are relatively unimportant for the meaning of the document). The former should be retained, while the latter can be neglected. Concepts, i.e. words that are similar in meaning (ideally synonyms ) can also be summarized here. So LSI generalizes the meaning of words. In this way, the usual number of dimensions can be significantly reduced by only comparing the concepts. If the examined documents (texts) consisted of the four words horse, horse, door and gate, then horse and horse would be combined into one concept, just like door and gate to another. The number of dimensions is reduced from 4 (in the original TD matrix) to 2 (in the generalized TD matrix). One can well imagine that with large TD matrices the savings are enormous in favorable cases. This dimension-reduced, approximating TD matrix is ​​called semantic space .

algorithm

  • The term-document matrix is ​​calculated and, if necessary, weighted, e.g. B. by means of tf-idf
  • The term-document matrix is now broken down into three components (singular value decomposition):
.
The two orthogonal matrices and contain eigenvectors of and , respectively , is a diagonal matrix with the roots of the eigenvalues ​​of , also called singular values.
  • The dimension reduction can now be controlled via the eigenvalues ​​in the generated matrix . This is done by successively omitting the smallest eigenvalue up to an indefinite limit .
  • In order to process a search query (for Query), it is mapped in the semantic space. is viewed as a special case of a size document . The (possibly weighted) query vector is mapped with the following formula :
are the first diagonal elements of .
  • Each document is mapped as in the semantic space. Then, for example, a comparison can be made with the document via the cosine similarity or the scalar product .

Advantages and disadvantages of the procedure

The semantic space (i.e. the TD matrix reduced to meanings) reflects the structure underlying the documents, their semantics. The approximate position in the vector space of the vector space retrieval is retained. The projection onto the eigenvalues ​​then indicates the affiliation to a concept (steps 4 and 5 of the algorithm). Latent Semantic Indexing elegantly solves the synonym problem , but only partially solves polysemy , that is, the same word can have different meanings. The algorithm is very computationally expensive. The complexity is for the singular value decomposition , where is the number of documents + number of terms and the number of dimensions. This problem can be avoided by using the Lanczos method for the economic calculation of a TD matrix that has been reduced from the start . The singular value decomposition must also be repeated whenever new terms or documents are added. Another problem is the dimension problem: how many dimensions the term-document matrix should be reduced to, i.e. how big it is.

Web links

swell

  1. US4839853 and CA1306062, patent protection has expired
  2. ^ Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, Richard Harshman: Indexing by Latent Semantic Analysis. In: Journal of the American society for information science. 1990. (pdf)
  3. ^ E. Leopold: On Semantic Spaces. In: LDV forum. Zeitschrift für Computational Linguistics and Language Technology. Vol. 20, Issue 1, 2005, pp. 63-86.