Weighted information retrieval

from Wikipedia, the free encyclopedia

The weighted information retrieval , also weighted retrieval , is a method of information science for information acquisition with search terms , short terms. It is divided into term-based methods and Boolean retrieval (English weighted boolean retrieval ). Modern methods use the theories of Robertson and Karen Spärck Jones for this purpose .

All search engines such as Google use these methods to create optimal hit lists.

Weighted Boolean Retrieval

George Boole described with "Laws of Thought" (1854) the theory that thought experiments with logical conclusions are always binary , distinguishable into true or false . The Boolean systems work accordingly with precisely set terms . The operators AND, NOT and OR combine several search terms, with no priorities based on relevance ( relevance ranking ). The possible placeholders for exactly one character or for any number of characters are specified manually; they have no fault tolerance .

In weighted Boolean retrieval, the terms are weighted by the users who set appropriate values, for example information retrieval with a value of 0.75 and the Journal of the American Society for Information Science with 0.25.

<”Information Retrieval”; 0,75> AND <”Journal of the American Society for Information Science"; 0,25>.

The term-based weighted retrieval distinguishes three types of terms: language elements , proper names and units of meaning . The relevant terms are identified for each document, and a weight is determined for each term, for example using the frequency in the document. This refinement increases the quality and the goodness of the results.

Probabilistic information retrieval

Sparck Jones developed a model with probabilistic statements that determines the relevance of a search query across all available documents. The Bayes' theorem allows the conditional probabilities to calculate this relevance in advance. It is the probability of a document and the event that is deemed relevant by the user. The event depends on the search query and user assessments.

The simplest approach, without assessments, hardly finds its way into practice. A loop is needed after the search query to determine whether the documents found were actually relevant. If the system assesses this itself, it is “pseudo-relevance feedback”. Because a subsequent search query will deliver better results if the assessments are cleverly evaluated. This is a loop that checks relevance by the user or the system. It is expedient to evaluate the documents as a function of one another, i.e. relatively, and to calculate relevance.

For example, a user is presented with the same document in two situations. If he already has other documents that cover the information requirement, the presented document will be considered less relevant. If he only has irrelevant documents, the same document may appear extremely relevant.

The relevance is a relative measure of the importance of the documents from the user's perspective. Each search query leads to a list of documents, of which are relevant and not relevant. The second criterion checks whether the document matches the search term. All documents that contain this term as well as those searched for that are additionally relevant are to be classified. This leads to the list of documents searched for.

Term / document relevant documents not relevant documents total
Term included
Term not included
total

The relevance of the documents can be calculated using a formula developed by Robertson and Sparck Jones, which Croft, Metzler and Strohman improved in 2010:

A numerical example in Microsoft Excel provides the following values for documents, if they are relevant and contain the search term, depending on the relevant documents found:

= relevant with search term 0 1 2 3 4th 5
= Relevance of the documents −2.1 −1 −0.3 0.3 1 2.1

The relevance determined with this search can be assigned to all documents in the metadata because the knowledge from this search query summarizes well. According to internal algorithms , the documents can be shown in an optimized order in a later search query, and each document has an improved relevance.

rating

The weighted information retrieval lays the foundation for the ranking of documents. The output of a search query, sorted by relevance, enables a quick and precise response to a user's information needs. The examined models indicate that a basic retrieval system must be supplemented by the weighting of the search terms in order to weight documents. The aim is to complement the advantages of the classic retrieval system with those of weighted information retrieval.

Boolean systems

In practice one encounters difficulties with the user friendliness of Boolean systems, because the end user has to use the search operators correctly and assign weights to every search term. This can be unintuitive and cumbersome for laypeople. Classic Boolean systems have so far not been able to establish themselves. The addition of the weighting of the individual search terms makes intuitive operation even more difficult. It can be assumed that weighted Boolean retrieval systems will not find acceptance outside of professional research.

Probabilistic model

Robertson Sparck Jones' probabilistic information retrieval is best documented and understood. In a modified version, it enables automated “pseudo-relevance feedback”, which can sort the output of a search query according to relevance without evaluating the individual documents by a user. These systems are not yet fully developed and have a relatively high error rate, since no documents marked as “not relevant” are passed to the algorithm. Classic document retrieval is fading into the background and the retrieval of images, videos and music is gaining in importance.

Search results sorted according to relevance represent a quality factor. A similar development is to be expected with regard to new, multimedia retrieval systems. It was only the weighting and sorting of the search results in information retrieval that led to today's standards in dealing with search engines. The weighting methods are therefore of great importance for the development of search engines.

Individual evidence

  1. a b c d e Stock, WG & Stock, M. (2013). Handbook of Information Science . Berlin: De Gruyter Saur.
  2. ^ Robertson, SE, Walker, S., & Beaulieu, M. (1999). Okapi at TREC-7. Automatic ad hoc, filtering, VLC and interactive track . In the 7th Text REtrieval Conference (TREC 7). Gaithersburg, MD: National Institute of Standards and Technology. (NIST Special Publication 500-242.)
  3. Sparck Jones, K., Walker, S., & Robertson, SE (2000a) A probabilistic model of information retrieval: development and comparative experiments Part 1. Information Processing and Management, 36, 779-808
  4. Sparck Jones, K., Walker, S., & Robertson, SE (2000b) A probabilistic model of information retrieval: development and comparative experiments Part 2. Information Processing and Management, 36, 809-840
  5. Croft, WB, Metzler, D., & Strohman, T. (2010). Search engines. Information Retrieval in Practice. Boston, MA: Addison Wesley.
  6. Mittendorf, E., Mateev, B., Schäuble, P. (2000). Using the Co-Occurences of Words for Retrieval Weighting. Alphen an den Rijn: Kluwer Academic Publishers.