Extended Boolean Retrieval

from Wikipedia, the free encyclopedia

Extended Boolean Retrieval is a modification of Boolean Retrieval , which allows a more flexible handling of the search terms and an evaluation of the search results.

With classic Boolean retrieval, the query determines which terms should appear in the search results. The request divides the documents into two sets: one of the documents fulfills the request, the other does not. This brings with it two problems:

  1. A document that does not contain a required term will not be found. Still, the document could be relevant. It may simply name the term you are looking for with another name ( synonymy ). The other search terms are perhaps well represented.
  2. The documents that match the search criteria cannot be sorted by relevance.

The extended Boolean model attempts to address these problems by overriding the binary nature of Boolean algebra (true - false) and instead allowing values ​​that range in between. The values ​​are mathematically defined over an interval [0.1], where zero stands for "false" and one for "true".

Mathematical definition

An extended Boolean model is defined by the 4- tuple ( T , Q , D , rank (.,.) ) With

  • T = { t i | i = 1,…, n}: set of index terms that describe documents and queries .
  • Q = { q j | i = 1,…}: Set of all permitted queries.
  • D = { d k | k = 1,…, m}: set of documents. In the extended Boolean models, each term t ki of a document d k has a weight w ki ∈ [0,1], which represents the importance of the term in the document. A document d k thus has the structure d k = (( t k1 , w k1 ),…, ( t kn , w kn )).
  • rank (.,.) : Ranking function , which describes the similarity of a document to a query. It is generally defined by: rank ( d k , q j ): D x Q → [0,1]: ( d k , q j ) | → rank ( d k , q j ) ∈ [0,1].

The ranking function describes the different classes of extended Boolean models, whereby all models differentiate between the use of the logical operators AND and OR in the query.

Extended Boolean models without query weights

In this class of models, each query is represented as a combination of the index terms and the logical operators AND, OR, NOT with the possible use of expressions in parentheses. It is assumed that all terms have the same meaning, which is represented by missing term weights in the query. The following extended Boolean models can be distinguished:

Simple fuzzy set model (Buell 1981, Bookstein 1980, Radecki 1979)

rank ( d k , t 1 AND t 2 ) = MIN { w k1 , w k2 };
rank ( d k , t 1 OR t 2 ) = MAX { w k1 , w k2 }.

This simple model has the limitation that it can only evaluate two terms, in contrast to the following models, which can process any number of terms.

Waller Kraft model (Waller & Kraft 1979)

rank ( d k , t 1 AND… AND t n ) = (1 - γ) MIN { w k1 ,…, w kn } + γ MAX { w k1 ,…, w kn }, 0 ≤ γ ≤ 0, 5;
rank ( d k , t 1 OR… OR t n ) = (1 - γ) · MIN { w k1 ,…, w kn } + γ · MAX { w k1 ,…, w kn }, 0.5 ≤ γ ≤ 1.

Paice model (Paice 1984)

In an AND operation, first order the weights w ki with increasing values, ie w k1 ≤… ≤ w kn , and then calculate

rank ( d k , t 1 AND ... AND t n ) = (Σ i = 1 n ( r i-1 · w ki )) / (Σ i = 1 n r i-1 ), 0 ≤ r ≤. 1

In the case of an OR operation, first order the weights w ki with descending values, ie w k1 ≥… ≥ w kn , and then calculate

rank ( d k , t 1 OR ... OR t n ) = (Σ i = 1 n ( r i-1 · w ki )) / (Σ i = 1 n r i-1 ), 0 ≤ r ≤. 1

P-norm model (Salton et al. 1983)

rank ( d k , t 1 AND… AND t n ) = 1 - (1 / n i = 1 n (1 - w ki ) p ) 1 / p , 1 ≤ p <∞,
rank ( d k , t 1 OR… OR t n ) = 1 - (1 / n · ∑ i = 1 n ( w ki ) p ) 1 / p , 1 ≤ p <∞.

Infinite One model (Smith 1990)

rank ( d k , t 1 AND ... AND t n ) = γ · (1 - MAX {1 - w k1 , ..., 1 - w kn }) + (1 - γ) · (1 / n · ∑ i = 1 n w ki ), 0 γ 1;
rank ( d k , t 1 OR… OR t n ) = γ · MAX { w k1 ,…, w kn } + (1 - γ) · (1 / n · ∑ i = 1 n w ki ), 0 ≤ γ ≤ 1.

Extended Boolean models with query weights

The retrieval -Effektivität can be increased by the terms are associated with important factors in the form of weights. A query q j thus gets a structure analogous to a document d k with tuples of terms and weights. If terms that do not occur in the original query are coded with a weight of zero, each original query can be recoded into a query that takes into account all terms occurring in T:

q j = (( t q (j) 1 , w q (j) 1 ), ..., ( t q (j) n , w q (j) n )).

In the relevance-weighting approaches (see Buell (1981)) the ranking functions are reformulated in such a way that the weights of the documents and queries are multiplied, ie:

rank ( d k , ( t q (j) , w q (j) )) = w k · w q (j) .

Under this assumption, the P-Norm and Infinite-One models of the presented extended Boolean models are able to evaluate the weights in the documents and a query:

P-norm model with query weights

rank ( d k , ( t q (j) 1 , w q (j) 1 ) AND… AND ( t q (j) n , w q (j) n )) = 1 - ((∑ i = 1 n ( 1 - w ki ) p · w q (j) i p ) / (Σ i = 1 n w ki p )) 1 / p , 1 ≤ p <∞,
rank ( d k , ( t q (j) 1 , w q (j) 1 ) OR ... OR ( t q (j) n , w q (j) n )) = 1 - ((∑ i = 1 n w ki p · w q (j) i p ) / (Σ i = 1 n w ki p )) 1 / p , 1 ≤ p <∞.

Infinite One Model with Query Weights

rank ( d k , ( t q (j) 1 , w q (j) 1 ) AND ... AND ( t q (j) n , w q (j) n )) = γ · (1 - (( MAX {( 1 - w k1 ) w q (j) 1 ,…, (1 - w kn ) w q (j) n }) / ( MAX { w q (j) 1 , ..., w q (j) n } )) + (1 - γ) · ((∑ i = 1 n ( w ki · w q (j) i ) / (∑ i = 1 n w q (j) i )), 0 ≤ γ ≤ 1;
rank ( d k , ( t q (j) 1 , w q (j) 1 ) OR ... OR ( t q (j) n , w q (j) n )) = γ · ( MAX { w k1 · w q (j) 1 , ..., w k n · w q (j) n }) / ( MAX { w q (j) 1 , ..., w q (j) n }) + (1 - γ) · ((Σ i = 1 n ( w ki · w q (j) i )) / (Σ i = 1 n w q (j) i )), 0 ≤ γ ≤. 1

literature

  • DA Buell: A general model for query processing in information retrieval system. In: Information Processing and Management. 17, 1981, pp. 249-262.
  • A. Bookstein: Fuzzy requests: an approach to weighted boolean searches. In: Journal of the American Society for Information Science. 31, 1980, pp. 240-247.
  • EA Fox, S. Betrabet, M. Koushik, W. Lee: Extended boolean models. In: WB Frankes, RB Yates (Ed.): Information Retrieval Data Structures and Algorithms. Prentice Hall. 1992, pp. 393-418.
  • MH Kim, JH Lee, YJ Lee: Analysis of fuzzy operators for high quality information retrieval. In: Information Processing Letters. 46 (5), 1993, pp. 251-256.
  • JH Lee, MH Kim, YJ Lee: Ranking documents in thesaurus-based boolean retrieval systems. In: Information Processing and Management. 30, 1994, pp. 79-91.
  • JH Lee, MII Kim, YJ Lee: Enhancing the fuzzy set model for high quality document rankings. In: Proceedings of the 19th Euromicro Conference. 1992, pp. 337-344.
  • JH Lee, WY Kim, MH Kim, YJ Lee: On the evaluation of boolean operators in the extended boolean retrieval framework. In: SIGIR. 1993, pp. 291-297.
  • Joon Ho Lee: Properties of extended Boolean models in information retrieval. In: Croft & Rijsbergen: SIGIR. 1994, pp. 182-190.
  • CP Paice: Soft evaluation of boolean search queries in information retrieval systems. In: Information Technology: Research and Development. 3 (1), 1984, pp. 33-42.
  • T. Radecki: Fuzzy set theoretical approach to document retrieval. In: Information Processing and Management. 15, 1979, pp. 247-259.
  • WM Sachs: An approach to associative retrieval through the theory of fuzzy sets. In: Journal of the American Society for Information Science. 27, 1976, pp. 85-87.
  • G. Salton, EA Fox, H. Wu: Extended boolean information retrieval. In: Communication of the ACM. 26 (11), 1983, pp. 1022-1036.
  • ME Smith: Aspect of the p-norm model of information retrieval: syntactic query generation, efficiency, and theoretical properties. PhD thesis. Cornell University, 1990.
  • WG Waller, DH Kraft: A mathematical model for weighted Boolean retrieval systems. In: Information Processing and Management. 15, 1979, pp. 235-245.