Fuzzy Retrieval

The fuzzy retrieval has evolved since the 1970s. Here, fuzzy information retrieval refers to information retrieval that is based on fuzzy logic .

The fuzzy IR model

The fuzzy IR model is to be defined with a quadruple , where ${\ displaystyle \ langle T, Q, D, F \ rangle}$

${\ displaystyle T = \ {t_ {1}, t_ {2}, \ dotsc, t_ {n} \}}$ a set of index termen that queries describe and documents.
${\ displaystyle Q = \ {q_ {1}, q_ {2}, \ dotsc, q_ {m} \}}$ a set of queries made up of index terms. The index terms can be linked by logical operations AND, OR and NOT.
${\ displaystyle D = \ {d_ {1}, d_ {2}, \ dotsc, d_ {k} \}}$ a lot of documents. Each is to be represented by, where the importance of term in represents and takes a value from the interval . ${\ displaystyle d_ {j} \ in D, j = 1,2, \ dotsc, k}$ ${\ displaystyle \ {(t_ {1}, w_ {j1}), \ dotsc, (t_ {n}, w_ {jn}) \}}$ ${\ displaystyle w_ {ji} (i = 1,2, \ dotsc, n)}$ ${\ displaystyle t_ {i}}$ ${\ displaystyle d_ {j}}$ ${\ displaystyle [0,1]}$
${\ displaystyle F}$ is a ranking function

{\ displaystyle F \ colon D \ times Q \ to [0,1]}

,

{\ displaystyle F (d, q) \ in [0,1]}

.

The value represents the similarity between the document and the query . ${\ displaystyle d}$ ${\ displaystyle q}$

The following applies to a query:

A query is a well-formed, propositional formula. ${\ displaystyle q}$
An individual index term is a query: . This type of query is called an atomic query. ${\ displaystyle q = t_ {i}}$
If is a query, then (the negation of ) is also a query. ${\ displaystyle q}$ ${\ displaystyle \ neg q}$ ${\ displaystyle q}$
If and are queries, ( or ) and ( and ) are also queries. ${\ displaystyle q}$ ${\ displaystyle q '}$ ${\ displaystyle q \ cup q '}$ ${\ displaystyle q}$ ${\ displaystyle q '}$ ${\ displaystyle q \ cap q '}$ ${\ displaystyle q}$ ${\ displaystyle q '}$

The fuzzy set operations are used as follows:

{\ displaystyle F (d_ {j}, t_ {1} {\ text {AND}} t_ {2}) = \ min (w_ {j1}, w_ {j2})}

{\ displaystyle F (d_ {j}, t_ {1} {\ text {OR}} t_ {2}) = \ max (w_ {j1}, w_ {j2})}

{\ displaystyle F (d_ {j}, t_ {1} ') = 1-w_ {j1}}

An example will now be given to illustrate the application of the fuzzy IR model. The query is:

{\ displaystyle q_ {1} = {\ text {Golden AND Silver}}}

There are two documents:

{\ displaystyle d_ {1} = \ {({\ text {Golden}}, \, 0 {,} 4), \, ({\ text {Silver}}, \, 0 {,} 4) \}}

{\ displaystyle d_ {2} = \ {({\ text {Golden}}, \, 0 {,} 4), \, ({\ text {Silver}}, \, 0 {,} 7) \}}

After the operation, the result is:

{\ displaystyle F (d_ {1}, t_ {1} {\ text {AND}} t_ {2}) = \ min (0 {,} 4, \ 0 {,} 4) = 0 {,} 4}

{\ displaystyle F (d_ {2}, t_ {1} {\ text {AND}} t_ {2}) = \ min (0 {,} 4, \ 0 {,} 7) = 0 {,} 4}

The same results at and state that the similarity between and with that between and is the same. But most people would decide that the more similar than would be. The undesirable result here is due to the fact that the operation only takes one term weight into account. In addition, the simple fuzzy set operations are limited to only two terms. In the following, two developed fuzzy models are presented that can evaluate any number of terms. Furthermore, a parameter can be introduced into the models as a “softness factor” to solve the above-mentioned problem (the result that depends on a weight) ${\ displaystyle d_ {1}}$ ${\ displaystyle d_ {2}}$ ${\ displaystyle d_ {1}}$ ${\ displaystyle q_ {1}}$ ${\ displaystyle d_ {2}}$ ${\ displaystyle q_ {1}}$ ${\ displaystyle d_ {2}}$ ${\ displaystyle q_ {1}}$ ${\ displaystyle d_ {1}}$

Extended fuzzy IR models

The Waller Kraft model

${\ displaystyle F (d_ {j}, t_ {1} {\ text {AND}} \ dots {\ text {AND}} t_ {n}) = (1- \ gamma) \ cdot \ min \ {w_ { j1}, \ dots, w_ {jn} \} + \ gamma \ cdot \ max {w_ {j1}, \ dots, w_ {jn}}}$ , ; ${\ displaystyle 0 \ leqq \ gamma \ leqq 0 {,} 5}$

${\ displaystyle F (d_ {j}, t_ {1} {\ text {OR}} \ dots {\ text {OR}} t_ {n}) = (1- \ gamma) \ cdot \ min \ {w_ { j1}, \ dots, w_ {jn} \} + \ gamma \ cdot \ max \ {w_ {j1}, \ dots, w_ {jn} \}}$ , . ${\ displaystyle 0 {,} 5 \ leqq \ gamma \ leqq 1}$

The model mixes the operation maximum with minimum and is more effective than the simple fuzzy model.

The Paice model

In the case of an AND operation: sorted according to size in ascending order, i. H. ${\ displaystyle w_ {ji}}$ ${\ displaystyle w_ {j1} \ leqq \ dots \ leqq w_ {jn}}$

${\ displaystyle F (d_ {j}, t_ {1} {\ text {AND}} \ dots {\ text {AND}} t_ {n}) = \ left [\ sum _ {i = 1} ^ {n } (r ^ {i-1} \ cdots w_ {ji}) \ right] / \ left [\ sum _ {i = 1} ^ {n} r ^ {i-1} \ right]}$ , . ${\ displaystyle 0 \ leqq r \ leqq 1}$

In the case of an OR link: sorted according to size in descending order, i. H. ${\ displaystyle w_ {ji}}$ ${\ displaystyle w_ {j1} \ geqq \ dots \ geqq w_ {jn}}$

${\ displaystyle F (d_ {j}, t_ {1} {\ text {OR}} \ dots {\ text {OR}} t_ {n}) = \ left [\ sum _ {i = 1} ^ {n } (r ^ {i-1} \ cdot w_ {ji}) \ right] / \ left [\ sum _ {i = 1} ^ {n} r ^ {i-1} \ right]}$ , . ${\ displaystyle 0 \ leqq r \ leqq 1}$

This model takes into account all term weights when calculating the similarity. But it requires more computational effort than the Waller-Kraft model.

comparison

The following table compares the results of and with the simple fuzzy IR model, Waller-Kraft model and Paice model. ${\ displaystyle d_ {1}}$ ${\ displaystyle d_ {2}}$

${\ displaystyle q_ {1} = t_ {1} {\ text {AND}} t_ {2}}$	Simple fuzzy IR model	Waller force model ( ) ${\ displaystyle \ gamma = 0 {,} 3}$	Paice model ( ) ${\ displaystyle r = 0 {,} 3}$
${\ displaystyle d_ {1} = \ left ((t_ {1}, \, 0 {,} 4), \, (t_ {2}, 0 {,} 4) \ right)}$	${\ displaystyle 0 {,} 4}$	${\ displaystyle (1-0 {,} 3) \ cdot 0 {,} 4 + 0 {,} 3 \ cdot 0 {,} 4 = 0 {,} 4}$	${\ displaystyle (0 {,} 3 ^ {0} \ cdot 0 {,} 4 + 0 {,} 3 ^ {1} \ cdot 0 {,} 4) / (0 {,} 3 ^ {0} + 0 {,} 3 ^ {1}) = 0 {,} 4}$
${\ displaystyle d_ {2} = \ left ((t_ {1}, \, 0 {,} 4), \, (t_ {2}, 0 {,} 7) \ right)}$	${\ displaystyle 0 {,} 4}$	${\ displaystyle (1-0 {,} 3) \ cdot 0 {,} 4 + 0 {,} 3 \ cdot 0 {,} 7 = 0 {,} 49}$	${\ displaystyle (0 {,} 3_ {0} \ cdot 0 {,} 4 + 0 {,} 3 ^ {1} \ cdot 0 {,} 7) / (0 {,} 3 ^ {0} +0 {,} ^ {1}) = 0 {,} 47}$

The degree of similarity between and is the same for the three models, which is understandable. The difference arises in the results of , where those of the two extended models are larger than those of the simple fuzzy IR model, which is more in line with the expectation. Therefore, it can be said that the two models are more effective in finding than the simple fuzzy IR model. ${\ displaystyle d_ {1}}$ ${\ displaystyle q_ {1}}$ ${\ displaystyle d_ {2}}$

The Waller-Kraft model mixes the maximum with the minimum, but it only considers these two term weights, which can lead to problems with queries with more than two terms. Example:

{\ displaystyle q_ {2} = t_ {1} {\ text {OR}} t_ {2} {\ text {OR}} t_ {3} {\ text {OR}} t_ {4} {\ text {OR }} t_ {5}}

{\ displaystyle d_ {3} = \ {(t_ {1}, \, 0 {,} 1), \, (t_ {2}, \, 0 {,} 5), \, (t_ {3}, \, 0 {,} 5), \, (t_ {4}, \, 0 {,} 5), \, (t_ {5}, \, 0 {,} 8) \}}

{\ displaystyle d_ {4} = \ {(t_ {1}, \, 0 {,} 1), \, (t_ {2}, \, 0 {,} 2), \, (t_ {3}, \, 0 {,} 2), \, (t_ {4}, \, 0 {,} 2), \, (t_ {5}, \, 0 {,} 8) \}}

It is clear that the degree of similarity between and is greater than that between and . But according to the equation at Waller-force model same results are obtained and calculated, which value for the parameter is also determined because it in this model only to the and -Termgewicht is taken into consideration. Thus the problem arises. In comparison, the Paice model is more complex, but it takes all term weights into account in the calculation and therefore avoids this problem. ${\ displaystyle d_ {3}}$ ${\ displaystyle q_ {2}}$ ${\ displaystyle d_ {4}}$ ${\ displaystyle q_ {2}}$ ${\ displaystyle d_ {3}}$ ${\ displaystyle d_ {4}}$ ${\ displaystyle \ gamma}$ ${\ displaystyle \ min}$ ${\ displaystyle \ max}$

The introduction of the term weight in the query

The models just shown do not consider the weights of terms in queries, where all terms have the same importance in queries. It is known that introducing the weights of terms into the queries can improve retrieval efficiency. The query is represented by the term weight: , . In retrieval, the weights of terms in queries and documents are multiplied, that is ${\ displaystyle q_ {k} = \ {(t_ {1}, w_ {k1}), \ dots, (t_ {n}, w_ {kn}) \}}$ ${\ displaystyle w_ {k} \ in [0,1]}$

{\ displaystyle F {\ bigl (} d_ {j}, (t_ {i}, w_ {ki}) {\ bigr)} = w_ {ji} \ cdot w_ {ki}}

.

A query without a term weight is like a query in which the weights of all terms are "1". A term is removed when its weight is zero, which means that the term has no influence on the query.

Although the Waller-Kraft model and the Paice model do not offer a method for evaluating the term weights in queries, the P-Norm model has formulas for calculating the term weights in queries.

Fuzzy IR model with query weights

The P-norm model with query weights

${\ displaystyle F {\ bigl (} d_ {j}, (t_ {q (k) 1}, w_ {q (k) 1}) {\ text {AND}} \ dots {\ text {AND}} ( t_ {q (k) n}, w_ {q (k) n}) {\ bigr)} = 1- \ left [\ left [\ sum _ {i = 1} ^ {n} (1-w_ {ji }) ^ {p} \ cdot w_ {q (k) i} ^ {p} \ right] / \ left [\ sum _ {i = 1} ^ {n} w_ {ji} ^ {p} \ right] \ right] ^ {1 / p}}$ , , ${\ displaystyle 1 \ leqq p <\ infty}$

${\ displaystyle F {\ bigl (} d_ {j}, (t_ {q (k) 1}, w_ {q (k) 1}) {\ text {OR}} \ dots {\ text {OR}} ( t_ {q (k) n}, w_ {q (k) n}) {\ bigr)} = 1- \ left [\ left [\ sum _ {i = 1} ^ {n} w_ {ji} ^ { p} \ cdot w_ {q (k) i} ^ {p} \ right] / \ left [\ sum _ {i = 1} ^ {n} w_ {ji} ^ {p} \ right] \ right] ^ {1 / p}}$ , . ${\ displaystyle 1 \ leqq p <\ infty}$

Here is the parameter and represents the level of accuracy. "1" means little precise, while means very precise. ${\ displaystyle p}$ ${\ displaystyle \ infty}$

Term relations

Fuzzy term relations are called fuzzy thesauruses. Here this relation means a fuzzy relation on a fuzzy set which has the interpretation of a fuzzy graph . Formally it is assumed: is a set of terms and a set of documents. A (general) term relation is defined by a fuzzy relation on . (Here terms and documents are combined into a whole set, although it is called a term relation.) Three types of relations are involved: ${\ displaystyle T = \ {t_ {1}, t_ {2}, \ dotsc, t_ {m} \}}$ ${\ displaystyle D = \ {d_ {1}, d_ {2}, \ dotsc, d_ {n} \}}$ ${\ displaystyle T \ cup D: R (x, y), x, y \ in T \ cup D}$

A relation between two terms, , ${\ displaystyle R (t, t '), t, t' \ in T}$
A relationship between two documents , ${\ displaystyle R (d, d '), d, d' \ in D}$
A relation between a term and a document: or . ${\ displaystyle R (t, d)}$ ${\ displaystyle R (d, t), t \ in T, d \ in D}$

The problems in term relations mentioned below are then discussed:

concrete examples for term relations,
Method of obtaining and creating term relations,
Method of using term relations in Information Retrieval.

Examples of term relations

The thesauruses and their fuzzy versions are typical examples of term relations, whereby the fuzzy relation is not defined on but on . Different types of fuzzy thesauruses are considered. For example, Reisinger sees fuzzy equivalence and fuzzy ordering relations as natural generalizations of sharply categorical and hierarchical relations. Tahani also mentions partial fuzzy ordering. Redecki suggests the use of a fuzzy equivalence relation together with a subset of the elementary terms and a term generalization relation. ${\ displaystyle R}$ ${\ displaystyle T \ cup D}$ ${\ displaystyle T}$

In the research of fuzzy thesauruses, symmetrical and asymmetrical fuzzy relations as well as fuzzy transitivity are taken into account, but their assumption leads to a problem because in reality no fuzzy transitivity can be found directly. This problem can be solved by considering fuzzy graphs (undirected graphs) and digraphs. A fuzzy relation that does not have to be transitive is given. This relation can be represented by a fuzzy digraph , and a “transitive closure” is reconsidered, ( where the - composition implies). means the degree of accessibility on the digraph, namely the -value of -cut, where on the sharp digraph from is reachable. ${\ displaystyle R}$ ${\ displaystyle R ^ {*} = R \ cup R_ {2} \ cup \ dots \ cup R_ {k} \ cup \ cdots}$ ${\ displaystyle R ^ {k} = R_ {k-1} \ circ R}$ ${\ displaystyle \ circ}$ ${\ displaystyle \ max}$ ${\ displaystyle \ min}$ ${\ displaystyle R ^ {*}}$ ${\ displaystyle R * (x, y)}$ ${\ displaystyle \ max}$ ${\ displaystyle \ alpha}$ ${\ displaystyle x}$ ${\ displaystyle y}$

The operations and properties of fuzzy relations mentioned above are summarized here:

Two fuzzy relations and are given , which are defined on. The - -Komposition: . ${\ displaystyle R}$ ${\ displaystyle S}$ ${\ displaystyle T}$ ${\ displaystyle \ max}$ ${\ displaystyle \ min}$ ${\ displaystyle (R \ circ S) (x, z) = \ max _ {y \ in T} \ min [R (x, y), S (y, z)]}$
A relation on a set is called ${\ displaystyle R}$ $R.$ ${\ displaystyle T}$ $T$
1. reflexive if for all , , , ${\ displaystyle x}$ ${\ displaystyle x \ in T}$ ${\ displaystyle R (x, x) = 1}$
2. symmetric if for all and , , , ${\ displaystyle x}$ ${\ displaystyle y}$ ${\ displaystyle x, y \ in T}$ ${\ displaystyle R (x, y) = R (y, x)}$
3. transitive if for all and , , . ${\ displaystyle x}$ ${\ displaystyle y}$ ${\ displaystyle x, y \ in T}$ ${\ displaystyle R (x, y) \ leqq \ max _ {z \ in T} \ min [R (x, z), R (z, y)]}$

Construction of the term relations

Different researches deal with the methods of automatic construction of the fuzzy relation of terms or of documents under different assumptions. A typical way to do this is to use Document Term Matrix , which represents the weight of Term in the document . Here it is assumed: is the fuzzy set that corresponds to the term . A symmetrical relation and an unsymmetrical relation are defined by ${\ displaystyle A = (a_ {ij})}$ ${\ displaystyle a_ {ij}}$ ${\ displaystyle t_ {j}}$ ${\ displaystyle d_ {i}}$ ${\ displaystyle \ gamma _ {j} = \ sum _ {i} a_ {ij} / d_ {i}}$ ${\ displaystyle t_ {j}}$ ${\ displaystyle R_ {s} (t_ {j}, t_ {k})}$ ${\ displaystyle R_ {n} (t_ {j}, t_ {k})}$

{\ displaystyle R_ {s} (t_ {j}, t_ {k}) = \ vert \ gamma _ {j} \ cap \ gamma _ {k} \ vert / \ vert \ gamma _ {j} \ cap \ gamma _ {k} \ vert}

,

{\ displaystyle R_ {n} (t_ {j}, t_ {k}) = \ vert \ gamma _ {j} \ cap \ gamma _ {k} \ vert / \ vert \ gamma _ {j} \ vert}

,

where is the sum. This method is based on the assumption that the meaning of the two terms is also similar if the two patterns of and are similar. Accepting is that a narrower meaning than has when the is included. ${\ displaystyle \ vert \ gamma _ {j} \ vert = a_ {1j} + a_ {2j} + \ dots + a_ {nj}}$ ${\ displaystyle \ sum}$ ${\ displaystyle \ gamma _ {j}}$ ${\ displaystyle \ gamma _ {k}}$ ${\ displaystyle R_ {n} (t_ {j}, t_ {k})}$ ${\ displaystyle \ gamma _ {j}}$ ${\ displaystyle \ gamma _ {k}}$ ${\ displaystyle \ gamma _ {j}}$ ${\ displaystyle \ gamma _ {k}}$

Use of term relations in the real numbers

There are two basic methods of using term relations in Information Retrieval . When a term relation is enabled as a network in which the documents are terminal nodes and a query is an original node, the retrieval is performed by the tracing of the network. On the other hand, if a term relation is given together with a fuzzy relation and a fuzzy query vector , a simple standard method for retrieval of the documents is to calculate a fuzzy set by using MAX-MIN composition of the fuzzy -Relations. ${\ displaystyle R}$ ${\ displaystyle T}$ ${\ displaystyle F (d, t)}$ ${\ displaystyle q = \ sum \ nolimits _ {j} {\ frac {w_ {j}} {t_ {j}}}}$ ${\ displaystyle \ delta = F \ circ R \ circ q}$

literature

Lee, Joon Ho: Properties of extended Boolean models in information retrieval. In: Croft & Rijsbergen (1994): SIGIR 1994. Pages 182-190.
Miyamoto, Sadaaki: 1989, Two approaches for information retrieval through fuzzy associations IEEE Trans. Syst. Man cybernet. SMC-19 123-30 - 1990b Fuzzy Sets in Information Retrieval and Cluster Analysis (Dordrecht: Kluwer).
Miyamoto, Sadaaki: Information Retrieval. In: Ruspini, Enrique H .; Bonissone, Piero P .; Pedrycz, Witold (eds.): Handbook of fuzzy computation. Bristol, Institute of Physics Publ. 1998: F.4.2.
Paice, CP: Soft evaluation of boolean search queries in information retrieval systems. In: Information Technology: Research and Development , 3 (1), 1984, pages 33-42.
Panyr, Jiří : The theory of fuzzy sets and information retrieval systems. In: Nachrichten für Documentation 37 , 1986. Pages 163-168.
Salton, G .; Fox, EA; Wu, H .: Extended boolean information retrieval. In: Communication of the ACM , 26 (11), 1983, pp. 1022-1036.
Waller, WG; Kraft, DH: A mathematical model for weighted Boolean retrieval systems. In: Information Processing and Management , 15, 1979, pages 235-245.

swell

↑ Salton et al., 1983
^ Reisinger, 1974
^ Tahani, 1976
↑ Redecki, 1976
↑ Miyamoto, 1990b, p. 30
↑ Miyamoto, 1990b, p. 195.

[1] Salton et al., 1983

[2] Reisinger, 1974

[3] Tahani, 1976

[4] Redecki, 1976

[5] Miyamoto, 1990b, p. 30

[6] Miyamoto, 1990b, p. 195.