Tf-idf-dimension

The tf-idf (from English term frequency , frequency of occurrence 'and inverse document frequency , inverse document frequency ") is in Information Retrieval used for assessment of the relevance of terms in documents of a document collection.

With the weighting of a word calculated in this way with respect to the document in which it is contained, documents as search hits of a word-based search can be better arranged in the hit list than would be possible, for example, using the term frequency alone.

Frequency of occurrence

The frequency of occurrence (also called search term density ) indicates how often the term occurs in the document . For example, if the document is the sentence ${\ displaystyle \ operatorname {\ #} (t, D)}$ ${\ displaystyle t}$ ${\ displaystyle D}$ ${\ displaystyle D_ {i}}$

The red car stops at the red light.

then ${\ displaystyle \ operatorname {\ #} ({\ text {red}}, D_ {i}) = 2.}$

In order to avoid a distortion of the result in long documents, it is possible to normalize the absolute frequency of occurrence . To do this, the number of occurrences of term in document is divided by the maximum frequency of a term in and you get the relative occurrence frequency . ${\ textstyle \ operatorname {\ #} (t, D)}$ ${\ displaystyle t}$ ${\ displaystyle D}$ ${\ displaystyle D}$ ${\ displaystyle \ operatorname {tf} (t, D)}$

{\ displaystyle \ operatorname {tf} (t, D) = {\ frac {\ # (t, D)} {\ max _ {t '\ in D} \ # (t', D)}}}

Other approaches use Boolean frequency (i.e. only checking whether the word occurs or not) or a logarithmically scaled frequency.

Inverse document frequency

The inverse document frequency measures the specificity of a term for the total amount of documents considered. A matching occurrence of rare terms is more telling for relevance than a match for very common words (e.g. "and" or "a").

The inverse document frequency of a term does not depend on the individual document, but on the document corpus (the total amount of all documents in the retrieval scenario): ${\ displaystyle \ operatorname {idf} (t)}$ ${\ displaystyle t}$

{\ displaystyle \ operatorname {idf} (t) = \ log {\ frac {N} {\ sum _ {D: t \ in D} 1}}}

Here is the number of documents in the corpus and the number of documents that include Term . ${\ displaystyle N}$ ${\ textstyle \ sum _ {D: t \ in D} 1}$ ${\ displaystyle t}$

TF-IDF

According to TF-IDF, the weight of a term in the document is then the product of the term frequency with the inverse document frequency: ${\ displaystyle \ operatorname {tf}. \ operatorname {idf} (t, D)}$ ${\ displaystyle t}$ ${\ displaystyle D}$

{\ displaystyle \ operatorname {tf}. \ operatorname {idf} (t, D) = \ operatorname {tf} (t, D) \ cdot \ operatorname {idf} (t)}

In most applications it should make sense that a multiple occurrence of a term does not contribute to the relevance to the same extent. In practice, the TF value is therefore usually normalized.

software

The Tf-idf calculation is implemented in Scikit-learn , a free software library in the Python programming language .

literature

Ricardo Baeza-Yates, Berthier Ribeiro-Neto: Modern Information Retrieval. Addison-Wesley, Harlow, et al. a. 1999, ISBN 0-201-39829-X , pp. 29-30.

Individual evidence

↑ sklearn.feature_extraction.text.TfidfTransformer. In: scikit-learn documentation. Retrieved April 9, 2019 .

[1] sklearn.feature_extraction.text.TfidfTransformer. In: scikit-learn documentation. Retrieved April 9, 2019 .