Within-document frequency

WDF (Within-Document Frequency) means the document-specific weighting of a word.

The formula for the document-specific word weighting was developed by Donna Harman to give words that appear in a document a weighting value that can be used in information science . This weighting value can be combined, for example, with the inverse document frequency (IDF) and the weighting value P to form a simple weighting formula. The WDF does not determine the relative frequency of a word in the document, but rather a compressed value that is easier to use. The higher the WDF of a word, the more often this word occurs in the document.

The formula

${\ displaystyle WDF (i) = {\ frac {\ log _ {2} (\ mathrm {Freq} (i, j) +1)} {\ log _ {2} (L)}}}$

i =: word
j =: document
L =: total number of words in document j
Freq (i, j) =: frequency of word i in document j

Explanation to "+1": if Freq (i, j) = 0, the "+1" means that log ₂ (1) = 0 is in the counter .

example

A document consists of 12000 words, so L = 12000. The word i occurs 23 times in this document, so Freq (i, j) = 23.
Now all you have to do is insert and it is

${\ displaystyle WDF (i) = {\ frac {\ log _ {2} (23 + 1)} {\ log _ {2} (12 \, 000)}}}$

The result is the weighting value WDF (i) = 0.3 (rounded); for comparison, the relative frequency of the word i is around 0.1917% ( ). ${\ displaystyle {\ frac {23} {12000}}}$

In the area of search engine optimization , the calculation of the weighting value WDF is used to increase the relevance of the website for a search engine. Compared to the simple calculation of a search word density , the logarithm used for the WDF value prevents the search word from being overly weighted.

WDF * IDF

The term WDF * IDF (which is similar to TF-IDF ) has also become popular in search engine optimization . The relevance of a document is compared to the competition. IDF denotes the inverse document frequency. The IDF value is calculated from the total number of all indexed documents, i.e. documents known to the search engine, divided by the number of all documents that contain the corresponding search term. This means that the (logarithmically compressed) IDF value is higher, the fewer documents there are in total for the respective search term. Conversely, the IDF value decreases towards 1 if the search term is already used on a large number of pages.

The WDF * IDF formula shows that a relevant document is weighted higher the less its combination of topics has been dealt with so far, as it then adds new and potentially useful information to the existing content. Correspondingly, documents that are also relevant for the search term and thus have a high WDF value, but essentially only repeat what has already been written in other documents, receive a lower IDF value and thus an overall lower WDF * IDF -Weighting. With a value close to 1, the IDF factor in the WDF * IDF equation is hardly significant as a ranking factor.

literature

Harman, Donna: Ranking algorithms. - In: William B. Frakes; Ricardo Baeza-Yates (Ed.): Information Retrieval.
Data Structures & Algorithms. Upper Saddle River, NJ: Prentice Hall PTR, 1992, 363-392.
Lecture Notes in Computer Science Vol. 1083 - Evaluating Natural Language Processing Systems by Karen Sparck Jones; Julia R. Galliers from the Lecture Notes series in Computer Science Vol. 1083. Berlin, Springer 1996.