Markov spam filter

The Markov spam filter (after Andrei Andrejewitsch Markow ) is a spam filter based on a hidden Markov model and is a further development of the Bayesian spam filter . The spam filter calculates the probability with which the word strings of the checked text match the word strings of typical spam texts. While a Bayesian spam filter calculates the probability of individual words, the Markov spam filter uses word chains to determine the probability and weights the individual possible combinations. If the word strings of the checked text are similar to those of typical spam texts, the checked text is considered spam .

Example of weighting the possible combinations

Using the example of the sentence "The quick brown fox jumps ..." one can illustrate the possible combinations and weightings 2 ^2N in the Markov spam filter:

Word chain	weighting	N
Of the	1	0
The fast one	4th	1
The <...> brown one	4th	1
The <...> <...> fox	4th	1
The quick brown one	16	2
The <...> brown fox	16	2
The fast <...> fox	16	2
The quick brown fox	64	3

Formal representation of the probability calculation

While the probability due to the Bayesian spam filter through

{\ displaystyle P _ {\ text {local}} = 0 {,} 5 + {\ frac {P _ {\ text {good}} - P _ {\ text {bad}}} {P _ {\ text {good}} + P _ {\ text {bad}} + 1}} \;}

is specified applies to the Markov spam filter

{\ displaystyle P _ {\ text {local}} = 0 {,} 5 + {\ frac {(P _ {\ text {good}} - P _ {\ text {bad}}) \ cdot {\ text {weighting}} } {(P _ {\ text {good}} + P _ {\ text {bad}} + 1) \ cdot {\ text {weighting}} _ {\ text {maximum}}}} \;}

.

literature

Shalendra Chhabra, William S. Yerazunis, Christian Siefkes: Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas . In: Fourth IEEE International Conference on Data Mining (ICDM'04) . 2004, p. 347-350 , doi : 10.1109 / ICDM.2004.10031 .

Web links

CRM114 - the Controllable Regex Mutilator