Bayesian classifier

This item has been on the quality assurance side of the portal mathematics entered. This is done in order to bring the quality of the mathematics articles to an acceptable level .

Please help fix the shortcomings in this article and please join the discussion ! ( Enter article )

A Bayesian classifier ( pronunciation : [ beɪz ], named after the English mathematician Thomas Bayes ) is a classifier derived from Bayes' theorem . It assigns each object to the class to which it belongs with the greatest probability or to which the classification results in the least costs. Formally, it is a mathematical function that assigns a class to each point in a feature space .

In order to define the Bayesian classifier, a cost measure is needed that assigns a cost to each possible classification. The Bayesian classifier is precisely that classifier that minimizes the costs arising from all classifications. The cost measure is sometimes also called the risk function; one then says that the Bayesian classifier minimizes the risk of a wrong decision and is defined by the minimum-risk criterion.

If a primitive cost measure is used which only incurs costs in the event of wrong decisions, the Bayesian classifier minimizes the probability of a wrong decision. It is then said that it is defined by the maximum a posteriori criterion.

Both forms assume that the probability that a point of the feature space belongs to a certain class is known, i.e. that each class is described by a probability density . In reality, however, these density functions are not known; you have to assess them. For this purpose, one suspects a type of probability distribution behind each class - usually a normal distribution - and tries to estimate its parameters based on the available data.

The Bayesian classifier is used much more often to assess other classifiers: some classes and their probability densities are artificially designed, a random sample is generated with this model and the other classifier divides the objects in this sample into classes. The result is compared with the classification that the Bayesian classifier would have made. Since the Bayesian classifier is optimal in this case, you get an estimate of how close the other classifier is to the optimal. At the same time, the Bayesian classifier provides a lower bound for the error probability of all other classifiers in this scenario; these cannot be better than the optimal Bayesian classifier.

Naive Bayesian classifier

The naive Bayesian classifier is also very popular because it can be calculated quickly and has a good recognition rate . Using the naive Bayesian classifier, it is possible to determine whether an object (class attribute) belongs to a class. It is based on Bayes' theorem. A naive Bayesian classifier could also be viewed as a star-shaped Bayesian network .

The naive basic assumption is that each attribute only depends on the class attribute. Although this is seldom the case in reality, naive Bayesian classifiers often achieve good results in practical applications as long as the attributes are not too closely correlated.

In the event of strong dependencies between the attributes, it makes sense to expand the naive Bayesian classifier to include a tree between the attributes. The result is called the tree-extended naive Bayesian classifier.

Mathematical definition

A Bayesian classifier is a function that maps vectors from the -dimensional real-valued feature space to a set of classes : ${\ displaystyle b}$ ${\ displaystyle f}$ ${\ displaystyle C}$

{\ displaystyle b \ colon \ mathbb {R} ^ {f} \ rightarrow C}

Usually one considers or in the event that two classes are considered, or if classes are considered. ${\ displaystyle C: = \ {0.1 \}}$ ${\ displaystyle C: = \ {- 1, + 1 \}}$ ${\ displaystyle C: = \ {1, \ dotsc, c \}}$ ${\ displaystyle c \ geq 3}$

Classification for normally distributed classes

If two classes are described by normal distributions, the decision limit between them resulting from the Bayesian classifier is quadratic. If the normal distributions are also described by the same covariance matrix , the decision limit between them is even linear. In both of these cases, the discriminant function can be described particularly easily, which makes the classification easy and efficient to calculate.

example

In e-mail programs with ( learning ) Naive Bayesian filters, spam e- mails are filtered out very efficiently . There are two classes of email: spam and non-spam emails ( ). An email consists of individual words . From old, already classified e-mails, one can estimate the probability of each word appearing in a spam or non-spam e-mail, i.e.: ${\ displaystyle C = \ {Spam, {\ overline {Spam}} \}}$ ${\ displaystyle W_ {i}}$ ${\ displaystyle W_ {i}}$

{\ displaystyle P (W_ {i} | Spam) = {\ frac {{\ text {Number of spam emails with the word}} W_ {i}} {\ text {Number of spam emails} }}}

{\ displaystyle P (W_ {i} | {\ overline {Spam}}) = {\ frac {{\ text {Number of non-spam e-mails with the word}} W_ {i}} {\ text { Number of non-spam emails}}}}

The question to be answered for a new e-mail is: Is the probability greater or less than the probability ? If the new email is classified as non-spam, otherwise it will be classified as spam. ${\ displaystyle W}$ ${\ displaystyle P (Spam | W)}$ ${\ displaystyle P ({\ overline {Spam}} | W)}$ ${\ displaystyle P (Spam | W) <P ({\ overline {Spam}} | W)}$

According to Bayes' theorem, the following applies to the probability : ${\ displaystyle P (Spam | W)}$

{\ displaystyle P (Spam | W) = {\ frac {P (Spam \ cap W)} {P (W)}} = {\ frac {P (W | Spam) P (Spam)} {P (W) }}}

.

${\ displaystyle P (W)}$ is the likelihood that the email will occur. Since this is independent of and , it always has the same value and can be neglected. This is why the e-mail programs look at the printout ${\ displaystyle W}$ ${\ displaystyle P ({\ overline {Spam}} | W)}$ ${\ displaystyle P (Spam | W)}$

{\ displaystyle Q = {\ frac {P (Spam | W)} {P ({\ overline {Spam}} | W)}} = {\ frac {P (W | Spam) P (Spam)} {P ( W)}} {\ frac {P (W)} {P (W | {\ overline {Spam}}) P ({\ overline {Spam}})}} = {\ frac {P (W | Spam) P (Spam)} {P (W | {\ overline {Spam}}) P ({\ overline {Spam}})}}}

and is greater than 1, the email is classified as spam, otherwise as not spam. The probability that an email is spam or not spam at all can again be estimated from the old emails: ${\ displaystyle Q}$

{\ displaystyle P (Spam) = {\ frac {\ text {Number of spam emails}} {\ text {Number of all emails}}}}

and

{\ displaystyle P ({\ overline {Spam}}) = {\ frac {\ text {Number of non-spam emails}} {\ text {Number of all emails}}}}

.

If the e-mail consists of the words and these words appear independently of one another, then the following applies ${\ displaystyle W}$ ${\ displaystyle W_ {1}, \ dotsc, W_ {n}}$

{\ displaystyle P (W | Spam) = P (W_ {1} \ cap \ dotsb \ cap W_ {n} | Spam) = P (W_ {1} | Spam) \ dotsm P (W_ {n} | Spam) }

.

The probability has already been given above and the total quotient can thus be calculated: ${\ displaystyle P (W_ {i} | Spam)}$

{\ displaystyle Q = {\ frac {P (Spam | W)} {P ({\ overline {Spam}} | W)}} = {\ frac {P (W_ {1} | Spam) \ dotsm P (W_ {n} | Spam) P (Spam)} {P (W_ {1} | {\ overline {Spam}}) \ dotsm P (W_ {n} | {\ overline {Spam}}) P ({\ overline { Spam}})}}}

.

Finally, three remarks:

In practice, an e-mail is classified as spam if, for example , the probability of being a spam e-mail is much greater than a non-spam e-mail. The reason is that an email classified as spam is usually automatically moved to a junk folder without the recipient seeing it again. This is fatal if the e-mail is incorrectly classified as spam. Then you would prefer to find a spam mail in your inbox folder every now and then.

{\ displaystyle Q> 10}

This filter is called a learning filter, because when new e-mails are marked as junk in the inbox, the probabilities , etc. change.

{\ displaystyle P (W_ {i} | Spam)}

{\ displaystyle P (Spam)}

Although the mathematical-statistical theory requires the independence of the words , this is not fulfilled in practice, e.g. For example, the words Viagra and sex are often used together. Despite the violation of this requirement, the naive Bayesian filters work very well in practice. The reason is that the exact probabilities and are not needed at all. It just has to be ensured that one can correctly say which of the two probabilities is the greater. This is why usually only about ten words from the email are used for classification: the five with the highest probability of appearing in a spam or non-spam email. ${\ displaystyle W_ {i}}$ ${\ displaystyle P (Spam | W)}$ ${\ displaystyle P ({\ overline {Spam}} | W)}$

Individual evidence

↑ A. Linke (2003) Spam or not Spam? Sorting emails with Bayesian filters, c't 17/2003, p. 150

[1] A. Linke (2003) Spam or not Spam? Sorting emails with Bayesian filters, c't 17/2003, p. 150