Evidence Theory

The Evidence Theory by Dempster and Shafer , also called Theory of Belief Functions , is a mathematical theory from the field of probability theory . It is used to combine information from different sources with the so-called Dempster rule of combination to form an overall statement, whereby the credibility of these sources is taken into account in the calculation.

presentation

In the 1960s Arthur Pentland Dempster wrote various articles on a theory for the offsetting of evidence , e.g. B. Dempster's work was further developed by Glenn Shafer and finally published in 1976. The theory has since been known as The Dempster-Shafer Theory of Evidence .

An evidence can be used as an extension of a probability are considered, and instead of a one-dimensional one-dimensional measure is used: It is made up of the degree of For holding or the level of confidence that the statement of a source is correct (English: degree of amounted , mathematical symbol ), and the plausibility of the event (mathematical symbol ), d. H. from a probability range with a lower limit and an upper limit . ${\ displaystyle bel}$ ${\ displaystyle pl}$ ${\ displaystyle bel}$ ${\ displaystyle pl}$

Evidence theory is mainly used where uncertain statements from various sources have to be combined into one overall statement. Evidence theory cannot be used in situations where the level of trust in a source cannot be quantified. However, there are applications such as B. in pattern recognition , in which statements of different unreliable algorithms can be combined by means of the evidence theory in order to obtain a statement whose accuracy is better than that of each individual statement.

Illustrative example

Let's assume that the official weather report means that tomorrow will be fine weather. Let us also assume that the weather report can be trusted to 80%, ie that in 80% of all cases the weather report provides an accurate forecast . So we can now 80% assume that the weather will be fine tomorrow. But what about the other 20%? Using traditional probability theory, one could conclude that there is a 20% chance that tomorrow the weather will not be nice. This is where Dempster and Shafer's theory differs from probability theory: in the remaining 20%, we don't know that the weather won't be nice, we do n't know anything . Since the forecast of the weather report is definitely correct in 80% of the cases, it is not with certainty in 20%. However, this means that the weather report's statement can be either incorrect or correct in 20% of all cases. The 20% is therefore not assigned to the complement of the hypothesis, but to the set of all hypotheses .

This difference between the theory of probability and the theory of Dempster and Shafer leads to further consequences. The theorem valid for a hypothesis in probability ${\ displaystyle H}$

${\ displaystyle p (\ neg H) = 1-p (H)}$

no longer applies and it is therefore no longer possible to use the probability that a hypothesis is correct to determine what the probability is that the hypothesis is incorrect. It is not enough, therefore, to express the belief for a hypothesis combined in any way with the belief that a hypothesis is incorrect by a single value. Instead of a single value, a range of probability is used. The range is given by the lower limit and the upper limit , ie by the probability with which it can be assumed that the hypothesis is correct or not correct. ${\ displaystyle bel (H)}$ ${\ displaystyle pl (H) = 1-bel (H ^ {c})}$

In the example above, the weather report leads to evidence of good weather . But there is no reason to assume that the weather will not be nice tomorrow: so . Based on the weather report, the probability of good weather is in the range 0.8 to 1.0. ${\ displaystyle bel ({\ rm {schoenesWetter}}) = 0.8}$ ${\ displaystyle bel ({\ rm {schoenesWetter}} ^ {c}) = 0}$ ${\ displaystyle pl ({\ rm {schoenesWetter}}) = 1}$

Combination of evidence

The official weather report says as before, tomorrow should be nice weather. Another source, e.g. For example, a weatherman who can only be trusted 60% also thinks that the weather will be fine tomorrow. The following table shows how the evidence can be offset:

	Weather report reliable (80%)	Weather report unreliable (20%)
Frog reliable (60%)	Nice weather: 80% * 60% = 48%	Nice weather: 20% * 60% = 12%
Frog unreliable (40%)	Nice weather: 80% * 40% = 32%	Weather indeterminate: 20% * 40% = 8%

In three out of four possible cases, the hypothesis is confirmed because the sources are reliable. It can therefore be assumed with 48% + 12% + 32% = 92% probability that the weather will be fine tomorrow. In the fourth case (8%), no statement can be made about the hypothesis. This now means and . ${\ displaystyle {\ rm {nice weather}}}$ ${\ displaystyle bel ({\ rm {schoenesWetter}}) = 0.92}$ ${\ displaystyle bel ({\ rm {schoenesWetter}} ^ {c}) = 0}$

In general, evidence that all support one hypothesis can be combined with: ${\ displaystyle e_ {i}}$ ${\ displaystyle H}$

${\ displaystyle bel (H) = 1- (1-e_ {1} (H)) (1-e_ {2} (H)) ... (1-e_ {n} (H))}$ and ${\ displaystyle bel (H ^ {c}) = 0}$

and accordingly evidence that all speak against a hypothesis with

${\ displaystyle bel (H) = 0}$ and ${\ displaystyle bel (H ^ {c}) = 1- (1-e_ {1} (H)) (1-e_ {2} (H)) ... (1-e_ {n} (H)) }$

If contradicting evidence is to be combined, suppose the weatherman claims that the weather is not going to be nice, the following situations arise:

	Weather report reliable (80%)	Weather report unreliable (20%)
Frog reliable (60%)	impossible !: 80% * 60% = 48%	Weather not nice: 20% * 60% = 12%
Frog unreliable (40%)	Nice weather: 80% * 40% = 32%	Weather indeterminate: 20% * 40% = 8%

The variant in which both statements are true now becomes impossible: If both assert the opposite, then both cannot be reliable. According to Dempster and Shafer, all possible cases are re-scaled so that the sum of all probabilities again results in 1, ie all values for the possible cases are divided by the sum of all values for the possible cases. This results in:

	Weather report reliable (80%)	Weather report unreliable (20%)
Frog reliable (60%)	-	Weather not nice: approx. 23%
Frog unreliable (40%)	Nice weather: approx. 62%	Unspecified weather: approx. 15%

So one can assume about 62% that the weather will be nice and about 23% that the weather will not be nice. The probability range for nice weather is therefore limited by the values 0.62 and 0.77 (1 - 0.23). Two contradicting pieces of evidence (for the hypothesis) and (against the hypothesis) can therefore be offset to: ${\ displaystyle e_ {1} (H)}$ ${\ displaystyle e_ {2} (H)}$

${\ displaystyle bel (H) = {\ frac {e_ {1} (H) (1-e_ {2} (H))} {1-e_ {1} (H) e_ {2} (H)}} }$ and

${\ displaystyle bel (H ^ {c}) = {\ frac {e_ {2} (H) (1-e_ {1} (H))} {1-e_ {1} (H) e_ {2} ( H)}}}$

Procedure for multiple hypotheses

In the previous examples only the case of one hypothesis was considered. In addition, two important conditions for the theory of Dempster and Shafer were not mentioned: the hypotheses must be mutually exclusive and the hypothesis space must be complete, ie the correct hypothesis must be present in the set of existing hypotheses. In the following, the procedure for the general case with several competing hypotheses is considered.

In the first example, the 20% in which the weather report is unreliable were not assigned to the complement of the hypothesis , but both to the hypothesis itself and its complement, since the weather report in the 20% can be right or wrong. If there are several hypotheses, the residual evidence , ie the portion of the evidence that cannot be assigned to any hypothesis, is assigned to the totality of the existing hypotheses ( ). The allocation to the totality of the hypotheses makes sense because it is assumed that the hypothesis space is complete. ${\ displaystyle {\ rm {nice weather}}}$ ${\ displaystyle \ theta}$

If there are several hypotheses, then evidence for one hypothesis is also evidence against the other hypotheses and vice versa. However, the evidence against the other hypotheses, analogous to the residual evidence, is not divided among the other hypotheses, but ascribed to the totality of all other hypotheses.

An example of a case with several hypotheses: Suppose a digitized image representing a character is to be classified. There are three hypotheses: The picture can represent an 'A', an 'R' or a 'B', ie the hypothesis space consists of the set = {'A', 'R', 'B'}. Let us also assume that there is evidence of 0.2 for 'R' and evidence of 0.6 against 'A'. This leads to a residual evidence and . The evidence against 'A' is an evidence of 'R' and 'B': . The following table summarizes the situation: ${\ displaystyle \ theta}$ ${\ displaystyle e_ {1} (\ lbrace 'R' \ rbrace) = 0.2}$ ${\ displaystyle e_ {1} (\ theta) = 0.8}$ ${\ displaystyle e_ {2} (\ theta) = 0.4}$ ${\ displaystyle e_ {2} (\ lbrace 'R', 'B' \ rbrace) = 0.6}$

	${\ displaystyle e_ {1} (\ lbrace 'R' \ rbrace) = 0.2}$	${\ displaystyle e_ {1} (\ theta) = 0.8}$
${\ displaystyle e_ {2} (\ lbrace 'R', 'B' \ rbrace) = 0.6}$	${\ displaystyle e (\ lbrace 'R' \ rbrace) = 0.2 \ cdot 0.6 = 0.12}$	${\ displaystyle e (\ lbrace 'R', 'B' \ rbrace) = 0.8 \ cdot 0.6 = 0.48}$
${\ displaystyle e_ {2} (\ theta) = 0.4}$	${\ displaystyle e (\ lbrace 'R' \ rbrace) = 0.2 \ cdot 0.4 = 0.08}$	${\ displaystyle e (\ theta) = 0.8 \ cdot 0.4 = 0.32}$

The evidence of the intersections result from the product of the evidence of the sets that are intersected. The range of probability of the set of hypotheses is given by: ${\ displaystyle \ lbrace H_ {i} \ rbrace}$

${\ displaystyle bel (\ lbrace H_ {i} \ rbrace) = \ sum _ {S \ subseteq \ lbrace H_ {i} \ rbrace} e (S)}$ and

${\ displaystyle bel (\ lbrace H_ {i} \ rbrace ^ {c}) = \ sum _ {S \ subseteq \ theta \ setminus \ lbrace H_ {i} \ rbrace} e (S)}$

In words: The belief that one of the hypotheses is the correct one is given by the sum of the evidences of all subsets of the hypothesis set and by the sum of all evidences of all subsets of the complementary set of the hypothesis set. For the example this leads to: ${\ displaystyle bel (\ lbrace H_ {i} \ rbrace)}$ ${\ displaystyle H_ {i}}$ ${\ displaystyle \ lbrace H_ {i} \ rbrace}$ ${\ displaystyle bel (\ lbrace H_ {i} \ rbrace ^ {c})}$

${\ displaystyle bel (\ lbrace 'A' \ rbrace) = 0}$

${\ displaystyle bel (\ lbrace 'A' \ rbrace ^ {c}) = e (\ lbrace 'R', 'B' \ rbrace) + \ sum e (\ lbrace 'R' \ rbrace) = 0.48 + 0.12 + 0.08 = 0.68}$

${\ displaystyle bel (\ lbrace 'R' \ rbrace) = \ sum e (\ lbrace 'R' \ rbrace) = 0.12 + 0.08 = 0.2}$

${\ displaystyle bel (\ lbrace 'R' \ rbrace ^ {c}) = 0}$

${\ displaystyle bel (\ lbrace 'B' \ rbrace) = 0}$

${\ displaystyle bel (\ lbrace 'B' \ rbrace ^ {c}) = \ sum e (\ lbrace 'R' \ rbrace) = 0.12 + 0.08 = 0.2}$

So the belief interval for 'A' is 0 to 1 - 0.68, so 0 to 0.32. The one for 'R' is 0.2 to 1 - 0, i.e. 0.2 to 1; and that for 'B' is 0 to 1 - 0.2, so 0 to 0.8.

If there is further evidence , the values in the following table result: ${\ displaystyle e_ {3} (\ lbrace 'B' \ rbrace) = 0.3}$

	${\ displaystyle e_ {3} (\ lbrace 'B' \ rbrace) = 0.3}$	${\ displaystyle e_ {3} (\ theta) = 0.7}$
${\ displaystyle e (\ lbrace 'R' \ rbrace) = 0.2}$ (from the table above)	${\ displaystyle e_ {3} (\ lbrace \ rbrace) = 0.3 \ cdot 0.2 = 0.06}$	${\ displaystyle e (\ lbrace 'R' \ rbrace) = 0.7 \ cdot 0.2 = 0.14}$
${\ displaystyle e (\ lbrace 'R', 'B' \ rbrace) = 0.48}$ (from the table above)	${\ displaystyle e (\ lbrace 'B' \ rbrace) = 0.3 \ cdot 0.48 = 0.144}$	${\ displaystyle e (\ lbrace 'R', 'B' \ rbrace) = 0.7 \ cdot 0.48 = 0.336}$
${\ displaystyle e (\ theta) = 0.32}$ (from the table above)	${\ displaystyle e (\ lbrace 'B' \ rbrace) = 0.3 \ cdot 0.32 = 0.096}$	${\ displaystyle e (\ theta) = 0.7 \ cdot 0.32 = 0.224}$

There is now evidence for the empty set, which is not possible because it has to be complete. As described above, the evidence must be scaled for the possible values. This ultimately results in: ${\ displaystyle \ theta}$

${\ displaystyle bel (\ lbrace 'A' \ rbrace) = 0}$

${\ displaystyle bel (\ lbrace 'A' \ rbrace ^ {c}) = {\ frac {0.144} {0.94}} + {\ frac {0.096} {0.94}} + {\ frac {0.14} {0.94}} + {\ frac {0.336} {0.94}} \ approx 0.762}$

${\ displaystyle bel (\ lbrace 'R' \ rbrace) = {\ frac {0.14} {0.94}} \ approx 0.149}$

${\ displaystyle bel (\ lbrace 'R' \ rbrace ^ {c}) = {\ frac {0.144} {0.94}} + {\ frac {0.096} {0.94}} \ approx 0.255}$

${\ displaystyle bel (\ lbrace 'B' \ rbrace) = {\ frac {0.144} {0.94}} + {\ frac {0.096} {0.94}} \ approx 0.255}$

${\ displaystyle bel (\ lbrace 'B' \ rbrace ^ {c}) = {\ frac {0.14} {0.94}} \ approx 0.149}$

Calculation method

If there is a set of competing hypotheses, subsets (of the power set of ) may have to be considered. If the evidence is compiled in any order, the effort for an algorithm becomes exponential. However, it can be shown ( Lit .: Barnett 1981) that an algorithm arises with only linear effort if evidence is calculated in a skilful order. The procedure is described below without evidence. ${\ displaystyle n}$ ${\ displaystyle 2 ^ {n}}$ ${\ displaystyle \ theta}$

For each of the competing hypotheses , all evidence for the hypothesis is calculated: ${\ displaystyle H_ {i}}$

${\ displaystyle e ^ {+} (H_ {i}) = 1- (1-e_ {1} (H_ {i})) (1-e_ {2} (H_ {i})) ... (1 -e_ {n} (H_ {i}))}$

and accordingly all evidence against the hypothesis. The collected evidence for and against the hypothesis can now be combined: ${\ displaystyle e ^ {-} (H_ {i})}$

${\ displaystyle pro (H_ {i}) = {\ frac {e ^ {+} (H_ {i}) (1-e ^ {-} (Hi))} {1-e ^ {+} (H_ { i}) e ^ {-} (Hi)}}}$ and ${\ displaystyle con (H_ {i}) = {\ frac {e ^ {-} (H_ {i}) (1-e ^ {+} (Hi))} {1-e ^ {+} (H_ { i}) e ^ {-} (Hi)}}}$

If there is only one hypothesis then holds

${\ displaystyle bel (H_ {i}) = pro (H_ {i})}$ and ${\ displaystyle bel (H_ {i} ^ {c}) = con (H_ {i})}$

If there are several hypotheses, it must be taken into account that every evidence for (against) one hypothesis also represents evidence against (for) all other hypotheses. This evidence must be taken into account so that:

${\ displaystyle bel (H_ {i}) = K (\ lbrack pro (H_ {i}) \ prod _ {\ forall j \ neq i} d (H_ {j}) \ rbrack + \ lbrack r (H_ {i }) \ prod _ {\ forall j \ neq i} con (H_ {j}) \ rbrack)}$ and

${\ displaystyle bel (H_ {i} ^ {c}) = K (\ lbrack \ prod _ {\ forall j} d (H_ {j}) \ rbrack \ lbrack \ sum _ {\ forall j \ neq i} { \ frac {pro (H_ {j})} {d (H_ {j})}} \ rbrack + \ lbrack con (H_ {i}) \ prod _ {\ forall j \ neq i} d (H_ {j} ) \ rbrack - \ lbrack \ prod _ {\ forall j} con (H_ {j}) \ rbrack)}$ where and as well ${\ displaystyle r (H_ {j}) = 1-pro (H_ {j}) - con (H_ {j})}$ ${\ displaystyle d (H_ {j}) = 1-per (H_ {j})}$ ${\ displaystyle K = {\ frac {1} {\ lbrack \ prod _ {\ forall j} d (H_ {j}) \ rbrack \ lbrack 1+ \ sum _ {\ forall j} {\ frac {pro (H_ {j})} {d (H_ {j})}} \ rbrack - \ prod _ {\ forall j} con (Hj)}}}$

This algorithm can be used if there is only evidence for or against individual hypotheses, ie not if there is evidence for or against sets of hypotheses. In addition, the hypotheses must be mutually exclusive and the set of hypotheses must be complete.

The last condition is often not fulfilled in reality, ie it is not guaranteed that the correct hypothesis is part of the solution set. As a solution to this problem, another set of competing hypotheses can be added which is representative of all other possible hypotheses. This extension results in a simplification of the algorithm described above if there is no evidence for or against this hypothesis , i.e. is. This leads to the fact that all products in which or occur become 0 and therefore do not have to be calculated. ${\ displaystyle \ alpha}$ ${\ displaystyle pro (\ alpha) = con (\ alpha) = 0}$ ${\ displaystyle pro (\ alpha)}$ ${\ displaystyle con (\ alpha)}$

literature

Glenn Shafer: A Mathematical Theory of Evidence . Princeton University Press, 1976, ISBN 978-0-691-10042-5 .
Glenn Shafer: Dempster-Shafer Theory. In: Encyclopedia of Artificial Intelligence , Second Edition, Stuart C. Shapiro, editor. Wiley. 1992., pp. 330-331.
JA Barnett: Computational Methods for a Mathematical Theory of Evidence . In: Proc. IJCAI-81 . 1981, pp. 868-875.
Andreas Garzotto: Fully automatic recognition of characters in printed documents. Increased reliability by combining uncertain knowledge from competing sources . Dissertation, University of Zurich 1994
Jochen Heinsohn, Rolf Socher-Ambrosius: Knowledge processing. An introduction . Spectrum Academic Publishing House, 2007, ISBN 978-3-8274-1844-9 .

Individual evidence

↑ ^a ^b Dempster, AP, Upper and lower probabilities induced from a multivalued mapping , Ann.Math.Statist. 1967, 38: 325-339
^ Glenn Shafer: A Mathematical Theory of Evidence . Princeton University Press 1976
↑ Andreas Garzotto: Fully automatic recognition of characters in printed documents. Increased reliability by combining uncertain knowledge from competing sources . Dissertation, University of Zurich 1994

[D-1] Dempster, AP, Upper and lower probabilities induced from a multivalued mapping , Ann.Math.Statist. 1967, 38: 325-339

[2] Glenn Shafer: A Mathematical Theory of Evidence . Princeton University Press 1976

[3] Andreas Garzotto: Fully automatic recognition of characters in printed documents. Increased reliability by combining uncertain knowledge from competing sources . Dissertation, University of Zurich 1994