Boosting

Classification into five classes. The classifier generated by boosting only classifies into two classes.

Boosting ( engl. "Amplifying") is an algorithm of the automatic classification , the plurality of weak classifiers merges into a single good classifier.

The basic terms required for understanding are explained in the article Classification .

The idea of boosting was introduced by Robert Schapire in 1990 . In 1997, Yoav Freund and Schapire published the AdaBoost algorithm. The name comes from the way the algorithm deals with the errors of the weaker classifiers. It adapts to this (" ad justs a daptively").

application areas

Boosting can be used wherever an automatic classification into two classes is required, for example to classify images of faces into “known” and “unknown” or to classify products on an assembly line as “OK” or “faulty”. The areas of application are almost as diverse as those of automatic classification itself.

meaning

Although there are far more sophisticated methods of designing classifiers, boosting is an acceptable alternative in many cases: The technique delivers acceptable results and can be easily implemented in a computer program that is economical in terms of memory requirements and fast in runtime.

functionality

A number of objects and a number of weak classifiers are given. We are looking for a classifier that divides the objects into two classes with as few errors as possible. Boosting combines the existing weak classifiers in such a way that the resulting new classifier makes as few errors as possible.

Weak classifiers, also called base classifiers or weak learners , have a very simple structure and usually only take into account a single feature of the objects. Taken alone, they therefore deliver poor results on the one hand, but can also be evaluated very quickly on the other. Boosting combines all weak classifiers with a weighting in such a way that the stronger among the weak classifiers are given special consideration, while the really weak ones are ignored.

Basics

There is a feature space of any dimension and within it a training sample of the size , i.e. a set of pattern vectors . For each of these pattern vectors it is known which class it belongs to, that is, for each there is a given which indicates which of the two classes +1 or −1 the pattern vector belongs to. Furthermore, m primitive classifiers are given, which each split the feature space into the two classes +1 and −1. ${\ displaystyle M}$ ${\ displaystyle T}$ ${\ displaystyle n}$ ${\ displaystyle x_ {1}, \ dots, x_ {n}}$ ${\ displaystyle x_ {i}}$ ${\ displaystyle y_ {i} \ in {+ 1, -1}}$ ${\ displaystyle f_ {1}, f_ {2}, \ dots, f_ {m}: M \ rightarrow {+ 1, -1}}$

Are searched for the weighting factors m of the classifier , which on the sign function by ${\ displaystyle w_ {1}, \ dots, w_ {m}}$ ${\ displaystyle F: M \ rightarrow {+ 1, -1}}$ ${\ displaystyle \ operatorname {sgn}}$

{\ displaystyle F (x): = \ operatorname {sgn} \ left (\ sum _ {i = 1} ^ {m} w_ {i} f_ {i} (x) \ right)}

given is. The weighting factors should be optimized so that as few mistakes as possible are made. ${\ displaystyle F}$

For the optimization, a so-called "exponential loss function" L defined via the exponential function e can be used as an optimization criterion:

{\ displaystyle L: = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} \ mathrm {e} ^ {- y_ {i} F (x_ {i})} \ rightarrow {\ text {min}}}

${\ displaystyle L}$ becomes smaller, the fewer objects are incorrectly classified. The goal is therefore to choose the weighting factors so that they are minimal. ${\ displaystyle F}$ ${\ displaystyle L}$

This optimization is carried out step-by-step over m, that is, it is initially only optimized, then , then and so on until all weighting factors are optimal. The optimization is explained in the next section. ${\ displaystyle w_ {1}}$ ${\ displaystyle w_ {2}}$ ${\ displaystyle w_ {3}}$

Gradual optimization

Step-by-step optimization requires m iterations to optimize all weighting factors for F. In each pass, a classifier F _{s is} generated by adding a weak classifier to the previously generated classifier F _{s − 1} . This means that the user can cancel the calculation after each run if the intermediate result already meets his requirements.

Before each run, it is assessed which pattern vectors can be classified well with the classifier created so far and which cannot. Those pattern vectors that are not yet well classified are given particular attention in the next run. For this purpose, sn auxiliary variables t _{s, 1} ,…, t _{s, n are} required in each pass . The higher the value of t _{s, i} , the stronger the pattern vector x _i goes into the current run.

The number of the passage is s:

1. Update weights.

In the first pass (s = 1) all auxiliary variables are set to the value 1 / n: t _1,1 ,…, t _{1, n} : = 1 / n; thus all pattern vectors are considered equally in the first pass. In all following runs (s> 1) the auxiliary variables are set as follows:

{\ displaystyle t_ {s, i}: = t_ {s-1, i} \ mathrm {e} ^ {- y_ {i} w_ {s-1} f_ {s-1} (x_ {i})} }

In this way, all pattern vectors that were incorrectly classified by the weak classifier f _{s − 1} just considered are given a particularly high auxiliary weight in this run, all others with a particularly low one.

2. Determine weighted training error.

In this pass the weak classifier f _{s is} added. The "weighted training error" is a measure of how badly this primitive classifier performs on its own. For each pattern vector x _i incorrectly classified by f _{s, it} sums up the associated auxiliary variable t _{s, i} :

{\ displaystyle err_ {s}: = \ sum _ {i: f_ {s} (x_ {i}) \ neq y_ {i}} t_ {s, i}}

If the weighted training error is 0, then f _s classifies all pattern vectors correctly; if it is 1, f _s classifies everything incorrectly. If err _s = 1/2, then f _s classifies just as well as if he were merely guessing or flipping a coin for each pattern vector.

3. Optimize the next weighting factor.

The weighting factor w _{s of} the primitive classifier f _s used in this round is determined from the following formula:

{\ displaystyle w_ {s} = {\ frac {1} {2}} \ log \ left ({\ frac {1-err_ {s}} {err_ {s}}} \ right)}

According to the formula, f _s is added with positive weight to the final result if and only if err _s <½, that is, the weak classifier is better than mere guessing. If exactly err _s = ½, then w _s = 0, that is, f _s is ignored. If, on the other hand, err _s > ½, then the weak classifier is quite useful, it is only "wrongly polarized", that is, it classifies exactly the wrong way round; by adding it with a negative weight, this form error can be compensated and the inverted classifier can be added with a reinforcing effect.

4. Set up the intermediate result.

The intermediate result F _s results from the formula:

{\ displaystyle F_ {s} (x): = \ sum _ {i = 1} ^ {s} w_ {i} f_ {i} (x)}

It is calculated in exactly the same way as the actual goal F, only that instead of all m weak classifiers only the first s that have already been optimized are taken into account.

These steps are repeated in this order until all weak classifiers have been taken into account, i.e. s = m, or the user cancels the process.

Weak classifiers

Typical weak classifiers are so-called decision stumps (" decision stumps "). These functions compare the value of a single coordinate j with a threshold value l and thus justify their decision for +1 or −1. If x: = (x ₁ , ..., x _d ) ∈ M is a pattern vector in the d-dimensional feature space M, then such a primitive classifier f generally has the form:

{\ displaystyle f (x) = f ((x_ {1}, x_ {2}, \ dots, x_ {d}))): = {\ begin {cases} +1 & {\ text {if}} x_ {j } \ geqslant l \\ - 1 & {\ text {falls}} x_ {j} <l \ end {cases}}}

More precisely, f divides the feature space into two classes with a hyperplane .

The name alludes to the analogy to decision trees : The generated overall classifier F can be viewed as a decision tree. Every weak classifier is an inner node of this tree, to which a subtree (cf. English stump , "(tree) stump") hangs. The final classification in one of the leaves of the tree as a sequence of binary decisions (Engl. Decision ) achieved.

Such decision stumps are very popular as a basis for boosting because they are easy to handle and can be evaluated extremely quickly. In addition, they do not have to be specified from the start, but can be created while the algorithm is running.

Subtypes of boosting

Individual evidence

^ Robert Schapire: The strength of weak learnability . In: Machine Learning . tape 5 , no. 2 , 1990, p. 197-227 , doi : 10.1007 / BF00116037 .
^ Yoav Freund, Robert E Schapire: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting . In: Journal of Computer and System Sciences . tape 55 , no. 1 , 1997, p. 119-139 , doi : 10.1006 / jcss.1997.1504 .

[1] Robert Schapire: The strength of weak learnability . In: Machine Learning . tape 5 , no. 2 , 1990, p. 197-227 , doi : 10.1007 / BF00116037 .

[2] Yoav Freund, Robert E Schapire: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting . In: Journal of Computer and System Sciences . tape 55 , no. 1 , 1997, p. 119-139 , doi : 10.1006 / jcss.1997.1504 .