Discriminant function

A discriminant function or separator function is a function that assigns a score value to each observation during discriminant analysis. The group membership of each observation and the boundaries between the groups are determined from the score value. If the observations are known to belong to a group, the feature variables are combined into a single discriminant variable with minimal loss of information.

The Fisher'sche discriminant is the best known discriminant function the Fisher'sche criterion realized. It was developed by RA Fisher in 1936 and describes a metric that measures the quality of the separability of two classes in a feature space and was published by him in The use of multiple measurements in taxonomic problems in 1936 .

introduction

Given are d-dimensional feature vectors of which the class and the class belong. A discriminant function describes the equation of a hyperplane that optimally separates the classes from one another. There are linear and non-linear , depending on the separability of the classes, which is explained in two dimensions in the following figure. ${\ displaystyle N}$ ${\ displaystyle \ mathbf {x}}$ ${\ displaystyle n_ {1}}$ ${\ displaystyle C_ {1}}$ ${\ displaystyle n_ {2}}$ ${\ displaystyle C_ {2}}$

example

Good (blue) and bad (red) borrowers in a bank.

The graphic on the right shows good (blue) and bad (red) credit customers of a bank. The income is shown on the x-axis and the customer's loan amount (in thousands of EUR) on the y-axis. The discriminant function results from

{\ displaystyle d = -0 {,} 256-0 {,} 048 {\ text {income}} + 0 {,} 007 {\ text {loan amount}}}

.

The parallel black lines from the bottom left to the top right result for . ${\ displaystyle d = -3, -2, \ ldots, 3}$

The values of the discriminant function for each observation are given below the data point. You can see that the bad customers have high scores in the discriminant function, while good customers have low scores. A derived rule for new customers could be:

{\ displaystyle d = {\ begin {cases} \ leq 0 & \ Rightarrow {\ text {good customer}} \\> 0 & \ Rightarrow {\ text {bad customer}} \ end {cases}}}

Linear discriminant function

As the introductory example shows, we are looking for a direction in the data so that the groups are separated from one another as best as possible. This direction is indicated in the graphic with the dashed line. The dashed and black lines that cross at the black point form a new rotated coordinate system for the data.

Such rotations are described with linear combinations of the feature variables. The canonical linear discriminant function for feature variables is therefore given by: ${\ displaystyle p}$

{\ displaystyle D = \ beta _ {0} + \ beta _ {1} X_ {1} + \ beta _ {2} X_ {2} + \ dots + \ beta _ {p} X_ {p}}

with the discriminant variable, 's the feature variables and the discriminant coefficients. Similar to multiple linear regression , the discriminant coefficients are calculated; however, a quadratic error is not optimized for but with respect to a discriminant measure. ${\ displaystyle D}$ ${\ displaystyle X_ {j}}$ ${\ displaystyle \ beta _ {j}}$ ${\ displaystyle D}$

Good (blue) and bad (red) borrowers and projected data points (light blue and light red) on the dashed line.

For each possible direction, the data points (red and blue points) are projected onto the dashed line (light blue and light red points). Then the group centers (for the light red and light blue points) and the overall mean (black point) are determined.

On the one hand, the distance of each light red or light blue point to its group center is now determined and these squared distances are added up to ( intravariance , within scatter ). The smaller is, the closer the projected points are to their group centers. ${\ displaystyle D _ {\ text {within}}}$ ${\ displaystyle D _ {\ text {within}}}$

On the other hand, for each light red and light blue point, the distance between the associated group center point and the total center point and the squared distances are added up to ( intervariance , between scatter ). The larger is, the further apart the group mean values are. ${\ displaystyle D _ {\ text {between}}}$ ${\ displaystyle D _ {\ text {between}}}$

Therefore the direction in the data is chosen so that

{\ displaystyle \ lambda = {\ frac {D _ {\ text {between}}} {D _ {\ text {within}}}}}

is maximum. The larger is, the more clearly the groups are separated from one another. ${\ displaystyle \ lambda}$

Fisher's criterion

The calculation of the optimal separating hyperplane is still relatively simple in two dimensions, but quickly becomes a more complex problem in several dimensions. Fisher therefore uses a trick that first reduces the dimension and then calculates the discriminant function. To do this, the data is projected into a single dimension, the direction of the projection being of crucial importance.

The classes are much better separated from each other when the feature vectors are projected in direction than in direction . ${\ displaystyle w_ {2}}$ ${\ displaystyle w_ {1}}$

In order to formally write this fact, a few definitions are needed.

Designate the mean value of the class and the mean value of the entire feature space. ${\ displaystyle \ mathbf {m} ^ {(i)}}$ ${\ displaystyle C_ {i}}$ ${\ displaystyle \ mathbf {m}}$

{\ displaystyle S_ {W} = \ sum _ {\ mathbf {x} \ in C_ {1}} {(\ mathbf {x} - \ mathbf {m} ^ {(1)}) (\ mathbf {x} - \ mathbf {m} ^ {(1)}) ^ {T}} + \ sum _ {\ mathbf {x} \ in C_ {2}} {(\ mathbf {x} - \ mathbf {m} ^ { (2)}) (\ mathbf {x} - \ mathbf {m} ^ {(2)}) ^ {T}}}

is called intravariance (English: within scatter) and measures the variance within the classes, while the intervariance (English: between scatter)

{\ displaystyle S_ {B} = (\ mathbf {m} ^ {(1)} - \ mathbf {m}) (\ mathbf {m} ^ {(1)} - \ mathbf {m}) ^ {T} + (\ mathbf {m} ^ {(2)} - \ mathbf {m}) (\ mathbf {m} ^ {(2)} - \ mathbf {m}) ^ {T}}

describes the variance between the classes. The most suitable projection direction is then obviously that which minimizes the intravariance of the individual classes, while the intervariance between the classes is maximized.

This idea is formulated mathematically with Fisher's criterion using the Rayleigh quotient :

{\ displaystyle J (w) = {\ frac {| w ^ {T} S_ {B} w |} {| w ^ {T} S_ {W} w |}}}

This criterion is used to measure the quality of the separability of the classes in the feature space. This means that the projection direction is then optimal (in the sense of the separability of the classes) when is maximal. ${\ displaystyle w}$ ${\ displaystyle J (w)}$

The explanations already show that Fisher's criterion can be expanded not only to a discriminant function, but also to an optimization method for feature spaces. In the case of the latter, a projection method would be conceivable that projects a high-dimensional feature space similar to the principal component analysis into a lower dimension and at the same time optimally separates the classes from one another.

Fisher's discriminant function

A discriminant function assigns objects to the respective classes. With Fisher's criterion , the optimal projection direction, more precisely the normal vector of the optimally separating hyperplane, can be determined. It then only has to be tested for each object on which side of the hyperplane it lies.

For this purpose, the respective object is first projected onto the optimal projection direction. Then the distance to the origin is tested against a previously determined threshold value . The Fisher's discriminant function is therefore of the following form: ${\ displaystyle w_ {0}}$

{\ displaystyle f (\ mathbf {x}) = \ mathbf {w} ^ {T} \ mathbf {x} -w_ {0}}

A new object is now assigned depending on the result of either or . In depends on the application to determine whether any one of the two classes is assigned. ${\ displaystyle y}$ ${\ displaystyle f (y)}$ ${\ displaystyle C_ {1}}$ ${\ displaystyle C_ {2}}$ ${\ displaystyle f (y) = 0}$ ${\ displaystyle y}$

Number of discriminant functions

To separate classes, a maximum of discriminant functions can be formed that are orthogonal (that is, right-angled or uncorrelated ). The number of discriminant functions cannot be greater than the number of characteristic variables that are used to separate the classes or groups: ${\ displaystyle K}$ ${\ displaystyle K-1}$ ${\ displaystyle p}$

{\ displaystyle M = \ min (K-1, p)}

.

Standardized discriminant coefficients

As with linear regression, one can also find out the standardized discriminant coefficients of the target with the aid of feature variables that have the greatest influence on the discriminant variable. The feature variables are standardized for this purpose : ${\ displaystyle \ beta _ {i} ^ {*}}$ ${\ displaystyle X_ {i}}$

{\ displaystyle Z_ {i} = {\ frac {X_ {i} - {\ bar {x_ {i}}}} {s_ {i}}}}

with the arithmetic mean and the standard deviation . Then the coefficients are recalculated: ${\ displaystyle {\ bar {x}} _ {i}}$ ${\ displaystyle s_ {i}}$

{\ displaystyle D = \ beta _ {0} ^ {*} + \ beta _ {1} Z_ {1} + \ beta _ {2} ^ {*} Z_ {2} + \ dots + \ beta _ {p } ^ {*} Z_ {p}}

and it applies

{\ displaystyle \ beta _ {i} ^ {*} = \ beta _ {i} s_ {i}}

.

variable	coefficient	Stand. Coefficient
income	0.048	1.038
Loan amount	−0.007	−1.107

If one of the standardized coefficients from the example were now close to zero, then one could simplify the discriminant function if one omits this feature variable with only slightly less discriminatory power.

example

A simple cuboid classifier is designed to determine whether or not they are teenagers based on a person's age . The discriminant function is ${\ displaystyle x}$

{\ displaystyle g (x) = {\ begin {cases} 1 & {\ text {if}} 13 \ leq x \ leq 19 \\ - 1 & {\ text {otherwise}} \ end {cases}}}

Since the feature space is one-dimensional (only age is used for classification), the interface points are at and . In this case, it must be agreed that the dividing surfaces also belong to the “teenager” class. ${\ displaystyle x = 13}$ ${\ displaystyle x = 19}$

Individual evidence

^ Backhaus, K., Erichson, B., Plinke, W., Weiber, R. (2008). Multivariate analysis methods. An application-oriented introduction. Springer: Berlin, p. 200. ISBN 978-3-540-85044-1

literature

R. Kraft: Discriminant Analysis. (PDF; 99 kB) Technical University of Munich-Weihenstephan, June 8, 2000, accessed on October 24, 2012 .
Christopher M. Bishop, Neural Networks for Pattern Recognition , Oxford University Press, 1995.
Richard O. Duda and Peter E. Hart, Pattern Classification and Scene Analysis , Wiley-Interscience Publication, 1974.
Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition , Academic Press, 1990.

[1] Backhaus, K., Erichson, B., Plinke, W., Weiber, R. (2008). Multivariate analysis methods. An application-oriented introduction. Springer: Berlin, p. 200. ISBN 978-3-540-85044-1