Softmax function

In the mathematics is the so called softmax function or normalized exponential a generalization of the logistic function , the a - dimensional vector with real components in a n-dimensional vector transformed also as a vector of real components in the value range (0, 1), wherein the components add up to 1. The function is given by: ${\ displaystyle K}$ ${\ displaystyle \ mathbf {z}}$ ${\ displaystyle K}$ ${\ displaystyle \ sigma (\ mathbf {z})}$

{\ displaystyle \ sigma: \ mathbb {R} ^ {K} \ to \ left \ {z \ in \ mathbb {R} ^ {K} \ mid z_ {i} \ geq 0, \ sum _ {i = 1 } ^ {K} z_ {i} = 1 \ right \}}

{\ displaystyle \ sigma (\ mathbf {z}) _ {j} = {\ frac {e ^ {z_ {j}}} {\ sum _ {k = 1} ^ {K} e ^ {z_ {k} }}}}

for j = 1, ..., K .

In probability theory , the output of the Softmax function can be used to represent a categorical distribution - i.e. a probability distribution over different possible events. In fact, this corresponds to the gradient-log normalization of the categorical probability distribution. Thus the Softmax function is the gradient of the LogSumExp function. ${\ displaystyle K}$

The Softmax function is used in various methods of multiclass classification, such as multinomial logistic regression (also known as softmax regression), multiclass-related linear discriminant analysis , naive Bayes and artificial neural networks . In particular in multinomial logistic regression and linear discriminant analysis, the input of the function corresponds to the result of distinct linear functions and the determined probability for the first class corresponds to a sample vector and a weight vector : ${\ displaystyle K}$ ${\ displaystyle j}$ ${\ displaystyle x}$ ${\ displaystyle w}$

{\ displaystyle P (y = j \ mid \ mathbf {x}) = {\ frac {e ^ {\ mathbf {x} ^ {\ mathsf {T}} \ mathbf {w} _ {j}}} {\ sum _ {k = 1} ^ {K} e ^ {\ mathbf {x} ^ {\ mathsf {T}} \ mathbf {w} _ {k}}}}}

This can be viewed as the composition of linear functions and the softmax function (where denotes the inner product of and ). The execution is equivalent to the application of a linear operator defined by for vectors , so that thereby the original, possibly high-dimensional input is transformed into vectors in -dimensional space . ${\ displaystyle K}$ ${\ displaystyle \ mathbf {x} \ mapsto \ mathbf {x} ^ {\ mathsf {T}} \ mathbf {w} _ {1}, \ ldots, \ mathbf {x} \ mapsto \ mathbf {x} ^ { \ mathsf {T}} \ mathbf {w} _ {K}}$ ${\ displaystyle \ mathbf {x} ^ {\ mathsf {T}} \ mathbf {w}}$ ${\ displaystyle \ mathbf {x}}$ ${\ displaystyle \ mathbf {w}}$ ${\ displaystyle \ mathbf {w}}$ ${\ displaystyle \ mathbf {x}}$ ${\ displaystyle K}$ ${\ displaystyle \ mathbb {R} ^ {K}}$

Individual evidence

↑ ^a ^b Christopher M. Bishop: Pattern Recognition and Machine Learning . Springer, 2006.
↑ Computer Science Department: Unsupervised Feature Learning and Deep Learning Tutorial. Stanford University , accessed January 30, 2019 .

[bishop-1] Christopher M. Bishop: Pattern Recognition and Machine Learning . Springer, 2006.

[2] Computer Science Department: Unsupervised Feature Learning and Deep Learning Tutorial. Stanford University , accessed January 30, 2019 .