The soft-max function is a generalization of the logistic function that normalizes a N-dimensional vector of arbitrary real values to a range of (0,1). It is used to obtain the posterior distribution of a class for a given sample vector x with weights w:
The softmax function can be used as an activation function for artificial neural networks and is also used in softmax regression where we are interested in multi-class classification as opposed to only binary classification of logistic regression. Below I describe the difference between softmax and the commonly used sigmoid function.
Sigmoid
The function: .
Its derivative: .
The problem here is the exponential e, which quickly goes to infinity, even though the result of σ is restricted to the interval [0, 1]. The solution: the sigmoid can be expressed in terms of tanh: .
Softmax
Softmax, which is more generally defined as (where u is a vector), is a little more complicated than the sigmoid function when it comes to its understanding and derivative. So the key here is to express softmax in terms of a log sum function:
.
The vector of partial derivatives, i.e. the gradient, of softmax is analogous to the sigmoid: .