Softmax Function

The soft-max function is a generalization of the logistic function that normalizes a N-dimensional vector of arbitrary real values to a range of (0,1). It is used to obtain the posterior distribution of a class for a given sample vector x with weights w:

p(y=j | x) = \frac{e^{x^\top w_j}}{\sum_{i=1}^{N} e^{x^\top w_i}}

The softmax function can be used as an activation function for artificial neural networks and is also used in softmax regression where we are interested in multi-class classification as opposed to only binary classification of logistic regression. Below I describe the difference between softmax and the commonly used sigmoid function.

Sigmoid

The function: \sigma (x) = \frac{1}{1+e^{-1}} .

Its derivative: \frac{d \sigma (x)}{d x} = (1-\sigma(x)) \times \sigma (x) .

The problem here is the exponential e, which quickly goes to infinity, even though the result of σ is restricted to the interval [0, 1]. The solution: the sigmoid can be expressed in terms of tanh: \sigma (x) = \frac{1}{2} (1+\tanh (\frac{x}{2})) .

Softmax

Softmax, which is more generally defined as \eta (u_{i}) = \frac{e^{u_{i}}}{\sum_{j=1}^{J} e^{u_{j}} }  (where u is a vector), is a little more complicated than the sigmoid function when it comes to its understanding and derivative. So the key here is to express softmax in terms of a log sum function: \log \eta (u_{i}) = u_{i} - \log \sum_{j=1}^{J} e^{u_{j}} .

The vector of partial derivatives, i.e. the gradient, of softmax is analogous to the sigmoid: \frac{\partial}{\partial{u_{i}}} \eta(u_{i}) = (1-\eta(u_{i}))\times \eta(u_{i}).

Leave a comment