Softmax Preserves Order, Is Translation Invariant But Not Invariant Under Scaling.#

Twitter Handle LinkedIn Profile GitHub Profile Tag Tag

Problem Formulation#

Consider a classification problem with a dataset \(\mathcal{S}\) with \(N\) samples, defined formally below,

\[ \mathcal{S} \overset{\mathrm{iid}}{\sim} \mathcal{D} = \left \{ \left(\mathbf{x}^{(n)}, y^{(n)} \right) \right \}_{n=1}^N \in \left(\mathcal{X}, \mathcal{Y} \right)^N \]

where

  • \(\mathcal{X}\) is the input space, \(\mathcal{X} \subseteq \mathbb{R}^{D}\),

  • \(\mathcal{Y}\) is the output space, \(\mathcal{Y} \subseteq \mathbb{Z}\),

  • \(\mathbf{x}^{(n)} \in \mathcal{X}\) is the input feature vector of the \(n\)-th sample,

  • \(y^{(n)} \in \mathcal{Y}\) is the output label of the \(n\)-th sample,

  • \(\mathcal{D}\) is the underlying, true but unknown, data distribution,

  • \(\mathcal{S}\) is the dataset, a finite sample drawn \(\mathrm{iid}\) (the independent and identically distributed assumption) from \(\mathcal{D}\).

Given the classification setuo, the goal is to define a functional form \(f(\cdot)\) that maps an input vector \(\mathbf{x}\) to a predicted output label \(y\). This mapping is typically represented by a function \(f: \mathcal{X} \rightarrow \mathcal{Y}\), which takes an input \(\mathbf{x}\) from the input space \(\mathcal{X}\) and predicts an output label \(y\) from the output space \(\mathcal{Y}\). Consequently, for a classification problem with \(K\) discrete classes \(\mathcal{C}_{k}\), where \(k=1 , \ldots, K\), the output space \(\mathcal{Y}\) corresponds to the set of these \(K\) classes, often represented as \(\{1, 2, \ldots, K\}\) or \(\{0, 1, \ldots, K-1\}\), depending on the indexing convention. Therefore, \(y \in \mathcal{Y}\) is the class label assigned to the input vector \(\mathbf{x}\) [Bishop, 2007].

Functional Form of \(f\)#

The function \(f\) can be explicitly defined in terms of parameters that need to be learned from the data \(\mathcal{S}\). A common approach in machine learning is to use a parameterized model \(f_{\boldsymbol{\theta}}\), where \(\boldsymbol{\theta} \in \boldsymbol{\Theta}\) denotes the parameters of the model. This model aims to approximate the true but unknown function that maps inputs to outputs in the underlying data distribution \(\mathcal{D}\).

For a given input vector \(\mathbf{x}^{(n)}\), the functional form of the model can be expressed as:

\[ y^{(n)} = f_{\boldsymbol{\theta}}\left(\mathbf{x}^{(n)}\right) \]

and the functional form of the estimated label \(\hat{y}^{(n)}\) can be expressed as:

\[ \hat{y}^{(n)} = f_{\hat{\boldsymbol{\theta}}}\left(\mathbf{x}^{(n)}\right) \]

For simplicity, we would consider the family of linear models, where the functional form can be expressed as:

\[ f_{\boldsymbol{\theta}}\left(\mathbf{x}\right) = \boldsymbol{\theta}^{\top} \mathbf{x} + b \]

where \(\boldsymbol{\theta} \in \mathbb{R}^{D}\) is the weight vector, \(b \in \mathbb{R}\) is the bias term, and \(\boldsymbol{\theta} = \{\boldsymbol{\theta}, b\}\) are the parameters of the model. We then seek \(\hat{\boldsymbol{\theta}}\) that minimizes the discrepancy between the predicted label \(\hat{y}^{(n)}\) and the true label \(y^{(n)}\) for all samples in the dataset \(\mathcal{S}\) via an optimization procedure (learning algorithm) \(\mathcal{A}\) over some loss function \(\mathcal{L}\).

It is also worth noting the functional form is usually extended to include an activation (inverse is the link function in statistics theory) function \(\sigma(\cdot)\), which introduces non-linearity to the model, and more importantly, ensures the output of the model is a valid probability distribution over the classes (note ensuring the output is a valid probability distribution does not equate to the model being well-calibrated). In our context, we would treat the activation function as the softmax function and is only applied to the output layer of the model.

\[ f_{\boldsymbol{\theta}}\left(\mathbf{x}\right) = \sigma\left(\underbrace{\boldsymbol{\theta}^{\top} \mathbf{x} + b}_{\mathbf{z}}\right) \]

In the setting of multiclass classification, it is common to represent the predicted \(\hat{y}\) as a \(K\)-dimensional vector, where \(K\) is the number of classes. And thus, the output of the model can be expressed as:

\[ \hat{\boldsymbol{y}} := f_{\boldsymbol{\theta}}\left(\mathbf{x}\right) = \begin{bmatrix} \hat{y}_1 & \hat{y}_2 & \cdots & \hat{y}_K \end{bmatrix} \]

with \(\hat{y}_k = p\left(y = \mathcal{C}_k \mid \mathbf{x}\right)\), the probability of the input vector \(\mathbf{x}\) belonging to class \(\mathcal{C}_k\).

To this end, it would be incomplete not to mention the probabilistic interpretation of the classification problem. We state very briefly in the next section. For a more rigorous treatment, we refer the reader to chapter 4 of Pattern Recognition and Machine Learning by Bishop [Bishop, 2007].

Probabilistic Interpretation#

We can also divide the classification to either discriminative or generative models. Discriminative models learn the decision boundary directly from the data, while generative models learn the joint distribution of the input and output, and then use Bayes’ rule to infer the decision boundary.

For example, if we look through the multiclass problem through the lens of generative models, our goal is basically to find \(p\left(\mathcal{C}_k \mid \mathbf{x}\right)\). One approach to determine this conditional probability is to adopt a generative approach in which we model the class-conditional densities given by \(p\left(\mathbf{x} \mid \mathcal{C}_k\right)\), together with the prior probabilities \(p\left(\mathcal{C}_k\right)\) for the classes, and then we compute the required posterior probabilities using Bayes’ theorem [Bishop, 2007],

\[ p\left(\mathcal{C}_k \mid \mathbf{x} ; \boldsymbol{\theta} \right)=\frac{p\left(\mathbf{x} \mid \mathcal{C}_k\right) p\left(\mathcal{C}_k\right)}{p(\mathbf{x})} \]

Unnormalized Logits#

At this juncture, we would take a look at what happens before the activation function, in our case, the softmax function is applied. The linear projection layer, often the head/output layer, of the model \(f_{\boldsymbol{\theta}}\left(\mathbf{x}\right)\) (pre-softmax), yields what we often referred to as the logits or unnormalized scores. We often denote the logits as \(\mathbf{z}\), and it is a \(K\)-dimensional vector, where \(K\) is the number of classes.

\[ \mathbf{z} \in \mathbb{R}^{K} = \boldsymbol{\theta}^{\top} \mathbf{x} + b \]

where each element \(z_k\) of the vector \(\mathbf{z}\) is the unnormalized score of the \(k\)-th class.

Since the logits are unnormalized and unbounded, it follows that we need to induce the model to produce a valid probability distribution over the classes. This is where the softmax function comes into play.

Enough is Enough

I don’t even know enough to continue the theoretical blabbering. I would stop here and transit to the highlight and regurgitate the key points of the softmax function.

Softmax Function#

The softmax function (a vector function) takes as input a vector \(\mathbf{z} = \begin{bmatrix} z_1 & z_2 & \cdots & z_K \end{bmatrix}^{\top} \in \mathbb{R}^{K}\) of \(K\) real numbers, and normalizes it into a probability distribution consisting of \(K\) probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1 ; but after applying softmax, each component will be in the interval \((0,1)\), and the components will add up to \(1\), so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

For a vector \(\mathbf{z} \in \mathbb{R}^{K}\) of \(K\) real numbers, the standard (unit) softmax function \(\sigma: \mathbb{R}^K \mapsto(0,1)^K\), where \(K \geq 1\), is defined by the formula

\[ \sigma(\mathbf{z})_j=\frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \text { for } j=1, \ldots, K \]

More concretely, we can write explicitly the mapping as:

\[\begin{split} \begin{aligned} \sigma(\cdot): \mathbb{R}^{K} & \rightarrow(0,1)^{K} \\ \mathbf{z} & \mapsto \begin{bmatrix} \sigma(\mathbf{z})_1 & \sigma(\mathbf{z})_2 & \cdots & \sigma(\mathbf{z})_K \end{bmatrix}^{\top} \\ \begin{bmatrix} z_1 & z_2 & \cdots & z_K \end{bmatrix}^{\top} & \mapsto \begin{bmatrix} \frac{e^{z_1}}{\sum_{k=1}^K e^{z_k}} & \frac{e^{z_2}}{\sum_{k=1}^K e^{z_k}} & \cdots & \frac{e^{z_K}}{\sum_{k=1}^K e^{z_k}} \end{bmatrix}^{\top} \end{aligned} \end{split}\]

In particular, for any element \(z_j\) of the input vector \(\mathbf{z}\), the softmax function computes the exponential of the element and normalizes it by the sum of the exponentials of all the elements in the input vector.

Softmax Obeys The Three Axioms of Probability (Kolmogorov Axioms)#

First of all, softmax induces a valid probability distribution over the classes. Why so? We first need to understand the three axioms of probability, also known as the Kolmogorov axioms.

Consider the probability space defined over the triplet \(\left(\Omega, \mathcal{F}, \mathbb{P}\right)\), where \(\Omega\) is the sample space, \(\mathcal{F}\) is the sigma-algebra event space (collection of events), and \(\mathbb{P}\) is the probability function.

A probability function \(\mathbb{P}\) defined over the probability space must satisfy the three axioms below. Recall that the probability function in a well defined experiment is a function \(\mathbb{P}: \mathcal{F} \to [0, 1]\). Informally, for any event \(A\), \(\mathbb{P}(A)\) is defined as the probability of event \(A\) happening.

This probability function/law \(\mathbb{P}(A)\) must satisfy the following three axioms:

Axiom 1 (Non-Negativity)

\(\mathbb{P}(A) \geq 0\) for any event \(A \subseteq \S\).

Axiom 2 (Normalization)

\(\sum_{i=1}^{n}\mathbb{P}(A_i) = 1\) where \(A_i\) are all possible outcomes for \(i = 1, 2,..., n\).

Axiom 3 (Additivity)

Given a countable sequence of disjoint events \(A_1, A_2, ..., A_n,... \subset \S\), we have

\[ \mathbb{P}\left(\bigsqcup_{i=1}^{\infty} A_i \right) = \sum_{i=1}^{\infty}\mathbb{P}[A_i] \]

Non-Negativity#

To satisfy the first axiom, we need to show that \(\sigma(\mathbf{z})_j \geq 0\) for all \(j\).

It is easy to see that the exponential function \(e^{z_j}\) is always positive for any real number \(z_j\). Therefore, \(e^{z_j} > 0\) for all \(i\). The denominator is a sum of positive terms, thus also positive. A positive number divided by another positive number is positive, hence \(\sigma(\mathbf{z})_j > 0\) for all \(i\). This satisfies the non-negativity axiom.

Normalization#

To satisfy the second axiom, we need to prove that the sum of the softmax function outputs equals 1, i.e., \(\sum_{k=1}^{K} \sigma(\mathbf{z})_k = 1\).

By definition of softmax, for any element \(z_j\) of the input vector \(\mathbf{z}\), we have \(\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}}\), and summing over all \(K\) classes, we have

\[\begin{split} \begin{aligned} \sum_{j=1}^{K} \sigma(\mathbf{z})_j & = \sum_{j=1}^{K} \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}} \\ & = \frac{\sum_{j=1}^{K} e^{z_j}}{\sum_{k=1}^{K} e^{z_k}} \\ & = 1 \end{aligned} \end{split}\]

This shows that the sum of the outputs of the softmax function equals 1, satisfying the normalization axiom.

Additivity#

This is evident because if we treat each instance \(z_j\) as a disjoint event since each \(z_j\) can only belong to one class, then the events (or classes) form a countable sequence of disjoint events.

Calibration#

While softmax ensures that the output of the model is a valid probability distribution over the classes, it does not guarantee that the model is well-calibrated. A well-calibrated model is one that produces predicted probabilities that are close to the true probabilities.

Implementation#

Here we show a simple implementation of the softmax function in PyTorch.

 1class Softmax:
 2    def __init__(self, dim: int | None = None) -> None:
 3        """
 4        Initialize the softmax function.
 5        """
 6        self.dim = dim
 7
 8    def __call__(self, z: torch.Tensor) -> torch.Tensor:
 9        """
10        Compute the softmax function for a given input.
11        """
12        numerator = torch.exp(z)
13        denominator = torch.sum(numerator, dim=self.dim, keepdim=True)
14        g = numerator / denominator
15        return g

We do a rough comparison of the softmax function implemented in PyTorch with the readily available implementation in PyTorch.

 1z = torch.randn((2, 5), requires_grad=True, dtype=torch.float32)
 2pytorch_softmax = nn.Softmax(dim=1)
 3pytorch_softmax_probs = pytorch_softmax(z)
 4pprint(pytorch_softmax_probs)
 5
 6my_softmax = Softmax(dim=1)
 7my_softmax_probs = my_softmax(z)
 8pprint(my_softmax_probs)
 9
10torch.testing.assert_close(
11    pytorch_softmax_probs, my_softmax_probs, rtol=1.3e-6, atol=1e-5, msg="Softmax function outputs do not match."
12)
tensor([[0.1661, 0.1143, 0.3998, 0.1204, 0.1994],
│   │   [0.1120, 0.1534, 0.0920, 0.1486, 0.4940]], grad_fn=<SoftmaxBackward0>)
tensor([[0.1661, 0.1143, 0.3998, 0.1204, 0.1994],
│   │   [0.1120, 0.1534, 0.0920, 0.1486, 0.4940]], grad_fn=<DivBackward0>)

Softmax Preserves Order (Monotonicity)#

Order preservation, in the context of mathematical functions, refers to a property where the function maintains the relative ordering of its input elements in the output. Specifically, for a function \(f\) to be considered order-preserving, the following condition must hold:

Given any two elements \(a\) and \(b\) in the domain of \(f\), if \(a > b\), then \(f(a) > f(b)\).

For the softmax function, which is defined for a vector \(\mathbf{z} = \begin{bmatrix} z_1 & z_2 & \cdots & z_K \end{bmatrix}^{\top}\) of \(K\) real numbers as:

\[ \sigma(\mathbf{z})_j=\frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \text { for } j=1, \ldots, K \]

and for it to be order-preserving, it must satisfy the condition that for any two elements \(z_i\) and \(z_j\) in the input vector \(\mathbf{z}\), if \(z_i > z_j\), then \(\sigma(\mathbf{z})_i > \sigma(\mathbf{z})_j\). The proof is relatively simple, since the exponential function is monotonically increasing.

We can show one example in code:

 1logits = torch.tensor([[2.0, 1.0, 3.0, 5.0, 4.0]], dtype=torch.float32)
 2my_softmax = Softmax(dim=1)
 3my_softmax_probs = my_softmax(logits)
 4
 5_, indices_logits = torch.sort(logits, descending=False)
 6_, indices_probs = torch.sort(my_softmax_probs, descending=False)
 7
 8pprint(indices_logits)
 9pprint(indices_probs)
10
11torch.testing.assert_close(
12    indices_logits, indices_probs, rtol=1.3e-6, atol=1e-5, msg="Softmax Ordering is not preserved?!"
13)
tensor([[1, 0, 2, 4, 3]])
tensor([[1, 0, 2, 4, 3]])

Indeed, if the logits is \(\mathbf{z} = \begin{bmatrix} 2.0 & 1.0 & 3.0 & 5.0 & 4.0 \end{bmatrix}\), with ordering of the elements \(z_2 < z_1 < z_3 < z_5 < z_4\) (indices 1, 0, 2, 4, 3), we show that the softmax function preserves the ordering of the elements in the output probability distribution.

Softmax Is Translation Invariance#

The softmax function showcases an important characteristic known as translation invariance. This means that if we translate each coordinate of the input vector \(\mathbf{z}\) by the same scalar value \(c\), the output of the softmax function remains unchanged. Mathematically, adding a constant vector \(\mathbf{c} = (c, \ldots, c)\) to the inputs \(\mathbf{z}\) results in \(\sigma(\mathbf{z} + \mathbf{c}) = \sigma(\mathbf{z})\), because this operation multiplies each exponent in the softmax function by the same factor \(e^c\), due to the property \(e^{z_i + c} = e^{z_i} \cdot e^c\). Consequently, the ratios of the exponents do not alter.

Given the input vector \(\mathbf{z} = \begin{bmatrix} z_1 & z_2 & \ldots & z_K \end{bmatrix}\) and a constant scalar \(c\), the translation invariance of the softmax function can be expressed as:

\[ \sigma(\mathbf{z} + \mathbf{c})_j = \frac{e^{z_j + c}}{\sum_{k=1}^{K} e^{z_k + c}} = \frac{e^{z_j} \cdot e^c}{\sum_{k=1}^{K} e^{z_k} \cdot e^c} = \sigma(\mathbf{z})_j \]

for \(j = 1, \ldots, K\), where \(K\) is the number of elements in vector \(\mathbf{z}\).

In code, we can demonstrate this property as follows:

1constant = 10
2translated_logits = logits + constant
3translated_probs = my_softmax(translated_logits)
4
5print("Original softmax  :", my_softmax_probs)
6print("Translated softmax:", translated_probs)
7
8torch.testing.assert_close(my_softmax_probs, translated_probs, rtol=1.3e-6, atol=1e-5, msg="Translation Invariance Violated!")
Original softmax  : tensor([[0.0317, 0.0117, 0.0861, 0.6364, 0.2341]])
Translated softmax: tensor([[0.0317, 0.0117, 0.0861, 0.6364, 0.2341]])

Numerical Instability of the Softmax Function#

Recall the softmax function as follows:

\[ \sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \quad \text{for } j = 1, \ldots, K \]

When elements of \(\mathbf{z}\) are very large (positive or negative), computing \(e^{z_j}\) can lead to numerical instability:

  • Overflow: If \(z_j\) is very large, \(e^{z_j}\) can exceed the range of floating-point numbers, leading to infinity (\(\infty\)).

  • Underflow: If \(z_j\) is very negative, \(e^{z_j}\) can become so small that it’s considered as zero in floating-point arithmetic.

Both situations can cause issues when computing the softmax, leading to inaccurate results or failures in computation due to divisions by infinity or zero.

To mitigate these issues, we leverage the invariance property of softmax to shifts in the input vector [Goodfellow et al., 2016]. Specifically, adding or subtracting the same scalar from every element of \(\mathbf{z}\) doesn’t change the output of the softmax function:

\[\sigma(\mathbf{z}) = \sigma(\mathbf{z} + c)\]

By choosing \(c = -\max_i z_i\), we subtract the maximum element of \(\mathbf{z}\) from every element. This operation ensures that the maximum value in the shifted vector is \(0\), which helps in avoiding overflow (since \(e^0 = 1\) is within range) and reduces the risk of underflow for other elements.

The numerically stable softmax function can thus be reformulated as:

\[ \sigma(\mathbf{z})_j = \frac{e^{z_j - \max_i z_i}}{\sum_{k=1}^K e^{z_k - \max_i z_i}} \quad \text{for } j = 1, \ldots, K \]

This reformulation achieves the following:

  • Maximal Value Normalization: By ensuring that the maximum exponentiated value is \(e^0 = 1\), we prevent overflow.

  • Relative Differences Preserved: The operation preserves the relative differences between the elements of \(\mathbf{z}\), ensuring the softmax output remains unchanged in terms of its distribution properties.

  • Improved Numerical Stability: The risk of underflow is also mitigated since subtracting a large value (\(\max_i z_i\)) from each \(z_j\) makes the values more manageable for computation without affecting the ratios that softmax aims to model.

To this end, a rough implementation of the numerically stable softmax function is shown below:

 1super_large_logits = torch.tensor([[1000.0, 2000.0, 3000.0, 4000.0, 5000.0]], dtype=torch.float32)
 2unstable_softmax = my_softmax(super_large_logits)
 3pprint(unstable_softmax)
 4
 5def stable_softmax(z: torch.Tensor) -> torch.Tensor:
 6    """
 7    Compute the numerically stable softmax function for a given input.
 8    """
 9    max_z = torch.max(z, dim=1, keepdim=True).values
10    numerator = torch.exp(z - max_z)
11    denominator = torch.sum(numerator, dim=1, keepdim=True)
12    g = numerator / denominator
13    return g
14
15stable_softmax_probs = stable_softmax(super_large_logits)
16pprint(stable_softmax_probs)
tensor([[nan, nan, nan, nan, nan]])
tensor([[0., 0., 0., 0., 1.]])

Softmax Is Not Invariant Under Scaling#

A function \(f: \mathbb{R}^D \rightarrow \mathbb{R}^D\) is said to be scale invariant if for any positive scalar \(c > 0\) and any vector \(\mathbf{x} \in \mathbb{R}^D\), the following condition holds:

\[ f(c\mathbf{x}) = f(\mathbf{x}) \]

This means that scaling the input vector \(\mathbf{x}\) by any positive factor \(c\) does not change the output of the function \(f\). Scale invariance implies that the function’s output depends only on the direction of the vector \(\mathbf{x}\) and not its magnitude.

In contrast, a function is not invariant under scaling if there exists at least one vector \(\mathbf{x} \in \mathbb{R}^D\) and one scalar \(c > 0\) such that \(f(c\mathbf{x}) \neq f(\mathbf{x})\).

The softmax function is not invariant under scaling. This is because the softmax output changes when the input vector \(\mathbf{z}\) is scaled by a positive scalar \(c\), as the exponential function magnifies differences in the scaled inputs.

The difference in softmax’s response to scaling can be attributed to the exponential operation applied to each element of the input vector before normalization. Since the exponential function is not linear, scaling the input vector \(\mathbf{z}\) by a scalar \(c\) does not merely scale the output by the same factor, but rather changes the relative weights of the output probabilities.

Consider the input vector \(\mathbf{z} = \begin{bmatrix} 10.0 & 20.0 & 30.0 \end{bmatrix}\), and its scaled version \(\tilde{\mathbf{z}} = c \cdot \mathbf{z} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \end{bmatrix}\) where \(c = 0.01\). The softmax output for the larger input vector \(\mathbf{z}\) is approximately \(\begin{bmatrix} 2.0611e-09 & 4.5398e-05 & 9.9995e-01 \end{bmatrix}\), while the softmax output for the scaled input vector \(\tilde{\mathbf{z}}\) is approximately \(\begin{bmatrix} 0.3006 & 0.3322 & 0.3672 \end{bmatrix}\) which show that the softmax function is not invariant under scaling by comparing the softmax outputs for the original and scaled input vectors.

 1c = 0.01
 2input_large = torch.tensor([10, 20, 30], dtype=torch.float32)
 3input_small = c * input_large # torch.tensor([0.1, 0.2, 0.3], dtype=torch.float32)
 4
 5softmax_large = Softmax(dim=0)(input_large)
 6softmax_small = Softmax(dim=0)(input_small)
 7
 8print("Softmax of the larger input :", softmax_large)
 9print("Softmax of the smaller input:", softmax_small)
10
11plt.figure(figsize=(12, 6))
12
13plt.subplot(1, 2, 1)
14plt.bar(range(len(softmax_large)), softmax_large)
15plt.title('Softmax probabilities (original scale)')
16
17plt.subplot(1, 2, 2)
18plt.bar(range(len(softmax_small)), softmax_small)
19plt.title(f'Softmax probabilities (scale factor: {c})')
20
21plt.show()
Softmax of the larger input : tensor([2.0611e-09, 4.5398e-05, 9.9995e-01])
Softmax of the smaller input: tensor([0.3006, 0.3322, 0.3672])
../_images/72f686d3ae0103e48b6f3940df77622922ecd21daa6b30095f2a31d91c2350b5.svg

Another example:

 1c = 0.1
 2input_large = torch.tensor([1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0], dtype=torch.float32)
 3input_small = c * input_large
 4
 5softmax_large = Softmax(dim=0)(input_large)
 6softmax_small = Softmax(dim=0)(input_small)
 7
 8plt.figure(figsize=(12, 6))
 9
10plt.subplot(1, 2, 1)
11plt.bar(range(len(softmax_large)), softmax_large)
12plt.title('Softmax probabilities (original scale)')
13
14plt.subplot(1, 2, 2)
15plt.bar(range(len(softmax_small)), softmax_small)
16plt.title(f'Softmax probabilities (scale factor: {c})')
17
18plt.show()
../_images/2c453cc57517b1ce3ed60e9dcec59fb66838232eaf46394a4ce2ca2d8f386f65.svg

Sharpening and Dampening the Softmax Distribution#

Continuing from our first example, the 3rd element of the softmax output has the largest weight (9.9995e-01), thereby suppressing the other elements. The sharp readers would have noticed that even though the 3rd element has the largest weight, it was at a whooping \(0.99995\), which is almost \(1\) - which means the rest of the elements are almost zero. As we shall briefly touch upon later, sampling from a multinomial distribution (number of trials is 1 because we are sampling one token at each time step) with the parameter \(\boldsymbol{\pi}\), where \(\boldsymbol{\pi}\) is the softmax output, almost surely will select the 3rd element because of the high probability. This is what we call a sharp softmax distribution.

On the other hand, if we scale the input vector by a factor \(c=0.01\), even though the ranking (order) is preserved, the softmax output is more uniformly distributed, with the weights of the elements more evenly spread out. We can see the 3rd element has a weight of \(0.3672\), which is still the maximum in the array, but the relative weights of the other elements are higher compared to the original softmax output. Similarly, sampling from a multinomial distribution with the parameter \(\boldsymbol{\pi}\) this time will be more diverse, because our \(\boldsymbol{\pi}\) is more uniform (at \(\begin{bmatrix} 0.3006 & 0.3322 & 0.3672 \end{bmatrix}\) and therefore the other two elements are likely to get selected as they are quite close to the max weight). This is what we call a dampened softmax distribution.

For people who has toyed around with the temperature parameter in the language model, we were told that when the temperature is high, the model is more random, and when the temperature is low, the model is more deterministic. This is because the temperature parameter effectively scales the logits by \(T\) (the temperature), where we can think of it as logits = logits / (T + epsilon). This is the same as scaling the logits by a factor \(c = 1/T\). When \(T\) is high, the softmax distribution is more uniform, and when \(T\) is low, the softmax distribution is more sharp and the highest weight displays an one-hot manner, where the rest of the weights are almost zero.

Temperature#

We start off this section with an extracted portion from the Wikipedia page that is relevant to our discussion on the temperature parameter in the softmax function.

The term “softmax” derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of \((1,2,8)\) is approximately \((0.001,0.002,0.997)\), which amounts to assigning almost all of the total unit weight in the result to the position of the vector’s maximal element (of 8).

In general, instead of \(e\) a different base \(b>0\) can be used. If \(0<b<1\), smaller input components will result in larger output probabilities, and decreasing the value of \(b\) will create probability distributions that are more concentrated around the positions of the smallest input values. Conversely, as above, if \(b>1\) larger input components will result in larger output probabilities, and increasing the value of \(b\) will create probability distributions that are more concentrated around the positions of the largest input values.

Writing \(b=e^\beta\) or \(b=e^{-\beta[\mathbf{a}]}\) (for real \(\beta\)) yields the expressions:

\[ \sigma(\mathbf{z})_i=\frac{e^{\beta z_i}}{\sum_{j=1}^K e^{\beta z_j}} \text { or } \sigma(\mathbf{z})_i=\frac{e^{-\beta z_i}}{\sum_{j=1}^K e^{-\beta z_j}} \text { for } i=1, \ldots, K \]

The reciprocal of \(\beta\) is sometimes referred to as the temperature, \(T=1 / \beta\), with \(b=e^{1 / T}\). A higher temperature results in a more uniform output distribution (i.e. with higher entropy, and “more random”), while a lower temperature results in a sharper output distribution, with one value dominating.

Softmax - Wikipedia

More concretely, how temperature is implemented in large language models like GPT is by scaling the logits by the temperature parameter \(T\) before applying the softmax function.

logits = torch.randn((1, 3), dtype=torch.float32)
temperature = [0, 0.1, 0.5, 1, 2, 5, 10]

for t in temperature:
    scaled_logits = logits / t
    scaled_probs = my_softmax(scaled_logits)
    print(f"Temperature: {t}, Scaled Probs: {scaled_probs}")

So now we can define the input of the softmax function as \(\frac{z}{T}\) where \(T\) is the temperature parameter. Now our softmax would be like the below:

\[ \sigma\left(\frac{\mathbf{z}}{T}\right)_j = \frac{e^{z_j / T}}{\sum_{k=1}^K e^{z_k / T}} \]

which coincides with the definition from the Wikipedia page.

To conclude, the temperature modifies the “sharpness” of the probability distribution without altering the order of the probabilities, a desirable property:

  • At high temperatures (\(T > 1\)), the distribution becomes more uniform, but if \(z_a > z_b\), then \(\text{softmax}_T(z_a) > \text{softmax}_T(z_b)\) still holds. The probabilities become closer to each other, making the choice more “random” or “equally likely” among options, but the ranking remains the same.

  • At low temperatures (\(T < 1\)), the distribution becomes sharper, with a more pronounced difference between the higher and lower probabilities, amplifying the differences in likelihood as determined by the logits. The probability of the largest logit increases towards 1, while others decrease towards 0. Yet, the order of logits is preserved—higher logits translate to higher probabilities.

  • At \(T = 1\), the softmax function is the standard softmax function.

Sampling from the Softmax Distribution#

Below we show the effect of temperature on multinomial sampling from the softmax distribution. The experiements are repeated 1000 times for each temperature, and the distribution of sampled outcomes is visualized for each temperature.

Hide code cell source
 1import torch
 2import matplotlib.pyplot as plt
 3from collections import defaultdict
 4from typing import Dict, List, Tuple
 5import numpy as np
 6
 7
 8def demonstrate_multinomial_sampling_effect(
 9    logits: torch.Tensor, temperatures: List[float], num_experiments: int, epsilon: float = 1e-8
10) -> Tuple[Dict[float, np.ndarray], Dict[float, List[int]]]:
11    """
12    Demonstrates the effect of temperature on multinomial sampling from the softmax distribution.
13
14    Parameters
15    ----------
16    logits : torch.Tensor
17        The input logits vector.
18    temperatures : List[float]
19        A list of temperatures to apply to the softmax function.
20    num_experiments : int
21        The number of sampling experiments to run for each temperature.
22    epsilon : float, optional
23        A small value added to the temperature to prevent division by zero, by default 1e-8.
24
25    Returns
26    -------
27    Tuple[Dict[float, np.ndarray], Dict[float, List[int]]]
28        A tuple containing two dictionaries:
29        - The first maps each temperature to its corresponding softmax probabilities as a numpy array.
30        - The second maps each temperature to a list of sampled outcomes (indices of logits).
31
32    Examples
33    --------
34    >>> logits = torch.tensor([2.0, 1.0, 3.0, 5.0, 4.0], dtype=torch.float32)
35    >>> temperatures = [0.1, 1.0, 10.0]
36    >>> softmax_results, sampling_results = demonstrate_multinomial_sampling_effect(
37    ...     logits, temperatures, num_experiments=1000
38    ... )
39    """
40    softmax_results: Dict[float, np.ndarray] = {}
41    sampling_results: Dict[float, List[int]] = defaultdict(list)
42
43    for temperature in temperatures:
44        # Scale logits by temperature
45        scaled_logits = logits / (temperature + epsilon)
46        # Apply softmax to scaled logits
47        probs = torch.softmax(scaled_logits, dim=-1)
48        softmax_results[temperature] = probs.numpy()
49
50        for _ in range(num_experiments):
51            sample = torch.multinomial(probs, num_samples=1, replacement=True)
52            sampling_results[temperature].append(sample.item())
53
54    return softmax_results, dict(sampling_results)
55
56
57logits = torch.tensor([2.0, 1.0, 3.0, 5.0, 4.0], dtype=torch.float32)
58temperatures = [0.1, 0.5, 1.0, 10.0]
59
60num_experiments = 1000
61softmax_results, sampling_results = demonstrate_multinomial_sampling_effect(
62    logits, temperatures, num_experiments=num_experiments
63)
64
65outcome_counts = {
66    temp: np.bincount(outcomes, minlength=len(logits)) / num_experiments * 100
67    for temp, outcomes in sampling_results.items()
68}
69
70fig, axs = plt.subplots(len(temperatures), 1, figsize=(10, 5 * len(temperatures)))
71for ax, temperature in zip(axs, temperatures):
72    bars = ax.bar(range(len(logits)), outcome_counts[temperature], alpha=0.5, label=f"Temperature: {temperature}")
73    ax.legend()
74    ax.set_title(f"Distribution of sampled outcomes at temperature {temperature}")
75    ax.set_xlabel("Outcome")
76    ax.set_ylabel("Percentage (%)")
77
78    for bar in bars:
79        yval = bar.get_height()
80        ax.text(bar.get_x() + bar.get_width() / 2, yval + 0.01, f"{yval:.3f}%", ha="center", va="bottom")
81
82plt.tight_layout()
83plt.show()
../_images/23c20f541fc6362b0c572509b701a7af45d642e53020ceffafd60efabc93e251.svg

Softmax Is Smooth, Continuous and Differentiable#

The term “softmax” is a misnomer, as one would have thought it is a smooth max - a smooth approximation of the maximum function. However, the softmax function is not a smooth approximation of the maximum function, but rather a smooth approximation of the argmax function[1], where argmax is the function that outputs the index of the maximum element in a vector.

The argmax function is not continuous or differentiable, because it is a discrete function that jumps from one index to another as the input vector changes (i.e. not smooth). The softmax function, on the other hand, is a smooth approximation of the argmax function, as it outputs a probability distribution over the elements of the input vector, with the highest probability assigned to the maximum element - essentially a continuous and differentiable version of the argmax function (we call it a “softened” version of the argmax function) [Goodfellow et al., 2016]. Consequently, it is more appropriate to name the function “softargmax” instead of “softmax”.

To this end, it is worth mentioning that softmax gains its popularity in the deep learning space because it is smooth and normalizer over the input vector.

The smoothness assumption is a common one in the context of deep learning, because for when we say that the estimator function \(f_{\hat{\boldsymbol{\Theta}}}(\cdot)\) is smooth with respect to the parameter space \(\hat{\boldsymbol{\Theta}}\), it means that the estimator function \(f_{\hat{\boldsymbol{\Theta}}}(\cdot)\) is smooth with respect to the parameter space \(\hat{\boldsymbol{\Theta}}\) if the function is continuous and differentiable with respect to the parameter space \(\hat{\boldsymbol{\Theta}}\) up to a certain order (usually the first for SGD variants and second order for Newton).

What this implies is that the derivative of the function with respect to the parameter space \(\hat{\boldsymbol{\Theta}}\), denoted as \(\nabla_{\hat{\boldsymbol{\Theta}}} f_{\hat{\boldsymbol{\Theta}}}(\cdot)\) is continuous. Loosely, you can think of that a small perturbation in the parameter space \(\hat{\boldsymbol{\Theta}}\) will result in a small change in the output of the function \(f_{\hat{\boldsymbol{\Theta}}}(\cdot)\) - enabling gradient-based optimization algorithms to work effectively as if not, then taking a step in the direction of the gradient would not guarantee a decrease in the loss function, slowing down convergence.

Softmax Function via Exponential Family#

For a rigorous derivation of the softmax function - which will give rise to intuition on how sigmoid and softmax is derived from the exponential family of distributions - we refer the reader to chapter 2.4. The Exponential Family of Christopher Bishop’s Pattern Recognition and Machine Learning [Bishop, 2007] with reference here.

Gradient, Jacobian, and Hessian of Softmax#

Softmax as a Vector Function#

The softmax function \(\sigma\) represents a mapping from \(\mathbb{R}^K\) to \((0,1)^K\), where each output component \(\sigma(\mathbf{z})_j\) is a function of all input components \(z_1, z_2, \ldots, z_K\). Given its vector-to-vector nature, it does not make sense to talk about the derivative of softmax as a scalar-valued function. Instead, the analysis of its derivatives involves the following:

  1. Which component of the output vector we are interested in, and

  2. Which and how the input components affect that output component.

Derivatives of the Softmax Function#

As a result, the above considerations lands ourselves naturally into the realm of multivariable calculus, where we consider the partial derivatives of the softmax function - which measure the change in a specific output component in response to a small change in a specific input component. For the \(i\)-th output of softmax with respect to the \(j\)-th input, we denote the partial derivative as:

\[\frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}\]

This notation precisely indicates the partial derivative of the \(i\)-th output with respect to the \(j\)-th input.

To this end, given the softmax function \(\sigma\) defined on a vector \(\mathbf{z} \in \mathbb{R}^K\), the goal is to compute the partial derivative of the \(i\)-th output of the softmax function with respect to the \(j\) th component of the input vector \(\mathbf{z}\).

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}=\frac{\partial \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}}{\partial z_j} \]

The Jacobian Matrix of Softmax#

The formal representation of the derivatives for a vector function like softmax is the Jacobian matrix, \(\mathbf{J}_{\sigma}\), which contains all the first-order partial derivatives. For the softmax function, the Jacobian matrix is defined as:

\[\begin{split} \mathbf{J}_{\sigma}(\mathbf{z}) = \begin{bmatrix} \frac{\partial \sigma(\mathbf{z})_1}{\partial z_1} & \cdots & \frac{\partial \sigma(\mathbf{z})_1}{\partial z_K} \\ \vdots & \ddots & \vdots \\ \frac{\partial \sigma(\mathbf{z})_K}{\partial z_1} & \cdots & \frac{\partial \sigma(\mathbf{z})_K}{\partial z_K} \end{bmatrix} \end{split}\]

This matrix provides a full description of how the softmax function responds to changes in its inputs, indicating the sensitivity of each output component to each input component.

While “gradient” is a term often used in machine learning to describe the derivative of a function, it strictly applies to scalar-valued functions. For vector-valued functions like softmax, describing the comprehensive derivative structure requires the Jacobian matrix, \(\mathbf{J}_{\sigma}\), not a gradient. Therefore, in discussions involving softmax and its impact on learning in neural networks, it is more accurate to refer to the Jacobian matrix when considering its derivatives.

Derivative of the Softmax Function#

To compute the partial derivative of the \(i\)-th component of the softmax output with respect to the \(j\)-th component of the input vector \(\mathbf{z}\), we express this as:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}}{\partial z_j} \]

Since there is division involved, we’ll apply the quotient rule of derivatives, which for a function \(f(x) = \frac{g(x)}{h(x)}\) is given by:

\[f'(x) = \frac{g'(x)h(x) - h'(x)g(x)}{[h(x)]^2}\]

For our specific case, the functions \(g\) and \(h\) with respect to the \(i\)-th component are defined as:

\[\begin{split} \begin{aligned} g_i &= e^{z_i} \\ h &= \sum_{k=1}^K e^{z_k} \end{aligned} \end{split}\]

Some Derivatives Tips So Far…

Firstly, the derivative of \(h\) with respect to any \(z_j\) is always \(e^{z_j}\). Why? The function \(h\) represents the sum of the exponentials of all components of the input vector \(\mathbf{z}\). Mathematically, \(h\) is independent of the index \(i\) and is a function of all \(z_k\) in the vector \(\mathbf{z}\). When we differentiate \(h\) with respect to \(z_j\), we treat \(z_j\) as the variable and all other \(z_k\) (\(k \neq j\))$ as constants.

The derivative of \(e^{z_k}\) with respect to \(z_j\) is \(e^{z_j}\) when \(k=j\) because of the basic derivative rule \(d\left(e^x\right) / d x=e^x\), and it’s 0 for all \(k \neq j\) because the derivative of a constant is 0 . Therefore, when you sum up all these derivatives \(\left(e^{z_j}\right.\) for the term where \(k=j\) and 0 for all others), the derivative of the sum \(h_k\) with respect to \(z_j\) is simply \(e^{z_j}\).

\[ \frac{\partial h}{\partial z_j}=\frac{\partial}{\partial z_j}\left(\sum_{k=1}^K e^{z_k}\right)=e^{z_j} \]

Secondly, the derivative of \(g_i\) with respect to \(z_j\) is \(e^{z_j}\) if \(i=j\), and 0 otherwise. Why? The function \(g_i\) is defined as \(e^{z_i}\), which only involves a single component \(z_i\) of the input vector \(\mathbf{z}\). The derivative of \(g_i\) with respect to \(z_j\) depends on whether \(i\) equals \(j\)

  • If \(i=j\), then \(g_i=e^{z_j}\), and the derivative of \(e^{z_j}\) with respect to \(z_j\) is \(e^{z_j}\) itself, following the basic exponential derivative rule.

  • If \(i \neq j, z_i\) and \(z_j\) are considered independent variables. Since \(g_i\) does not involve \(z_j\) in this case, the derivative is 0, reflecting the principle that the derivative of a constant (or a term that does not involve the variable of differentiation) is 0.

\[\begin{split} \frac{\partial g_i}{\partial z_j}=\frac{\partial e^{z_i}}{\partial z_j}= \begin{cases}e^{z_j} & \text { if } i=j \\ 0 & \text { if } i \neq j\end{cases} \end{split}\]

Case 1: \(i = j\)#

We seek to find the derivative of \(\sigma(\mathbf{z})_i\) with respect to \(z_j\) and since \(i=j\), we seek to find the derivative of \(\sigma(\mathbf{z})_i\) with respect to \(z_i\). Note it doesn’t matter if we use \(i\) or \(j\) as the index because it applies to any index.

Given the softmax function’s \(i\)-th component:

\[ \sigma(\mathbf{z})_i=\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \]

To find the derivative of \(\sigma(\mathbf{z})_i\) with respect to \(z_i\), we apply the quotient rule:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_i}=\frac{\left(e^{z_i} \cdot \sum_{k=1}^K e^{z_k}\right)-\left(e^{z_i} \cdot e^{z_i}\right)}{\left(\sum_{k=1}^K e^{z_k}\right)^2} \]

Simplifying this expression, where the denominator is the square of the sum of all exponentiated components of \(\mathbf{z}\), and recognizing that the numerator simplifies to the difference between the product of \(e^{z_i}\) and the sum of all exponentiated components minus the square of \(e^{z_i}\) :

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_i}=\frac{e^{z_i} \cdot \sum_{k=1}^K e^{z_k}-e^{2 z_i}}{\left(\sum_{k=1}^K e^{z_k}\right)^2} \]

Upon substituting the definition of \(\sigma(\mathbf{z})_i\) into this equation, we observe:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_i}=\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}-\left(\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}\right)^2 \]

Given that \(\sigma(\mathbf{z})_i=\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}\), it follows that the derivative simplifies to:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_i}=\sigma(\mathbf{z})_i-\sigma(\mathbf{z})_i^2 \]

Thus, expressing it in a more compact form:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_i} = \sigma(\mathbf{z})_i (1 - \sigma(\mathbf{z})_i) \]

Case 2: \(i \neq j\)#

We then consider the scenario where we aim to compute the derivative of the \(i\)-th component of the softmax output with respect to a different component \(z_j\), where \(i \neq j\). This examines how changes in one input logit affect the probability associated with a different class.

Given the \(i\)-th component of the softmax function:

\[ \sigma(\mathbf{z})_i=\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \]

To compute \(\frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}\) for \(i \neq j\), we again apply the quotient rule, but with an understanding that the derivative of the numerator \(e^{z_i}\) with respect to \(z_j\) is 0 , since \(z_i\) does not change when \(z_j\) changes:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}=\frac{0-e^{z_i} \cdot e^{z_j}}{\left(\sum_{k=1}^K e^{z_k}\right)^2} \]

This expression can be simplified as:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}=-\frac{e^{z_i} \cdot e^{z_j}}{\left(\sum_{k=1}^K e^{z_k}\right)^2} \]

Substituting the definition of \(\sigma(\mathbf{z})_i\) into this equation, we get:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}=-\frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \cdot \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \]

Given \(\sigma(\mathbf{z})_i=\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}\) and similarly, the term \(\frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}\) can be recognized as \(\sigma(\mathbf{z})_j\) , the expression simplifies to:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}=-\sigma(\mathbf{z})_i \sigma(\mathbf{z})_j \]

This formulation shows that the rate of change of the probability of one class \((i)\) with respect to the logit of a different class \((j)\) is negative, indicating that as the logit \(z_j\) increases, the probability of class \(i\) decreases, assuming \(i \neq j\).

Thus, for \(i \neq j\), the derivative of the \(i\)-th softmax component with respect to \(z_j\) is:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j}=-\sigma(\mathbf{z})_i \sigma(\mathbf{z})_j \]

Kronecker Delta#

The partial derivative of the softmax output with respect to its input, considering both cases, can be written in cases:

\[\begin{split} \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j} = \begin{cases} \sigma(\mathbf{z})_i (1 - \sigma(\mathbf{z})_i) & \text{if } i=j \\ -\sigma(\mathbf{z})_i \sigma(\mathbf{z})_j & \text{if } i \neq j \end{cases} \end{split}\]

And it is common to represent the above case either as indicator functions or the Kronecker delta \(\delta_{ij}\),

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j} = \sigma(\mathbf{z})_i (\delta_{ij} - \sigma(\mathbf{z})_j) \]

where

\[\begin{split} \delta_{ij} = \begin{cases} 1 & \text{if } i=j \\ 0 & \text{if } i \neq j \end{cases} \end{split}\]

And if we were to use indicator functions, the above equation can be written as:

\[ \frac{\partial \sigma(\mathbf{z})_i}{\partial z_j} = \sigma(\mathbf{z})_i \left(\mathbb{1}_{i=j} - \sigma(\mathbf{z})_j\right) \]

where \(\mathbb{1}_{i=j}\) is the indicator function that evaluates to 1 if \(i=j\) and 0 otherwise.

Representing Derivative of Softmax as a Jacobian Matrix#

Consider the softmax output vector \(S\) for an input vector \(\mathbf{z} = \begin{bmatrix} z_1 & z_2 & \ldots & z_K \end{bmatrix}^\top\), where \(K\) is the number of classes or dimensions of the softmax output. The softmax function is defined as:

\[ S_i = \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \quad \text{for } i=1, \ldots, K \]

Definition 59 (Softmax Output as Vector)

The softmax output vector \(S\) can be explicitly represented as:

\[\begin{split} S = \begin{bmatrix} S_1 \\ S_2 \\ \vdots \\ S_K \end{bmatrix} = \begin{bmatrix} \frac{e^{z_1}}{\sum_{k=1}^K e^{z_k}} \\ \frac{e^{z_2}}{\sum_{k=1}^K e^{z_k}} \\ \vdots \\ \frac{e^{z_K}}{\sum_{k=1}^K e^{z_k}} \end{bmatrix} \end{split}\]

Definition 60 (Jacobian Matrix of Softmax Function)

The Jacobian matrix \(J_S\) of the softmax function captures the partial derivatives of each component of \(S\) with respect to each component of \(\mathbf{z}\). Specifically, the element at the \(i\)-th row and \(j\)-th column of \(J_S\), denoted as \((J_S)_{ij}\), is the partial derivative of \(S_i\) with respect to \(z_j\):

\[(J_S)_{ij} = \frac{\partial S_i}{\partial z_j}\]

For \(i = j\), we have:

\[\frac{\partial S_i}{\partial z_i} = S_i (1 - S_i)\]

For \(i \neq j\), we have:

\[\frac{\partial S_i}{\partial z_j} = -S_i S_j\]

Putting it all together, the Jacobian matrix \(J_S\) can be written as:

\[\begin{split} J_S = \begin{bmatrix} S_1 (1 - S_1) & -S_1 S_2 & \cdots & -S_1 S_K \\ -S_2 S_1 & S_2 (1 - S_2) & \cdots & -S_2 S_K \\ \vdots & \vdots & \ddots & \vdots \\ -S_K S_1 & -S_K S_2 & \cdots & S_K (1 - S_K) \end{bmatrix} \end{split}\]

Definition 61 (Matrix Formulation)

We can express \(J_S\) using matrix operations for compactness:

  • \(\text{diag}(S)\) is a diagonal matrix where each diagonal element is \(S_i\), the softmax output for class \(i\).

  • \(S S^T\) is the outer product of \(S\) with itself, creating a matrix where each element \((i, j)\) is \(S_i S_j\).

Thus, the Jacobian matrix \(J_S\) can be expressed as:

\[ J_S = \text{diag}(S) - S S^T \]

Implementation of the Jacobian Matrix#

 1import torch
 2from rich.pretty import pprint
 3from torch import nn
 4
 5class Softmax:
 6    def __init__(self, dim: int = 1):
 7        self.dim = dim
 8
 9    def __call__(self, z: torch.Tensor) -> torch.Tensor:
10        max_z = torch.max(z, dim=self.dim, keepdim=True).values
11        numerator = torch.exp(z - max_z)
12        denominator = torch.sum(numerator, dim=self.dim, keepdim=True)
13        g = numerator / denominator
14        return g
15
16    def gradient(self, z: torch.Tensor) -> torch.Tensor:
17        S = self.__call__(z)
18        diag_S = torch.diag_embed(S)
19        outer_S = torch.matmul(S.unsqueeze(2), S.unsqueeze(1))
20        gradient = diag_S - outer_S
21        return gradient.squeeze()
22
23
24z = torch.randn((2, 5), requires_grad=True, dtype=torch.float32)
25pprint(z)
26
27pytorch_softmax = nn.Softmax(dim=1)
28pytorch_softmax_outputs = pytorch_softmax(z)
29pprint(pytorch_softmax_outputs)
30
31my_softmax = Softmax(dim=1)
32my_softmax_outputs = my_softmax(z)
33pprint(my_softmax_outputs)
34
35torch.testing.assert_close(
36    pytorch_softmax_outputs, my_softmax_outputs, rtol=1.3e-6, atol=1e-5, msg="Softmax function outputs do not match."
37)
38
39B, K = z.size()
40
41pytorch_jacobian = []
42for k in range(K):
43    grad = torch.autograd.grad(pytorch_softmax_outputs[:, k].sum(), z, retain_graph=True)[0]
44    pytorch_jacobian.append(grad)
45
46pytorch_jacobian = torch.stack(pytorch_jacobian, dim=1)
47pprint(pytorch_jacobian)
48
49my_jacobian = my_softmax.gradient(z)
50pprint(my_jacobian)
51
52torch.testing.assert_close(
53    pytorch_jacobian, my_jacobian, rtol=1.3e-6, atol=1e-5, msg="Jacobian matrices do not match."
54)
tensor([[-0.1674, -0.2757, -0.4164, -0.7480,  1.7970],
│   │   [ 1.3310, -0.9053,  0.3516,  1.0953, -0.3567]], requires_grad=True)
tensor([[0.0965, 0.0866, 0.0752, 0.0540, 0.6878],
│   │   [0.4069, 0.0435, 0.1528, 0.3215, 0.0753]], grad_fn=<SoftmaxBackward0>)
tensor([[0.0965, 0.0866, 0.0752, 0.0540, 0.6878],
│   │   [0.4069, 0.0435, 0.1528, 0.3215, 0.0753]], grad_fn=<DivBackward0>)
tensor([[[ 0.0872, -0.0083, -0.0073, -0.0052, -0.0663],
│   │    [-0.0083,  0.0791, -0.0065, -0.0047, -0.0595],
│   │    [-0.0073, -0.0065,  0.0695, -0.0041, -0.0517],
│   │    [-0.0052, -0.0047, -0.0041,  0.0511, -0.0371],
│   │    [-0.0663, -0.0595, -0.0517, -0.0371,  0.2147]],
│   │   
│   │   [[ 0.2413, -0.0177, -0.0622, -0.1308, -0.0306],
│   │    [-0.0177,  0.0416, -0.0066, -0.0140, -0.0033],
│   │    [-0.0622, -0.0066,  0.1295, -0.0491, -0.0115],
│   │    [-0.1308, -0.0140, -0.0491,  0.2181, -0.0242],
│   │    [-0.0306, -0.0033, -0.0115, -0.0242,  0.0696]]])
tensor([[[ 0.0872, -0.0083, -0.0073, -0.0052, -0.0663],
│   │    [-0.0083,  0.0791, -0.0065, -0.0047, -0.0595],
│   │    [-0.0073, -0.0065,  0.0695, -0.0041, -0.0517],
│   │    [-0.0052, -0.0047, -0.0041,  0.0511, -0.0371],
│   │    [-0.0663, -0.0595, -0.0517, -0.0371,  0.2147]],
│   │   
│   │   [[ 0.2413, -0.0177, -0.0622, -0.1308, -0.0306],
│   │    [-0.0177,  0.0416, -0.0066, -0.0140, -0.0033],
│   │    [-0.0622, -0.0066,  0.1295, -0.0491, -0.0115],
│   │    [-0.1308, -0.0140, -0.0491,  0.2181, -0.0242],
│   │    [-0.0306, -0.0033, -0.0115, -0.0242,  0.0696]]],
grad_fn=<SqueezeBackward0>)

Derivative Of Softmax With Respect To Weight Matrix#

In the context of neural networks, the softmax function is often used in the output layer to compute the probabilities of each class. The output of the linear transformation can be represented as:

\[ \mathbf{z} = \boldsymbol{\theta}^{\top} \mathbf{x} + \mathbf{b} \]

where \(\boldsymbol{\theta}\) is the weight matrix, \(\mathbf{x}\) is the input vector, and \(\mathbf{b}\) is the bias vector.

Our predicted softmax output is expressed as:

\[ \hat{\mathbf{y}} = \sigma(\mathbf{z}) \quad \textrm{where}\quad \hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \]

For a classification task, a common choice is the Cross-Entropy Loss, defined between the predicted probability distribution \(\hat{\mathbf{y}}\) and the true distribution \(\mathbf{y}\). If \(\mathbf{y}\) is one-hot encoded, where \(y_i = 1\) for the correct class and \(y_i = 0\) for all others, the Cross-Entropy Loss \(\mathcal{L}\) over \(K\) classes can be defined as:

\[ \mathcal{L}\left(\mathbf{y}, \hat{\mathbf{y}}, \boldsymbol{\hat{\theta}}\right) = -\sum_{j=1}^{K} y_j \log(\hat{y}_j) \]

where we make explicit the dependence on the parameters \(\boldsymbol{\hat{\theta}}\) of the model because ultimately, the loss function is a function of the model’s parameters.

Step 1: Derivative of \(\mathcal{L}\) with respect to \(\mathbf{z}\)#

Given:

\[\mathcal{L} = -\sum_{j=1}^{K} y_j \log(\hat{y}_j)\]

And the softmax output:

\[\hat{y}_j = \sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}\]

The derivative of the Cross-Entropy Loss with respect to each logit \(z_j\) is:

\[\frac{\partial \mathcal{L}}{\partial z_j} = \hat{y}_j - y_j\]

Note that \(\frac{\partial \mathcal{L}}{\partial z_j}\) can be expressed as chain rule:

\[\begin{split} \begin{aligned} \frac{\partial \mathcal{L}}{\partial z_j} &= \frac{\partial \mathcal{L}}{\partial \hat{y}_j} \cdot \frac{\partial \hat{y}_j}{\partial z_j} \\ &= \frac{\partial \mathcal{L}}{\partial \sigma(\mathbf{z})_j} \cdot \frac{\partial \sigma(\mathbf{z})_j}{\partial z_j} \\ \end{aligned} \end{split}\]

Step 2: Derivative of \(\mathbf{z}\) with respect to \(\boldsymbol{\theta}\)#

The linear transformation that produces the logits is:

\[\mathbf{z} = \boldsymbol{\theta}^{\top} \mathbf{x} + \mathbf{b}\]

For a specific element \(z_j\) and a weight \(\theta_{ij}\) (the weight connecting input \(i\) to output class \(j\)), the derivative is:

\[\frac{\partial z_j}{\partial \theta_{ij}} = x_i\]

Step 3: Derivative of \(\mathcal{L}\) with respect to \(\boldsymbol{\theta}\)#

Applying the chain rule to connect the derivative of the loss with respect to \(\theta_{ij}\) through the logits:

\[ \frac{\partial \mathcal{L}}{\partial \theta_{ij}} = \frac{\partial \mathcal{L}}{\partial z_j} \cdot \frac{\partial z_j}{\partial \theta_{ij}} = (\hat{y}_j - y_j) \cdot x_i \]

The full gradient with respect to \(\boldsymbol{\theta}\) is constructed by aggregating these individual derivatives:

\[\begin{split} \begin{aligned} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}} &= \left(\hat{\mathbf{y}} - \mathbf{y}\right) \mathbf{x}^{\top} \\ \end{aligned} \end{split}\]

Hessian Matrix#

The Hessian matrix represents the second-order partial derivatives of the softmax function’s components with respect to the components of the input vector \(\mathbf{z}\). The Hessian gives us insight into the curvature of the softmax function, which is particularly useful for understanding the optimization landscape. In fact, the Hessian matrix is a key component in monitoring the loss function \(\mathcal{L}\) where the first order partial derivatives measures the rate of change of the loss function with respect to the parameters, and the second order partial derivatives measures the curvature of the loss function - which says about convexity/concavity, saddle points, minima and maxima etc. See 12.1. Optimization and Deep Learning - Dive Into Deep Learning for more insights.

Given the softmax function \(\sigma(\mathbf{z})_i\) for component \(i\), defined as:

\[ \sigma(\mathbf{z})_i=\frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} \]

Hessian Matrix for Softmax The Hessian matrix \(\left(H_\sigma\right)\) of the softmax function with respect to the input vector \(\mathbf{z}\) is a \(K \times K\) matrix where each element \((i, j)\) is the second-order partial derivative of the softmax output for class \(i\) with respect to the inputs \(z_i\) and \(z_j\), denoted as:

\[ \left(H_\sigma\right)_{i j}=\frac{\partial^2 \sigma(\mathbf{z})}{\partial z_i \partial z_j} \]

References and Further Readings#

Citations#

  • [1] D. A. Roberts, S. Yaida, B. Hanin, “Chapter 6.2.1 Bayesian Model Fitting” in “The Principles of Deep Learning Theory”, arXiv preprint arXiv:2106.10165, [Submitted on 18 Jun 2021 (v1), last revised 24 Aug 2021 (this version, v2)].

  • [2] I. Goodfellow, Y. Bengio, A. Courville, Chapter 6.2.2.3 Softmax Units for Multinoulli Output Distributions in “Deep Learning”, MIT Press, 2016. pp. 180-184

  • [3] C. M. Bishop, Chapter 4. Linear Models for Classification in Pattern Recognition and Machine Learning. New York: Springer, 2006.

  • [4] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Chapter 3.4. Softmax Regression” in Dive into Deep Learning, Cambridge University Press, 2023.

  • [5] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Chapter 12.1. Optimization and Deep Learning” in Dive into Deep Learning, Cambridge University Press, 2023.