Cross Entropy
1. Review of Entropy
Entropy is usually used to represent the average amount of self-information. The calculation formula is as follows:
$$H(X) = E(log\frac{1}{p_i}) = -\sum_{i=1}^{n}p_ilogp_i $$
For example, supposed that we have a probability distribution X = [0.7, 0.3]. Then we have,
$$H(X) = -0.7log0.7-0.3log0.3 = 0.88 bit$$
What’s more, we can easily find that $$0\le H(X) \le logn$$ is always true.
The proof of the left inequality is trivial because $$0\le p\le 1$$ the equality is established if and only if $$p_i=1$$
As for the right hand side, we take n=3 as the example
$$H(X) = -p_1logp_1-p_2logp_2-(1-p_1-p_2)log(1-p_1-p_2)$$
Let the partial derivatives be zero,
$$\frac{\partial H(X)}{\partial p_1} = log(\frac{1-p_1-p_2}{p_1})=0$$
then we have,
$$p_1=p_2=1-p_1-p_2=\frac{1}{3}$$
and finally we get the maximum,
$$H(X)_{max} = log3$$
2. KL Divergence & Cross Entropy
KL divergence is used to indicate the distance of two different distribution.
$$D(p||q) = \sum_{i=1}^{n}p_ilog(\frac{p_i}{q_i})$$
From the formula, we know that usually $$D(p||q) \ne D(q||p)$$ Conventionally, we set p as the real distribution and q as the predicted distribution.
As an example, set p = [1, 0, 0] and q = [0.7, 0.2, 0.1] then
$$D(p||q) = 1*log(\frac{1}{0.7}) = 0.51bit$$
The relationship of KL divergence, Cross Entropy and Entropy is: $$D(p||q) = H(p,q) - H(p)$$
Therefore, the formula of Cross Entropy is
$$H(p, q) = -\sum_{i=1}^{n}p_ilogq_i$$
It is used to represent the difference of two distribution.
3. Softmax & Logistic Regression
For classification problems, supposed that we have an output from some neural network
$$X = [x_1, x_2, …, x_n]$$
then we can obtain the output probability by softmax
$$Y = [\frac{e^{x_1}}{\sum_{i=1}^{n}e^{x_i}}, \frac{e^{x_2}}{\sum_{i=1}^{n}e^{x_i}}, …, \frac{e^{x_n }}{\sum_{i=1}^{n}e^{x_i}}] = [y_1, y_2, …, y_n] = \frac{exp(X)}{1^Texp(X)}$$
if we use the Cross Entropy Loss, then
$$loss = -\hat Y^Tlog(Y) = -\hat Y^Tlog(Wx-1log(1^Texp(Wx))) = -\hat Y^TWx+log(1^Texp(Wx))$$
where$$\hat Y^T1 = 1$$and since
$$d\sigma(X)=\sigma^{’}(X)\odot dX$$
and
$$tr(A^T(B\odot C)) = tr((A\odot B)^TC)$$
we have
$$dloss = -\hat Y^TdWx + \frac{1^T(exp(Wx)\odot(dWx))}{1^Texp(Wx)} = -\hat Y^TdWx + \frac{exp(Wx)^TdWx}{1^Texp(Wx)}$$
$$= tr(-\hat Y^TdWx+softmax(Wx)^TdWx)=tr(x(softmax(Wx)-\hat Y)^TdW)$$
thus
$$\frac{\partial loss}{\partial W} = (softmax(Wx)-\hat Y)x^T$$
or
$$a=Wx,\frac{\partial loss}{\partial a} = softmax(a)-\hat Y$$
on the other hand,
$$D_jS_i=\frac{\partial y_i}{\partial x_j} = \begin{cases}S_i(1-S_j) & i=j\ -S_iS_j & i\ne j\end{cases}$$