What is the difference between Cross-entropy and KL divergence?

Question

Both the cross-entropy and the KL divergence are tools to measure the distance between two probability distributions, but what is the difference between them?

𝐻 (𝑃, 𝑄) = - \sum 𝑥 𝑃 (𝑥) log 𝑄 (𝑥)

𝐾 𝐿 (𝑃 | 𝑄) = \sum 𝑥 𝑃 (𝑥) log 𝑃 ( 𝑥 ) 𝑄 ( 𝑥 )

Moreover, it turns out that the minimization of KL divergence is equivalent to the minimization of cross-entropy.

I want to know them instinctively.

doubllle · Accepted Answer · 2021-09-04 20:28:53Z

You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing KL divergence. I will put your question under the context of classification problems using cross entropy as loss functions.

Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as

𝑆 (𝑣) = - \sum 𝑖 𝑝 (𝑣 𝑖) log 𝑝 (𝑣 𝑖),

for

p (v_{i})

as the probabilities of different states

v_{i}

of the system. From an information theory point of view,

S (v)

is the amount of information is needed for removing the uncertainty.

For instance, the event $I$ I will die within 200 years is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information of the aging problem cannot be solved to make it certain. However, the event $I I$ I will die within 50 years is more uncertain than event $I$ , thus it needs more information to remove the uncertainties. Here entropy can be used to quantify the uncertainty of the distribution When will I die?, which can be regarded as the expectation of uncertainties of individual events like $I$ and $I I$ .

Now look at the definition of KL divergence between distributions A and B

𝐷 𝐾 𝐿 (𝐴 ∥ 𝐵) = \sum 𝑖 𝑝 𝐴 (𝑣 𝑖) log 𝑝 𝐴 (𝑣 𝑖) - 𝑝 𝐴 (𝑣 𝑖) log 𝑝 𝐵 (𝑣 𝑖),

where the first term of the right hand side is the entropy of distribution A, the second term can be interpreted as the expectation of distribution B in terms of A. And the

D_{K L}

describes how different B is from A from the perspective of A. It's worth of noting

A

usually stands for the data, i.e. the measured distribution, and

B

is the theoretical or hypothetical distribution. That means, you always start from what you observed.

To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions $A$ and $B$ as

𝐻 (𝐴, 𝐵) = - \sum 𝑖 𝑝 𝐴 (𝑣 𝑖) log 𝑝 𝐵 (𝑣 𝑖) .

From the definitions, we can easily see

𝐻 (𝐴, 𝐵) = 𝐷 𝐾 𝐿 (𝐴 ∥ 𝐵) + 𝑆 𝐴 .

If

S_{A}

is a constant, then minimizing

H (A, B)

is equivalent to minimizing

D_{K L} (A ∥ B)

.

A further question follows naturally as how the entropy can be a constant. In a machine learning task, we start with a dataset (denoted as $P (D)$ ) which represent the problem to be solved, and the learning purpose is to make the model estimated distribution (denoted as $P (m o d e l)$ ) as close as possible to true distribution of the problem (denoted as $P (t r u t h)$ ). $P (t r u t h)$ is unknown and represented by $P (D)$ . Therefore in an ideal world, we expect

𝑃 (𝑚 𝑜 𝑑 𝑒 𝑙) \approx 𝑃 () \approx 𝑃 (𝑡 𝑟 𝑢 𝑡 ℎ)

and minimize

D_{K L} (P (D) ∥ P (m o d e l))

. And luckily, in practice

D

is given, which means its entropy

S (D)

is fixed as a constant.

Thank you for your answer. It deepened my understanding. So when we have a dataset, it is more effective to minimize cross- entropy rather than KL, right? However, I cannot understand the proper use of them. In other words, when should I minimize KL or cross entropy? — – kiwamizamurai, Commented Jul 19, 2018 at 14:00
After reading your answer, I think it is no use to minimize KL because we always have a dataset, P(D). — – kiwamizamurai, Commented Jul 19, 2018 at 14:03
Ideally, one would choose KL divergence to measure the distance between two distributions. In the context of classification, the cross-entropy loss usually arises from the negative log likelihood, for example, when you choose Bernoulli distribution to model your data. — – doubllle, Commented Jul 19, 2018 at 14:14
You might want to look at this great post. The symmetry is not problem in classification as the goal of machine learning models is to make predicted distribution as close as possible to the fixed P(D), though regularizations are usually added to avoid overfitting. — – doubllle, Commented Jul 19, 2018 at 14:35
Re: "For instance, the event A I will die eventually is almost certain, therefore it has low entropy". Not sure what you meant to write here, but technically speaking an event has no entropy. You can define its information, and you can measure the entropy of the distribution or the system. The statement I will die eventually isn't an event either. — – Amelio Vazquez-Reina, Commented May 30, 2020 at 20:23

lahwran · Answer 2 · 2024-01-08 22:42:46Z

40

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as

𝐻 (𝑝, 𝑞) = 𝐷 𝐾 𝐿 (𝑝, 𝑞) + 𝐻 (𝑝) = - \sum 𝑖 𝑝 𝑖 log (𝑞 𝑖)

so have

𝐷 𝐾 𝐿 (𝑝, 𝑞) = 𝐻 (𝑝, 𝑞) - 𝐻 (𝑝)

From the equation, we could see that KL divergence can be split into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p^{'}$ of a minibatch may be different from the global $p$ . In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

edited Jan 8, 2024 at 22:42

lahwran

1032 bronze badges

answered May 20, 2019 at 17:47

zewen liu

5094 silver badges3 bronze badges

8

This answer is what I was looking for. In my own current experience, which involves learning a target probabilities, BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions.
– Nicholas Leonard
Commented Nov 29, 2019 at 16:31
When you said "the first part" and "the second part", which one was which?
– Josh
Commented May 30, 2020 at 20:27
1

@zewen's answer can be misleading as he claims that in mini-batch training, CE can be more robust than KL. In most of standard mini-batch training, we use gradient-based approach, and the gradient of $H (p)$ with respect to $q$ (which is a function of our model parameter) would be zero. So in these cases, CE and KL as a loss function are identical.
– fatpanda2049
Commented Sep 23, 2021 at 13:41
1

Are you sure the 1st formula is correct? Seems the p,d are ordered wrong.
– Junwei Dong
Commented Sep 28, 2022 at 3:29
1

I don't understand why the $H (p)$ constant makes the training less robust. The gradient should still be exactly the same, no? So is it just that your loss curve may look a bit more jiggly, but you training is still unchanged?
– Thomas Ahle
Commented Dec 9, 2023 at 19:33

Show 3 more comments

Vivek Subramanian · Answer 3 · 2020-05-09 18:59:13Z

This is how I think about it:

𝐷 𝐾 𝐿 (𝑝 (𝑦 𝑖 | 𝑥 𝑖) | | 𝑞 (𝑦 𝑖 | 𝑥 𝑖, 𝜃)) = 𝐻 (𝑝 (𝑦 𝑖 | 𝑥 𝑖, 𝜃), 𝑞 (𝑦 𝑖 | 𝑥 𝑖, 𝜃)) - 𝐻 (𝑝 (𝑦 𝑖 | 𝑥 𝑖, 𝜃)) (1)

where $p$ and $q$ are two probability distributions. In machine learning, we typically know $p$ , which is the distribution of the target. For example, in a binary classification problem, $Y = {0, 1}$ , so if $y_{i} = 1$ , $p (y_{i} = 1 | x) = 1$ and $p (y_{i} = 0 | x) = 0$ , and vice versa. Given each $y_{i} \forall i = 1, 2, \dots, N$ , where $N$ is the total number of points in the dataset, we typically want to minimize the KL divergence $D_{K L} (p, q)$ between the distribution of the target $p (y_{i} | x)$ and our predicted distribution $q (y_{i} | x, θ)$ , averaged over all $i$ . (We do so by tuning our model parameters $θ$ . Thus, for each training example, the model is spitting out a distribution over the class labels $0$ and $1$ .) For each example, since the target is fixed, its distribution never changes. Thus, $H (p (y_{i} | x_{i}))$ is constant for each $i$ , regardless of what our current model parameters $θ$ are. Thus, the minimizer of $D_{K L} (p, q)$ is equal to the minimizer of $H (p, q)$ .

If you had a situation where $p$ and $q$ were both variable (say, in which $x_{1} \sim p$ and $x_{2} \sim q$ were two latent variables) and wanted to match the two distributions, then you would have to choose between minimizing $D_{K L}$ and minimizing $H (p, q)$ . This is because minimizing $D_{K L}$ implies maximizing $H (p)$ while minimizing $H (p, q)$ implies minimizing $H (p)$ . To see the latter, we can solve equation ( $1$ ) for $H (p, q)$ :

𝐻 (𝑝, 𝑞) = 𝐷 𝐾 𝐿 (𝑝, 𝑞) + 𝐻 (𝑝) (2)

The former would yield a broad distribution for

p

while the latter would yield one that is concentrated in one or a few modes. Note that it is your choice as a ML practitioner whether you want to minimize

D_{K L} (p, q)

or

D_{K L} (q, p)

. A small discussion of this is given in the context of variational inference (VI) below.

In VI, you must choose between minimizing $D_{K L} (p, q)$ and $D_{K L} (q, p)$ , which are not equal since KL divergence is not symmetric. If we once again treat $p$ as known, then minimizing $D_{K L} (p, q)$ would result in a distribution $q$ that is sharp and focused on one or a few areas while minimizing $D_{K L} (q, p)$ would result in a distribution $q$ that is wide and covers a broad range of the domain of $q$ . Again, the latter is because minimizing $D_{K L} (q, p)$ implies maximizing the entropy of $q$ .

In equation (1) on the left side you don't have $θ$ in $p (y_{i} | x_{i})$ , whereas on the right side you have $p (y_{i} | x_{i}, θ)$ . Why? Also in the 5-th row you should use $x_{i}$ instead of $x$ . — – Rodvi, Commented May 19, 2020 at 13:45
Also, will the entropy $H (p)$ be typically constant in the case of generative classifiers $q (y, x | θ)$ , in the case of regression models, and in the case of non-parametric models (not assuming latent variable case)? — – Rodvi, Commented May 19, 2020 at 14:05

User1865345 · Answer 4 · 2023-03-13 08:05:46Z

Some answers are already provided, while I would like to point out regarding the question itself

measure the distance between two probability distributions

that neither of cross-entropy and KL divergence measures the distance between two distributions-- instead they measure the difference of two distributions [1]. It's not distance because of the asymmetry, i.e. $CE (P, Q) \neq CE (Q, P)$ and $KL (P, Q) \neq KL (Q, P) .$

Reference:

[1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1 (MIT Press Cambridge, 2016).

Intuitive fred888

Top 10 Posts This Month

Friday, May 15, 2026

Cross-Validated's Statistics Stack Exchange.

What is the difference between Cross-entropy and KL divergence?

4 Answers

Reference:

No comments: