begin quote from: Cross-Validated's Statistics Stack Exchange.
Both the cross-entropy and the KL divergence are tools to measure the distance between two probability distributions, but what is the difference between them?
Moreover, it turns out that the minimization of KL divergence is equivalent to the minimization of cross-entropy.
I want to know them instinctively.
4 Answers
You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing KL divergence. I will put your question under the context of classification problems using cross entropy as loss functions.
Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as
For instance, the event I will die within 200 years is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information of the aging problem cannot be solved to make it certain. However, the event I will die within 50 years is more uncertain than event ,
thus it needs more information to remove the uncertainties. Here
entropy can be used to quantify the uncertainty of the distribution When will I die?, which can be regarded as the expectation of uncertainties of individual events like and .
Now look at the definition of KL divergence between distributions A and B
To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions and as
A further question follows naturally as how the entropy can be a constant. In a machine learning task, we start with a dataset (denoted as ) which represent the problem to be solved, and the learning purpose is to make the model estimated distribution (denoted as ) as close as possible to true distribution of the problem (denoted as ). is unknown and represented by . Therefore in an ideal world, we expect
-
1Thank you for your answer. It deepened my understanding. So when we have a dataset, it is more effective to minimize cross- entropy rather than KL, right? However, I cannot understand the proper use of them. In other words, when should I minimize KL or cross entropy? Commented Jul 19, 2018 at 14:00
-
3After reading your answer, I think it is no use to minimize KL because we always have a dataset, P(D). Commented Jul 19, 2018 at 14:03
-
1Ideally, one would choose KL divergence to measure the distance between two distributions. In the context of classification, the cross-entropy loss usually arises from the negative log likelihood, for example, when you choose Bernoulli distribution to model your data.– doubllleCommented Jul 19, 2018 at 14:14
-
1You might want to look at this great post. The symmetry is not problem in classification as the goal of machine learning models is to make predicted distribution as close as possible to the fixed P(D), though regularizations are usually added to avoid overfitting.– doubllleCommented Jul 19, 2018 at 14:35
-
4Re: "For instance, the event A
I will die eventuallyis almost certain, therefore it has low entropy". Not sure what you meant to write here, but technically speaking an event has no entropy. You can define its information, and you can measure the entropy of the distribution or the system. The statementI will die eventuallyisn't an event either. Commented May 30, 2020 at 20:23
I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as
In many machine learning projects, minibatch is involved to expedite training, where the of a minibatch may be different from the global . In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.
-
8This answer is what I was looking for. In my own current experience, which involves learning a target probabilities, BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions. Commented Nov 29, 2019 at 16:31
-
When you said "the first part" and "the second part", which one was which?– JoshCommented May 30, 2020 at 20:27
-
1@zewen's answer can be misleading as he claims that in mini-batch training, CE can be more robust than KL. In most of standard mini-batch training, we use gradient-based approach, and the gradient of with respect to (which is a function of our model parameter) would be zero. So in these cases, CE and KL as a loss function are identical. Commented Sep 23, 2021 at 13:41
-
1Are you sure the 1st formula is correct? Seems the p,d are ordered wrong. Commented Sep 28, 2022 at 3:29
-
1I don't understand why the constant makes the training less robust. The gradient should still be exactly the same, no? So is it just that your loss curve may look a bit more jiggly, but you training is still unchanged? Commented Dec 9, 2023 at 19:33
This is how I think about it:
where and are two probability distributions. In machine learning, we typically know , which is the distribution of the target. For example, in a binary classification problem, , so if , and , and vice versa. Given each , where is the total number of points in the dataset, we typically want to minimize the KL divergence between the distribution of the target and our predicted distribution , averaged over all . (We do so by tuning our model parameters . Thus, for each training example, the model is spitting out a distribution over the class labels and .) For each example, since the target is fixed, its distribution never changes. Thus, is constant for each , regardless of what our current model parameters are. Thus, the minimizer of is equal to the minimizer of .
If you had a situation where and were both variable (say, in which and were two latent variables) and wanted to match the two distributions, then you would have to choose between minimizing and minimizing . This is because minimizing implies maximizing while minimizing implies minimizing . To see the latter, we can solve equation () for :
In VI, you must choose between minimizing and , which are not equal since KL divergence is not symmetric. If we once again treat as known, then minimizing would result in a distribution that is sharp and focused on one or a few areas while minimizing would result in a distribution that is wide and covers a broad range of the domain of . Again, the latter is because minimizing implies maximizing the entropy of .
-
In equation (1) on the left side you don't have in , whereas on the right side you have . Why? Also in the 5-th row you should use instead of .– RodviCommented May 19, 2020 at 13:45
-
Also, will the entropy be typically constant in the case of generative classifiers , in the case of regression models, and in the case of non-parametric models (not assuming latent variable case)?– RodviCommented May 19, 2020 at 14:05
Some answers are already provided, while I would like to point out regarding the question itself
measure the distance between two probability distributions
that neither of cross-entropy and KL divergence measures the distance between two distributions-- instead they measure the difference of two distributions [1]. It's not distance because of the asymmetry, i.e. and
Reference:
[1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1 (MIT Press Cambridge, 2016).