begin quote:
Intuitive fred888
To the best of my ability I write about my experience of the Universe Past, Present and Future
Top 10 Posts This Month
- The reliant robin 3 wheeled CAR?
- $6.69 a gallon of Regular today in SF Bay area. Premium $7.07!
- California bear-suit luxury car scam ends in insurance fraud sentences for 3
- Why scientists are nervous about fungi: Full Article
- Most read articles as of Monday May 4th 2026
- The problem with Social Media might be different than you think?
- Texas judge orders girls camp to preserve cabins damaged during deadly 2025 flood
- Tsunami comes weeks after 15th anniversary of 2011 meltdown (which caused 3 meltdowns at Fukushima) and locals permanently evacuated
- Wyntoon- (McCloud River Hearst Castle) near Mt. Shasta the mountain
- How 2 men claimed an absurd record by driving an old 3-wheel car the length of Africa
Friday, May 15, 2026
Speaking of probabilities:
Many of the people dying from their exposure to Artificial Intelligence don't seem to know really who or what they are conversing with, so this often can end their lives if they don't have enough adult experience with understanding who and what they are dealing with. Artificial Intelligence is neither a who or a what and I like to say that Artificial intelligence is like an Adding machine in Binary run by logarithms.
So, the first thing you need to know is that you are NOT communicating with a who or a what because this is like a really advanced Adding machine in binary you are talking to or writing to online.
For example, I use Google AI to ask a lot of the questions I do at this site when I have questions about something I'm studying like today I'm studying AI some because parts of the movie "Good Luck Have Fun Don't Die(2025) was very disturbing to me because of the questions this movie asks.
So, even though the premise is all wrong to entertain people like a comedy horror movie of a dystopian future maybe, their questions are all valid that we should all be asking about all our future now on earth.
For example, Trump's antics could be because everyone's jobs are slowly or quickly going away so he is sort of Entertaining us with crazy things while we all die and people lose their jobs to AI all over the world.
Or maybe better said, "It is the death of the slave class worldwide (including here in America)."
So, if people don't watch carefully what is happening most people will starve to death worldwide this century because of Trump and AI and rich people using these developments AGAINST the good of the people of earth for the most part.
So, the Only way to survive all this is to not be a slave in your thinking I presently believe.
In other words if you are not captain of your own ship and master of your own destiny then it's likely you are not going to survive this century the way things are presently going worldwide.
The ONLY way to survive this century is to be an ultimate opportunist always looking for ways to survive whatever comes while being as kind as possible to everyone around you but still survive worldwide!
By God's Grace
Cross-Validated's Statistics Stack Exchange.
begin quote from: Cross-Validated's Statistics Stack Exchange.
Both the cross-entropy and the KL divergence are tools to measure the distance between two probability distributions, but what is the difference between them?
Moreover, it turns out that the minimization of KL divergence is equivalent to the minimization of cross-entropy.
I want to know them instinctively.
4 Answers
You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing KL divergence. I will put your question under the context of classification problems using cross entropy as loss functions.
Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as
For instance, the event I will die within 200 years is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information of the aging problem cannot be solved to make it certain. However, the event I will die within 50 years is more uncertain than event When will I die?, which can be regarded as the expectation of uncertainties of individual events like
Now look at the definition of KL divergence between distributions A and B
To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions
A further question follows naturally as how the entropy can be a
constant. In a machine learning task, we start with a dataset (denoted
as
-
1Thank you for your answer. It deepened my understanding. So when we have a dataset, it is more effective to minimize cross- entropy rather than KL, right? However, I cannot understand the proper use of them. In other words, when should I minimize KL or cross entropy? Commented Jul 19, 2018 at 14:00
-
3After reading your answer, I think it is no use to minimize KL because we always have a dataset, P(D). Commented Jul 19, 2018 at 14:03
-
1Ideally, one would choose KL divergence to measure the distance between two distributions. In the context of classification, the cross-entropy loss usually arises from the negative log likelihood, for example, when you choose Bernoulli distribution to model your data.– doubllleCommented Jul 19, 2018 at 14:14
-
1You might want to look at this great post. The symmetry is not problem in classification as the goal of machine learning models is to make predicted distribution as close as possible to the fixed P(D), though regularizations are usually added to avoid overfitting.– doubllleCommented Jul 19, 2018 at 14:35
-
4Re: "For instance, the event A
I will die eventuallyis almost certain, therefore it has low entropy". Not sure what you meant to write here, but technically speaking an event has no entropy. You can define its information, and you can measure the entropy of the distribution or the system. The statementI will die eventuallyisn't an event either. Commented May 30, 2020 at 20:23
I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as
In many machine learning projects, minibatch is involved to expedite training, where the
-
8This answer is what I was looking for. In my own current experience, which involves learning a target probabilities, BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions. Commented Nov 29, 2019 at 16:31
-
When you said "the first part" and "the second part", which one was which?– JoshCommented May 30, 2020 at 20:27
-
1@zewen's answer can be misleading as he claims that in mini-batch training, CE can be more robust than KL. In most of standard mini-batch training, we use gradient-based approach, and the gradient of
𝐻(𝑝) with respect to𝑞 (which is a function of our model parameter) would be zero. So in these cases, CE and KL as a loss function are identical. Commented Sep 23, 2021 at 13:41 -
1Are you sure the 1st formula is correct? Seems the p,d are ordered wrong. Commented Sep 28, 2022 at 3:29
-
1I don't understand why the
𝐻(𝑝) constant makes the training less robust. The gradient should still be exactly the same, no? So is it just that your loss curve may look a bit more jiggly, but you training is still unchanged? Commented Dec 9, 2023 at 19:33
This is how I think about it:
where
If you had a situation where
In VI, you must choose between minimizing
-
In equation (1) on the left side you don't have
𝜃 in𝑝(𝑦𝑖|𝑥𝑖) , whereas on the right side you have𝑝(𝑦𝑖|𝑥𝑖,𝜃) . Why? Also in the 5-th row you should use𝑥𝑖 instead of𝑥 .– RodviCommented May 19, 2020 at 13:45 -
Also, will the entropy
𝐻(𝑝) be typically constant in the case of generative classifiers𝑞(𝑦,𝑥|𝜃) , in the case of regression models, and in the case of non-parametric models (not assuming latent variable case)?– RodviCommented May 19, 2020 at 14:05
Some answers are already provided, while I would like to point out regarding the question itself
measure the distance between two probability distributions
that neither of cross-entropy and KL divergence measures the distance between two distributions-- instead they measure the difference of two distributions [1]. It's not distance because of the asymmetry, i.e.
Reference:
[1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1 (MIT Press Cambridge, 2016).
