Categorical cross-entropy

Categorical cross-entropy is the most diffused classification cost function, adopted by logistic regression and the majority of neural architectures. The generic analytical expression is:

This cost function is convex and can be easily optimized using stochastic gradient descent techniques; moreover, it has another important interpretation. If we are training a classifier, our goal is to create a model whose distribution is as similar as possible to pdata. This condition can be achieved by minimizing the Kullback-Leibler pergence between the two distributions:

In the previous expression, p_M is the distribution generated by the model. Now, if we rewrite the pergence, we get:

The first term is the entropy of the data-generating distribution, and it doesn't depend on the model parameters, while the second one is the cross-entropy. Therefore, if we minimize the cross-entropy, we also minimize the Kullback-Leibler pergence, forcing the model to reproduce a distribution that is very similar to p_data. This is a very elegant explanation as to why the cross-entropy cost function is an excellent choice for classification problems.