Sparse Categorical Cross-Entropy vs Categorical Cross-Entropy

3 min readNov 13, 2021

Many of you have the following question “In which situations should I use a specific loss function like categorical, sparse, binary, etc?”

So, take this example of loss functions and their cases:

Regression, numerical value

Nothing to say. This case is the most basic case and we dont need to explain how it works. Just to know, the activation function is linear and loss function can be MSE (Mean Square Error).
For more details about Regression check here.

Classification:

Classification task can be sub-divided into:

binary classification (2 classes)
Single label, multiclass classification.
Multiple label, multiclass classification.

Better explanation about these three tasks

For all these cases, we will define the Cross-Entropy Loss. The main difference will be the input (output of the activation functions) to this loss.

The formulation of the Cross-Entropy Loss is the following:
Binary: 2 Classes

Multi-Classes:

Logistic Loss and Multinomial Logistic Loss are other names for Cross-Entropy loss.

Binary classification, binary cross-entropy loss function

In this case, the vanilla classification (2 classes). This takes only one output value between -1 or 1. Binary classification with one output should takes the sigmoid function as activation.

Also called Sigmoid Cross-Entropy loss. It is a Sigmoid activation plus a Cross-Entropy loss.

Non-Binary classification

Difference between Multi-class and Multi-label. Source: here

Multi-Class only classify one object from multiples objects in one sample.
Multi-Label can classify multiples objects in one sample.

Multi-Class Single Label:

In this case, we can calculate using two different methods: Categorical Cross-Entropy and Sparse Categorical Cross-Entropy. We can explain themas following:

categorical_crossentropy (cce) produces a one-hot array containing the probable match for each category,
sparse_categorical_crossentropy (scce) produces a category index of the most likely matching category.

Consider a classification problem with 5 categories (or classes).

In the case of cce, the one-hot target may be [0, 1, 0, 0, 0] and the model may predict [.2, .5, .1, .1, .1] (probably right)
In the case of scce, the target index may be [1] and the model may predict: [.5].

Consider now a classification problem with 3 classes.

In the case of cce, the one-hot target might be [0, 0, 1] and the model may predict [.5, .1, .4] (probably inaccurate, given that it gives more probability to the first class)
In the case of scce, the target index might be [0], and the model may predict [.5]

Many categorical models produce scce output because you save space, but lose A LOT of information (for example, in the 2nd example, index 2 was also very close.) I generally prefer cce output for model reliability.

There are a number of situations to use scce, including:

when your classes are mutually exclusive, i.e. you don’t care at all about other close-enough predictions,
the number of categories is large to the prediction output becomes overwhelming.

Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss.

Multi-Class Multiple Label:

For this case, we can apply the softmax loss with a little modification. To more details please check:

I hope you enjoyed this post :D