Grad-CAM is a popular technique for visualizing where a convolutional neural network model is looking. Grad-CAM is class-specific, meaning it can produce a separate visualization for every class present in the image:

Example cat and dog Grad-CAM visualizations modified from Figure 1 of the Grad-CAM paper

Grad-CAM can be used for weakly-supervised localization, i.e. determining the location of particular objects using a model that was trained only on whole-image labels rather than explicit location annotations.

Grad-CAM can also be used for weakly-supervised segmentation, in which the model predicts all of the pixels that belong to particular objects, without requiring pixel-level labels for training:

Part of Figure 4 of the Grad-CAM paper showing predicted motorcycle and person segmentation masks obtained by using Grad-CAM heatmaps as the seed for a method called SEC (Seed, Expand, Constrain)

Finally, Grad-CAM can be used to gain better understanding of a model, for example by providing insight into model failure modes:

Figure 6 of the Grad-CAM paper, showing example model failures along with Grad-CAM visualizations that illustrate why the model made incorrect predictions.

The main reference for this post is the expanded version of the Grad-CAM paper: Selvaraju et al. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.” International Journal of Computer Vision 2019.

A previous version of the Grad-CAM paper was published in the International Conference on Computer Vision (ICCV) 2017.

Grad-CAM as Post-Hoc Attention

Grad-CAM is a form of post-hoc attention, meaning that it is a method for producing heatmaps that is applied to an already-trained neural network after training is complete and the parameters are fixed. This is distinct from trainable attention, which involves learning how to produce attention maps (heatmaps) during training by learning particular parameters. For a more in-depth discussion of post-hoc vs. trainable attention, see this post.

Grad-CAM as a Generalization of CAM

Grad-CAM does not require a particular CNN architecture. Grad-CAM is a generalization of CAM (class activation mapping), a method that does require using a particular architecture.

CAM requires an architecture that applies global average pooling (GAP) to the final convolutional feature maps, followed by a single fully connected layer that produces the predictions:

In the sketch above, the squares A1 (red), A2 (green), and A3 (blue) represent feature maps produced by the last convolutional layer of a CNN. To use the CAM method upon which Grad-CAM is based, we first take the average of each feature map to produce a single number per map. In this example we have 3 feature maps and therefore 3 numbers; the 3 numbers are shown as the tiny colored squares in the sketch. Then we apply a fully-connected layer to those 3 numbers obtain classification decisions. For the output class “cat” the prediction will be based on 3 weights (w1, w2, and w3). To make a CAM heatmap for “cat”, we perform a weighted sum of the feature maps, using the “cat” weights of the final fully-connected layer:

Note that the number of feature maps doesn’t have to be three – it an be any arbitrary k. For a more detailed explanation of how CAM works, please see this post. Understanding CAM is important for understanding Grad-CAM, as the two methods are closely related.

Part of the motivation for the development of Grad-CAM was to come up with a CAM-like method that does not restrict the CNN architecture.

Grad-CAM Overview

The basic idea behind Grad-CAM is the same as the basic idea behind CAM: we want to exploit the spatial information that is preserved through convolutional layers, in order to understand which parts of an input image were important for a classification decision.

Similar to CAM, Grad-CAM uses the feature maps produced by the last convolutional layer of a CNN. The authors of Grad-CAM argue, “we can expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information.”

Here is a sketch showing the parts of a neural network model relevant to Grad-CAM:

The CNN is composed of some convolutional layers (shown as “conv” in the sketch). The feature maps produced by the final convolutional layer are shown as A1, A2, and A3, the same as in the CAM sketch.

At this point, for CAM we would need to do global average pooling followed by a fully connected layer. For Grad-CAM, we can do anything – for example, multiple fully connected layers – which is shown as “any neural network layers” in the sketch. The only requirement is that the layers we insert after A1, A2, and A3 have to be differentiable so that we can get a gradient. Finally, we have our classification outputs for airplane, dog, cat, person, etc.

The difference between CAM and Grad-CAM is in how the feature maps A1, A2, and A3 are weighted to make the final heatmap. In CAM, we weight these feature maps using weights taken out of the last fully-connected layer of the network. In Grad-CAM, we weight the feature maps using “alpha values” that are calculated based on gradients. Therefore, Grad-CAM does not require a particular architecture, because we can calculate gradients through any kind of neural network layer we want. The “Grad” in Grad-CAM stands for “gradient.”

The output of Grad-CAM is a “class-discriminative localization map”, i.e. a heatmap where the hot part corresponds to a particular class:

If there are 10 possible output classes, then for a particular input image, you can make 10 different Grad-CAM heatmaps, one heatmap for each class.

Grad-CAM Details

First, a bit of notation:

In other words, y^c is the raw output of the neural network for class c, before the softmax is applied to transform the raw score into a probability.

Grad-CAM is applied to a neural network that is done training. The weights of the neural network are fixed. We feed an image into the network to calculate the Grad-CAM heatmap for that image for a chosen class of interest.

Grad-CAM has three steps:

Step 1: Compute Gradient

The particular value of the gradient calculated in this step depends on the input image chosen, because the input image determines the feature maps A^k as well as the final class score y^c that is produced.

For a 2D input image, this gradient is 3D, with the same shape as the feature maps. There are k feature maps each of height v and width u, i.e. collectively the feature maps have shape [k, v, u]. This means that the gradients calculated in Step 1 are also going to be of shape [k, v, u].

In the sketch below, k=3 so there are three u x v feature maps and three u x v gradients:

Step 2: Calculate Alphas by Averaging Gradients

In this step, we calculate the alpha values. The alpha value for class c and feature map k is going to be used in the next step as a weight applied to the feature map A^k. (In CAM, the weight applied to the feature map A^k is the weight w_k in the final fully connected layer.)

Recall that our gradients have shape [k, v, u]. We do pooling over the height v and the width u so we end up with something of shape [k, 1, 1] or to simplify, just [k]. These are our k alpha values.

Step 3: Calculate Final Grad-CAM Heatmap

Now that we have our alpha values, we use each alpha value as the weight of the corresponding feature map, and calculate a weighted sum of feature maps as the final Grad-CAM heatmap. We then apply a ReLU operation to emphasize only the positive values and turn all the negative values into 0.

Won’t the Grad-CAM Heatmap Be Too Small?

The Grad-CAM heatmap is size u x v, which is the size of the final convolutional feature map:

You may wonder how this makes sense, since in most CNNs the final convolutional features are quite a bit smaller in width and height than the original input image.

It turns out it is okay if the u x v Grad-CAM heatmap is a lot smaller than the original input image size. All we need to do is up-sample the tiny u x v heatmap to match the size of the original image before we make the final visualization.

For example, here is a small 12 x 12 heatmap:

Now, here is the same heatmap upsampled to 420 x 420 using the Python package cv2:

The code to visualize the original small low-resolution heatmap and turn it into a big high-resolution heatmap is here:

import cv2
import matplotlib
import matplotlib.pyplot as plt

small_heatmap = CalculateGradCAM(class='cat')

plt.imshow(small_heatmap, cmap='rainbow')

#Upsample the small_heatmap into a big_heatmap with cv2:
big_heatmap = cv2.resize(small_heatmap, dsize=(420, 420), 
                         interpolation=cv2.INTER_CUBIC)

plt.imshow(big_heatmap, cmap='rainbow')

Grad-CAM Implementation

A Pytorch implementation of Grad-CAM is available here.

More Grad-CAM Examples

Grad-CAM has been applied in numerous research areas and is particularly popular in medical images. Here are a few examples:

Yang et al. “Visual Explanations From Deep 3D Convolutional Neural Networks for Alzheimer’s Disease Classification”

Top row is CAM, bottom row is Grad-CAM. Kim et al. “Visual Interpretation of Convolutional Neural Network Predictions in Classifying Medical Image Modalities.”

Grad-CAM visualizations from Woo et al. “CBAM: Convolutional Block Attention Module.” This paper is an example of a trainable attention mechanism (CBAM) combined with a post-hoc attention mechanism for visualization (Grad-CAM).

Caveat: Explainability is Not Interpretability. Any Post-Hoc Attention Mechanism May Not Be Optimal for High-Stakes Decisions

“Explainability” is not the same as “interpretability.”

“Explainability” means that it’s possible to explain how a model made its decision, although the explanation is not guaranteed to make sense to humans, and the explanation is also not constrained to follow any known rules of the natural world. For example, a model may “explain” a boat classification by highlighting the water, or a model may “explain” a “severely ill” classification by highlighting a label within a medical image that indicates that the image was taken while the patient was lying down. The explanation is also not guaranteed to be fair or free from biases.

“Interpretability” means that a model has been designed from the beginning to produce a human-understandable relationship between the inputs and the outputs. For example, logistic regression is an interpretable model, in which the design of the model results in weights that show which inputs contribute more or less to the final prediction. Rule-based methods are also interpretable.

Grad-CAM is a technique for “explainability” meaning that it is meant to explain what a trained CNN did. Grad-CAM does not make a model “interpretable.” While Grad-CAM heatmaps often make sense, they aren’t required to make sense, and they must be used carefully – especially for sensitive applications like medical image interpretation, an area in which Grad-CAM is particularly popular.

If you are working on weakly-supervised localization or weakly-supervised segmentation, Grad-CAM is definitely a useful method. If you are interested in “debugging” a model and gaining more insight into why the model is making certain mistakes, Grad-CAM is also useful. If you are working on an application with sensitive data used for real world, high-stakes decisions, any post-hoc attention mechanism (i.e. any method for making heatmaps that is “tacked on” after a network has been trained) including Grad-CAM is potentially inappropriate, depending on how it is going to be used. If you are interested in interpretable machine learning models, I recommend this excellent paper: Cynthia Rudin “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.”

Caveat: Use Vanilla Grad-CAM, Not Guided Grad-CAM

As another caveat to be aware of, the Grad-CAM paper mentions a variant of Grad-CAM called “Guided Grad-CAM” which combines Grad-CAM with another CNN heatmap visualization technique called “guided backpropagation.” I discuss guided backpropagation in this post and this post. The short summary is that recent work by Adebayo et al. and Nie et al. suggests that guided backpropagation is performing partial image recovery and acting like an edge detector, rather than providing insight into a trained model. Therefore, it is best not to use guided backpropagation.

The good news is that the vanilla Grad-CAM method discussed in this post (i.e., Grad-CAM without guided backpropagation) passes Adebayo et al.’s sanity checks and is a great option to use.

Summary

  • Grad-CAM is a popular technique for creating a class-specific heatmap based off of a particular input image, a trained CNN, and a chosen class of interest.
  • Grad-CAM is closely related to CAM.
  • Grad-CAM is compatible with any CNN architecture as long the layers are differentiable.
  • Grad-CAM can be used for understanding a model’s predictions, weakly-supervised localization, or weakly-supervised segmentation.
  • Grad-CAM is a method for explainability, not interpretability, and therefore should be used with caution in any sensitive domain.
  • Vanilla Grad-CAM is a better choice than Guided Grad-CAM.

About the Featured Image

The featured image is modified from Figures 1 and 20 of the Grad-CAM paper.