Where CNN is looking? – Grad CAM

Do you think, when you train a Convolutional Neural Network (CNN) to classify between images it is exactly understanding the image as we humans perceive? It’s difficult to answer, as for most of the times Deep learning models are often considered to be a black box. We feed in the data and then we get the output. Whatever happens in between this flow is very difficult to debug. Though we get the accurate predictions, it may not be true that they are intelligent enough to perceive the same way as we do.

Why does it happen?

Imagine a situation where you want to classify between Elephants and Penguins (I know it’s fairly an easy task). Now you got the data and trained a CNN to classify the images and deployed it. The model might perform well with most of the data, but there are chances for misclassification to happen. One may ignore thinking it as a corner case. But what do you think when the object is clearly distinguishable?

Considering the above situation, it’s very obvious to find plants/trees in the images of elephant and snow/water in the images of penguin. So, it is evident that the model has learned to distinguish between the colors/shades of green and white rather than actually learning to classify the objects.

In the above case, simple image processing techniques like color channel statistics do on par with the trained model. Because there is no intelligence, the model is simply distinguishing between the colors. Now you may ask, how to know where exactly the CNN is looking? Well, the answer is Grad-CAM.


Gradient weighted Class Activation Map is the technique that we implement in this blog post. At first, we need to know that this is not the only technique out there. The authors say,

Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

So, to explain in simple terms, we simply take the final convolutional feature map and then we weigh every channel in that feature with the gradient of the class with respect to the channel. It’s just nothing but how intensely the input image activates different channels by how important each channel is with regard to the class. The best part is it doesn’t require any re-training or change in the existing architecture.

$$S^c = \frac{1}{Z}\sum_i\sum_j \sum_k \frac{\partial y^c}{\partial A_{ij}^k} A_{ij}^k$$

The spatial score of a specific class $S^c$ is nothing but the global average pooling over two dimensions $i$ and $j$ for the gradient of respective class output $y^c$ with respect to the feature map $A_{ij}^k$. Then, we multiply the resulting value with the feature map along with its channel axis $k$. We then average/pool the resultant over the channel dimension $k$. Thus, resulting in the spatial score map of size $i \times j$. The $\sum$ is used to describe the pooling and average operation.

The ReLU activation is applied to the score map and then it is normalized to output positive region predictions.


For the purpose of this blog post, let’s take a pre-trained VGG model and start implementing by importing necessary packages.

from keras.applications.vgg16 import VGG16, preprocess_input, decode_predictions
from keras.preprocessing import image
import keras.backend as K
import numpy as np
import cv2
import sys

We use the VGG16 model that’s shipped with Keras. And load certain helper functions required for loading and preprocessing our image.

model = VGG16(weights="imagenet")
img_path = sys.argv[1]
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

Let’s initialize our model and load the image from the command line argument. The VGG network expects input size to be $(224 \times 224 \times 3)$ we resize our image the required size. Since we are passing only one image through the network, it’s required to expand the first dimension noting it as a batch of size $1$. We then normalize our image by subtracting mean RGB values from the input image using a helper function preprocess_input

preds = model.predict(x)
class_idx = np.argmax(preds[0])
class_output = model.output[:, class_idx]
last_conv_layer = model.get_layer("block5_conv3")

Here, in this case, let’s see the map for the top prediction. So we get the predictions for the image and then we take the topmost class index. Remember that we can compute map for any class. Then, we take the output from the final convolutional layer in the VGG16 which isblock5_conv3. The resulting feature map will be of shape $14 \times 14 \times 512$.

grads = K.gradients(class_output, last_conv_layer.output)[0]
pooled_grads = K.mean(grads, axis=(0, 1, 2))
iterate = K.function([model.input], [pooled_grads, last_conv_layer.output[0]])

pooled_grads_value, conv_layer_output_value = iterate([x])

for i in range(512):
    conv_layer_output_value[:, :, i] *= pooled_grads_value[i]

As explained above, we compute the gradient of the class output value with respect to the feature map. Then, we pool the gradients over all the axes leaving out the channel dimension. Finally, we weigh the output feature map with the computed gradient values.

heatmap = np.mean(conv_layer_output_value, axis=-1)
heatmap = np.maximum(heatmap, 0)
heatmap /= np.max(heatmap)

We then average the weighed feature map along the channel dimension resulting in a heat map of size 14 \times 14. And, then we normalize the heat map to make the values in between 0 and 1.

img = cv2.imread(img_path)
heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))
heatmap = np.uint8(255 * heatmap)
heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
superimposed_img = cv2.addWeighted(img, 0.6, heatmap, 0.4, 0)
cv2.imshow("Original", img)
cv2.imshow("GradCam", superimposed_img)

Finally, we use OpenCV to read the image, resize the existing heatmap to the image size. We blend the original image and the heatmap to superimpose the heatmap on to the image. An example is shown below.

From the above image, it is clearly visible where CNN is looking in the image to actually distinguish between the classes.

This technique is not only useful for localization but it also used for Visual Question and Answering, Image captioning etc., as mentioned in their paper itself.

Also, it is very much helpful in debugging about the data requirements for building an accurate model. Though hyperparameter tuning is not much associated with this, we can generalize the model better with extra data and data augmentation techniques.

  • Thanks for sharing, Saideep. Will you be at PyImageConf?

    • Saideep

      Thanks, Yes I’ll be there.

  • Worked, thank you.

  • Denys Filippov

    Extremely clear explanation and straightforward code example. Many thanks!

  • thanks for this

  • conjuring

    Sorry for this noob question but I’m a bit stuck at this line ->

    class_output = model.output[:, class_idx]

    if I understood this line correctly then changing the above to

    class_output = model.output[]

    should return me the “output” of the last layer created,but if so is the case then how is model.predict() different from model.output[] ?