Image segmentation with Deep learning

Difficult! Isn’t it? Image segmentation is a bit challenging as well as very exciting problem to solve. It is one of the most critical applications in the field of computer vision. Various industrial applications like medical, aerial imagery, etc are powered by image segmentation. In this series of posts, you will be learning about how to solve and build solutions to the problem using Deep learning.

image segmentation

Unlike object detection which involves detecting a bounding box around the objects and classifying them in an image, segmentation refers to the process of assigning a class label to each pixel in an image. The accuracy is very much critical as the boundaries of the object should be precisely classified. Therefore, making it a difficult task to handle.

In the above image, you can see how each pixel in an image is classified. Since each pixel in the pixel needs to be classified into one of the available object classes it involves heavy computation and complex components to build.

Traditional image processing methods can be helpful up to some level. As the input image gets complicated, it becomes very difficult to frame certain heuristics. So what do we do now? Well, deep learning comes into the picture.


Image/Semantic segmentation is completely a different problem compared to any classification or object detection task. Unlike, in classification where we predict the class of the input image into one of the available classes or the box coordinates in case of object detection, segmentation requires to output the class label for each and every pixel in the image.

So it’s not possible to use an affine layer due to the following reasons:

  1. Requires many parameters depending upon the size of the image. For example, if the input image is of size $(w, h)$ then we require $w \times h$ nodes in the final layer and the weights for the last layer itself will be $\omega_{n-1} \times (w*h)$ where $\omega, n$ are the number of nodes and number of layers in the network respectively.
  2. It’s very computationally expensive.
  3. The semantic information between several groups of pixel or regions of the image cannot be preserved.
  4. Translational and rotational variant. Therefore, making it not suitable for image segmentation tasks.

Due to the above reasons, we use an Encoder Decoder kind of architecture to solve this problem. We use CNN and pooling to encode the input image into lower dimensional feature space and then we again use CNN and unpooling to decode those representations to the size of the input image.

Therefore given an input image we output mask with the same size of input image but each pixel in the mask is classified.


There has been very active research in this area for many years. Let’s see some of those published implementations and review them briefly.

Fully Convolutional Networks for Semantic Segmentation

This is the first paper to use convolutional neural networks for semantic segmentation. It uses the convolution trick applied at the final layers so as to make the variable input sizes predict the classification scores. Further, it uses pre-trained networks like AlexNet, VGG, and GoogleNet and adds a convolutional layer with kernel size $1 \times 1$at the top with 21 channels (including ground). Then a deconvolution layer is applied to bilinearly upsample the coarse outputs to the desired size for dense pixel-wise predictions.

Furthermore, it introduced three variants of the network where the output from the final convolutional layer is combined with the outputs from the previous pooling layers. Since the feature map sizes don’t match, the final output should be upscaled to match the previous layers and then summed.

The bilinear upsampling uses parameters so that it learns when the gradients are back propagated through it. Pixelwise loss is used and the FCN-8s has shown the highest accuracy followed by FCN-16s and FCN-32s. It is clearly evident that due to the information from previous layers to the final layers in FCN-8s it achieved such accuracy.

Learning Deconvolution Network for Semantic Segmentation

It is also known as DeconvNet. The primary difference between FCN and DeconvNet is that FCN doesn’t fully utilize the decoder block. So, in FCN the final layer is immediately upsampled to the input size. DeconvNet uses unpooling and deconvolutions in the decoder block to upsample the feature maps to the actual image size at various steps.

Also, it’s an instance wise segmentation where the object proposals are extracted and then those image regions are passed through the network and the outputs are aggregated to form the actual image segmentation.

Here, the convolution network (encoder) employs a pre-trained VGG-16 with its final classification layers removed. So it has 13 layers where the spatial size of an input image is reduced while increasing its channels. The deconvolution network (decoder) reconstructs the vector by applying unpooling and deconvolution at various steps.

Unpooling can be viewed as a reverse operation to the pooling where it reconstructs the original image size activation maps. Therefore, it enlarges the input map but the values in the output activation maps are sparse. Deconvolution operation fixes this issues by converting the input sparse activation map and then converts it to a dense map. The learned filters in deconvolution operation form the bases for object reconstruction.

It basically implements two-stage training.

  1. The first stage of training includes easy examples where we crop object instances using ground-truth segmentations such that it is centered.
  2. In the second stage, the hard examples are trained where objects are sufficiently overlapped with ground-truth segmentations are chosen.

Though DeconvNet is able to capture the fine details of an object, FCN is good at extracting the overall shape of the object. So the final model is an ensemble of both DeconvNet and FCN to increase accuracy.

U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net is an amazing network for image segmentation. It basically implements an encoder-decoder architecture where the connections from the encoder are feed into the decoder block at several stages. It achieves precise segmentation (good accuracy) without the need for huge data.

As you can see in the above image, the encoder decreases the spatial size of the input maps while increasing the number of channels. The decoder uses up-convolution operation with decreasing number of filters so as to increase the spatial dimensions of the feature map and reduce its channels by half. The feature maps (cropped) from the encoder path are fed into their respective decoder paths, these two feature maps are concatenated along the channel dimension.

Note that in the actual implementation the size of output segmentation map is a bit reduced, as it uses un-padded convolutions. It uses a cross-entropy loss function with the softmax applied to the output maps.

Mask R-CNN

Published by Facebook AI Research (FAIR). It is built on top of Faster R-CNN which is a real-time object detection framework. Faster R-CNN outputs the class label and bounding box offset values for each candidate. To that, a new branch is added to output the object masks. Discussing Faster R-CNN is outside the scope of this post, but remember that it has been extended to output object masks per each candidate. So, it’s basically an instance segmentation network.

Here, the RoIPool is replaced by RoIAlign as  RoIPooling results in sparse quantization of the feature extraction as we need accurate pixel-level details. And the loss is updated to include the object mask loss and the network is trained. We will review about this in the series when we use Mask R-CNN.

In the next posts, we will build, develop and train such networks to solve for the image segmentation in various domains.

Have a nice day!

  • Sayantan Mukherjee

    Great introduction. Waiting for the next part. thanks.