Building a Reverse Image Search Engine

Ever wondered, how does the Google reverse image search engine works which take in an image and returns you the most similar images in a fraction of a second? How does the Pinterest let you search the visually similar images of the selected objects? Sounds interesting? Do you want to understand and build similar kind of a system? If yes then you are at the right place.

Hello all, I’m very excited to write this blog post. Building Image Search Engines are most challenging and my favorite topic of interest.  It is very much needed to understand what is it and the difficulties involved in building it before we dive in.

What is a Reverse Image Search Engine?

Long back, I have gone to a hotel and their ambiance was very good. One thing which stole my attention was the Sofa. And I wanted it to be in my home as it is very attractive. Immediately I asked the manager where did you get this kind of thing? He replied that they recently bought it from an e-commerce site. After reaching my home I started searching for different kind of sofas, but no luck, as I could not get the exact kind of Design, Pattern, and Clothing I had seen there. Immediately I remembered that I had clicked an image of the Sofa. Then I used Google’s search by image option, luckily I got many results of the same kind and links to buy the exact Sofa.

So querying by image and getting similar images rather than a keyword is the most exciting and accurate. This is what exactly Reverse Image Search Engines means!

Where are they used?

Reverse Image Search is also known as Content Based Image Retrieval. There are many applications using this technique. For example, in fashion e-commerce, these are used for recommending similar items to the user browsing an item and giving the user many options to choose from. Above mentioned applications use this kind of systems as the visually similar items to be recommended.

Image Search Engines are used for detecting copyright of images and logos. They are also used by various aerial Industries for detecting similar kind of regions in the entire Globe. For example, if the user wants to know where all Helipads are there in the world, then a reference image is provided to the engine and it searches entire satellite images for finding those similar to the reference image. Moreover, this application is very computationally expensive.

To learn more about these systems please do visit this link for more information.

how to build these systems?

Lots of research is going on this field to improve the efficiency and accuracy of these systems. And the most popular and widely used model is Bag-Of-Visual-Words. In this technique, the key points in an image are detected and they are described using various descriptors. And the extracted features are clustered to build a fixed visual vocabulary and the same vocabulary is used to build the final feature vectors for those images. And finally, a similarity metric is used to rank the images.

Though this technique is accurate, it is computationally expensive and little harder to build as it is very limited in performance for in-class problems. It works great for finding similarities out-of-class images but fails at in-class applications. Consider these in the application of face recognition. These are good at retrieving the images which contain faces in them, but it fails to get the exact same person’s images rather than just faces.

Above mentioned difficulties can be resolved by using Deep Learning to build these systems which are more Artificially Intelligent and accurate with less computationally expensive.

Just think, training a Deep Learning model that learns the representations of an image and builds a feature vector per image where the feature vectors corresponding to similar images are close in the feature space. It is more elegant as it learns the similarity metric itself. So to summarize  it in one sentence, “It takes an image  and retrieves the similar images as output.”

In this blog post, we will be implementing the technique mentioned in this paper named Learning Fine Grained Image Similarity using Deep learning

the architecture of a cbir

So when a user submits a query image, it extracts features from the image and then we compare query features with the features in the database and then we rank the relevant results and return them.

designing our network

We’ll be using Convolution Neural Networks (CNNs) for building a Reverse Image Search Engine. If you are unfamiliar what CNNs are please do refer this link. Until now, you might have seen CNNs applied for either classification or Regression tasks. But this task is way more different from those. Because consider a situation, if we have trained a network to classify the images in our dataset and we retrieve the same class images of the user submitted query image. Initially, it seems to be good, but as time goes on, there might be a new class comes up and which results in fine-tuning the network to adapt to a new set of classes again. It becomes more hassle as the number of classes increases. So taking this task forward as a classification is a very bad idea.

It’s better to build a network which extracts the lower dimensional features from an image rather than classifying it. So passing an image through a convolution neural network always yields a feature vector, but how do we know that the feature vector produced by the network is good enough such that all the features corresponding to the similar images are close?

So it is clear that the input to our network should not be a single image. It’s better to design a network which takes in 3 images they are

  1. Query Image
  2. Positive Image
  3. Negative Image

I mean that the network will take an image and produces a feature vector, but since it is not enough to train our network to learn the representations, we pass in 3 inputs which is a query image, an image which is similar to the query image (positive) and an image which is not similar to the query image (negative) and get 3 feature vectors corresponding to the 3 inputs. And we use these feature vectors to train our model to learn the representations. The higher level view of the network is shown below.

Now that we have got three feature vectors, and we have to train our network such that the loss function should maximize the distance between the query image and negative image and minimize the distance between the query image and positive image simultaneously. But how do we model the objective function? The answer is triplet loss.

Triplet loss

$$L(q,p,n) = max(0, \epsilon+D(q,p)-D(q,n))$$

where $D(q,p)$ is the distance between the query and positive feature vectors, $D(q,n)$ is the distance between the query and negative feature vectors. The way the triplet loss works is that if the $D(q,p)$ is very small, and $D(q,n)$ is very high then $D(q,p)-D(q,n)$ is negative and hence the final loss will be $0$, which means the network learned well, note that we take maximum of 0 and loss. Similarly, if it is the other way, then the loss will be $>0$. Which indicates that we need to minimize this loss function to train our network. So finally minimizing the above loss function is nothing but minimizing $D(q,p)$ and maximizing $D(q,n)$ which makes our network learn the similarity representations between images.

Note that $\epsilon$ is just a constant to fine-tune the training. And the distance can be anything you want to model like Euclidean, Manhattan, Cosine, etc. We can even use the $\chi^2$ distance but before that, we better normalize our feature vector.

Let’s start building our engine

Before we start building, we need a dataset on which we can train and test our engine. For the purpose of this post, I decided to go with the popular UKBENCH dataset which is best in evaluating image retrieval systems. I got this dataset from another web resource as the official website for downloading this dataset is not available now. So please download this dataset from this link. Credits go to the people who worked hard to collect those images.

The dataset consists of 1000 images of 250 different object taken in different viewpoints (four each) and therefore it is very good for training and evaluating our engine. Also please note that there are many datasets available for image retrieval systems, but this one is enough to go with our experiment.

The steps involved in building our engine are as follows.

  1. Preparing the dataset
  2. Extracting features from images
  3. Training the network
  4. Evaluating the network
Preparing the dataset

You believe or not, preparing the data for our experiment is much harder compared to any classification task because we need to prepare image triplets. But it becomes easier when we have corresponding labels. Let’s start coding by firing up a file named in the utils package. It contains various methods for loading, preparing and extracting our dataset. Let’s start importing the necessary packages first.

import numpy as np
import h5py
import glob
import cv2

Now let’s write a function which randomly samples a triplet from the given features and labels.

def get_triplets(data, labels):
    pos_label, neg_label = np.random.choice(labels, 2, replace=False)

    pos_indexes = np.where(labels == pos_label)[0]
    neg_indexes = np.where(labels == neg_label)[0]


    anchor = data[pos_indexes[0]]
    positive = data[pos_indexes[-1]]
    negative = data[neg_indexes[0]]

    return (anchor, positive, negative)

The functionget_triplets takes in the features and labels and samples an feature triplet. We randomly sample a positive label and a negative label then we use these labels to sample the corresponding features after shuffling the features. Note that we do need labels in order to differentiate between non-similar features.

def dump_features(images_dir,labels, hdf5_path, feature_extractor, image_formats=("jpg")):

    image_paths = []

    for image_format in image_formats:
        image_paths += glob.glob("{}/*.{}".format(images_dir, image_format))

    image_paths = sorted(image_paths)
    db = h5py.File(hdf5_path, mode="w")

    features_shape = ((len(labels),), feature_extractor.output_shape[1:])
    features_shape = [dim for sublist in features_shape for dim in sublist]

    imageIDDB = db.create_dataset("image_ids", shape=(len(labels),), 
    featuresDB = db.create_dataset("features", 
        shape=features_shape, dtype="float")
    labelsDB = db.create_dataset("labels", 
        shape=(len(labels),), dtype="int")

    for i in range(0, len(labels), 32):
        start,end = i, i+32
        image_ids = [path.split("/")[-1] for path in image_paths[start:end]]
        images = [cv2.imread(path,1) for path in image_paths[start:end]]
        features = feature_extractor.extract(images)

        imageIDDB[start:end] = image_ids
        featuresDB[start:end] = features
        labelsDB[start:end] = labels[start:end]
        print "Extracting {}/{}".format(i+1+32, len(labels))


Now we write a function  dump_features which takes in the path to the directory containing images and their corresponding labels and extracts features and saves those features to disk. We use HDF5 format to save the extracted features from the images. The reason we choose an HDF5 file format is that the size of the features are very large and we can access only parts of the array with out loading the entire array into the memory which gives us a smooth way of organizing things. We create three datasets within the HDF5 database namely image_ids, features, labels to which we write the image names, extracted features, and their corresponding labels. We extract features from a feature extractor which is given as an argument to the function. We will learn why we need to need a feature extractor instead of a single network all alone in the below sections. Finally, the features are extracted from the images batch wise and stored in the HDF5 database.

def extract_features(hdf5_path):
    db = h5py.File(hdf5_path,mode="r")
    features = db["features"][:]
    labels = db["labels"][:]

    return (features, labels)

def extract_embeddings(features, model):
    embeddings = model.predict([features, features, features])
    return embeddings[:,:,0]

The above two functions extract_features and extract_embeddings are the helper functions which we will use in the later modules.

extracting features from images

Extracting features from images is pretty important as this plays a key role in learning the representations of the images. We can train a neural network from scratch, but that would take so long to converge or at least learn the important spatial representations such as gradients of the images. So it is always better to use a pre-trained model to extract the features from the image and then feed these features as inputs to the another network which we will design. This not only saves lots of time but it is also very efficient in terms of memory usage.

You might be wondering why to extract the features from a pre-trained model and store it on the disk and then feed these features into another network instead of creating a network which is contiguous to the existing model by freezing up to some layers and train? Well, that’s a good approach but not a better one. Because if we had to build our network on top of the pre-trained model, then in the training phase the image has to be passed through that gigantic network several times and that not only decreases the speed but also it loads the weights into the memory which is sometimes not recommended. As VGG16 has weights of size 1GB and which will eat up your GPU though you are not training on those layers.

And building a continuous network would be a good idea in case if you wanted to back propagate the gradients through out all the layers. Remember that in most cases this is not required. So I consider the method it is presented here is well optimized for the experiment.

Let’s start coding in file by importing necessary libraries.

from keras.applications import VGG16, InceptionV3, VGG19, ResNet50, Xception
from keras.applications import imagenet_utils
from keras.applications.inception_v3 import preprocess_input
import cv2
import numpy as np
import keras.backend as K
from keras.models import Model
from keras.models import Sequential
from keras.layers import Lambda, Input, Dense, GlobalAveragePooling2D, Merge, Dropout

Then let’s start building a class named ImageNetFeatureExtractor which wraps up several pre-trained models.

class ImageNetFeatureExtractor(object):
    def __init__(self, model="vgg16", resize_to=(224,224)):
        MODEL_DICT = {"vgg16": VGG16, "vgg19": VGG19, "inception": InceptionV3, "resnet": ResNet50,
                      "xception": Xception}
        network = MODEL_DICT[model.lower()]
        self.model_name = model.lower()
        self.model = network(include_top=False)
        self.preprocess_input = imagenet_utils.preprocess_input
        self.imageSize = resize_to
        if model in ["inception","xception"]:
            self.preprocess_input = preprocess_input

    def extract(self, images):
        images = self.preprocess(images)
        return self.model.predict(images)

    def output_shape(self):
        return self.model.compute_output_shape([[None, self.imageSize[0], self.imageSize[1], 3]])

    def resize_images(self, images):
        images = np.array([cv2.resize(image, (self.imageSize[0], self.imageSize[1])) for image in images])
        return images

    def preprocess(self, images):
        images = self.resize_images(images)
        images = self.preprocess_input(images.astype("float"))
        return images

The ImageNetFeatureExtractor constructor takes in two optional arguments namely model and resize_to. The primary objective of this class is to present a pre-trained ImageNet model and extract features from an image using a method called ImageNetFeatureExtractor.extract. Since the input shape of the image can vary largely depending on our application and which varies the output shape of the network, it’s always to have a property called output_shape which helps in building the top network later on that takes in the output of this network as its input. Also, the image that needs to be passed through the network should be properly pre-processed as it is expected by the pre-trained model. So we resize the image to a fixed size and some preprocessing is done. Note that the image size can be of any size but all the images should be resized to a fixed size to produce consistent output.

def concat_tensors(tensors, axis=-1):
    return K.concatenate([K.expand_dims(t, axis=axis) for t in tensors])

def get_small_network(input_shape=(None, 5,5,2048)):
    model = Sequential()
    model.add(Dense(512, activation="relu"))
    model.add(Dense(512, activation="relu"))
    model.add(Dense(256, activation="relu"))
    #model.add(Dense(128, activation="relu"))
    return model

The concat_tensors is a helper function which is useful for concatenating the tensors over a given axis. And the get_small_network is a simple network architecture we will use for the purpose of this blog post.  By the way, you can develop your own networks and test it. Observe that we take the extracted features as input and we apply a global average pooling which averages over the height and width dimensions of the output. For example, if the output shape of the features is (None, 5, 5, 2048) then it will be averaged to produce the output of shape (None, 2048). It helps to produce the same size output vector when using a particular network with varying image sizes. Actually, this is used to decrease the size of the network compared to the flattening the output features. Also, it proved to be best fitting in some cases rather than just flattening the features. And finally, we produce 256 embeddings as the output. So any image can be represented in just 256 numbers!

def get_triplet_network(input_shape=(None, 5,5,2048)):
    base_model = get_small_network(input_shape=input_shape)

    anchor_input = Input(input_shape[1:])
    positive_input = Input(input_shape[1:])
    negative_input = Input(input_shape[1:])

    anchor_embeddings = base_model(anchor_input)
    positive_embeddings = base_model(positive_input)
    negative_embeddings = base_model(negative_input)

    output = Lambda(concat_tensors)([anchor_embeddings, positive_embeddings, negative_embeddings])
    model = Model([anchor_input, positive_input, negative_input], output)

    return model

Now comes the most interesting part i.e building a network that is built on top of our model that accepts 3 inputs and produces their respective embeddings. We make use of the keras Lambda function to concatenate the anchor/query, positive and negative embeddings to a single tensor of shape (None, embedding_size, 3). Note that we use the same base_model instance to pass the query, positive, negative features so the weights are all same for the network. If you wish to use tensorflow for building a network don’t forget to make use of same variable scope for creating all the weights and biases.

Now it’s time to define our loss function. Let’s open up a file named and start coding.

import keras.backend as K

def euclidean_distance(a,b):
    return K.sqrt(K.sum(K.square((a-b)), axis=1))

def triplet_loss(y_true, anchor_positive_negative_tensor):
    anchor = anchor_positive_negative_tensor[:,:,0]
    positive = anchor_positive_negative_tensor[:,:,1]
    negative = anchor_positive_negative_tensor[:,:,2]

    Dp = euclidean_distance(anchor, positive)
    Dn = euclidean_distance(anchor, negative)

    return K.maximum(0.0, 1+K.mean(Dp-Dn))

We define our triplet loss as discussed above by taking in the output tensor and slicing the query, positive and negative embeddings and then we calculate the Euclidean distance as above and then the hinge loss is calculated. Note that in keras a loss function should take y_true and y_pred as arguments. So we create a dummy variable to pass into this function later.

Let’s write a simple script named to extract and dump features to an HDF5 database as described above.

import argparse
import numpy as np
from models import ImageNetFeatureExtractor
from utils import dump_features

ap = argparse.ArgumentParser()
ap.add_argument("-d","--dataset", help="Path to dataset", required=True)
ap.add_argument("-f","--features-db", help="Path to save extracted features", required=True)
ap.add_argument("-m","--model", help="(VGG16, VGG19, Inceptionv3, ResNet50)", default="InceptionV3")
ap.add_argument("-r","--resize", help="resize to", default=229, type=int)

args = vars(ap.parse_args())

feature_extractor = ImageNetFeatureExtractor(model=args["model"],
                                             resize_to=(int(args["resize"]), int(args["resize"])))
print "[+] Successfully loaded pre-trained model"

dump_features(args["dataset"], labels=(np.arange(1000)/4)+1,
              hdf5_path=args["features_db"], feature_extractor=feature_extractor,

The code itself is pretty straight forward and self-explanatory. Note that we give labels explicitly since there are 1000 images of 250 objects we manually give them the labels. Let’s start extracting features by firing up our script.

For this experiment, we will be using VGG16 for extracting the intermediate features from the images and we resize every image to (299, 299). So the output shape of our features will be (None, 9, 9, 512). We take the output from the last convolution layer i.e conv5_3. Please refer to this link for a detailed architecture of VGG16.

Now let’s write a script named to train our network to learn the representations and move the similar embeddings closer and others far from each other.

import argparse
from models import get_triplet_network
from utils import extract_features, triplet_loss, get_triplets
import numpy as np
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint

ap = argparse.ArgumentParser()
ap.add_argument("-f","--features-db", help="Path to saved features db")
ap.add_argument("-o","--output", help="Path to save the model checkpoints")
args = vars(ap.parse_args())

features, labels = extract_features(args["features_db"])
print "[+] Finished loading extracted features"

model = get_triplet_network(features.shape)

data = []
for i in range(len(features)):
    anchor, positive, negative = get_triplets(features, labels)
    data.append([anchor, positive, negative])

data = np.array(data)
targets = np.zeros(shape=(1000,256,3))

callback = ModelCheckpoint(args["output"], period=1, monitor="val_loss")
X_train, X_test, Y_train, Y_test = train_test_split(data, targets)

model.compile(Adam(1e-4), triplet_loss)[X_train[:,0], X_train[:,1], X_train[:,2]], Y_train, epochs=10,
          validation_data=([X_test[:,0], X_test[:,1], X_test[:,2]], Y_test),
          callbacks=[callback], batch_size=32)

We will use get_triplet_network to get our network which accepts three inputs and gives a concatenated output of three embeddings. Note that we use a custom loss function called triplet_loss which we built earlier to train our network. We use Adam optimizer with a learning rate of 0.0001 and we fit our model to train for 10 epochs with 32 batches. We create target values even though we don’t use them for calculating the loss.

So now we have trained our model and let’s write a script to show the similar images to the query image given. Let’s open up and start coding.

import numpy as np
import h5py
import argparse
from utils import triplet_loss, extract_features
from keras.models import load_model
from sklearn.metrics import pairwise_distances
import cv2
from imutils import build_montages

ap = argparse.ArgumentParser()
ap.add_argument("-m","--model", help="Path to trained model")
ap.add_argument("-f","--features-db", help="Path to saved features database")
ap.add_argument("-d","--dataset", help="Path to dataset")
ap.add_argument("-i","--image", help="Path to query image")

args = vars(ap.parse_args())

image_ids = h5py.File(args["features_db"], mode="r")["image_ids"][:]

def get_image_index():
    filename = args["image"].split("/")[-1]
    return np.where(image_ids == filename)[0][0]

def get_image_path(index):
    return args["dataset"].strip("/")+"/"+str(image_ids[index])

model = load_model(args["model"], custom_objects={"triplet_loss":triplet_loss})
features, labels = extract_features(args["features_db"])

embeddings = model.predict([features, features, features])
embeddings = embeddings[:,:,2]

image_id = get_image_index()
query = embeddings[image_id]

distances = pairwise_distances(query.reshape(1,-1), embeddings)
indices = np.argsort(distances)[0][:12]
images = [cv2.imread(get_image_path(index)) for index in indices]
images = [cv2.resize(image, (200,200)) for image in images]
result = build_montages(images, (200, 200), (4,3))[0]

cv2.imshow("Result", result)

Firstly we get the query image and load its extracted features from the HDF5 database we have saved earlier and now we pass this feature vector to extract embeddings from the final network and we compare this embedding with all the other embeddings using  pairwise_distances function which calculates the pairwise Euclidean distances. Please note that we should use the same metric we used for our loss function earlier to get the results. We have to explicitly provide the loss function for loading the model. And for displaying the results using build_montages from  imutils module which helps us to visualize more vividly.

Let’s see some of the results. Please note that the top left image is the query image.

The result seems very promising and accurate. So It is evident that our network is able to learn the visual similarity between images.

evaluating the network

Undoubtedly our network performed very well, but let’s just evaluate over the dataset we have been working on. We fire up a script named and start coding.

import numpy as np
import argparse
from utils import triplet_loss, extract_features
from keras.models import load_model
from sklearn.metrics import pairwise_distances

ap = argparse.ArgumentParser()
ap.add_argument("-f","--features-db", help="Path to extracted features")
ap.add_argument("-m","--model", help="Path to trained model")
args = vars(ap.parse_args())

def get_similar_image_indices(embeddings, index, num_results=4):
    query = embeddings[index]
    distances = pairwise_distances(query.reshape(1, -1), embeddings)
    indices = np.argsort(distances)[0][:num_results]
    return indices

def find_num_correct(true_indices, predicted_indices):
    num_correct = 0
    for i in true_indices:
        if i in predicted_indices:
            num_correct += 1
    return num_correct

model = load_model(args["model"],custom_objects={"triplet_loss":triplet_loss})
features, labels = extract_features(args["features_db"])
embeddings = model.predict([features, features, features])
embeddings = embeddings[:,:,2]
num_correct = 0

labels = (np.arange(1000)/4)+1
for i in range(1000):
    similar_indices = get_similar_image_indices(embeddings, i)
    true_indices = np.where(labels==(i/4)+1)[0].tolist()
    num_correct += find_num_correct(true_indices, similar_indices)/4.0

print "Accuracy", num_correct/1000.0

We iterate over all the extracted features from the pre-trained model and pass it to our network and get the predicted similar images indices. Then we compare them with the true targets which are just the labels.

By running the script with the required arguments, we got accuracy up to 94.68%. That’s really great! Just remember that we are using the first 5 predicted image indices to evaluate, also it is not the case that only 4 of those will be similar to each other and others are not. So, therefore, this evaluation metric is limited as we do not know which image is similar to other than the four variants of it, as we can observe from the above there are many similar images to each other in terms of structural representations.

So it shows that our network is more accurate than it is evaluated i.e >94.86%.

Finally, we have built a reverse image search engine which ranks the images based on their visual similarity which is very exciting and challenging. The same technique can be used to build effective image recommendation engines, facial recognition systems, aerial localization systems, etc.

further improvements 

We have built a system which learns the similarity between images. But what if we need exact opposite representations of the image, for example, if the query image is a desert on a sunny day then the expected results should be regions covered with ice? It’s pretty important to learn not only the similar images but also separate the opposite images farther away. Remember that not similar doesn’t mean the opposite. So, in this case, everything will be same and here comes the importance of deciding the distance metric. We have used Euclidean distance for modeling similar images closer and not similar ones farther away. For this task, we should use cosine distance which ranges from [-1,1]. So if we build our network using cosine distance as a metric and we know the opposite of images (which are very difficult to label) we can train our network not only similar images but also the images that are no way related.

And if we are going to develop a visual similarity engine for fashion products where we need to recommend in-class items effectively, then building a shallow convolution network parallel to the pre-trained and then concatenating the output features gives high accuracy. Because the shallow convnet used in parallel learns fine grained similarities like simple color textures, patterns etc. For example, Flipkart recently produced a paper named Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce where they stated exactly the same problem. The architecture they used is shown below.

final words

I know, this is a very long blog post, but remember that there is still a lot of research going on this field. Finally, we have built a reverse image engine with decent accuracy. The same technique can be applied to build facial recognition systems. Also, it can be used for image classification tasks where the nearest/similar images determine the class the query image belongs to.

You can download the code from my github. Thanks a lot, and keep checking back for more exciting content.

  • Nguyễn Anh Nguyên

    Excellent article ! The best on triplet loss in the Internet so far.


  • jax

    Hi, i suppose plotting this with TSNE will produce better localised clusters for each class. I’m curious though, what happens if i pass in an image that is not within the classes in the original training? WIill this embedding appear in a cluster that is separate from every other known cluster?